Monday, February 23, 2015

Wikipedia XML Dump Search Program

Did you know that Wikipedia regularly releases dumps of its current article set encoded as XML with BZ2 compression?

As part of the Information Retrieval and Extraction Course at IIIT-H, a mini project was given to create an indexer and searcher for such dumps without using any 3rd party indexing implementations.

The Given Problem

The given problem was to design and develop a scalable and efficient search engine using the Wikipedia data.
  • ~50 GB of Wikipedia data (Downloaded compressed file is ~11GB)
  • Results obtained in less than a sec (even for long queries) 
  • Supports field queries (ex: title) 
  • Index size should be less than 1/4 of the data size. 
  • You have to build your own indexing mechanism, i.e. you cannot use Nutch or Lucene to index the Wikipedia data.
  • OS: Preferably Linux
  • Languages: Java/C++/Python

My Approach

There are a number of ways to parse an XML file. The worst way, but brute force way is to maintain the entire XML structure in memory by loading the entire file into memory. This would work for small files, but for huge files, it would not scale.

The optimal way to do it is by using the SAX parser (Simple API for XML)
The SAX parser traverses the file line by line and triggers specific functions whenever there is an opening tag, content, and closing tag. This allows us to parse through the file without having to load the entire structure into memory.

Thus we can decide how many articles to process during the parsing itself.

Using this approach, we extract each Wikipedia article as an object, and then pass it to an Indexer Module. This module analyses the article, takes the necessary text components, tokenizes them, and creates a frequency map of the word and documents it occurs in. The tokenization uses case folding and stemming (Porter Stemmer in Python) so as to cover various word cases.

Now this map can not be stored in memory at once, so dump the map to temporary files for every 10,000 Wikipedia articles, and then merge them together at the end. Then we sort the final file and output compressed part files for us to access in an easy way, and reduce the index size.

The searching takes a query and breaks it down the same way (tokenization, case folding, stemming) and tries to extract the frequency from the index. Then we calculate the TFIDF (Term Frequency - Inverse Document Frequency) for each word and document and generate a ranked list of Wikipedia article titles as the end result.

My Implementation

I wrote the following python program that can parse a Wikipedia dump XML file and index it in index files. These can then be used to search for articles through a simple searching algorithm.

This implementation is very slow and not optimized, it is far from the ideal, but it gets the job done.

I thought I might share it with you. The code is present on Github : WikiSearchMachine

Once you clone the project (or download the zip and uncompress it) you will see ans src folder, and some other files.

The src folder contains the python code split into various .py files for ease of maintenance

In the main folder you will find some .sh (for Linux) files or .bat (for Windows) files.

Though the code works on Windows, I would not advice running it in windows as there are memory constraints that windows might fail to satisfy (a 64-bit windows gives at max 2GB of RAM to a 32-bit process such as our python script, and the indexer might fail to index; Linux does a better job of giving about 3GB ). Of course this memory constraint is if you run it on the ~50 GB Wikipedia dump file, not the sample 100 article subset that I provided.

In order to run the indexer, create a folder named Index the main directory (alongside src)
I provided a sample file with 100 Wikipedia articles that illustrates the concept of indexing and searching.

On Linux open the terminal and run:
     sh ./sampleXML.xml ./Index/index

On Windows open the command prompt and run:
     run_indexer.bat sampleXML.xml Index\index

The indexer should start running, it might take a while, and you will get a completion message with time statistics displayed in counts of milliseconds taken.

For testing the searching, you can try 2 types of queries:
  1. Regular Queries - just plain text
  2. Fielded Queries - words with specific criteria, like t:lord b:rings, where t: means search in title, b: means search in body. You can use 4 types, namely t: for title, b: for body, c: for category, i: for infobox

For testing the query create a text file in the main directory with the first line as the number of queries, and subsequent lines will contain one query per line. An example is given in testQuery.txt

On Linux open the terminal and run:
     sh ./Index/index < testQuery.txt
On Windows, open the command prompt and run:
     run.bat Index\index < testQuery.txt

You should see the top 10 results available (maximum) per each query

If you have any Questions or Comments, please add them below.


Friday, February 20, 2015

Learning Web Development Part 0 - Introduction: A Primer For Newbies

Web Development is the process of developing websites or web services. Anything that users can consume online via web browsers can be considered a part of web development.

Learning basic web development is easy.
It involves learning about 4 different parts.
2. Javascript
3. CSS
4. Server Technologies

A web-based system generally looks like:        Web Client <-----------> Web Server

The client may request the server using different supported protocols. Websites may be static based, or dynamic based. For a static website, all we need to know is that there is an http web server that serves static files. The files that are generally served are HTML files. You web browser is a client that requests a web server for information, and usually the server responds with html if it’s an http server.

This is the introduction post for a series of posts that will cover various aspects of web development, and lets see how in depth we can get. I will cover the basics in the initial posts, and will try to touch a bit more serious stuff later on.

My next post will include things related to basics of :

  • HTML stands of Hyper Text Mark Up Language. A markup language basically adds meta information to content. This markup may add semantic meaning to the content, or it may allow a different presentation.
  • JavaScript is a scripting language used by most of the browsers for adding enhanced experience to the webpage. Things like triggering click events, and adding fancy overlays, basically handling all events in a web page programmatically is done through JavaScript.
  • CSS stands for Cascading Style Sheets. It is used to design the page to look a certain way.

Later I will try to get into setting up simple basic webservers with various languages and frameworks.
I look forward to seeing you there ;)