Thursday, July 2, 2015

My thoughts on The Dunning-Kruger Effect and its influence on perception of Skill and Time

I was skimming through the internet when I happened to come over this question and answer on Quora
Read Piaw Na's answer to I am confident that I am going to build a search engine that will compete with Google at least in the smallest scale possible first, but for now I don't know any programming. What should I do? on Quora

Intrigued, I started reading about the Dunning-Kruger effect. It basically states that people who don't know a complex skill overestimate their abilities and underestimate that skill, while people who do know complex skills underestimate their abilities and knowledge of those skills.

I realized that most of the times I was affected by this, in both ways of underestimating and overestimating my skill level, especially when it comes to computer science and programming.

I realized that everyone at some point in their lives will experience this effect. It especially occurs when you try teaching something you are good at to someone else who is a novice. The most obvious things to you about that skill are so difficult to the other person, that you feel he's not trying hard enough. In reality, it might take time for him to get accustomed to that difficulty level.

Another realization came when I was reading about failed startups and reasons for failing. Probably the number one reason that startups fail is because the initial founders overestimate their skills, and/or underestimate the difficulty of the problem. Now if they are capable people, they will learn the way to overcome their difficulties. But in most of the cases, it just takes too much time and the startup cannot be sustained, or the skills required are too difficult to learn at that point in time.

So my final "Aha!" moment of realization came in that time is one of the crucial factors that affects how you perceive a problem, and how well you understand your skill level. Time also influences your learning ability, and your perception of the difficulty of the problem greatly influences how efficiently you learn. So its this "A influences B influences C influences A ..." cyclic relation where time influences your skill which influences your perception which influences your estimation of time which goes on and on.

This is the sole reason why you cannot really learn something without spending enough time on it. This was exactly what I learned from Prof. Yegnanarayana of IIIT-H in his Artificial Neural Networks class. The neural network requires multiple iterations over an aspect of learning before it is committed to long term memory. Multiple iterations can only be possible with enough time given to that aspect.

So the next time you see "Crash Course! Learn blah blah blah in 30 days", remember that it most definitely will take more than just 30 days time to get good proficiency. And do remember the Dunning-Kruger effect the next time someone says "That's so easy, anyone can do it!"

Cheers ;)

Monday, February 23, 2015

Wikipedia XML Dump Search Program

Did you know that Wikipedia regularly releases dumps of its current article set encoded as XML with BZ2 compression?

As part of the Information Retrieval and Extraction Course at IIIT-H, a mini project was given to create an indexer and searcher for such dumps without using any 3rd party indexing implementations.

The Given Problem

The given problem was to design and develop a scalable and efficient search engine using the Wikipedia data.
  • ~50 GB of Wikipedia data (Downloaded compressed file is ~11GB)
  • Results obtained in less than a sec (even for long queries) 
  • Supports field queries (ex: title) 
  • Index size should be less than 1/4 of the data size. 
  • You have to build your own indexing mechanism, i.e. you cannot use Nutch or Lucene to index the Wikipedia data.
  • OS: Preferably Linux
  • Languages: Java/C++/Python

My Approach

There are a number of ways to parse an XML file. The worst way, but brute force way is to maintain the entire XML structure in memory by loading the entire file into memory. This would work for small files, but for huge files, it would not scale.

The optimal way to do it is by using the SAX parser (Simple API for XML)
The SAX parser traverses the file line by line and triggers specific functions whenever there is an opening tag, content, and closing tag. This allows us to parse through the file without having to load the entire structure into memory.

Thus we can decide how many articles to process during the parsing itself.

Using this approach, we extract each Wikipedia article as an object, and then pass it to an Indexer Module. This module analyses the article, takes the necessary text components, tokenizes them, and creates a frequency map of the word and documents it occurs in. The tokenization uses case folding and stemming (Porter Stemmer in Python) so as to cover various word cases.

Now this map can not be stored in memory at once, so dump the map to temporary files for every 10,000 Wikipedia articles, and then merge them together at the end. Then we sort the final file and output compressed part files for us to access in an easy way, and reduce the index size.

The searching takes a query and breaks it down the same way (tokenization, case folding, stemming) and tries to extract the frequency from the index. Then we calculate the TFIDF (Term Frequency - Inverse Document Frequency) for each word and document and generate a ranked list of Wikipedia article titles as the end result.

My Implementation

I wrote the following python program that can parse a Wikipedia dump XML file and index it in index files. These can then be used to search for articles through a simple searching algorithm.

This implementation is very slow and not optimized, it is far from the ideal, but it gets the job done.

I thought I might share it with you. The code is present on Github : WikiSearchMachine

Once you clone the project (or download the zip and uncompress it) you will see ans src folder, and some other files.

The src folder contains the python code split into various .py files for ease of maintenance

In the main folder you will find some .sh (for Linux) files or .bat (for Windows) files.

Though the code works on Windows, I would not advice running it in windows as there are memory constraints that windows might fail to satisfy (a 64-bit windows gives at max 2GB of RAM to a 32-bit process such as our python script, and the indexer might fail to index; Linux does a better job of giving about 3GB ). Of course this memory constraint is if you run it on the ~50 GB Wikipedia dump file, not the sample 100 article subset that I provided.

In order to run the indexer, create a folder named Index the main directory (alongside src)
I provided a sample file with 100 Wikipedia articles that illustrates the concept of indexing and searching.

On Linux open the terminal and run:
     sh ./sampleXML.xml ./Index/index

On Windows open the command prompt and run:
     run_indexer.bat sampleXML.xml Index\index

The indexer should start running, it might take a while, and you will get a completion message with time statistics displayed in counts of milliseconds taken.

For testing the searching, you can try 2 types of queries:
  1. Regular Queries - just plain text
  2. Fielded Queries - words with specific criteria, like t:lord b:rings, where t: means search in title, b: means search in body. You can use 4 types, namely t: for title, b: for body, c: for category, i: for infobox

For testing the query create a text file in the main directory with the first line as the number of queries, and subsequent lines will contain one query per line. An example is given in testQuery.txt

On Linux open the terminal and run:
     sh ./Index/index < testQuery.txt
On Windows, open the command prompt and run:
     run.bat Index\index < testQuery.txt

You should see the top 10 results available (maximum) per each query

If you have any Questions or Comments, please add them below.


Friday, February 20, 2015

Learning Web Development Part 0 - Introduction: A Primer For Newbies

Web Development is the process of developing websites or web services. Anything that users can consume online via web browsers can be considered a part of web development.

Learning basic web development is easy.
It involves learning about 4 different parts.
2. Javascript
3. CSS
4. Server Technologies

A web-based system generally looks like:        Web Client <-----------> Web Server

The client may request the server using different supported protocols. Websites may be static based, or dynamic based. For a static website, all we need to know is that there is an http web server that serves static files. The files that are generally served are HTML files. You web browser is a client that requests a web server for information, and usually the server responds with html if it’s an http server.

This is the introduction post for a series of posts that will cover various aspects of web development, and lets see how in depth we can get. I will cover the basics in the initial posts, and will try to touch a bit more serious stuff later on.

My next post will include things related to basics of :

  • HTML stands of Hyper Text Mark Up Language. A markup language basically adds meta information to content. This markup may add semantic meaning to the content, or it may allow a different presentation.
  • JavaScript is a scripting language used by most of the browsers for adding enhanced experience to the webpage. Things like triggering click events, and adding fancy overlays, basically handling all events in a web page programmatically is done through JavaScript.
  • CSS stands for Cascading Style Sheets. It is used to design the page to look a certain way.

Later I will try to get into setting up simple basic webservers with various languages and frameworks.
I look forward to seeing you there ;)


Monday, September 8, 2014

NodeJS: A Primer For Newbies

Node.js is an excellent server side scripting application.

Node.js was created by striping the V8 javacript engine from chrome and running it as a native application. Thus you can use JavaScript just as you would use java or python or ruby.

The use of JavaScript is really cool because of a concept called platform equalization.
This means using the same language on both the server side and on the client side for data manipulation and presentation. This means that a lot of code can be reused directly, since web browsers come with JavaScript by default, and you gain by using it on the server side as well because data can directly be used with out having to convert it into a variety of formats.

JSON, or JavaScript Object Notation, is a form of serialization that is used to transfer data to and fro between server and client. It is slowly starting to replace XML, because it has a cleaner structure, takes up less characters, and directly integrates with javascript (since its javascript object notation).

JSON follows a structure of a dictionary, or key-value mappings.
eg.   {     "key1":"value1"   ,     "key2":"value2"   }

JavaScript is an asynchronous language. It runs in a single thread, but handles methods asynchronously through callback functions.

A callback function is any function that is to be executed after the completion of something. Usually in JavaScript, we have events that trigger callback functions to be executed. For example, a server waiting for a request from a client will trigger an event that a request has arrived, and will pass on the request to a callback function.
The advantage of this is that the initial method can go back to listening for requests. Thus it doesn't get blocked from executing. 

The disadvantage of JavaScript is again it's single threaded architecture. This means that any computationally intensive task will clog up the execution. Thus javascript should be used for tasks that are more in number, but less expensive, to utilize it's full potential.

For the Node.js official site:

For resources on how to start with nodejs, check the following question on StackOverflow:

Have fun JavaScripting!

Sunday, September 7, 2014

Experience: A Biased View Of Life

"Have you done this before? We need you to start working right away, you need prior experience..."

Experience is what lets us do stuff effortlessly. Unfortunately, we might not always be experienced at something.

Fortunately, it can be gained by engaging with stuff that you haven't done yet. New things always give us a new perspective, and that's what experience is all about. The slight problem, though, is you might not always have an opportunity to try out something new. This implies that you do stuff every day over and over again in predefined routine. This is good in a way. If you are, for example, working a job, then even though you might not be doing something new, it will add up to you job experience of having a particular skill.

If you want to try out something new, then you have to find a way to break free from your current routine.

Consider the case of the playing the video games Assassin's Creed 2 (AC2) and Assassin's Creed Brotherhood. As you play the role of Assassin Ezio, you try so hard to become the best assassin that you can be in AC2. But once you complete that game and move on to Brotherhood, all that goes away, Ezio loses his fancy armor and weapons, and you have start over from scratch. But you have experience from playing the first game. And that itself is extremely useful in starting over.

Its kind of like college life. Working so hard to prove yourself to everyone how hard working and dedicated you can be is a daunting task. But once you start employment, or join for higher education in a different university, it starts all over again. But you have experience to help you out.

So the next time someone asks you about experience, don't be afraid. Tell them that experience only comes through practice, and that only comes through first having an opportunity. Ask them for that opportunity if they are in a position to give it.


Sunday, August 24, 2014

Sleep: A Biased View Of Life

Sleep. Its such an essential part of the human way of life that we tend to overlook its importance.
Why do we sleep? What makes it so important that we spend one third of our life sleeping?

Sleep is a way for the brain to focus on passive maintenance of the body. When we are awake, our brain can only focus on the current activities that need to be addressed. When we sleep, the brain has its resources freed and can therefore use them for other vital activities such cell regeneration, healing, relaxing stressed body areas and so on.

Research has shown that sleep deprivation is extremely detrimental to health. Here is an awesome video that shows the effects of sleep deprivation on facial appearance.

But many don't take this so seriously. We feel that sleeping less is synonymous with having a busy life, and having a busy life is a sign of working hard in this socially fast-paced world.

An Excerpt I found from this article Bring back the 40-hour work week - Salon
Research by the US military has shown that losing just one hour of sleep per night for a week will cause a level of cognitive degradation equivalent to a .10 blood alcohol level. Worse: most people who’ve fallen into this state typically have no idea of just how impaired they are. It’s only when you look at the dramatically lower quality of their output that it shows up. 

So had some good sleep lately? If not, you definitely should. There are so many benefits of getting a good night's rest. You will be more active through out the day, and will have soaring energy levels. Your body will relatively be more fit, and you will most likely be in a better mood. Your concentration levels will be better, thus decision making skills and memory skills will also be better.

Have a wonderful sleep time ;)

Saturday, August 23, 2014

The Life of a Techie: A Biased View Of Life

Half the screen in white. The other half in red. A head just half a foot away from the screen. Motionless.

"What in the world are you talking about!" you might exclaim. No, nothing horrifying here.
Welcome to the life of a software professional.

"What's with the red, white, and motionless stuff?" you might wonder. Just the average day for a techie who churns out many bugs and errors as the lines of code exponentially pile up - then stares at the screen as all hell breaks loose and wonders what just happened.

According to Google,
techie (noun) - a person who is expert in or enthusiastic about technology, especially computing.

A Techie is synonymous to any technologist. Or a Developer. Or a Tester. Or an Entrepreneur. Whoever dons the role of a techie, however, will get used to one thing: trying out new stuff and utterly fail, only to try again and again. Until Success. (Maybe.)

This infinite loop of developing new stuff, then maintaining it, especially in software, is termed as the software development cycle. Of course this cycle is applicable to any stream of technology. The thing to remember: this cycle is endless. Create and Maintain. The moment one stage is broken, the cycle breaks and the product fails. Period.

A Techie must have exceedingly high patience and determination. Why? The simple answer is he has to live through multiple iterations of the development cycle. A techie needs determination to finish what he starts and patience to see through the maintenance.

But come to think of it, aren't all humans techies in their own right? Every person is the architect of his/her life, and go through the development cycle of their lives. Everyone fails at some point in life until they learn from their mistakes and succeed. It all seems to be part of a bigger picture, the grand picture of Life.