In Part 1 of this two-part post we were introduced to the problem of finding out what the top ten most commonly occurring words are in all the English language books in a country. We saw that there were a number of practical difficulties identified in arriving at the answer to this problem using analogies for conventional technologies, the biggest ones being:
- How to estimate how big a library would be required, or even a number of reasonably-sized libraries given that the number of books in a country must be a really large figure, and that new books continuously get added every day.
- How to continuously pay for the huge amount of labour and skills involved in carrying books to a set of individuals, getting the English ones picked out and read, and getting all the words in them counted and then sorted in order of occurrence.
- How to figure out how many people to bring together to process all the books.
What Hadoop does, in effect, is solve the problem by trying to reduce some of the effort involved by breaking the processing down into smaller pieces, each of which is easily affordable, and which can be incrementally increased in number as required.
First of all there's the question of storage. It would be virtually impossible to build a building that could infinitely be able to keep accommodating all the books that already exist and the new ones which can be expected to keep coming in in the future. Even if we consider building a large building to start with, and then adding more buildings as required, it's still a pretty expensive proposition. In our Hadoop solution, what we'd be doing would be to just start with an ordinary bookshelf, the kind that is readily available, and far less intimidating to procure when compared with setting up the infrastructure for a library. We could have a number of bookshelves (and CDs or flash drives for audio books and soft books) to start with, and we could keep adding new ones if and when needed. Hadoop can work with ordinary commodity hardware and more such hardware can be added as and when required. This means that there's on early need to worry about the cost and size estimation of the largest possible servers, either for storage or for processing.
Next, extracting and carrying books to the people who would have to read them is pretty expensive. With our Hadoop analogy solution, what we'd do instead, is have those people station themselves at each bookshelf, pick out the English ones and read each right there. As new bookshelves are added, new people to read them would be added at the same time. While reading, each person writes down whatever they read, word for word. Doing this locally would eliminate the need for bringing all the books in a vehicle to readers located elsewhere. All these locally produced results would then need to be sorted, so that a count of each word could be obtained. After that all the notepads would need to be brought together for one last step of work, which is to consolidate and collate the contents of all the individual sorted lists. Bringing in these notepads which contain summarized content is a much easier task than bringing in all the individual books themselves.
Reading the top ten entries on the list would give us the answer to our question. Having that list would also enable other kinds of questions to be asked if required.
So in a nutshell, and getting back to what Hadoop does, is that it breaks the total volume of data down to smaller lots that can be stored in commodity hardware. It then gets each lot partially processed locally, thereby avoiding having to move large volumes of data back and forth across a network between traditional centralized storage and processing servers. Finally, it prepares partially summarized (or reduced volume) output in steps. It then makes the data available for further querying. It can scale up incrementally to meet large volumes and it can accept various kinds of data because it doesn't actually store it relationally.
There's more that can be said, but this should present an adequate beginning to help a non-technical leader understand the concept of how Hadoop works.