The Library of Congress, the de facto national library of the United States, has about 35 million cataloged books. Assuming each book has 64,000 words on an average, we are looking at around 2,240,000,000,000 or – save your breath from counting – 2.24 Trillion words. If an average adult, with a reading speed of 250 words per minute, were to read every single book from the Library of Congress, it would take the person around 9 Billion hours, which means a mammoth 373 million days or simply put…… A Million Years!
In a parallel world, the World Wide Web - or WWW - currently comprises of well over a billion websites. Each website is like a book, each webpage has many pages in a book and the WWW is one big library. Assuming there are at least 5-6 pages to a website, with each webpage having around 300 words; we are looking at…… well, you get the point. This isn’t about the math. The data we are talking about is of gargantuan proportions. Mind you, this is considering just the surface web, the portion of the web accessible to the general public and the part of web that actually matters, to you and me.
The current world replicates the Library of Congress in gigantic proportions. Data is not constrained to a fixed location or form. Data is not stored anymore just in hard bound books, it is intangible. The Libraries of the past which proudly housed pearls of wisdom, no longer hold the elite distinction of data containers; the World Wide Web has taken over knowledge sharing with a fervor that was unforeseen and will be unmatched for eons to come.
Imagine a person requesting the chief librarian to retrieve a specific book, no wait, a specific page out of a book from the entire library based on a few words of description. The chief librarian would have had to read every book and remember every single detail to achieve a feat like that. Let us add a level of difficulty: imagine the person gave only 5 seconds to perform this task. I am sure the chief librarian would opt for a less demanding job, like being the librarian at the smallest library in the world based out of Cardigan, Canada. Hope this builds enough context to introduce to you the star of the moment - the Search Engine.
Search Engines - What do they exactly do?
Search Engines do what the chief librarian of the Library of Congress could never do; retrieve a particular webpage for you out of that large stack in the blink of an eye. While purists would stick to calling search engines technical terms and suggesting geeky connotations, I cannot help but notice the human side to search engines; or least I notice it to understand and relate to search engines better. To the untrained eye, understanding search engines is beyond imagination or their capabilities, but the mere functioning of it is so fascinating that, for me, it is one of the greatest laurels of human creation and it is a concept that every person should fall in love with. I will refer to the Google Search Engine in this piece because, to be honest, after all the hours spent reading, I have learnt only a very small fraction of it.
The chief librarian for Google is called Googlebot or colloquially known as ‘Spider’. Spider spends all its time reading through every page in the ever-expanding WWW and retaining every word of it. Every last word of it. This process of reading through the WWW is called ‘crawling’ and the process of remembering and retaining everything it reads by storing information in its personal diary (the Search Engine database) is called ‘indexing’. However, if you think from a human aspect, the spider is simply good enough to read at an extremely voracious speed and has enough patience to note down whatever it reads. Straight out of a DC superhero comic book. Spider not only remembers the words from each page, it also understands what the page is talking about. It goes a level beyond to evaluate how its experience was while reading through the page. And yes, if you let it read in unfavorable conditions (meaning a badly designed webpage), Spider will hold that against you.
Gathering, Understanding and Evaluating Information
Spider is a hard customer to please and its standards are very high. You need to have won its trust, met its expectations and overall made sure it likes your website. Take a moment to think about this again: Your childhood friend is visiting your city and asks you for the best pizza place; would you not recommend a place that you trust, a place that has always met your expectations and overall a place that you love going? Now think about this - is Spider doing anything different than a human would? Spider considers you its best friend and it wants to make you happy by always delivering what you want.
Also consider this: You take your childhood friend to this pizza place yourself. How excited would you be to see your friend try out your choice? Would you not look at his/her face with bated excitement as he/she eats the Pizza? Would you not take note of every aspect of the body language and process it to figure out if the person it or not? How would you feel if he/she walked out after having just one bite? If this happened more than once, would you bring other friends to the same place again? Spider displays these behaviors all the time! It accompanies you perpetually while you wade through the WWW. It remembers how you behave on the internet; it figures out if you liked a page it showed you or not, based on your reaction it processes to see if this is good enough to be shown to its other ‘friends’. It is this ability of Spider to ‘think’ that makes it more of a human than just a bot for me.
How do you get what you are looking for?
Spider has an associate (Ranking Algorithm) that helps it serve you the information you search. This associate has been taught by spider the order in which they have to bring you information. The beauty of Spider and team is their swift reflexes and nimble toes. The moment you type a query into the search engine and hit the return/enter key, the associate rushes to the personal diary (the Search Engine database). It starts searching through the pages it ‘remembers’ and ‘trusts’, evaluates all the options that it trusts, shortlists the relevant options it could show you and based on over 200 parameters, it weighs the options against each other and then it ranks the best 10 options and serves it to you on a platter. All in the blink of an eye. The process is set.
Basically, over and above crawling and storing information about every webpage (the ones available for crawling) spider and team assign a score to each page based on over 200 parameters, that helping in ranking and delivering when a search query is fired. The motto of Spider and its associate is simple: crawl, index, deliver, repeat. Non-Stop and relentless.