Q: I use Google for everything! But I’ve heard there’s stuff out on the web that they can’t find. What’s up with that?
Due to its intuitive single search box and ever-improving algorithms, it’s easy to rely 100% on Google when it comes to finding information online. You simply type in a few words to instantly view the webpages that have been identified as the most relevant to that search.
The beauty of Google lies in its automation. No employee has to manually add new websites to the database; Google’s web crawlers are programmed to process the code of mass amounts of webpages and follow all the links they encounter, adding new pages to Google’s index as they go.
These web crawlers do a great job, at least for publicly accessible webpages, which are known collectively as the visible web. But it’s important to understand that there’s more out there. A lot more.
The Deep Web Explained
Anything on the web that cannot be found by search engine bots is part of something known as the “deep web” or “invisible web.”
The two biggest categories of the deep web are private resources, content that requires a login and password, and dynamic pages that are generated on the fly, usually as a result of user input.
This should make sense, first because many content providers do not allow free access to their material but want it available online for those who pay. The login screen serves as a dead-end for all web crawlers, and they cannot index any of the content that lives within.
Secondly, due to the infinite possibilities of search terms in something like an online catalog or other database, a search engine can’t (and wouldn’t want to) index every combination.
Staggering Statistics: Deep Web vs. Visible Web
According to one study, back in 2006 Google had indexed 25 billion pages. In contrast, the deep web contained some 900 billion pages! An astonishing statistic and a strong reminder that although Google is becoming more and more powerful, it still has its limits. (Source: NYTimes Bits Tech Talk Podcast)
There are also other types of deep web resources, including websites that don’t have any links pointing to them, certain file formats that can’t be handled by search engines, and webmasters who have intentionally blocked crawler access to their sites for various reasons.
A Deep Web Example
A good example of the deep web is the body of resources available on a library website. When you search the online library catalog, for example, it queries a database and generates a list of results on-the-fly, a type of page which Google or any search engine will not be able to index.
But more importantly, when you search the site for an article you need to read, you pass through as an authenticated user of the library. Since the library has paid money to provide it for you, the full-text will be accessible. The url you see in your browser bar might seem regular enough, but an unaffiliated searcher who typed the article title into Google will not be able to read the article.
This example often becomes problematic, because many library users do not realize they are accessing a “deep web” subscription resource, since the library tries to make the integration as seamless as possible.
In an era of rising costs and more and more knowledge hiding behind subscription logins, it’s becoming increasingly clear that Google does not find everything, no matter how quickly it throws back a tidy list of results for any query.
When you’re thinking about finding information on the web today, remember that Google still can’t solve all your problems!