Media ResourcesInclude yourself in SOURCES
Membership Form Be an Affiliate Powerful Tools Tell your story Media Directory
Normally, indexed Web sites are displayed when you retrieve pages from the various Web search engines. Subject directories such as Yahoo will also display these sites. Some people call this the Surface Web. The Deep Web (or Invisible Web) is what you cannot retrieve, for you cannot see it in your search statements. Plus of course all the URLs contained in these types of sites. There are seven main divisions here:
1) The HTTP command is but a subset of Internet content. There are also FTP (file transfer), E-mail, news, Telnet, and pre-Web Gopher), which are not searchable.
2) Excluded pages: not every Web site wants to be totally included in search engine reports. In the HTML source code there is a provision and procedure for turning away the spider search bots so that they don't report that particular page. In other words, the URL has been turned off by the code.
3) Databases: many Web sites have the contents of thousands of specialized searchable databases that you can search via the Web. You can get results from these databases, but only in answer to your one specific query. You cannot access the whole database. In fact, the database may not even be stored online. You only wake it up when it is needed for a response. "It is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database" (Berkeley, see below). And thus the search devices cannot find or create these pages. Tabular formats are a bitch to display without appropriate software to do so. Even simple layouts such as crossword puzzles have some display component, and hence are normally not indexed (check out www.ecrostic.com). Databases with tables created by Access, Oracle, SQL Server and DB2 are accessible only by query. There's a lot of information out there on the Web via databases. Content on the Deep Web may be 500 times larger than the normal Google-searchable Web. The 60 largest Deep Web sources contain 84 billion pages of content. That's about 750 terabytes of information. Top dogs are the US National Climatic Data Center, 366,000 GigaB, 42 billion records; US NASA EOSDIS, 219,000 GigaB, 25 billion records; and US National Oceanographic Data Center, a mere 33,000 GigaB, 4 billion records. By contrast, Google indexes only about 6 billion pages. And 95% of the Deep Web is publicly-accessible information, not subject to fees or subscriptions. Awesome .
4) For a variety of technical reasons (easy to understand, but long and cumbersome to explain), there are "static" pages on the Web. These reside on servers, waiting to be retrieved when their URL is used in an HTTP command. But they are not linked, and hence spiders do not find them.
5) Some sites require a password or loginID, and these sites are closed to spiders. Passworded sites include indexing services, encyclopedias, directories, Lexis and Nexis. In fact, any site that is not free requires a password. There are thousands of such sites, although some will let you in with teasers or partials, such as the Wall Street Journal. Yahoo in 2005 made a small part of the Deep Web searchable by creating Yahoo Subscriptions which searches through a few subscription-only Web sites.
6) Non-HTML formatted pages: these include programming languages which have codes that are incompatible with HTML, although the links can be indexed but not the actual pages. Search engines have a hard time with Adobe .pdf files (although Google has a reformatting tool), image databases, spreadsheets (.xls files), multimedia files, PostScript (.ps), Flash, Shockwave, PowerPoint (.ppt), and even wordprocessing programs (Word .doc, WordPerfect .wp). There is no problem downloading these materials once they are found; the major trick is finding them in the first place!
7) Script-based pages with a ? (question mark) in their URL: these are particularly devilish for spiders to locate. Most spiders do not return the URL because of script problems and, believe it or not, spider traps.
The basic Invisible Web, of course, are the various Intranets put
up by businesses, governments, and universities. These are locally
connected Web sites meant for just the corporation's use: sometimes
passwords are required. All manner of documents, many unclassified,
are posted - terabytes of information. And they are a major concern
for internal security since they can be hacked and also accessed
by rogue employees. There is no outside index to these sites, since
they are just local. All are hidden behind firewalls. I cannot tell
you how many times people have told me that a particular document
is on a Web site - "just go over and follow the links"
- only to find out that it is on their Intranet and hence inaccessible
to me. Actually, I can tell you: about a score of times
Also, "dynamically changing new information" will be part of the Invisible Web. This includes news, job postings, travel data (airline flights, hotels, etc.), stock market postings.
How do you find Deep Web? One way is through academic search tools such as Infomine, Librarians Index, and AcademicInfo. You could try Direct Search at www.freepint.com/gary/direct.htm. There is also www.profusion.com, and www.completeplanet.com,. Another way is through your usual search engine. Just type in a short subject term with the word "database" (e.g., biomedical database). If the database includes the word "database", then bingo! (Bob's your uncle?). If you drill through a directory such as Yahoo, then be sure to also use the term "database": this will pick up additional listings. Many search engines feature searchable databases as part of their service. Google, for example, has separate searches for audio-visual material, images, news, and non-HTML formats. These are just one click away from the main HTML search.
Some interesting Deep Web sites include:
* AnimalSearch (animalsearch.net):
family-safe animal-related sites, search by group, type, and regions.
* On-Line Encyclopedia of Integer Sequences (www.research.att.com/~njas/sequences):
"Type in a series of numbers and this database will complete
the sequence and provide the sequence name, along with its mathematical
formula, structure, references, and links."
* MagPortal.com (magportal.com): freely available magazine articles on the Web, using keyword searching or category browsing methods.
* Directory of Open Access Journals (www.doaj.org): one stop open access directory, providing no-cost access to the full text of over 2,000 journals, with over 500 journals searchable on the article level (over 83,000 articles available) in the science and humanities/social sciences
* Cryptome (cryptome.org):
specializes in posting both previously classified or under-publicized
US federal documents, along with similar documents from other jurisdictions.
There could be half-a-dozen posted every business day. Just go over
to the site, and the home page lists the latest docs. Typical titles
include "Expansion of the Strategic Petroleum Reserve",
"Calendar of 2,482 US Military dead in Iraqi War", "Security
Measures for Radioactive Materials", "Outer Continental
Shelf Polluters Fined", "CIA Creation Documents".
There is also an index to off-site documents, dealing with topics
such as the Israeli Lobby and US foreign policy, Al Qaeda documents,
New York City public safety.
For the immediate future, you should expect a big impact from two sources.
One is the court system. The
Canadian Judicial Council (the organization of Canada's
top judges) has recommended that access to court records via the
Internet be restricted. Many of these may be moving over to the
local intranet and never accessible via the open Internet. You'll
soon have to visit your nearest courthouse to view legal documents,
much as you have to now just to view the paper copies.
Another is change of ownership. While most of the databases within the Deep Web are government-owned or non-profit, there are still vast areas such as E-mail and FTP which are in private hands. Every time someone buys an Internet property, there are policy changes. What should we expect with the newest batch of dot com purchases by the media itself? How will this play out for searching for data? NBC Universal has bought iVillage, the top women's oriented site on the Internet, with over 30 million unique visitors a month. News Corp (Murdoch) has bought MySpace, the fastest growing social networking site on the Web, about 50 million unique visitors a month. News Corp also bought IGN, a top gaming and entertainment site for young hot males, with under 20 million unique users a month. Viacom (owner of MTV and Paramount) has bought Neopets, a young person's community site with virtual pets. Viacom also has bought iFilm (where users track the film industry and post their own videos), GameTrailers (a competitor to IGN with more hot males), and GoCityKids (via Nickelodeon). The New York Times has bought About.Com, an online advice site with over 60 million unique users a month.
Other hot properties appear to be photo- and video-sharing sites. Murdoch still has $2 billion earmarked for these purchases, coming up real soon. The big audiences in all the new acquisitions can link to each other within and without their communities. And they could be susceptible to database searching by new owners or positioned for a sell off of contents to database searchers.
For more details on the Invisible Web and the Deep Web, try these URLs: