The X' (ML) Files - Finding Data in the Deep Web

With so much data being created daily (161 exabytes in 2006 alone—that’s a 10 with 18 zeros after it—according to IDC), finding complete answers to any search requires the ability to query textual and database information anywhere, internally or on the Web. The problem is that most of that data isn’t contained in a webpage, but within a database that only gets displayed in response to user action.

“In 2000/2001 we did some analysis and realized that the quantity of documents from these deep web databases was far bigger than what everyone was calling the Internet,” said Jerry Tardif, vice-president of Bright Planet Corp., a search firm headquartered in Sioux Falls, S.D.

You can’t just plug some search terms into Google to access all this data. It requires the use of a federated search tool.

“Google makes search look simple, but in fact search is not simple, particularly when completeness is important, said David Fuess, a computer scientist in Lawrence Livermore National Laboratory’s Nonproliferation, Homeland and International Security (NHI) directorate. His team uses Bright Planet’s Deep Query Manager (DQM) to look for information on end users of export-controlled goods which might have military uses.

“To be effective you must strike a proper balance that maximizes the probability that the information you seek is in the results and that the results can be reviewed within the response time allowed.”

Federated Search

Traditional Web searches, or searches against an organization’s own content, consists of creating a database of words in those documents and then running a query against that database. Federated search is the ability to execute queries against multiple databases at the same time.

Internally, for example, a user could run a query on a customer that would turn up both invoices contained in the finance system as well as that customer’s contract contained in the document management system. Expanding its scope to outside sources, the federated search engine could also pull up the most recent stock quote from Dow Jones, the bond rating from Moody’s and the customer’s latest filings with the Securities and Exchange Commission.

Creating a federated search engine is more than just a matter of installing some software. “IT staff needs to understand that this is not a trivial undertaking,” said Abe Lederman, president of Deep Web Technologies, which develops the Explorit federated search software. “It is very unlikely that this is something an IT person can just purchase a copy of it, set it up and run it.”

The first step is surveying what resources are available to be searched. This is relatively simple when dealing with data the organization owns, but gets far more complex when locating outside sources. No one knows exactly the number of publicly available databases on the Internet, but the CompletePlanet directory has a searchable and browse-able list of more than 70,000 online databases and specialty search engines.

“If an agency is federating search on their own databases, they generally know what they have, where it is, and the type of information that is in there,” said BrightPlanet’s Tardiff. “But if they are doing something on the outside, they need subject matter expertise on what public sources are available.”

Once you have selected the databases to include in the search, there is the matter of creating links and writing the code needed to execute the query on each of those databases. This can include writing appropriate log in scripts. These scripts need to be checked regularly and updated whenever the underlying database structure changes. A final step is to refine the user interface that aggregates the data from these different sources and presents it to the end user.

Roy Tennant, the User Services architect for the California Digital Library, (the group that provides centralized digital access to the collections of all University of California campuses, as well as hundreds of other databases) found that an off the shelf product didn’t provide the needed functions without extensive customization.