Simon Bisson 

Appliance of Google’s science

Simon Bisson looks at how familiar search techniques are brought back into the corporate network
  
  


Modern businesses thrive on electronic information. But as time passes, this is information that piles up in server directories, effectively locked away for good, as no one can remember what HV365-1093.DOC actually contains, and nobody really has the time to look. But what if it was the text of a document that won substantial amounts of business, or the original memo that defined the company's current business model?

Keeping track of what is in all those megabytes of files is a complex task. Not everyone wants to use a document management system, checking their work in and out, and manually filling in forms full of metadata. However, there's value in understanding what information is available. It's time to bring another web technology back inside the firewall.

The Google search engine now accounts for a large percentage of all the searches carried out on the web. What businesses need is an enterprise Google of their own, able to crawl through file systems and intranet servers. Armed with an index and a set of search forms, anyone should be able to find anything they need.

It's not surprising that Google is already in this market with the Google Search Appliance, a standalone search tool that will search any web servers in a business's network. Two hundred different file types can be indexed, including output from common productivity suites such as Microsoft Office. Bright yellow single rack units, Google's appliances can index between 150,000 and 15m documents, depending on the configuration purchased.

Companies such as National Semiconductor are using Google appliances to manage large-scale information intranets, where large numbers of servers spread across the world contain hundreds of thousands of documents. E-government is also getting results from implementing Google's yellow search systems, including the much-praised San Diego implementation. Here a $20,000 investment in a single server opened its information stores to the general public as well as intranet users.

Solutions such as Google's can be expensive, though not as expensive as implementing a Verity or Autonomy search solution. The high-end of search technology, both systems offer advanced document search based on sophisticated Bayesian search techniques. While Google's tools are restricted to working with web servers, Autonomy's enterprise search tools can search file systems, databases, email servers and document stores. This opens up a larger selection of documents, though there's a lot of work required in choosing which servers are to be searched, and how information is to be delivered.

Companies that have standardised on Microsoft technologies can take advantage of the extended search tools built into SharePoint Portal Server. These can crawl file systems, web and mail servers (including Lotus Notes). Developers can extend SharePoint, adding tools for searching other data sources through a standard set of interfaces. While the search facilities in SharePoint are similar to those offered by most vendors, end-users subscribe to searches, and are notified if the result changes.

There are also open source enterprise search solutions. A free, GPLed, search tool is ht://dig. This will index web documents across a multi-server intranet, though external parsers are required to index other document types. Another, though not fully open, option is the combination of the Glimpse index engine and the Webglimpse front end. Out of the box this can index a wide selection of web accessible files, including Word, Excel and PDF. While a licence fee is required for commercial installations, all the search engine code is available to users.

One of the biggest problems facing organisations is the number of different systems they want to index. Compatibility is a key issue, as information will be stored in many different formats, and inside as many different applications. Any enterprise search system will need to be able to winkle information out from the tiniest crevices, and present it in a readable form to an indexing engine.

If enterprise search is going to be at the heart of a knowledge management strategy, then appropriate investment will be needed. Full access to all files does not come cheap, but good results can be achieved by just improving access to common document types. Once a system has been widely accepted, then it is time to dig out legacy information from obsolete stores, convert it to current formats and make it available to the index.

Links Google Search Appliance

Search appliance:

www.google.com/appliance

Autonomy:

www.autonomy.com

Verity:

www.verity.com

Webglimpse:

www.webglimpse.net

ht://dig:

www.htdig.org

Microsoft SharePoint:

www.microsoft.com/sharepoint

 

Leave a Comment

Required fields are marked *

*

*