Yahoo, Google, and MSN hold a huge lead in search engine technology over open-source alternatives. These search giants are competing in a battle among themselves to be a computer user’s default search site for search.
Where can a computer user go to find an adequate open-source alternative to mainstream engines? Choices appear to be limited. A few established open-source projects provide corporate IT managers with some additional choices; however, a new offering from the founder of Wikipedia may soon change the search engine landscape.
The concept of finding essential information with the fewest keyword refinements is a challenge for both the searcher and the search engine company. Searching for information online and within local storage drives is an integral part of the workflow process.
The need for an open search engine tool with the ability to catalog and retrieve data stored within the user’s network as well as find information on the Internet holds potential for innovation from open source projects. However, few alternatives exist today in open-source search engine technology.
“The difference in using Google or Yahoo is the ability to search inside my firewall or search privately. You can buy a proprietary product [for intranet searching], but very few open-source search engines are in use,” David Christian, chief technical officer of Mindbridge, told LinuxInsider. Mindbridge is a provider of business process outsourcing (BPO) services.
Some critics of existing search engine products say there is a growing need for alternatives to the proprietary search companies and the big business associated with sponsored information and ad revenue from search results. A few innovators are conducting a quest for new search engines and an alternative to the influences of ranking done by proprietary search platforms.
For instance, take the experience of Matt Burkhardt, chief executive officer of Impari Systems, as an example of the growing user needs for new search engine options. Impari Systems is a startup focusing on bringing open-source software to schools.
Burkhardt is unhappy with his efforts to disperse his information displayed on Google news feeds. He put out two press releases only to find that soon after posting, they disappeared. Even worse, his notices seemed to be replaced with competing information that was two years old.
That experience and others convinced Burkhardt that search is broken on the Internet. He is hoping that something better comes along.
“Existing open source caters to [a] vertical market. We need something more mainstream,” he told LinuxInsider.
Search engines such as Google, Yahoo, and MSN differ in their methodologies and search algorithms. Search engine technology is mostly secret, given the proprietary nature of their platforms.
Preferences for one search engine over another sometimes reach fanatic status, as users rely on a favorite search platform to find content. One of the leading search product alternatives, according to Mindbridge’s Christian, is Apache Lucene.
Most open-source searching involves a component embedded into a larger project, he noted. Similarly, most of the open-source projects using full-text search are built with Lucene as the basis.
These alternative open-source search projects include both desktop technologies and server-side technologies, alone or in combination, he explained.
The Lucene Model
Apache Lucene is an open source, full-featured text search engine library written in Java that is compatible with cross-platform searching. It is available for free download.
Its June update includes new features that include a payload package for query mechanisms. This new version is able to boost a search term’s relevancy score based on the value of the payload located at that term.
Lucene is now able to use “point-in-time” searching over NFS (network file system) structures. It also has a new API (application programming interface) for pre-analyzed fields.
A Starting Point
Using the Lucene platform as a basis for new open-source search products may offer more choices. It is capable of integrating current technology.
“From a programmer’s perspective, Apache Lucene has a robust API and .net and Java compatibility. Lucene is the basis for a number of search platforms,” said Christian.
NET Framework is a software component developed by Microsoft that is included in the Microsoft Windows operating system. It provides a large library of pre-coded instructions. Java is a programming language developed by Sun Microsystems.
Developing new search engine strategies for both Internet and intranet use runs the risk of other problems for potential users, warned Christian.
For example, one problem with using an alternative search product is that components may not talk to all data containers. Another problem is that most people are not good at managing metadata (mechanisms that help define the structure of various document types).
“We need to search multiple indexes and return results in a cohesive fashion. We see some companies just beginning to explore this. We need a search vehicle that will pull everything together,” Christian said.
Perhaps one of the most promising new open-source search offerings will become available by the end of this year by Wiki.com, which recently completed a purchase of the Grub Web crawler tool from LookSmart.
Until now, a proprietary search engine, Jimmy Wales, Wikia chairman and Wikipedia founder, told LinuxInsider he will release the Grub code as open source.
Grub is a Web crawler that creates an index of the World Wide Web by borrowing the processing power donated by volunteer computers, similar to the SETI@home project, which looks for extraterrestrial life. This will allow Wales to jumpstart his new search product without having to develop its own computer network to crawl the Web to build and maintain a catalog of content.
“We plan to build all the software needed for free licensing for searching. I want to make all content available license free. Nothing like this exists today,” Wales said.
Wales’ plan for a new open-source-based search engine calls for an expansion of previous open-source efforts begun by projects such as Lucene. His goal is to create an open and transparent search tool that does not mask its methodologies and search algorithms.
“There were several open-source search projects. They were a start. Some of the pieces have existed. Now we are trying to give it full support,” he said.
Wales plans to release some form of a very rough first cut of his new search offering by the first of the year. He will use an ad-based model for the Web site but is not sure about the rest of the business model yet.