on 08 August 19
Many know what a search engine is, what it does and even how it functions using keywords. But to dwell in deeper into the mechanics of familiar search engines that we are fond of such as Google and Yahoo, it is pretty useful to understand more about information retrieval through an open source search engine.
Unlike the layman user search engine that common folks know about, an open source search engine can be considered an intermediate tool component that is part of an ordinary everyday search process. For example, if you are on a Google page and you would like to find out more about the history of football (or soccer if you are from the United States), you simply just type in the keywords and a large array of search results will appear in front of you in less than a second. And within that few mini seconds, Google would utilize the open source search tool to retrieve all the necessary information from the tool's library database (encompassing of thousands of servers) and categorized them neatly in specific indexes. Providing us with the Search Engine Results Pages (SERPs) we have today.
For start-up business trying to make their website prevalent on the internet, Open source search engines can also be regarded to be an alternative to the conventional search engines that we used. For starters, Google and Yahoo may not be practical to these businesses due to its costly fees and the fact that these conventional search engines focus on well established websites. Hence, many small enterprises choose to use open source search engines as they are free, the software is actively maintained and you can customize its programming codes for specific preferences.
Now that you have a better idea of how an open source search engine works and its undeniable usefulness, let's dive into my recommended list of open source search engines and dwell further into the technicalities.
The list is updated recently with Elasticsearch and Apache Solr.
Elasticsearch is a highly scalable open-source full-text search and analytics engine based on Lucene. It is developed in Java and provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Using Elasticsearch you can store all kinds of documents, search, and analyze big volumes of data quickly and in near real time.
Programming language: Java, but have client available in .NET (C#), PHP, Python, Apache Groovy and many other languages.
License: Apache License 2.0
Ranking of search results: Relevance Scoring
Indexing style: Elasticsearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Re-balancing and routing are done automatically". Related data is often stored in the same index, which consists of one or more primary shards, and zero or more replica shards. Once an index has been created, the number of primary shards cannot be changed.
To learn ElastciSearch and build your own search engine, here are some good Online Courses for you:
- Complete Elasticsearch Masterclass with Logstash and Kibana (Rated 4.5 / 5 by 548 students)
- Complete Guide to Elasticsearch (Rated 4.5 / 5 by 900+ students)
- ElasticSearch, LogStash, Kibana ELK #1 - Learn ElasticSearch (Rated 4.5/5 by 340 students)
2. Apache Solr
It is an open source enterprise search platform programmed in Java to provide full-text search, real-time indexing, hit highlighting, dynamic clustering, faceted search, database integration, and rich document (e.g., Word, PDF) handling.
Since it is based on Lucene, Solr extensively uses the Lucene's Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs support, thus making it usable from most popular programming languages. It is designed for scalability and fault tolerance. Solr's external configuration allows it to be tailored to many types of applications without programming in Java, and it's plugin architecture helps support more advanced customization.
Programming language: Java
License: Apache License 2.0
Ranking of search results: It is based on td.if scoring model as used in Lucene, which involves various factors like Term Frequency, Inverse Document Frequency, Coordination Factor and Field length.
To learn Apache Solr and build your own search engine, here are some good online courses for you:
- Getting Started with Enterprise Search Using Apache Solr (Rated 4 / 5 by 271 students)
- Learn Apache Solr with Big Data and Cloud Computing (Rated 3/5 by 119 students)
Lucene is one of the more established open source search engines out there with a text search engine library that is written purely in Java. Its indispensable software can be used for any application that requires full-text search.
Lucene can be used across platforms, has a configurable storage engine (Codecs) and has many powerful query types such as proximity queries and phrase queries.
At the moment, its open source project is available for free download and Twitter is actually using Lucene for real time search.
Programming Language: Originally Java but ported to other languages such as: Delphi, Perl, C#, C++, Python, Ruby, and PHP
License: Apache Software Foundation
Ranking of search results: Versatile (follows popular choices)
Indexing style: multiple-index searching with merged results
Even in 2017, Lucene is still preferred by some employers. Click here to see the demand for lucene in your location.
"It's probably the most advanced library out there today – open source or not," says Shay Banon, the founder of ElasticSearch
Sphinx is an open source full text search server that is programmed with relevant search quality and integration simplicity.
Sphinx allows flexible testing whereby its indexing features include full support for SBCS and UTF-8 encodings, stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more.
Sphinx has an easy application integration that is derived from 3 different APIs. It has a native library for many programming languages, a pluggable storage engine for MySQL and an application query that uses MySQL client library and syntax.
Websites such as Craigslist, Living Social, MetaCafe and Groupon has adopted Sphinx for its searches.
Programming Language: C++
License: GPLv2 and commercial
Ranking of search results: Versatile
Indexing style: SQL database indexing and Non-SQL storage indexing.
The demand for Sphinx among employers is not as that of Lucene. To know who is hiring for Sphinix near you, click here.
Xapian, termed as an open source probabilistic information retrieval library, provides a full text search engine library for programmers.
It possesses a wide range of structured Boolean search operators which are allocated based on probabilistic weights. There are also Boolean filters to restrict a probabilistic search.
Xapian's search engine also has the dexterous ability to support the search's word synonyms explicitly and as an automatic form of query expansion.
Also, if you're looking for a fully packaged search engine that is derived upon Xapian, you may install Omega into your site. A great aspect about Xapian is its versatility is that allows you to extend to Omega to meet your needs as they grow.
Currently, Xapian is used as a search engine for the Library of the University of Cologne and Die Zeit (A popular German newspaper)
Programming Language: C++
License: GNU General Public License
Ranking of search results: Flexible (important words become more probable than unimportant words)
Indexing Style: Filing system
Below are couple of more open source search engines that could be of your interest.
Indri is an open source search engine that prides itself through its state-of-the-art text search and a rich structured query language for text collections of up to 50 million documents (single machine) or 500 million documents (distributed search). Indri is multi platform and is applicable in Linux, Solaris, Windows and Mac OSX.
Indri is can support UTF-8 encoded text and is able to parse PDF, HTML, XML and TREC documents. It also recognizes text annotations.
One of Indri's significant involvements is being the search engine component of the Lemur toolkit. The Lemur toolkit came from the partnership between the Center for Intelligent Information Retrieval and the Language Technologies Institute at Carnegie Mellon University. The partnership between the 2 institutions developed the Lemur Toolkit, an open-source (BSD license) software framework for building language modeling and information retrieval software.
Programming Language: Java, PHP, or C++
License: BSD style license
Ranking of search results: Versatile (Explicit term weighting and Robust query language)
Indexing style: Flexible indexing with tokenization
Written and designed by the Search Engine Group at RMIT University, Zettair is a compact and fast text search engine which allows you to index and search HTML (or TREC) collections. It also formatted for simplicity as well as speed and flexibility, and one of its fundamental features is the ability to handle large amounts of text.
Other features that Zettair has are its Boolean, ranked and phrase querying, Modular C API and it's easy to use command-line interface. Not to mention the search engine is applicable for many platforms including Solaris and Linux.
Programming language: C
License: BSD style License
Ranking of search results: simple and straightforward
Indexing style: Single Executable (when an index doesn't exist, Zettair will create one for you based on the parameters you provide)
If you feel something better and note worthy Open Source Search Engine is missing in this list, please don't hesitate to leave a comment below!