Top 5 Open Source Search Engines
Published on 12 August 15
2
2
Many know what a search engine is, what it does and even how it functions using keywords. But to dwell in deeper into the mechanics of familiar search engines that we are fond of such as Google and Yahoo, it is pretty useful to understand more about information retrieval through an open source search engine.
Unlike the layman user search engine that common folks know about, an open source search engine can be considered an intermediate tool component that is part of an ordinary everyday search process. For example, if you are on a Google page and you would like to find out more about the history of football (or soccer if you are from the United States), you simply just type in the keywords and a large array of search results will appear in front of you in less than a second. And within that few mini seconds, Google would utilize the open source search tool to retrieve all the necessary information from the tool's library database (encompassing of thousands of servers) and categorized them neatly in specific indexes. Providing us with the Search Engine Results Pages (SERPs) we have today.
For start-up business trying to make their website prevalent on the internet, Open source search engines can also be regarded to be an alternative to the conventional search engines that we used. For starters, Google and Yahoo may not be practical to these businesses due to its costly fees and the fact that these conventional search engines focus on well established websites. Hence, many small enterprises choose to use open source search engines as they are free, the software is actively maintained and you can customize its programming codes for specific preferences.
The list is updated recently with Elasticsearch and Apache Solr.
1. Elasticsearch
Elasticsearch is a highly scalable open-source full-text search and analytics engine based on Lucene. It is developed in Java and provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Using Elasticsearch you can store all kinds of documents, search, and analyze big volumes of data quickly and in near real time.
To learn ElastciSearch and build your own search engine, here are some good Online Courses for you:
Using Elasticsearch you can store all kinds of documents, search, and analyze big volumes of data quickly and in near real time.
Programming language: Java, but have client available in .NET (C#), PHP, Python, Apache Groovy and many other languages.
License: Apache License 2.0
Ranking of search results: Relevance Scoring
Indexing style: Elasticsearch is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Re-balancing and routing are done automatically". Related data is often stored in the same index, which consists of one or more primary shards, and zero or more replica shards. Once an index has been created, the number of primary shards cannot be changed.
To learn ElastciSearch and build your own search engine, here are some good Online Courses for you:
- Complete Elasticsearch Masterclass with Logstash and Kibana (Rated 4.5 / 5 by 548 students)
- Complete Guide to Elasticsearch (Rated 4.5 / 5 by 900+ students)
- ElasticSearch, LogStash, Kibana ELK #1 - Learn ElasticSearch (Rated 4.5/5 by 340 students)
2. Apache Solr
It is an open source enterprise search platform programmed in Java to provide full-text search, real-time indexing, hit highlighting, dynamic clustering, faceted search, database integration, and rich document (e.g., Word, PDF) handling.
Since it is based on Lucene, Solr extensively uses the Lucene's Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs support, thus making it usable from most popular programming languages. It is designed for scalability and fault tolerance. Solr's external configuration allows it to be tailored to many types of applications without programming in Java, and it's plugin architecture helps support more advanced customization.
Since it is based on Lucene, Solr extensively uses the Lucene's Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs support, thus making it usable from most popular programming languages. It is designed for scalability and fault tolerance. Solr's external configuration allows it to be tailored to many types of applications without programming in Java, and it's plugin architecture helps support more advanced customization.
Programming language: Java
License: Apache License 2.0
Ranking of search results: It is based on td.if scoring model as used in Lucene, which involves various factors like Term Frequency, Inverse Document Frequency, Coordination Factor and Field length.
To learn Apache Solr and build your own search engine, here are some good online courses for you:
- Getting Started with Enterprise Search Using Apache Solr (Rated 4 / 5 by 271 students)
- Learn Apache Solr with Big Data and Cloud Computing (Rated 3/5 by 119 students)
3. Lucene
Lucene is one of the more established open source search engines out there with a text search engine library that is written purely in Java. Its indispensable software can be used for any application that requires full-text search.
Lucene can be used across platforms, has a configurable storage engine (Codecs) and has many powerful query types such as proximity queries and phrase queries.
Programming Language: Originally Java but ported to other languages such as: Delphi, Perl, C#, C++, Python, Ruby, and PHP
License: Apache Software Foundation
Ranking of search results: Versatile (follows popular choices)
Indexing style: multiple-index searching with merged results
Even in 2017, Lucene is still preferred by some employers. Click here to see the demand for lucene in your location.
"It's probably the most advanced library out there today – open source or not," says Shay Banon, the founder of ElasticSearch
4. Sphinx
Sphinx is an open source full text search server that is programmed with relevant search quality and integration simplicity.
Sphinx allows flexible testing whereby its indexing features include full support for SBCS and UTF-8 encodings, stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more.
Sphinx has an easy application integration that is derived from 3 different APIs. It has a native library for many programming languages, a pluggable storage engine for MySQL and an application query that uses MySQL client library and syntax.
Websites such as Craigslist, Living Social, MetaCafe and Groupon has adopted Sphinx for its searches. Programming Language: C++
License: GPLv2 and commercial
Ranking of search results: Versatile
Indexing style: SQL database indexing and Non-SQL storage indexing.
The demand for Sphinx among employers is not as that of Lucene. To know who is hiring for Sphinix near you, click here.
5. Xapian
Xapian, termed as an open source probabilistic information retrieval library, provides a full text search engine library for programmers.
It possesses a wide range of structured Boolean search operators which are allocated based on probabilistic weights. There are also Boolean filters to restrict a probabilistic search.
Xapian's search engine also has the dexterous ability to support the search's word synonyms explicitly and as an automatic form of query expansion.
Also, if you're looking for a fully packaged search engine that is derived upon Xapian, you may install Omega into your site. A great aspect about Xapian is its versatility is that allows you to extend to Omega to meet your needs as they grow.
Currently, Xapian is used as a search engine for the Library of the University of Cologne and Die Zeit (A popular German newspaper) Programming Language: C++
License: GNU General Public License
Ranking of search results: Flexible (important words become more probable than unimportant words)
Indexing Style: Filing system
Below are couple of more open source search engines that could be of your interest.
Indri
Indri is an open source search engine that prides itself through its state-of-the-art text search and a rich structured query language for text collections of up to 50 million documents (single machine) or 500 million documents (distributed search). Indri is multi platform and is applicable in Linux, Solaris, Windows and Mac OSX.
Indri is can support UTF-8 encoded text and is able to parse PDF, HTML, XML and TREC documents. It also recognizes text annotations.
Programming Language: Java, PHP, or C++
License: BSD style license
Ranking of search results: Versatile (Explicit term weighting and Robust query language)
Indexing style: Flexible indexing with tokenization
Zettair
Written and designed by the Search Engine Group at RMIT University, Zettair is a compact and fast text search engine which allows you to index and search HTML (or TREC) collections. It also formatted for simplicity as well as speed and flexibility, and one of its fundamental features is the ability to handle large amounts of text.
Programming language: C
License: BSD style License
Ranking of search results: simple and straightforward
Indexing style: Single Executable (when an index doesn't exist, Zettair will create one for you based on the parameters you provide)
If you feel something better and note worthy Open Source Search Engine is missing in this list, please don't hesitate to leave a comment below!
References:
http://sphinxsearch.com/about/sphinx/
This blog is listed under
Open Source
, Development & Implementations
, Data & Information Management
and Mobility
Community
You may also be interested in
Share your perspective
Share your achievement or new finding or bring a new tech idea to life. Your IT community is waiting!
I will recommend ElasticSearch
i want to start a search engine . like yahoo or google in future. Which one you suggest?