In-memory Search made Easy
A library that helps you build an in-memory search index, out of the data residing in your database/persistence layer. This should be possible as long as you are able to pipe data out of the persistence layer, into your application.
Say you have small amount of data in your primary datastore, but you want simple search capabilities on top of this data, would you spin up an entire search engine for this? Like a dedicated Elasticsearch or Solr? Or would you start creating indexes left and right, and bloat up your database? You would still not be able to do free text search on it, if you wanted to.
There are some obvious problems with whatever approach you take:
- Overkill: It is definitely an overkill in most use-cases, like the times when your database has only a few 1000 rows
- Expensive: Depending on what hardware/cloud you choose to use to host the search engine
- Latencies: Search engines today are really fast (especially if you provision on the hardware), but whatever you do, you still incur the network hop cost.
The library attempts to solve the above, by creating a simple search index, in every application node's memory.
The following is a high level sketch of what is happening:
We've finished the What and the Why, now let's look at the How.
At its heart is Lucene. Why lucene you ask? Well, lucene is the most evolved open-source
java search engine libraries out there. It powers Nutch, Solr, Elasticsearch etc. It is well maintained,
supported by the Apache Software Foundation, and has continuous contributions. Need I say more?!
Now how do you make your database searchable?
Essentially, the problem can be divided into 4 critical steps:
- Bootstrapping: Ship all data from your database and index it in Lucene
- Periodic Update: Do this at regular intervals (to account for changes in your database)
- Indexing Rules: Be able to define what parts of the Data, what fields, you want indexed in Lucene
- Search Queries: Be able to retrieve documents by querying the indexed fields.
This is where we retrieve all data elements from your database and send it to the Consumer.
Your DAO layer, should implement a Bootstrapper
class, and its bootstrap()
method is where you would scan your
persistence layer. In this method, you can call the consumer.consumer()
method, for all those data items you want
indexed in the search engine.
- The consumer handles parallel callbacks
- it also ensures single threaded processing of those callbacks (ie, indexing into lucene)
You can define how often the full bootstrap happens. A PeriodicUpdateEngine
ensures that the bootstrapping process is
called at regular intervals. The interval is configurable, based on what you think is right for your use-case.
You should be able to decide which of the fields are being indexed. As such, the Bootstrapper
implementation's Consumer
takes in a IndexableDocument
, where-in, you can choose how the item is indexed as a
document. The examples in the Usage section, should make this more clear.
You should be able to express your retrieval strategies, using the ForageQuery
class.
There are several static helpers in QueryBuilder
which should make things easy when constructing the query.
- One important prerequisite is that, you should be able to pull all data from your database, ie, you should be able to stream it out as a batched select query (on your relational DB), or a scan (Aerospike, Redis, HBase or any other non-relational DB), depending on what database you are using.
- Size of data should be limited. While it totally depends on how much heap you supply to your java application, it presume it shouldn't be in the range of 10s of millions. This library has been tested for 100k rows in memory. (todo) mention details
- Ensure your application is supplied with sufficient memory. A ballpark for calculating the memory for your base java application is to (todo)
<dependency>
<groupId>com.livetheoogway.forage</groupId>
<artifactId>forage-search-engine</artifactId>
<version>${forage.version}</version> <!--look for the latest version on top-->
</dependency>
Let's go the full mile and see what the complete integration would look like.
The sample shows how Book
items stored in some database can be made searchable. Assume Book
with typical properties
like (title
, author
, rating
, numPage
)
Step 1
You would typically start with your datastore/DAO implementations. The following is a good example of what it would look like:
import java.util.HashMap;
class DataStore implements Bootstrapper<IndexableDocument>, Store<Book> {
private final HashMap<String, Book> books; // This would be your DB connections
public DataStore() {
this.books = Lists.newArrayList(); // You would be initializing your DB connections here
}
public void saveBook(final Book book) {
books.put(book.getId(), book); // You would be saving this in your database
}
@Override
public void bootstrap(final Consumer<IndexableDocument> itemConsumer) {
// THIS IS THE MAIN IMPLEMENTATION
for (final Book book : books) {
// You would scan all rows of your database here, and create individual ForageDocument and supply to the consumer
// All rules on which fields need to be indexed how, should be happening here
itemConsumer.accept(new ForageDocument(book.getId(), book, ImmutableList
.of(new TextField("title", book.getTitle()),
new TextField("author", book.getAuthor()),
new FloatField("rating", new float[]{book.getRating()}),
new IntField("numPage", new int[]{book.getNumPage()}))));
}
}
// The following function will be called during search operations.
// This is to get the current stored data for the matching doc ids during the search operation, you just have to replace
// this with the implementation that retrieves the ids from your actual datastore
@Override
public Map<String, Book> get(final List<String> ids) {
return ids.stream().map(id -> MapEntry.of(id, books.get(id))).collect(MapEntry.mapCollector());
}
}
Step 2
Your next steps, would involve creating and initializing the SearchEngine and using it for retrieval
import java.awt.print.Book;
@Singleton
public class Container {
private SearchEngine<ForageQuery, ForageQueryResult<Book>> searchEngine;
public Container() {
final ForageSearchEngineBuilder<Book> engineBuilder = ForageSearchEngineBuilder.<Book>builder()
.withDataStore(dataStore)
.withObjectMapper(new ObjectMapper());
this.searchEngine = new ForageEngine<>(engineBuilder);
final PeriodicUpdateEngine<IndexableDocument> updateEngine =
new PeriodicUpdateEngine<>(
dataStore,
new AsyncQueuedConsumer<>(searchEngine),
60, TimeUnit.SECONDS // depicts how often you want to bootstrap from the database
);
updateEngine.start();
}
// And while searching, you can do this:
public void sampleSearch() {
// retrieve top 10 books that have numPages between 600 and 1000
final ForageQueryResult<Book> results =
searchEngine.search(QueryBuilder.intRangeQuery("numPage", 600, 800).buildForageQuery(10));
// retrieve all books that have "rowling" in Author, and "prince" in Title
ForageQueryResult<Book> result = searchEngine.search(
QueryBuilder.booleanQuery()
.query(new MatchQuery("author", "rowling"))
.query(new MatchQuery("title", "prince"))
.clauseType(ClauseType.MUST)
.buildForageQuery());
}
}
There is a much simple integration available if your application is a Dropwizard application.
Add the following dependency
<dependency>
<groupId>com.livetheoogway.forage</groupId>
<artifactId>forage-dropwizard-bundle</artifactId>
<version>${forage.version}</version> <!--look for the latest version on top-->
</dependency>
In your Application, register the ForageBundle
public class MyApplication extends Application<MyConfiguration> {
// other stuff
@Override
public void initialize(final Bootstrap<MyConfiguration> bootstrap) {
bootstrap.addBundle(new ForageBundle<>() {
@Override
public Store<Book> dataStore(final MyConfiguration configuration) {
return store; // the one that retrieves data given id
}
@Override
public Bootstrapper<IndexableDocument> bootstrap(final MyConfiguration configuration) {
return store; // the one that implements the bootstrap
}
@Override
public ForageConfiguration forageConfiguration(final MyConfiguration configuration) {
return configuration.getForageConfiguration(); // have ForageConfiguration as part of your main config class
}
});
}
}
- Simple Term Match
QueryBuilder.matchQuery("title","sawyer").buildForageQuery()
- Fuzzy Query: You can try a fuzzy query match for retrieving results
QueryBuilder.fuzzyMatchQuery("title","sayyer").buildForageQuery()
- Range Queries
QueryBuilder.intRangeQuery("numPage",600,800).buildForageQuery()
- Boolean Queries
QueryBuilder.booleanQuery()
.query(new MatchQuery("author","rowling"))
.query(new MatchQuery("title","prince"))
.clauseType(ClauseType.MUST) // or SHOULD, MUST_NOT, FILTER
.buildForageQuery();
- Page Queries and Paginated Results
ForageQueryResult<Book> result = searchEngine.search(QueryBuilder.matchQuery("author", "rowling").buildForageQuery(15)); // first 15 items
ForageQueryResult<Book> result2 = searchEngine.search(new PageQuery(result.getNextPage(), 20)); // next 20 items
- Phrase match query
QueryBuilder.phraseMatchQuery("title", "Tom Sawyer").buildForageQuery();
- All match query
QueryBuilder.matchAllQuery().buildForageQuery();
- Java 11
- Lucene 9.1.0
- Dropwizard 2.1.0 (Optional)
Please raise Issues, Bugs or any feature requests at Github Issues
.
If you plan on contributing to the code, fork the repository and raise a Pull Request back here.
- Core and the bootstrapper diagram with the queued listeners
- Lucene internals being masked
- Searchers
- Attributes being stored for field conversion
- Helpers for query creation
- Fuzzy Query Support
- Dropwizard bundle for simpler integrations
- Expose Scoring and boosting
- Phrase Query Support
- Auto complete query Support
- Expose explain query IndexSearcher.explain(Query, doc)