Forage

In-memory Search made Easy

What is it?

A library that helps you build an in-memory search index, out of the data residing in your database/persistence layer. This should be possible as long as you are able to pipe data out of the persistence layer, into your application.

Why would it be required?

Say you have small amount of data in your primary datastore, but you want simple search capabilities on top of this data, would you spin up an entire search engine for this? Like a dedicated Elasticsearch or Solr? Or would you start creating indexes left and right, and bloat up your database? You would still not be able to do free text search on it, if you wanted to.

There are some obvious problems with whatever approach you take:

Overkill: It is definitely an overkill in most use-cases, like the times when your database has only a few 1000 rows
Expensive: Depending on what hardware/cloud you choose to use to host the search engine
Latencies: Search engines today are really fast (especially if you provision on the hardware), but whatever you do, you still incur the network hop cost.

The library attempts to solve the above, by creating a simple search index, in every application node's memory.

How it works?

The following is a high level sketch of what is happening:

We've finished the What and the Why, now let's look at the How.
At its heart is Lucene. Why lucene you ask? Well, lucene is the most evolved open-source java search engine libraries out there. It powers Nutch, Solr, Elasticsearch etc. It is well maintained, supported by the Apache Software Foundation, and has continuous contributions. Need I say more?!

Now how do you make your database searchable?
Essentially, the problem can be divided into 4 critical steps:

Bootstrapping: Ship all data from your database and index it in Lucene
Periodic Update: Do this at regular intervals (to account for changes in your database)
Indexing Rules: Be able to define what parts of the Data, what fields, you want indexed in Lucene
Search Queries: Be able to retrieve documents by querying the indexed fields.

1. Bootstrapping from your database:

This is where we retrieve all data elements from your database and send it to the Consumer. Your DAO layer, should implement a Bootstrapper class, and its bootstrap() method is where you would scan your persistence layer. In this method, you can call the consumer.consumer() method, for all those data items you want indexed in the search engine.

The consumer handles parallel callbacks
it also ensures single threaded processing of those callbacks (ie, indexing into lucene)

2. Periodic Update

You can define how often the full bootstrap happens. A PeriodicUpdateEngine ensures that the bootstrapping process is called at regular intervals. The interval is configurable, based on what you think is right for your use-case.

3. Indexing Rules

You should be able to decide which of the fields are being indexed. As such, the Bootstrapper implementation's Consumer takes in a IndexableDocument, where-in, you can choose how the item is indexed as a document. The examples in the Usage section, should make this more clear.

4. Search Queries

You should be able to express your retrieval strategies, using the ForageQuery class. There are several static helpers in QueryBuilder which should make things easy when constructing the query.

Any prerequisites and callouts?

One important prerequisite is that, you should be able to pull all data from your database, ie, you should be able to stream it out as a batched select query (on your relational DB), or a scan (Aerospike, Redis, HBase or any other non-relational DB), depending on what database you are using.
Size of data should be limited. While it totally depends on how much heap you supply to your java application, it presume it shouldn't be in the range of 10s of millions. This library has been tested for 100k rows in memory. (todo) mention details
Ensure your application is supplied with sufficient memory. A ballpark for calculating the memory for your base java application is to (todo)

Getting started

Maven Dependency

<dependency>
    <groupId>com.livetheoogway.forage</groupId>
    <artifactId>forage-search-engine</artifactId>
    <version>${forage.version}</version> <!--look for the latest version on top-->
</dependency>

Usage

Let's go the full mile and see what the complete integration would look like. The sample shows how Book items stored in some database can be made searchable. Assume Book with typical properties like (title, author, rating, numPage)

Step 1

You would typically start with your datastore/DAO implementations. The following is a good example of what it would look like:

import java.util.HashMap;

class DataStore implements Bootstrapper<IndexableDocument>, Store<Book> {
    private final HashMap<String, Book> books; // This would be your DB connections

    public DataStore() {
        this.books = Lists.newArrayList();  // You would be initializing your DB connections here
    }

    public void saveBook(final Book book) {
        books.put(book.getId(), book);  // You would be saving this in your database
    }

    @Override
    public void bootstrap(final Consumer<IndexableDocument> itemConsumer) {
        
        // THIS IS THE MAIN IMPLEMENTATION
        
        for (final Book book : books) {
          // You would scan all rows of your database here, and create individual ForageDocument and supply to the consumer
          // All rules on which fields need to be indexed how, should be happening here
          itemConsumer.accept(new ForageDocument(book.getId(), book, ImmutableList
                    .of(new TextField("title", book.getTitle()),
                        new TextField("author", book.getAuthor()),
                        new FloatField("rating", new float[]{book.getRating()}),
                        new IntField("numPage", new int[]{book.getNumPage()}))));
        }
    }

    // The following function will be called during search operations.
    // This is to get the current stored data for the matching doc ids during the search operation, you just have to replace 
    // this with the implementation that retrieves the ids from your actual datastore
    @Override
    public Map<String, Book> get(final List<String> ids) {
        return ids.stream().map(id -> MapEntry.of(id, books.get(id))).collect(MapEntry.mapCollector());
    }
}

Step 2

Your next steps, would involve creating and initializing the SearchEngine and using it for retrieval

import java.awt.print.Book;

@Singleton
public class Container {

    private SearchEngine<ForageQuery, ForageQueryResult<Book>> searchEngine;

    public Container() {
        final ForageSearchEngineBuilder<Book> engineBuilder = ForageSearchEngineBuilder.<Book>builder()
                .withDataStore(dataStore)
                .withObjectMapper(new ObjectMapper());

        this.searchEngine = new ForageEngine<>(engineBuilder);

        final PeriodicUpdateEngine<IndexableDocument> updateEngine =
                new PeriodicUpdateEngine<>(
                        dataStore,
                        new AsyncQueuedConsumer<>(searchEngine),
                        60, TimeUnit.SECONDS // depicts how often you want to bootstrap from the database
                );

        updateEngine.start();
    }

    // And while searching, you can do this:
    public void sampleSearch() {

        // retrieve top 10 books that have numPages between 600 and 1000
        final ForageQueryResult<Book> results =
                searchEngine.search(QueryBuilder.intRangeQuery("numPage", 600, 800).buildForageQuery(10));   

        // retrieve all books that have "rowling" in Author, and "prince" in Title
        ForageQueryResult<Book> result = searchEngine.search(
                QueryBuilder.booleanQuery()
                        .query(new MatchQuery("author", "rowling"))
                        .query(new MatchQuery("title", "prince"))
                        .clauseType(ClauseType.MUST)
                        .buildForageQuery());
    }
}

Dropwizard Bundle

There is a much simple integration available if your application is a Dropwizard application.

Add the following dependency

<dependency>
    <groupId>com.livetheoogway.forage</groupId>
    <artifactId>forage-dropwizard-bundle</artifactId>
    <version>${forage.version}</version> <!--look for the latest version on top-->
</dependency>

In your Application, register the ForageBundle

public class MyApplication extends Application<MyConfiguration> {
    // other stuff
    @Override
    public void initialize(final Bootstrap<MyConfiguration> bootstrap) {
        bootstrap.addBundle(new ForageBundle<>() {

            @Override
            public Store<Book> dataStore(final MyConfiguration configuration) {
                return store;  // the one that retrieves data given id
            }

            @Override
            public Bootstrapper<IndexableDocument> bootstrap(final MyConfiguration configuration) {
                return store;  // the one that implements the bootstrap
            }

            @Override
            public ForageConfiguration forageConfiguration(final MyConfiguration configuration) {
                return configuration.getForageConfiguration(); // have ForageConfiguration as part of your main config class
            }
        });
    }
}

Features

Simple Term Match

QueryBuilder.matchQuery("title","sawyer").buildForageQuery()

Fuzzy Query: You can try a fuzzy query match for retrieving results

QueryBuilder.fuzzyMatchQuery("title","sayyer").buildForageQuery()

Range Queries

QueryBuilder.intRangeQuery("numPage",600,800).buildForageQuery()

Boolean Queries

QueryBuilder.booleanQuery()
        .query(new MatchQuery("author","rowling"))
        .query(new MatchQuery("title","prince"))
        .clauseType(ClauseType.MUST)  // or SHOULD, MUST_NOT, FILTER
        .buildForageQuery();

Page Queries and Paginated Results

ForageQueryResult<Book> result = searchEngine.search(QueryBuilder.matchQuery("author", "rowling").buildForageQuery(15)); // first 15 items
ForageQueryResult<Book> result2 = searchEngine.search(new PageQuery(result.getNextPage(), 20)); // next 20 items

Phrase match query

QueryBuilder.phraseMatchQuery("title", "Tom Sawyer").buildForageQuery();

All match query

QueryBuilder.matchAllQuery().buildForageQuery();

Tech Dependencies

Java 11
Lucene 9.1.0
Dropwizard 2.1.0 (Optional)

Contributions

Please raise Issues, Bugs or any feature requests at Github Issues .
If you plan on contributing to the code, fork the repository and raise a Pull Request back here.

Under the Hood

(todo)

Core and the bootstrapper diagram with the queued listeners
Lucene internals being masked
Searchers
Attributes being stored for field conversion

Todos

Helpers for query creation
Fuzzy Query Support
Dropwizard bundle for simpler integrations
Expose Scoring and boosting
Phrase Query Support
Auto complete query Support
Expose explain query IndexSearcher.explain(Query, doc)

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github		.github
forage-core		forage-core
forage-dropwizard-bundle		forage-dropwizard-bundle
forage-models		forage-models
forage-search-engine		forage-search-engine
resources		resources
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
sonar-project.properties		sonar-project.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forage

What is it?

Why would it be required?

How it works?

1. Bootstrapping from your database:

2. Periodic Update

3. Indexing Rules

4. Search Queries

Any prerequisites and callouts?

Getting started

Maven Dependency

Usage

Dropwizard Bundle

Features

Tech Dependencies

Contributions

Under the Hood

Todos

About

Releases 8

Packages

Contributors 2

Languages

License

livetheoogway/forage

Folders and files

Latest commit

History

Repository files navigation

Forage

What is it?

Why would it be required?

How it works?

1. Bootstrapping from your database:

2. Periodic Update

3. Indexing Rules

4. Search Queries

Any prerequisites and callouts?

Getting started

Maven Dependency

Usage

Dropwizard Bundle

Features

Tech Dependencies

Contributions

Under the Hood

Todos

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages