Connecting Hbase to Elasticsearch in 10 min or less

This might be useful to Elasticsearch users who want to store their data in Hbase, and to Hbase users who wish to enable full-text search on their existing tables via REST API.

Elasticsearch is a distributed search engine based on Apache Lucene, and Hbase is a distributed database built on top of Apache Hadoop. There is some overlap between the two as Elasticsearch also functions as a database, which stores search indexes and archives text data. There has been a good deal of discussion lately on reasons to NOT use Elasticsearch as a primary datastore, among them weak consistency guarantees that can result in loss of data, with the recommended alternative of applying it primarily as an indexing engine while delegating storage to a safer database. Data ingest can be handled by JDBC River, which is a pluggable service running within elasticsearch cluster pulling data (or being pushed with data) that is then indexed into the cluster.

Elasticsearch JDBC River

While still in active development Hbase is a relatively mature and reliable distributed database based on Google's BigTable design which comes with very good scalability and performance. It supports random, realtime read/write access to very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. In fact Facebook Messages' search engine is built with an inverted index stored in HBase.

Apache Phoenix is a SQL skin over HBase which essentially transforms Hbase into a distributed RDBMS, for example you could execute standard relational operators such as SQL SELECT or JOIN over multiple tables containing billions of rows which are spread across hundreds of machines. Phoenix JDBC driver allows easy integration with Elasticsearch JDBC River as shown below.

Setup

Prerequisites: JDK 7 and JAVA_HOME is set

$ export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

1. Download Hbase
$ wget http://mirrors.ibiblio.org/apache/hbase/hbase-0.94.22/hbase-0.94.22.tar.gz
$ tar xvzf hbase-0.94.22.tar.gz
$ cd hbase-0.94.22; export HBASE_HOME=`pwd`; cd ../
2. Download Phoenix
$ wget http://www.motorlogy.com/apache/phoenix/phoenix-3.1.0/bin/phoenix-3.1.0-bin.tar.gz
$ tar xvzf phoenix-3.1.0-bin.tar.gz
$ cd phoenix-3.1.0-bin; export PHOENIX_HOME=`pwd`;  cd ../
Note: if you're using HBase 0.98 with Hadoop2 you'd want to download the 4.1.0 release tar

3. Copy Phoenix core jar to hbase lib directory
$ cp $PHOENIX_HOME/common/phoenix-core-3.1.0.jar $HBASE_HOME/lib/
Note: if you're using HBase 0.98 with Hadoop2 copy the the hadoop2/phoenix-4.1.0-server-hadoop2.jar to hbase lib directory

4. Start Hbase in standalone mode
$ $HBASE_HOME/bin/start-hbase.sh
Check if Hbase is up: http://localhost:60010

5. Create a table in Hbase using SQLline
$ $PHOENIX_HOME/hadoop1/bin/sqlline.py localhost
> CREATE TABLE test.orders ( id BIGINT not null primary key, name VARCHAR);
> UPSERT INTO TEST.ORDERS(ID,NAME) VALUES(123,'foo');
> UPSERT INTO TEST.ORDERS(ID,NAME) VALUES(456,'bar');
> SELECT * FROM test.orders;
> !quit
6. Download Elasticsearch
$ wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.2.zip
$ unzip elasticsearch-1.3.2.zip;
$ cd elasticsearch-1.3.2; export ELS_HOME=`pwd`
7. Install Elasticsearch JDBC plugin
$ $ELS_HOME/bin/plugin --install jdbc --url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.0.4/elasticsearch-river-jdbc-1.3.0.4-plugin.zip
8. Copy Phoenix JDBC driver to Elasticsearch plugins directory
$ cp $PHOENIX_HOME/hadoop1/phoenix-3.1.0-client-hadoop1.jar $ELS_HOME/plugins/jdbc/ 
Note: if you're using HBase 0.98 with Hadoop2 copy hadoop2/phoenix-4.1.0-client-hadoop2.jar to plugins directory

9. Start Elasticsearch in standalone mode
$ nohup $ELS_HOME/bin/elasticsearch &
$ tail nohup.out
Check if it's running:
$ curl -X GET http://localhost:9200/
10. Create JDBC River named 'phoenix_jdbc_river'
curl -XPUT 'localhost:9200/_river/phoenix_jdbc_river/_meta' -d '{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "jdbc:phoenix:localhost",
        "user" : "",
        "password" : "",
        "sql" : "select * from test.orders"
    }
}'
11. Get data via Search API
$ curl 'localhost:9200/jdbc/_search?pretty&q=*'
Results:
{
  "took" : 46,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "jdbc",
      "_type" : "jdbc",
      "_id" : "twfnoYwNT0W4sRILyjfepA",
      "_score" : 1.0,
      "_source":{"ID":456,"NAME":"bar"}
    }, {
      "_index" : "jdbc",
      "_type" : "jdbc",
      "_id" : "BjLNiqOwS7m8HZiuwlCgow",
      "_score" : 1.0,
      "_source":{"ID":123,"NAME":"foo"}
    } ]
  }
}

Thanks to James Taylor (@JamesRTaylor) and Jörg Prante (@jprante) for making this quick integration possible.