ZomboDB - A Short History

Apr 6, 2017 3:38 pm EDT in history

ZomboDB is a Postgres extension that allows you to CREATE INDEX-es that use Elasticsearch. ZomboDB allows you to horizontally scale your indices to provide extremely fast text search and analytics to your applications without directly communicating with Elasticsearch. Everything you do with ZomboDB is transaction-safe and happens through standard Postgres SQL via the Postgres drivers/API/UIs you already use today.

Origins

ZomboDB started as an internal development project at Technology Concepts & Design, Inc. TCDI is a leading e-Discovery company that develops, hosts, and provides support services around its own software for document review, search, reporting, analytics, and production.

In 2013, we had outgrown the database system we were using (which shall remain unnamed). Despite being
a 32-bit DBMS, it had fairly sophisticated and unique text search capabilities. And despite its sophisticated text-search capabilities, it had severe limitations around concurrency, the amount of data it could index/search, performance, and most importantly: correctness.

We were already using PostgreSQL and Solr (as separate data stores) for various applications, so we started to look into how we could leverage them together to whole-sale replace our legacy DBMS without losing any of the sophisticated full-text search capabilities it provided our applications and users.

Basically, we wanted Postgres to be the "source of truth" and Solr to provide the text search capabilities, but we didn't want our applications to talk directly to Solr.

The legacy DBMS was JDBC-compatible, so we also wanted to limit the amount of changes we'd need to make to our existing applications. We wanted to stay with the Java JDBC model and at worst, just change the applications' query generators to support whatever our custom solution would support.

Solr was pretty quickly ruled out as the "text index" for various reasons and we began to investigate Elasticsearch instead. Elasticsearch, even in 2013, had very robust documentation, was extremely simple to ingest data, scaled easily, and was incredibly fast, even for extremely complicated full-text queries.

After some quick prototyping it was time to move forward with integrating Postgres and Elasticsearch so that we could provide a single point of entry for both relational and text-search.

v1.0: A Foreign Data Wrapper

In 2013, Postgres 9.3 was the current version, and it provided a new interface called "Foreign Data Wrappers". FDWs let you implement a way to expose data from a remote (and generally distinct-from-Postgres) data-store as if it were a local Postgres table. With Postgres 9.3, this was a read-only API – SELECT statements were supported but INSERT/UPDATE/DELETE were not.

We created a Postgres extension that implemented a Foreign Data Wrapper that knew how to talk to Elasticsearch, along with a series of SQL functions, triggers to synchronize the remote Elasticsearch index, and SQL-level management functions for creating/dropping indices.

This approach worked, but there was one gigantic hole (which is typically the same of any Postgres-to-Elasticsearch synchronization scheme): no transaction isolation!

After awhile it became clear that it was extremely simple to cause concurrent Postgres sessions to see incorrect results from the Elasticsearch index. Each session had full visibility into every document in the remote index including documents from inserted/updated/deleted rows in Postgres that had not yet been committed. Even worse, sessions could see in Elasticsearch what were technically aborted rows in Postgres.

Not only did this cause problems for general SELECT statements but it also caused various Elasticsearch-specific aggregate functions to return incorrect results.

Something had to be done about this, and fast!

v2.0: An Access Method

The good news here is Postgres. We decided from the start that Postgres was going to be our source of truth, so it's not as if we had "corrupt data". Simply that our remote Elasticsearch indices weren't staying in sync, in a transactionally-isolated manner, with Postgres. Further, our applications never made decisions about updating/deleting rows based on the results of a text-search (results from Elasticsearch), so all our data in Postgres was sound.

Postgres has an API called "Access Methods" that defines a strict interface around implementing an index type. Postgres' internal btree, hash, gin, and gist indexes are all implemented with this API. It's assumed that a new implementation of this API will work similarly to the built-in types in that the new implementation will store its index tuples in local storage, going through Postgres buffer management and such. But that's not strictly required, per se.

The primary benefit of the Postgres Access Method API is that it, if implemented correctly, gives queries using indexes of that type full MVCC transaction isolation, along with tie-ins to (auto)vacuum for background cleanup. Additionally, with proper statistics estimations, Postgres will use the index and plan queries with Index Scans or Bitmap Index Scans. Finally, Access Methods receive new tuples during INSERT/UPDATE statements at the proper times during query execution, which means, for ZomboDB, that its synchronization to Elasticsearch would always be correct.

After many months of development, ZomboDB transformed from a Foreign Data Wrapper into an Access Method, and our internal applications were updated, yet again, to support the SQL-level syntax changes this required.

Shortly after this development work was finished and rolled into production, life necessitated that I quit TCDI and move half-way across the country. TCDI was generous enough to open-source ZomboDB, and I've continued its development as my full-time career.

What's Next?

That was July 2015 when ZomboDB was open-sourced. I've since formed a company around its development and support, and have lots of great things planned.

This blog will continue to be updated with more details around ZomboDB – subjects ranging from internal implementation details, interesting problems it can solve, insidious bugs that have long-since been fixed, new features, etc.

Thanks for reading!

Recent Posts

ZomboDB - A Short History

Apr 6 2017

ZomboDB is a Postgres extension that allows you to CREATE INDEX-es that use Elasticsearch. ZomboDB allows you to horizontally scale your indices to provide extremely fast text search and analytics to your applications without directly communicating...

Read More