This is ZomboDB v3.0. It brings significant performance, robustness, and compatibility improvements over the v2.6 series.
ZomboDB now supports Postgres 9.3, 9.4, and 9.5. From v3.0 forward, every ZomboDB release will include rpm/deb/tgz packages for all three versions.
USING zombodbindex in all databases. It may be more simple to do something like:
$ for db in $(psql -qAtl | cut -f1 -d\|); do
echo "Processing $db"
psql $db -c "REINDEX DATABASE $db;"
- ZomboDB is now highly sensitive to Postgres' "visibility_map" (read below) so it might be a good idea to
VACUUM; each of your databases as well
- ZomboDB no longer supports "Bitmap Index Scan" plans. Supporting these introduced more overhead than a regular index scan, and complicated the code. This is nothing to be concerned about as Postgres knows that ZomboDB doesn't support "Bitmap Index Scans" so it won't even attempt to consider one
Previously, ZomboDB stored row visibility information in Elasticsearch using a custom trigger on each table with a ZomboDB index. This violated the contract of the Postgres Access Method API which is that indexes don't store visibility information – only the heap. Additionally, this necessitated that ZomboDB use parent/child queries in Elasticsearch to answer certain queries. Parent/child queries introduce quite a bit of search overhead, not to mention the 2x increase in the number of rows ES has to manage per index.
With v3.0, this is no longer the case. Instead, ZomboDB builds what it calls an "invisibility map" at search time to provide a list of rows that cannot be seen by the current transaction. The implementation detail here is that ZomboDB probes Postgres' "visibility_map", page-by-page, and for any page that is not known to be "all visible", inspects each tuple on that page to determine its visibility to the current transaction snapshot. Any tuples that are invisible are added to an exclusion list when the query is sent to Elasticsearch.
This process only needs to happen when ZomboDB's aggregate functions or zdb_estimate_count() are called. Normal SELECT statements do not need to build an "invisibility map".
ZomboDB also stores the creating transaction's TransactionId with the row in Elasticsearch (using the 64bit-epoch encoded TransactionId so that indexes are immune to wraparound effects). This value is used to also filter out concurrent modifications being made by other sessions.
Because of the "invisibility map", it is extremely important that tables be vacuumed as aggressively as possible. Make sure that autovacuum is turned on and that its thresholds are as low as your I/O subsystem can afford.
My recommendation is to set, in postgresql.conf:
Or something nearly as aggressive. The more dead tuples that collect in your table, the slower ZomboDB aggregate and zdb_estimate_count() queries will be, because a list of the dead tuple ids need to be sent to Elasticsearch to be excluded from the query.
It's also important to note that ZomboDB <=2.6 had caveats around VACUUM FREEZE and wraparound vacuums leaving ZDB indexes in an inconsistent state. These caveats no longer exist.
Additionally, it was previously possible to cancel a running INSERT/UPDATE/DELETE/COPY and leave the ZomboDB index in an inconsistent state. This too, is no longer an issue – DML can be cancelled without discretion.
ZomboDB runs user queries through a number of optimizations to generate the most efficient Elasticsearch QueryDSL possible. ZDB's optimizations for solving index links has been improved such that SIREn can be automatically used, providing significant performance improvements when a query references fields from linked indexes.
SIREn is detected at runtime, and if it is installed on the Elasticsearch cluster, will be used automatically. If not installed, ZomboDB's naive implementation will be used (which is still quite fast).
In either case, ZomboDB does some estimation on the number of rows that will be returned from a linked index and if that number is greater than 50% of the total index size, it doubly inverts the query such that the smaller set of terms will be "joined". By itself, this provides significant performance improvements for typical query patterns that involve linked indexes.
A few new index parameters have been added to aid in managing performance in certain situations.
refresh_interval: allows control of when the underlying Elasticsearch indexes are refreshed and new documents are visible. Defaults to -1, meaning "as soon as Postgres dictates"
ignore_visibility: allows ZomboDB to ignore row visibility per index. This can improve performance of aggregate queries if you don't mind having dead/dirty rows counted in the results
always_resolve_joins: ZomboDB has the ability to target the specific index used when executing an aggregate query that only uses one linked index. If the relationship to that index is known to be one-to-many (or many-to-one or many-to-many), this parameter should be set to true. The default is false.
- upper-bound limits on
bulk_concurrency have been increased to allow more flexibility in tuning indexing performance
These are controlled per session using the
zombodb.batch_mode: allows ZomboDB to batch changes until transaction commit, instead of per statement. Can significantly improve performance of long-running transactions that, for example, INSERT/UPDATE many rows, but one at a time. The default is false.
zombodb.ignore_visibility: similar to the index setting of the same name, but can be controlled per session.
In no particular order, this is a list of specific areas where performance has been improved. - If the table being indexed does not have any columns of type 'json', ZomboDB no longer tries to strip CRLF's from the row_to_json() encoded form of the row, slightly improving indexing performance - Single-term wildcard queries against a field of type 'fulltext_with_shingles' are rewritten to use a regular expression query that matches fewer terms - As noted above, index links have generally been improved, and even more-so if SIREn is installed
zdb_dump_query(table_name regclass, query text): This function shows the Elasticsearch QueryDSL that will be used for the specified query. Very useful for debugging query execution
zdb_update_mapping(table_name regclass): When an
ALTER TABLEstatement is executed, you should call this function to push the mapping or index configuration changes to Elasticsearch. Hopefully a future version of ZomboDB will make this automatic, negating the need for this function.
Because index link path solving has been significantly improved, it's now possible for the left-hand field to reference another named link. No longer is it required that the left-hand field always be from the base table. As long as there's some path through all the defined links, ZomboDB will find it. This allows for more sophisticated "join" queries.
ZomboDB now enables doc values for all types that support it. This should help improve Elasticsearch memory consumption on highly-loaded clusters and also improve aggregation performance.
The (new-to Postgres 9.4)
jsonb Postgres data type is fully supported. It is indexed as a "nested object" in Elasticsearch just as the
Many of the documentation files have been updated to reflect new things in v3.0 or to generally clarify (or actually document) existing features.
I'd like to thank @taspotts, @tcdi, and @ryancutter for their testing and general support. Would also like to thank all the random folks that have been pressuring the project to support Postgres 9.4 and 9.5