Elasticsearch

Over the last year or so I’ve been looking quite a bit at Elasticsearch for use as a general purpose time series database for operational data.

Whilst there are definitely a lot of options in this space, most notably InfluxDB, I keep coming back to Elasticsearch when we’re talking about large volumes of data where you’re doing a lot of analytical workload.

More than a few times, I’ve been asked to explain what Elasticsearch looks like to the would-be developer/operations person. This isn’t too suprising, the documentation isn’t great at giving a real world architectural overview - which can really help contextualise the documentation.

Having been asked again today, I’ve decided to write one up here - so I can save some time explaining the same thing over and over.

It’s important to note for the would be reader that this is my imperfect understanding of Elasticsearch. If you spot any glaringly obvious errors please let me know and I will update this accordingly and you’ll have my eternal thanks for helping me grok this a little better.

So, as they say: the show must go on!

Elasticsearch concepts

ElasticSearch is a search engine built on top of the Apache Lucene. It’s great for full text search (duh) as well as analytical workloads with adhoc queries and aggregations.

In NoSQL parlance, it would be classified as a document-oriented database and that’s primarily how your application will run with it.

You insert your DOCUMENT into an INDEX.

An index is a namespace for documents and is analogous to a database table
Unlike a table, there is no schema per se
You define what fields are indexed for search at this level

An INDEX is backed by one or more SHARDS

When you create an index, you specify the number of primary shards your data will be split across.
Elasticsearch uses a hash function to determine what shard your document is stored in and accessed from.
You cannot change the number of primary shards after the index is created.

A SHARD can have zero or more REPLICA SHARDS

For each primary shard in your index you will have X replica copies (defined by the index policy)
If the node hosting the primary shard fails, a replica shard will be promoted to primary
Replica shards are used for scaling out read performance - the more replica shards you have, the more reads you can service.
Unlike primary shards, you can change the number of replica shards any time

ELASTICSEARCH runs as a CLUSTER made up of NODES

Nodes automatically form a cluster when correctly configured
Elasticsearch will automatically distribute (and move) shards as needed by your index configuration
Your application can talk to any node in the cluster and they will forward your request to a node with the data to service your request
If you have a busy cluster, you can deploy proxy nodes - these are nodes that don’t store shards and can be used to direct incoming requests

A SHARD is a Lucene index

Every time your query needs to access a shard, the Lucene engine needs to be running for that data
Don’t confuse a Lucene index (a shard) with an Elasticsearch Index (a collection of shards)

Caring and feeding for your cluster

I won’t cover setting up and maintaining quorum in the cluster, because that’s pretty well covered elsewhere. If you’re running on AWS there’s even a managed product available which helps simplify things a lot.

For the keen observer, that last bullet point raises some interesting constraints when managing your cluster.

Elasticsearch manages lucene process, but remember they don’t share the same memory space. Because your indexes have to be loaded fully into virtual memory, make sure you leave enough memory free for your index data (lucene).

Don’t allocate more than 32GB heap to Elasticsearch. Due to how Java addresses memory, this will slow things down heaps (see what I did there?).

Read this. No really.

Working with elasticsearch

In short, deploying Elasticsearch for your search application requires some careful planning on both the ingest and query side.

Primarily you want your cluster to have enough memory so your busy shards (read or write) stay resident in physical memory. Otherwise your nodes will spend all their time paging data in and out of disk, which defeats the point!

Read the designing for scale part of the Elasticsearch guide.

Your best bet is to split the data across indexes by a meaningful criteria, and in the case of time series data this is a natural fit.

Elasticsearch has a feature called index templates, which are super useful for dynamically creating indexes with specific settings. Writes are directed to the correct index and you can automatically have the new index added to an index alias for reads.

Conclusion

Elasticsearch is a great tool, but requires you to plan ahead: Hopefully I’ve given you a good introduction to how things hang together and where there are sharp edges.

I highly recommend reading through the entire Elasticsearch: The definitive guide document, with the above information in mind ahead of time I think it makes for a much more cohesive read.

Good luck and happy hacking!

Thanks to @ZacharyTong for pointing out that Lucene does in fact support paging index segments (statement removed), and @warkolm for spotting a mistake regarding number of replicas and an opportunity to clarify!