Over the last year or so I've been looking quite a bit at Elasticsearch for use as a general purpose time series database for operational data.
Whilst there are definitely a lot of options in this space, most notably InfluxDB, I keep coming back to Elasticsearch when we're talking about large volumes of data where you're doing a lot of analytical workload.
More than a few times, I've been asked to explain what Elasticsearch looks like to the would-be developer/operations person. This isn't too suprising, the documentation isn't great at giving a real world architectural overview - which can really help contextualise the documentation.
Having been asked again today, I've decided to write one up here - so I can save some time explaining the same thing over and over.
It's important to note for the would be reader that this is my imperfect understanding of Elasticsearch. If you spot any glaringly obvious errors please let me know and I will update this accordingly and you'll have my eternal thanks for helping me grok this a little better.
So, as they say: the show must go on!
ElasticSearch is a search engine built on top of the Apache Lucene. It's great for full text search (duh) as well as analytical workloads with adhoc queries and aggregations.
In NoSQL parlance, it would be classified as a document-oriented database and that's primarily how your application will run with it.
You insert your DOCUMENT into an INDEX.
An INDEX is backed by one or more SHARDS
A SHARD can have zero or more REPLICA SHARDS
ELASTICSEARCH runs as a CLUSTER made up of NODES
A SHARD is a Lucene index
I won't cover setting up and maintaining quorum in the cluster, because that's pretty well covered elsewhere. If you're running on AWS there's even a managed product available which helps simplify things a lot.
For the keen observer, that last bullet point raises some interesting constraints when managing your cluster.
Elasticsearch manages lucene process, but remember they don't share the same memory space. Because your indexes have to be loaded fully into virtual memory, make sure you leave enough memory free for your index data (lucene).
Don't allocate more than 32GB heap to Elasticsearch. Due to how Java addresses memory, this will slow things down heaps (see what I did there?).
Read this. No really.
In short, deploying Elasticsearch for your search application requires some careful planning on both the ingest and query side.
Primarily you want your cluster to have enough memory so your busy shards (read or write) stay resident in physical memory. Otherwise your nodes will spend all their time paging data in and out of disk, which defeats the point!
Read the designing for scale part of the Elasticsearch guide.
Your best bet is to split the data across indexes by a meaningful criteria, and in the case of time series data this is a natural fit.
Elasticsearch has a feature called index templates, which are super useful for dynamically creating indexes with specific settings. Writes are directed to the correct index and you can automatically have the new index added to an index alias for reads.
Elasticsearch is a great tool, but requires you to plan ahead: Hopefully I've given you a good introduction to how things hang together and where there are sharp edges.
I highly recommend reading through the entire Elasticsearch: The definitive guide document, with the above information in mind ahead of time I think it makes for a much more cohesive read.
Good luck and happy hacking!
Thanks to @ZacharyTong for pointing out that Lucene does in fact support paging index segments (statement removed), and @warkolm for spotting a mistake regarding number of replicas and an opportunity to clarify!