Skip to main content
Skip to main content

Migrating data to ClickStack from Elastic

Parallel operation strategy

When migrating from Elastic to ClickStack for observability use cases, we recommend a parallel operation approach rather than attempting to migrate historical data. This strategy offers several advantages:

  1. Minimal risk: by running both systems concurrently, you maintain access to existing data and dashboards while validating ClickStack and familiarizing your users with the new system.
  2. Natural data expiration: most observability data has a limited retention period (typically 30 days or less), allowing for a natural transition as data expires from Elastic.
  3. Simplified migration: no need for complex data transfer tools or processes to move historical data between systems.

Migrating data

We demonstrate an approach for migrating essential data from Elasticsearch to ClickHouse in the section "Migrating data". This should not be used for larger datasets as it is rarely performant - limited by the ability for Elasticsearch to export efficiently, with only JSON format supported.

Implementation steps

  1. Configure Dual Ingestion

Set up your data collection pipeline to send data to both Elastic and ClickStack simultaneously. How this is achieved depends on your current agents for collection - see "Migrating Agents".

  1. Adjust Retention Periods

Configure Elastic's TTL settings to match your desired retention period. Set up the ClickStack TTL to maintain data for the same duration.

  1. Validate and Compare:

  • Run queries against both systems to ensure data consistency
  • Compare query performance and results
  • Migrate dashboards and alerts to ClickStack. This is currently a manual process.
  • Verify that all critical dashboards and alerts work as expected in ClickStack
  1. Gradual Transition:

  • As data naturally expires from Elastic, users will increasingly rely on ClickStack
  • Once confidence in ClickStack is established, you can begin redirecting queries and dashboards

Long-term retention

For organizations requiring longer retention periods:

  • Continue running both systems in parallel until all data has expired from Elastic
  • ClickStack tiered storage capabilities can help manage long-term data efficiently.
  • Consider using materialized views to maintain aggregated or filtered historical data while allowing raw data to expire.

Migration timeline

The migration timeline will depend on your data retention requirements:

  • 30-day retention: Migration can be completed within a month.
  • Longer retention: Continue parallel operation until data expires from Elastic.
  • Historical data: If absolutely necessary, consider using Migrating data to import specific historical data.

Migrating settings

When migrating from Elastic to ClickStack, your indexing and storage settings will need to be adapted to fit ClickHouse's architecture. While Elasticsearch relies on horizontal scaling and sharding for performance and fault tolerance and thus has multiple shards by default, ClickHouse is optimized for vertical scaling and typically performs best with fewer shards.

We recommend starting with a single shard and scaling vertically. This configuration is suitable for most observability workloads and simplifies both management and query performance tuning.

  • ClickHouse Cloud: Uses a single-shard, multi-replica architecture by default. Storage and compute scale independently, making it ideal for observability use cases with unpredictable ingest patterns and read-heavy workloads.
  • ClickHouse OSS: In self-managed deployments, we recommend:
    • Starting with a single shard
    • Scaling vertically with additional CPU and RAM
    • Using tiered storage to extend local disk with S3-compatible object storage
    • Using ReplicatedMergeTree if high availability is required
    • For fault tolerance, 1 replica of your shard is typically sufficient in Observability workloads.

When to shard

Sharding may be necessary if:

  • Your ingest rate exceeds the capacity of a single node (typically >500K rows/sec)
  • You need tenant isolation or regional data separation
  • Your total dataset is too large for a single server, even with object storage

If you do need to shard, refer to Horizontal scaling for guidance on shard keys and distributed table setup.

Retention and TTL

ClickHouse uses TTL clauses on MergeTree tables to manage data expiration. TTL policies can:

  • Automatically delete expired data
  • Move older data to cold object storage
  • Retain only recent, frequently queried logs on fast disk

We recommend aligning your ClickHouse TTL configuration with your existing Elastic retention policies to maintain a consistent data lifecycle during the migration. For examples, see ClickStack production TTL setup.

Migrating data

While we recommend parallel operation for most observability data, there are specific cases where direct data migration from Elasticsearch to ClickHouse may be necessary:

  • Small lookup tables used for data enrichment (e.g., user mappings, service catalogs)
  • Business data stored in Elasticsearch that needs to be correlated with observability data, with ClickHouse's SQL capabilities and Business Intelligence integrations making it easier to maintain and query the data compared to Elasticsearch's more limited query options.
  • Configuration data that needs to be preserved across the migration

This approach is only viable for datasets under 10 million rows, as Elasticsearch's export capabilities are limited to JSON over HTTP and don't scale well for larger datasets.

The following steps allow the migration of a single Elasticsearch index from ClickHouse.

Migrate schema

Create a table in ClickHouse for the index being migrated from Elasticsearch. Users can map Elasticsearch types to their ClickHouse equivalent. Alternatively, users can simply rely on the JSON data type in ClickHouse, which will dynamically create columns of the appropriate type as data is inserted.

Consider the following Elasticsearch mapping for an index containing syslog data:

Elasticsearch mapping

The equivalent ClickHouse table schema:

ClickHouse schema

Note that:

  • Tuples are used to represent nested structures instead of dot notation
  • Used appropriate ClickHouse types based on the mapping:
    • keywordString
    • dateDateTime
    • booleanUInt8
    • longInt64
    • ipArray(Variant(IPv4, IPv6)). We use a Variant(IPv4, IPv6) here as the field contains a mixture of IPv4 and IPv6.
    • objectJSON for the syslog object whose structure is unpredictable.
  • Columns host.ip and host.mac are explicit Array type, unlike in Elasticsearch where all types are arrays.
  • An ORDER BY clause is added using timestamp and hostname for efficient time-based queries
  • MergeTree, which is optimal for log data, is used as the engine type

This approach of statically defining the schema and using the JSON type selectively where required is recommended.

This strict schema has a number of benefits:

  • Data validation – enforcing a strict schema avoids the risk of column explosion, outside of specific structures.
  • Avoids risk of column explosion: although the JSON type scales to potentially thousands of columns, where subcolumns are stored as dedicated columns, this can lead to a column file explosion where an excessive number of column files are created that impacts performance. To mitigate this, the underlying Dynamic type used by JSON offers a max_dynamic_paths parameter, which limits the number of unique paths stored as separate column files. Once the threshold is reached, additional paths are stored in a shared column file using a compact encoded format, maintaining performance and storage efficiency while supporting flexible data ingestion. Accessing this shared column file is, however, not as performant. Note, however, that the JSON column can be used with type hints. "Hinted" columns will deliver the same performance as dedicated columns.
  • Simpler introspection of paths and types: although the JSON type supports introspection functions to determine the types and paths that have been inferred, static structures can be simpler to explore e.g. with DESCRIBE.

Alternatively, users can simply create a table with one JSON column.

note

We provide a type hint for the host.name and timestamp columns in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won't be null and ensures it knows which sub-columns to use (there may be multiple for each type, so this is ambiguous otherwise).

This latter approach, while simpler, is best for prototyping and data engineering tasks. For production, use JSON only for dynamic sub structures where necessary.

For more details on using the JSON type in schemas, and how to efficiently apply it, we recommend the guide "Designing your schema".

Install elasticdump

We recommend elasticdump for exporting data from Elasticsearch. This tool requires node and should be installed on a machine with network proximity to both Elasticsearch and ClickHouse. We recommend a dedicated server with at least 4 cores and 16GB of RAM for most exports.

elasticdump offers several advantages for data migration:

  • It interacts directly with the Elasticsearch REST API, ensuring proper data export.
  • Maintains data consistency during the export process using the Point-in-Time (PIT) API - this creates a consistent snapshot of the data at a specific moment.
  • Exports data directly to JSON format, which can be streamed to the ClickHouse client for insertion.

Where possible, we recommend running both ClickHouse, Elasticsearch, and elastic dump in the same availability zone or data center to minimize network egress and maximize throughput.

Install ClickHouse client

Ensure ClickHouse is installed on the server on which elasticdump is located. Do not start a ClickHouse server - these steps only require the client.

Stream data

To stream data between Elasticsearch and ClickHouse, use the elasticdump command - piping the output directly to the ClickHouse client. The following inserts the data into our well structured table logs_system_syslog.

Note the use of the following flags for elasticdump:

  • type=data - limits the response to only the document content in Elasticsearch.
  • input-index - our Elasticsearch input index.
  • output=$ - redirects all results to stdout.
  • sourceOnly flag ensuring we omit metadata fields in our response.
  • searchAfter flag to use the searchAfter API for efficient pagination of results.
  • pit=true to ensure consistent results between queries using the point in time API.

Our ClickHouse client parameters here (aside from credentials):

  • max_insert_block_size=1000 - ClickHouse client will send data once this number of rows is reached. Increasing improves throughput at the expense of time to formulate a block - thus increasing time till data appears in ClickHouse.
  • min_insert_block_size_bytes=0 - Turns off server block squashing by bytes.
  • min_insert_block_size_rows=1000 - Squashes blocks from clients on the server side. In this case, we set to max_insert_block_size so rows appear immediately. Increase to improve throughput.
  • query="INSERT INTO logs_system_syslog FORMAT JSONAsRow" - Inserting the data as JSONEachRow format. This is appropriate if sending to a well-defined schema such as logs_system_syslog.

Users can expect throughput in order of thousands of rows per second.

Inserting into single JSON row

If inserting into a single JSON column (see the syslog_json schema above), the same insert command can be used. However, users must specify JSONAsObject as the format instead of JSONEachRow e.g.

See "Reading JSON as an object" for further details.

Transform data (optional)

The above commands assume a 1:1 mapping of Elasticsearch fields to ClickHouse columns. Users often need to filter and transform Elasticsearch data before insertion into ClickHouse.

This can be achieved using the input table function, which allows us to execute any SELECT query on the stdout.

Suppose we wish to only store the timestamp and hostname fields from our earlier data. The ClickHouse schema:

To insert from elasticdump into this table, we can simply use the input table function - using the JSON type to dynamically detect and select the required columns. Note this SELECT query could easily contain a filter.

Note the need to escape the @timestamp field name and use the JSONAsObject input format.