Project Antalya Quick Start Guide

Getting up to speed with Project Antalya

Project Antalya delivers new features that make ClickHouse® even more powerful than before. There are three concepts we’ll deal with in this guide:

  • Swarms - Swarms are pools of stateless ClickHouse clusters. With Project Antalya, ClickHouse can use a swarm to distribute the processing load of a query, giving you much faster query times. They can be spun up or down as needed, and they register (and unregister) themselves with Keeper automatically. And they can cut your compute costs significantly by running on spot instances, which Amazon says can be up to 90% cheaper than regular instances.
  • Data Lakes - Project Antalya implements data lakes that use Iceberg as their table format, store data as columns in Parquet, and host everything on inexpensive, S3-compatible storage. Most importantly, Project Antalya’s data lakes can be used by multiple applications. Analytics workloads with ClickHouse, AI applications, and batch jobs can all use the same Iceberg catalogs, eliminating silos of data and greatly reducing your storage costs.
  • Hybrid Tables - Project Antalya delivers the Hybrid table engine, which allows you to divide a dataset between block storage and object storage. Putting your lesser-used data into object storage can have significant cost savings. And even though your data is stored in different places, hybrid tables let you analyze all of your data with a single query.

If you’d like a more in-depth look at these topics, see the Project Antalya concepts guide.

Throughout this guide we’ll look at two different datasets:

  • The AWS Public Blockchain dataset - This has 15+ years of data, with thousands of Parquet files, one for each day. It’s a great way to show the benefits of swarm clusters, since we can use multiple threads to read and process those thousands of files in parallel.
  • The New York Taxi and Limousine Commission dataset, which has more than fifteen years’ worth of data on taxi rides. This is a great dataset to illustrate the power of Hybrid tables. Analytics against this data tend to focus on time-based queries. If there’s a clear line between hot data and cold data, the ability to move cold data to much cheaper object storage yet still query all our data with a single SQL statement has substantial benefits.

Creating Swarm Clusters

Getting started with swarms

Querying Data with Swarms

Using swarms to speed up queries

Working with Data Lakes

Creating a data lake and querying it

Working with Hybrid Tables

Using a single query against hot and cold storage

Bringing it all Together

Querying data lakes with swarm clusters