Working with Data Lakes

Creating our own data lake and querying it

Our last step is to create our own data lake. With the data lake created, we’ll run queries against it with the swarm cluster we created earlier.

Here are the steps we’ll go through:

  1. Enable an Iceberg catalog for our Altinity.Cloud environment
  2. Get the connection details for our Iceberg catalog
  3. Write data to our Iceberg catalog (this takes place outside the ACM)
  4. Create a ClickHouse database that’s connected to our Iceberg catalog
  5. Use a swarm cluster to run queries against that database

Pretty straightforward, eh? Let’s go!

Enabling an Iceberg catalog

Before we can work with data lakes, we need to enable an Iceberg catalog for your Altinity.Cloud environment. From the ACTIONS menu on the Environments tab, select the Iceberg Catalog menu item. You’ll see this dialog:

Figure 1 - Enabling an Iceberg catalog

Click the button to enable the catalog. (You may need to click the button a few times while the catalog is created.) When the catalog is enabled, you’ll see the connection details for the catalog:

Figure 2 - Connection details for the Iceberg catalog

Complete information about working with Iceberg catalogs is in the Enabling an Iceberg Catalog documentation.

Getting the connection details later

Any time you need the connection details, you can get back to this dialog by clicking the Enabled link next to Iceberg Catalogs on the Environment summary panel:

Figure 3 - Getting connection details from the Environment overview panel

Clicking that link takes you back to Figure 2 above. You’ll need these values to connect to the catalog and write data to it. Which brings us to our next step….

Writing data to the Iceberg catalog

Now it’s time to take the credentials from Figure 2 above and use them to load data into our catalog. There are a number of tools that can do this, but we’ll use Ice, an open-source tool from Altinity.

First, we’ll put the connection details into the file .ice.yaml:

uri: https://iceberg-catalog.altinity-maddie.altinity.cloud
bearerToken: abcdef0123456789abcdef0123456789

(Notice that the field in the YAML file is uri, not url.)

With the YAML file configured, we’ll load a Parquet file from the AWS public blockchain dataset into the catalog entry named blockchain.data:

 ice insert blockchain.data -p https://aws-public-blockchain.s3-us-east-2.amazonaws.com/v1.0/btc/transactions/date=2025-08-01/part-00000-6ad97917-542b-409c-9bfc-86efe51edbba-c000.snappy.parquet

For this example, we loaded the blockchain data from the first ten days of August, 2025.

Creating a database from the Iceberg catalog

By using the connection details in Figure 2 above, we were able to load data into the catalog. But now we need to create a ClickHouse database to query the data in our catalog.

From the Configure menu of a cluster, select the Data Lake Catalogs menu item. You’ll see this dialog:

Figure 4 - Connecting to a data lake catalog

The dialog in Figure 4 specifies a catalog type (AWS Glue is an alternative format, and others are coming soon), the catalog (default is the only catalog), and the name of the database we’re creating. The new database will be connected to our data lake.

NOTE: Support for multiple catalogs and write access to a catalog are in development. For now, we have a single, read-only catalog. Also, depending on your configuration, you may need to enable the catalog first.

See Configuring Data Lake Catalogs for all the details.

When this operation is complete, you can look at the Schema tab of the Cluster Explorer and see the tables in the ClickHouse cluster:

Figure 5 - The database table created from our data lake

We’ve got a table named nyc.taxis in the maddie database. It has an engine of IcebergS3 and has more than 10 million rows.

Using swarm clusters to run queries against our Iceberg catalog

At this point we can use our swarm cluster to run queries against this table, which is actually a link to our Iceberg catalog of Parquet files stored on S3-compatible storage. We’ll see similar performance improvements as in the previous section. As we mentioned above, our sample catalog contains public blockchain data from the first ten days of August, 2025.

We’ll start by running the query from the previous section against our data lake without swarms:

SELECT date, sum(output_value)
FROM maddie.`blockchain.data`
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date

We get these results:

    ┌─date───────┬──sum(output_value)─┐
 1. │ 2025-08-01 │  765015.4144215047 │
 2. │ 2025-08-02 │   468510.929370189 │
 3. │ 2025-08-03 │ 465258.31995901733 │
 4. │ 2025-08-04 │  683513.1045118246 │
 5. │ 2025-08-05 │  609936.6882308874 │
 6. │ 2025-08-06 │  626958.1893022532 │
 7. │ 2025-08-07 │  605012.0557144039 │
 8. │ 2025-08-08 │  625050.3468524646 │
 9. │ 2025-08-09 │   438600.899152057 │
10. │ 2025-08-10 │ 456397.45688583393 │
    └────────────┴────────────────────┘

With these statistics:

query time: 1.887s, read rows: 4161101, read bytes: 253276483

Now we’ll add SETTINGS object_storage_cluster='maddie-swarm' to the end of the query:

SELECT date, sum(output_value)
FROM maddie.`blockchain.data`
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date
SETTINGS object_storage_cluster='maddie-swarm'

We get the same results, only much faster:

query time: 0.819s, read rows: 4161101, read bytes: 253276483

Those are the results from our first query. Running it again with a loaded filesystem cache is even better:

query time: 0.305s, read rows: 4161101, read bytes: 253276483

Even though this is a very small data lake, a swarm-assisted query is over 6X faster once the cache is loaded. As you work with larger data lakes and larger swarm clusters, the performance benefits will be even greater.

Summary

Now we have a complete example that highlights all of the capabilities of Project Antalya. We went through these steps:

  • We enabled our ClickHouse cluster for swarms
  • With swarms enabled, we created a swarm cluster
  • We used our swarm cluster to query a public data lake
  • We created a data lake catalog, which created an S3 bucket with an Iceberg catalog of Parquet files
  • We loaded data into the Iceberg catalog
  • We created a ClickHouse database connected to that S3 bucket, with that database having a table for every Iceberg catalog in the bucket
  • We used swarms to query data in the Iceberg catalog.

All of these pieces together use the full power of Project Antalya.