Working with Data Lakes

Creating our own data lake and querying it

Our last step is to create our own data lake. As mentioned earlier, the data lake we’re working with will be an Iceberg catalog that uses Parquet files stored on S3 block storage.

Here are the steps we’ll go through:

  1. Enable an Iceberg catalog for our Altinity.Cloud environment
  2. Get the connection details for our Iceberg catalog
  3. Write data to our Iceberg catalog (this takes place outside the ACM)
  4. Create a ClickHouse database that’s connected to our Iceberg catalog
  5. Use a swarm cluster to run queries against that database

Pretty straightforward, eh? Let’s go!

Enabling an Iceberg catalog

Before we can work with data lakes, we need to enable an Iceberg catalog for your Altinity.Cloud environment. From the ACTIONS menu on the Environments tab, select the Iceberg Catalog menu item. You’ll see this dialog:

Figure 1 - Enabling an Iceberg catalog

Click the button to enable the catalog. (You may need to click the button a few times while the catalog is created.) When the catalog is enabled, you’ll see the connection details for the catalog:

Figure 2 - Connection details for the Iceberg catalog

Complete information about working with Iceberg catalogs is in the Enabling an Iceberg Catalog documentation.

Getting the connection details later

Any time you need the connection details, you can get back to this dialog by clicking the Enabled link next to Iceberg Catalogs on the Environment summary panel:

Figure 3 - Getting connection details from the Environment overview panel

Clicking that link takes you back to Figure 2 above. You’ll need these values to connect to the catalog and write data to it. Which brings us to our next step….

Writing data to the Iceberg catalog

Now it’s time to take the credentials from Figure 2 above and use them to load data into our catalog. There are a number of tools that can do this, but we’ll use Ice, an open-source tool from Altinity.

First, we’ll put the connection details into the file .ice.yaml:

uri: https://iceberg-catalog.altinity-maddie.altinity.cloud
bearerToken: abcdef0123456789abcdef0123456789

(Notice that the field in the YAML file is uri, not url.)

With the YAML file configured, we’ll load a Parquet file from the New York taxi data set into the ClickHouse table named nyc.taxis:

ice insert nyc.taxis -p \
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

Creating a database from the Iceberg catalog

By using the connection details in Figure 2 above, we were able to load data into the catalog. But now we need to create a ClickHouse database to query the data in our catalog.

From the Configure menu of a cluster, select the Data Lake Catalogs menu item. You’ll see this dialog:

Figure 4 - Connecting to a data lake catalog

The dialog in Figure 4 specifies a catalog type (AWS Glue is an alternative format, and others are coming soon), the catalog (default is the only catalog), and the name of the database we’re creating. The new database will be connected to our data lake.

NOTE: Support for multiple catalogs as well as write access to the catalog are in development. For now, we’ll keep it simple with a single, read-only catalog. Also, depending on your configuration, you may need to enable the catalog first.

See Configuring Data Lake Catalogs for all the details.

When this operation is complete, you can look at the Schema tab of the Cluster Explorer and see the tables in the ClickHouse cluster:

Figure 5 - The database table created from our data lake

We’ve got a table named nyc.taxis in the maddie database. It has an engine of IcebergS3 and has more than 10 million rows.

Using swarm clusters to run queries against our Iceberg catalog

At this point we can use our swarm cluster to run queries against this table, which is actually a link to our Iceberg catalog of Parquet files stored on S3 block storage. We’ll see similar performance improvements as in the previous section.

Summary

Now we have a complete example that highlights all of the capabilities of Project Antalya. We went through these steps:

  • We enabled our ClickHouse cluster for swarms
  • With swarms enabled, we created a swarm cluster
  • We used our swarm cluster to query a public data lake
  • We created a data lake catalog, which created an S3 bucket with an Iceberg catalog of Parquet files
  • We loaded data into the Iceberg catalog
  • We created a ClickHouse database connected to that S3 bucket, with that database having a table for every Iceberg catalog in the bucket
  • We used swarms to query data in the Iceberg catalog.

All of these pieces together use the full power of Project Antalya.