Querying Data with Swarms

Using swarms to speed up queries

At this point, we’ve enabled swarms and created a swarm cluster alongside our regular cluster. You can use swarms to run queries against your data lake, of course, but we haven’t set one up yet. Still, we can query a public dataset (the AWS public blockchain dataset, in this example) and see the benefits of swarm clusters.

Running the query without a swarm

Here’s the basic query we’ll run against the blockchain dataset:

SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date

Our results look like this:

     ┌─date───────┬──sum(output_value)─┐
  1. │ 2025-01-01 │  539953.5324481277 │
  2. │ 2025-01-02 │  634979.2361196541 │
  3. │ 2025-01-03 │  674687.3423463742 │
  4. │ 2025-01-04 │  474874.7069853526 │
  5. │ 2025-01-05 │  450109.2146900331 │
  6. │ 2025-01-06 │  737125.5617075353 │
  7. │ 2025-01-07 │  808630.9692842694 │
  8. │ 2025-01-08 │  784665.2639110681 │
  9. │ 2025-01-09 │  750688.5840565205 │
  . . .

That’s great, but we’re all about performance here. Let’s check the statistics:

query time: 31.493s, read rows: 86145787, read bytes: 1216327350

BTW, you may need to modify the timeout value for your queries. In the ACM, the default timeout value is 30 seconds; that’s not enough time for the query without swarms.

Running the query with a swarm

Now we’ll update the query’s SETTINGS to use the swarm cluster:

SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date
SETTINGS object_storage_cluster='maddie-swarm'

We’re telling ClickHouse to use the swarm cluster. Our results are roughly 9X better:

query time: 3.544s, read rows: 86145787, read bytes: 1216327350

NOTE: We ran this query in the Altinity Cloud Manager (ACM), which enables filesystem caching by default. If you’re running Antalya in some other environment, you’ll need to add enable_filesystem_cache=1 to the SETTINGS of your query. See the Antalya Command and Configuration Reference for complete details.

The cache is empty when we use it for the first time, of course. Re-running the query with a loaded cache is even better:

query time: 2.427s, read rows: 86145787, read bytes: 1216327350

Our final query time is almost 13X faster than our original, non-swarm query. Combining swarms and caching delivers dramatically better results. And with swarms running on cheaper spot instances, you may save money as well as time.

Troubleshooting your query

There are a couple of error messages you may get when you run your query. The first tells you that the basic syntax for specifying the swarm cluster name isn’t supported:

Code: 115. DB::Exception: Setting object_storage_cluster is neither a builtin setting nor started with the prefix 'SQL_' registered for user-defined settings. (UNKNOWN_SETTING) (version 25.3.6.10034.altinitystable (altinity build))

This tells us that the object_storage_cluster setting isn’t recognized…which means this cluster is running an Altinity Stable build. (Scroll over to the end of the message for the crucial clue.) You’ll need to upgrade your cluster to an Antalya build before swarm queries will work. See the Upgrading a Cluster documentation for all the details.

The other common message tells us our swarm cluster can’t be found:

Code: 701. DB::Exception: Requested cluster 'maddie-swarm' not found. (CLUSTER_DOESNT_EXIST) (version 25.3.3.20186.altinityantalya (altinity build))

This one can be more frustrating; it’s telling us the maddie-swarm cluster can’t be found. The swarm cluster is active, and we’re running an Antalya build. There isn’t anything wrong with the maddie-swarm cluster, so what’s going on? The swarm cluster can’t be found because the cluster you’re using isn’t enabled for swarms. See the section on Enabling Swarms for the details. It’s an easy fix, but nothing works until you enable swarms for each cluster that needs them. Enabling swarms only enables them for a single cluster, not all clusters in your environment.

Now that we’ve got a working query that uses a swarm cluster, we’ll set up our own data lake and use swarms to query it.

👉 Next: Working with data lakes