Querying Data with Swarms

Using swarms to speed up queries

At this point, we’ve enabled swarms and created a swarm cluster alongside our regular cluster. You can use swarms to run queries against your data lake, of course, but we haven’t set one up yet. Still, we can query a public dataset (the AWS public blockchain dataset, in this example) and see the benefits of swarm clusters.

Running the query without a swarm

Here’s the basic query we’ll run against the blockchain dataset:

SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date

Our results look like this:

     ┌─date───────┬──sum(output_value)─┐
  1. │ 2025-01-01 │  539953.5324481277 │
  2. │ 2025-01-02 │  634979.2361196541 │
  3. │ 2025-01-03 │  674687.3423463742 │
  4. │ 2025-01-04 │  474874.7069853526 │
  5. │ 2025-01-05 │  450109.2146900331 │
  6. │ 2025-01-06 │  737125.5617075353 │
  7. │ 2025-01-07 │  808630.9692842694 │
  8. │ 2025-01-08 │  784665.2639110681 │
  9. │ 2025-01-09 │  750688.5840565205 │
  . . .

That’s great, but we’re all about performance here. Let’s check the statistics:

query time: 31.493s, read rows: 86145787, read bytes: 1216327350

Running the query with a swarm

Now we’ll update the query’s SETTINGS to use the swarm cluster:

SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date
SETTINGS object_storage_cluster='maddie-swarm'

We’re telling ClickHouse to use the swarm cluster. Our results are roughly 9X better:

query time: 3.544s, read rows: 86145787, read bytes: 1216327350

NOTE: We ran this query in the Altinity Cloud Manager (ACM), which enables filesystem caching by default. If you’re running Antalya in some other environment, you’ll need to add enable_filesystem_cache=1 to the SETTINGS of your query.

The cache is empty when we use it for the first time, of course. Re-running the query with a loaded cache is even better:

query time: 2.427s, read rows: 86145787, read bytes: 1216327350

Our final query time is almost 13X faster than our original, non-swarm query. Combining swarms and caching delivers dramatically better results. And with swarms running on cheaper spot instances, you may save money as well as time.

Next, we’ll set up our own data lake and use swarms to query it.

👉 Next: Working with data lakes