Querying Data with Swarms
At this point, we’ve enabled swarms and created a swarm cluster alongside our regular cluster. You can use swarms to run queries against your data lake, of course, but we haven’t set one up yet. Still, we can query a public dataset (the AWS public blockchain dataset, in this example) and see the benefits of swarm clusters.
Running the query without a swarm
Here’s the basic query we’ll run against the blockchain dataset:
SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date
Our results look like this:
┌─date───────┬──sum(output_value)─┐
1. │ 2025-01-01 │ 539953.5324481277 │
2. │ 2025-01-02 │ 634979.2361196541 │
3. │ 2025-01-03 │ 674687.3423463742 │
4. │ 2025-01-04 │ 474874.7069853526 │
5. │ 2025-01-05 │ 450109.2146900331 │
6. │ 2025-01-06 │ 737125.5617075353 │
7. │ 2025-01-07 │ 808630.9692842694 │
8. │ 2025-01-08 │ 784665.2639110681 │
9. │ 2025-01-09 │ 750688.5840565205 │
. . .
That’s great, but we’re all about performance here. Let’s check the statistics:
query time: 31.493s, read rows: 86145787, read bytes: 1216327350
Running the query with a swarm
Now we’ll update the query’s SETTINGS
to use the swarm cluster:
SELECT date, sum(output_value)
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=*/*.parquet', NOSIGN)
WHERE date >= '2025-01-01'
GROUP BY date
ORDER BY date
SETTINGS object_storage_cluster='maddie-swarm'
We’re telling ClickHouse to use the swarm cluster. Our results are roughly 9X better:
query time: 3.544s, read rows: 86145787, read bytes: 1216327350
NOTE: We ran this query in the Altinity Cloud Manager (ACM), which enables filesystem caching by default. If you’re running Antalya in some other environment, you’ll need to add enable_filesystem_cache=1
to the SETTINGS
of your query.
The cache is empty when we use it for the first time, of course. Re-running the query with a loaded cache is even better:
query time: 2.427s, read rows: 86145787, read bytes: 1216327350
Our final query time is almost 13X faster than our original, non-swarm query. Combining swarms and caching delivers dramatically better results. And with swarms running on cheaper spot instances, you may save money as well as time.
Next, we’ll set up our own data lake and use swarms to query it.