Querying Data with Swarms

Using swarms to speed up queries

At this point, we’ve enabled swarms and created a swarm cluster alongside our regular cluster. We’ll run some queries against the AWS Public Blockchain dataset and see how swarms impact our performance.

Running a query without a swarm

We’ll start with a simple query. Here’s how to get the number of transactions and the total value of all outputs from those transactions for each day in 2024:

SELECT
    date,
    count() AS tx_count,
    round(sum(output_value), 2) AS total_btc
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=2024*/*.snappy.parquet', NOSIGN)
GROUP BY date
ORDER BY date

Our results look like this:

     ┌─date───────┬─tx_count─┬──total_btc─┐
  1. │ 2024-01-01 │   657752 │  531084.15 │
  2. │ 2024-01-02 │   367319 │   849250.1 │
  3. │ 2024-01-03 │   502749 │   859376.1 │
  4. │ 2024-01-04 │   482557 │ 1054460.97 │
  5. │ 2024-01-05 │   420884 │  786108.94 │
  6. │ 2024-01-06 │   382140 │  562957.52 │
  7. │ 2024-01-07 │   348688 │  574159.19 │
  8. │ 2024-01-08 │   414359 │  847723.86 │
  9. │ 2024-01-09 │   468088 │  1141429.2 │
 10. │ 2024-01-10 │   383774 │ 1314482.11 │
 . . .

That’s great, but we’re all about performance here. Let’s check the statistics:

query time: 97.606s, read rows: 192248608, read bytes: 1730237472

NOTE: If you’re working with massive datasets, you may need to modify the timeout value for your queries. In the ACM, the default timeout value is 30 seconds. You can change it by editing the Timeout entry field above the Query window. You can also add SETTINGS max_execution_time = 150 to the end of your queries.

Running the query with a swarm

Now we’ll update the query’s SETTINGS to use the swarm cluster and see how that impacts performance:

SELECT
    date,
    count() AS tx_count,
    round(sum(output_value) / 1e8, 2) AS total_btc
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=2024*/*.snappy.parquet', NOSIGN)
GROUP BY date
ORDER BY date
SETTINGS object_storage_cluster='maddie-swarm'

Um, yeah. Substantially better, a little over 16X:

query time: 5.877s, read rows: 192248608, read bytes: 1730237472

Running the query again gives us the benefit of caching:

query time: 3.578s, read rows: 192248608, read bytes: 1730237472

That’s just over 27X better. That’s running on a four-node swarm, but you can obviously create a larger swarm. This query is a good candidate for larger swarms because we have lots of Parquet files to download. You can also add the setting object_storage_max_nodes to control how many of the swarm’s nodes are used in the query:

SETTINGS object_storage_cluster='maddie-swarm',
         object_storage_max_nodes=2

The ability to use multiple threads to download all of those Parquet files gives us significant benefits.

Another query

Now let’s look at the top ten biggest days by transaction volume:

SELECT
    date,
    count() AS tx_count,
    round(sum(output_value) / 1e8, 4) AS total_btc
FROM s3('s3://aws-public-blockchain/v1.0/btc/transactions/date=2024*/*.snappy.parquet', NOSIGN)
GROUP BY date
ORDER BY total_btc DESC
    LIMIT 10
    ┌─date───────┬─tx_count─┬─total_btc─┐
 1. │ 2024-03-05 │   424347 │    0.0168 │
 2. │ 2024-03-13 │   390118 │    0.0163 │
 3. │ 2024-02-28 │   413800 │    0.0154 │
 4. │ 2024-12-05 │   512490 │    0.0146 │
 5. │ 2024-01-12 │   375872 │    0.0145 │
 6. │ 2024-11-21 │   498512 │    0.0144 │
 7. │ 2024-03-06 │   381934 │    0.0142 │
 8. │ 2024-09-10 │   568630 │    0.0139 │
 9. │ 2024-03-12 │   341867 │    0.0138 │
10. │ 2024-11-12 │   503677 │    0.0138 │
    └────────────┴──────────┴───────────┘

We get our results in a little over a minute and a half:

query time: 98.477s, read rows: 192248608, read bytes: 1730237472

Now we add swarms:

query time: 3.503s, read rows: 192248608, read bytes: 1730237472

That’s just over 28X faster. But we have cache enabled, so let’s run it again:

query time: 2.766s, read rows: 192248608, read bytes: 1730237472

That’s almost 35X faster! In this query, swarms let us load the thousands of Parquet files in parallel, making the query much, much faster.

Joins

All of our queries to this point have been against a single table. What if we want to do joins? That’s a problem, because a swarm node doesn’t have any persistent storage where the joined table might be stored.

We’ll set up a query against the Ethereum data that’s part of the AWS dataset. We’ll look at a Parquet file of transaction data; each row has a contract address, but it doesn’t have the token name associated with that address. To find the token name for each address, we’ll create a lookup table with a few address / token pairs:

CREATE DATABASE IF NOT EXISTS eth;
CREATE TABLE eth.token_addresses
(
    contract_address  String,
    token_name        String,
    decimals          UInt8
)
ENGINE = MergeTree
ORDER BY contract_address;

INSERT INTO eth.token_addresses VALUES
    ('0xdac17f958d2ee523a2206206994597c13d831ec7', 'USDT',  6),
    ('0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48', 'USDC',  6),
    ('0x6b175474e89094c44da98b954eedeac495271d0f', 'DAI',  18),
    ('0x4fabb145d64652a948d72533023f6e7a623c7c53', 'BUSD', 18),
    ('0x2260fac5e5542a773aa44fbcfedf7c193bc2c599', 'WBTC',  8),
    ('0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2', 'WETH', 18);

Now we have a mapping between contract addresses and token names. So let’s find some Ethereum data and use a JOIN to get the token names that aren’t part of the table:

SELECT
    t.to_address                                            AS recipient,
    tok.token_name                                          AS token,
    count()                                                 AS transfer_count,
    round(sum(t.value) / pow(10, any(tok.decimals)), 2)     AS total_value
FROM s3(
    's3://aws-public-blockchain/v1.0/eth/token_transfers/date=2026-01-*/*.parquet',
    NOSIGN
) AS t
INNER JOIN eth.token_addresses AS tok
    ON t.token_address = tok.contract_address
WHERE
    tok.token_name IN ('USDT', 'USDC', 'DAI', 'BUSD')
GROUP BY recipient, token
ORDER BY total_value DESC
LIMIT 10
SETTINGS
    max_execution_time = 150,
    object_storage_cluster = 'maddie-swarm'

We get this error:

Code: 81. DB::Exception: Received from chi-maddie-swarm-maddie-swarm-0-0-0.chi-maddie-swarm-maddie-swarm-0-0.altinity-cloud-managed-clickhouse.svc.cluster.local:9000. DB::Exception: Database eth does not exist. (UNKNOWN_DATABASE) (version 25.8.16.20002.altinityantalya (altinity build))

The swarm cluster is being asked to do a join against the eth.token_addresses lookup table. But remember, the swarm cluster is ephemeral; it has no permanent storage, so it can’t find the lookup table. The solution is to use the object_storage_cluster_join_mode = 'local' setting, which tells the initiator node to do the join locally. If we add that to the query, everything works:

SELECT
    t.to_address                                            AS recipient,
    tok.token_name                                          AS token,
    count()                                                 AS transfer_count,
    round(sum(t.value) / pow(10, any(tok.decimals)), 2)     AS total_value
FROM s3(
    's3://aws-public-blockchain/v1.0/eth/token_transfers/date=2026-01-*/*.parquet',
    NOSIGN
) AS t
INNER JOIN eth.token_addresses AS tok
    ON t.token_address = tok.contract_address
WHERE
    tok.token_name IN ('USDT', 'USDC', 'DAI', 'BUSD')
GROUP BY recipient, token
ORDER BY total_value DESC
LIMIT 10
SETTINGS
    max_execution_time = 150,
    object_storage_cluster = 'maddie-swarm',
    object_storage_cluster_join_mode = 'local'

Now we get the results we want:

    ┌─recipient──────────────────────────────────────────────────────────┬─token─┬─transfer_count─┬─────total_value─┐
 1. │ 0x000000000000000000000000bbbbbbbbbb9cc5e90e3b3af64bdaf62c37eeffcb │ USDC  │          65384 │ 531355531771.01 │
 2. │ 0x00000000000000000000000006cff7088619c7178f5e14f0b119458d08d2f5ef │ USDC  │          27048 │ 201980849864.27 │
 3. │ 0x00000000000000000000000000eb00c6f847740000884d00e03f00c761998feb │ USDC  │           1756 │ 147836777091.44 │
 4. │ 0x000000000000000000000000d226997439ecfbeff8e110c8c78c8a7eefd19f89 │ USDC  │          15466 │ 137901259512.23 │
 5. │ 0x000000000000000000000000bbbbbbbbbb9cc5e90e3b3af64bdaf62c37eeffcb │ USDT  │          31954 │   83812169102.2 │
 6. │ 0x00000000000000000000000006cff7088619c7178f5e14f0b119458d08d2f5ef │ USDT  │          32405 │  44274013722.55 │
 7. │ 0x0000000000000000000000000000000000000000000000000000000000000000 │ DAI   │          13262 │  39937896537.86 │
 8. │ 0x000000000000000000000000f6e72db5454dd049d0788e411b06cfaf16853042 │ DAI   │          44589 │  26970182709.54 │
 9. │ 0x000000000000000000000000d226997439ecfbeff8e110c8c78c8a7eefd19f89 │ USDT  │          16200 │   26037873287.5 │
10. │ 0x000000000000000000000000a188eec8f81263234da3622a406892f3d630f98c │ DAI   │          19134 │  25526230223.27 │
    └────────────────────────────────────────────────────────────────────┴───────┴────────────────┴─────────────────┘

Whenever you get a “Database does not exist” from a swarm cluster, it’s most likely because you’re doing a JOIN without the object_storage_cluster_join_mode setting.

That’s all we’ll do with the Ethereum data; we’ll go back to the Blockchain data for our next examples.

Where swarms make a big impact

Swarms gave us a massive performance improvement. Here are aspects of the BTC dataset that are uniquely suited to swarms:

  • Many small files — The BTC dataset has thousands of Parquet files (one per day going back to 2009). A swarm can fetch those in parallel across nodes dramatically faster than a single node.
  • External data joins — joining a large local table against a large remote S3 dataset, where each node can pull its share of the remote data independently.
  • Very large local tables — once you’re in the tens of billions of rows, the parallelism across nodes becomes meaningful for aggregations.

Troubleshooting your query

There are three common error messages you may get when you run your query against a swarm.

Swarm syntax not supported

The first error tells you that the basic syntax for specifying the swarm cluster name isn’t supported:

Code: 115. DB::Exception: Setting object_storage_cluster is neither a builtin setting nor started with the prefix 'SQL_' registered for user-defined settings. (UNKNOWN_SETTING) (version 25.3.6.10034.altinitystable (altinity build))

This tells us that the object_storage_cluster setting isn’t recognized…which means this cluster is running a non-Antalya build. (Scroll over to the end of the message for the crucial clue.) You’ll need to upgrade your cluster to an Antalya build before swarm queries will work. See the Upgrading a Cluster documentation for all the details.

Swarm cluster can’t be found

Another common message tells us our swarm cluster can’t be found:

Code: 701. DB::Exception: Requested cluster 'maddie-swarm' not found. (CLUSTER_DOESNT_EXIST) (version 25.3.3.20186.altinityantalya (altinity build))

This one can be more frustrating; it’s telling us the maddie-swarm cluster can’t be found. The swarm cluster is active, and we’re running an Antalya build. There isn’t anything wrong with the maddie-swarm cluster, so what’s going on? The swarm cluster can’t be found because the cluster you’re using isn’t enabled for swarms. See the section on Enabling Swarms for the details. It’s an easy fix, but nothing works until you enable swarms for each cluster that needs them. Enabling swarms only enables them for a single cluster, not all clusters in your environment.

A swarm cluster says a database does not exist

Finally, you may get a message from a swarm cluster that a database does not exist:

Code: 81. DB::Exception: Received from chi-maddie-swarm-maddie-swarm-0-0-0.chi-maddie-swarm-maddie-swarm-0-0.altinity-cloud-managed-clickhouse.svc.cluster.local:9000. DB::Exception: Database eth does not exist. (UNKNOWN_DATABASE) (version 25.8.16.20002.altinityantalya (altinity build))

That’s almost certainly a JOIN issue; see the section on joins above for the fix.

Now that we’ve got a working query that uses a swarm cluster, we’ll set up a data lake and use swarms to query it.

👉 Next: Working with data lakes