Adding Persistent Storage to Your Cluster

Making sure your data stays if your pods go away

As our ClickHouse® cluster is starting to take shape, we need persistent storage. We can’t lose data whenever something goes wrong with a pod.

Creating persistent storage

We’ll create persistent storage and add it to the definition of our cluster. Copy and paste the following into manifest02.yaml:

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: cluster01
spec:
  templates:
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          containers:
            - name: clickhouse
              image: altinity/clickhouse-server:24.8.14.10459.altinitystable
              volumeMounts:
                - name: clickhouse-storage
                  mountPath: /var/lib/clickhouse
    volumeClaimTemplates:
      - name: clickhouse-storage
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 5Gi
          storageClassName: standard
  configuration:
    clusters:
      - name: cluster01
        layout:
          shardsCount: 1
          replicasCount: 1
        templates:
          podTemplate: clickhouse-pod-template

Several things are new here, as you would expect:

  • We added persistent storage (volumeMounts) to the podTemplate for our ClickHouse cluster. The storage is defined in a template named clickhouse-storage, and it is mounted at /var/lib/clickhouse on each pod.
  • We have a volumeClaimTemplats section that defines the parameters of the storage our pods will use. The storage will have five gigabytes of space, and its storageClass is standard. (More on storage classes in a minute.)
  • At the end of the file, the definition of our cluster hasn’t changed. But clickhouse-pod-template now includes persistent storage.

One thing to keep in mind: the values for storageClassName can vary from one platform to the next. The command kubectl get storageclasses will show you what’s available in your current environment.

Running kubectl get storageclasses on AWS gives us two options:

NAME                      PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2                       kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  3h12m
gp3-encrypted (default)   ebs.csi.aws.com         Delete          WaitForFirstConsumer   true                   3h4m

Azure has several options:

NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azurefile               file.csi.azure.com   Delete          Immediate              true                   20m
azurefile-csi           file.csi.azure.com   Delete          Immediate              true                   20m
azurefile-csi-premium   file.csi.azure.com   Delete          Immediate              true                   20m
azurefile-premium       file.csi.azure.com   Delete          Immediate              true                   20m
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   20m
managed                 disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   20m
managed-csi             disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   20m
managed-csi-premium     disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   20m
managed-premium         disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   20m

We get three storageclasses from GCP:

NAME                     PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE  
premium-rwo              pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   12d  
standard                 kubernetes.io/gce-pd    Delete          Immediate              true                   12d  
standard-rwo (default)   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   12d

Running the same command on minikube gives us only one option:

NAME                 PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE  
standard (default)   k8s.io/minikube-hostpath   Delete          Immediate           false                  143m

Edit manifest02.yaml so that it has the right storageClassName, then go ahead and apply the update:

kubectl apply -f manifest02.yaml -n quick

You’ll see a success message:

clickhouseinstallation.clickhouse.altinity.com/cluster01 configured

This redeploys your ClickHouse cluster with the new settings. You can look at the chi to see how things are progressing:

kubectl get chi -o wide -n quick

You’ll likely see a status of InProgress for a while, but eventually you’ll see Completed:

NAME        VERSION   CLUSTERS   SHARDS   HOSTS   TASKID                                 STATUS      HOSTS-COMPLETED   HOSTS-UPDATED   HOSTS-ADDED   HOSTS-DELETED   ENDPOINT                                       AGE   SUSPEND
cluster01   0.25.0    1          1        1       f2b5c79a-2cac-420b-9c5f-e417a9236e63   Completed                                                                   clickhouse-cluster01.quick.svc.cluster.local   11m

(You may need to scroll to the right to see the status because we used the -o wide option.)

Let’s take a quick look to see how things look:

kubectl get all -n quick

We’ve got everything we expect:

NAME                                READY   STATUS    RESTARTS   AGE
pod/chi-cluster01-cluster01-0-0-0   1/1     Running   0          5m15s

NAME                                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
service/chi-cluster01-cluster01-0-0   ClusterIP   None         <none>        9000/TCP,8123/TCP,9009/TCP   9m45s
service/clickhouse-cluster01          ClusterIP   None         <none>        8123/TCP,9000/TCP            9m33s

NAME                                           READY   AGE
statefulset.apps/chi-cluster01-cluster01-0-0   1/1     5m15s

And now for some data

Now let’s give our ClickHouse cluster some data to work with. We’ll create a database, then create a table in that database, then put data in the table. First, connect to the cluster:

kubectl exec -it chi-cluster01-cluster01-0-0-0 -n quick -- clickhouse-client

Now copy and paste these commands:

CREATE DATABASE analytics;

USE analytics;

CREATE TABLE page_views
(
    `event_time` DateTime,
    `user_id` UInt32,
    `page_url` String,
    `referrer_url` String,
    `device` String,
    `country` String
)
    ENGINE = MergeTree
ORDER BY event_time;

INSERT INTO page_views (event_time, user_id, page_url, referrer_url, device, country) VALUES  
('2025-01-01 12:00:00', 101, '/home', 'google.com', 'mobile', 'USA'),  
('2025-01-01 12:05:00', 102, '/products', 'facebook.com', 'desktop', 'Canada'),  
('2025-01-01 12:10:00', 103, '/cart', 'twitter.com', 'tablet', 'UK'),  
('2025-01-02 14:00:00', 101, '/checkout', 'google.com', 'mobile', 'USA'),  
('2025-01-06 08:20:00', 110, '/blog', 'twitter.com', 'desktop', 'Australia');

Run SELECT * FROM analytics.page_views; to verify that your data is in the database:

SELECT *  
FROM analytics.page_views

And there it is:

   ┌──────────event_time─┬─user_id─┬─page_url──┬─referrer_url─┬─device──┬─country───┐
1. │ 2025-01-01 12:00:00 │     101 │ /home     │ google.com   │ mobile  │ USA       │
2. │ 2025-01-01 12:05:00 │     102 │ /products │ facebook.com │ desktop │ Canada    │
3. │ 2025-01-01 12:10:00 │     103 │ /cart     │ twitter.com  │ tablet  │ UK        │
4. │ 2025-01-02 14:00:00 │     101 │ /checkout │ google.com   │ mobile  │ USA       │
5. │ 2025-01-06 08:20:00 │     110 │ /blog     │ twitter.com  │ desktop │ Australia │
   └─────────────────────┴─────────┴───────────┴──────────────┴─────────┴───────────┘

5 rows in set. Elapsed: 0.004 sec.

👉 Type exit to end the kubectl exec session.

Testing persistent storage

Okay, we’ve got everything set up, so let’s make sure it’s actually working before we move on to replication. Delete the pod. It will be restarted, of course, and if persistent storage is working, our data should still be there. Here we go:

kubectl delete pod chi-cluster01-cluster01-0-0-0 -n quick

You’ll get a message that the pod has been deleted. Check kubectl get pods -n quick until it says the pod is running and ready. Now connect to the restarted pod and query the analytics.page_views table to see if our data is still there:

kubectl exec -it chi-cluster01-cluster01-0-0-0 -n quick -- clickhouse-client -q "SELECT * FROM analytics.page_views;"

Everything looks good:

2025-01-02 15:30:00	105	/home	direct	desktop	Germany  
2025-01-03 11:45:00	107	/contact	direct	desktop	India  
2025-01-04 18:00:00	102	/products	google.com	tablet	Canada  
2025-01-05 21:30:00	109	/checkout	facebook.com	mobile	USA  
2025-01-09 17:20:00	110	/cart	direct	desktop	Australia

Having persistent storage for our data is great, but any highly available system will have multiple copies (replicas) of important data. Which brings us to our next topic….

👉 Next: Enabling replication