Adding Persistent Storage to Your Cluster
As our ClickHouse® cluster is starting to take shape, we need persistent storage. We can’t lose data whenever something goes wrong with a pod.
Creating persistent storage
We’ll create persistent storage and add it to the definition of our cluster. Copy and paste the following into manifest02.yaml
:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: cluster01
spec:
templates:
podTemplates:
- name: clickhouse-pod-template
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:24.8.14.10459.altinitystable
volumeMounts:
- name: clickhouse-storage
mountPath: /var/lib/clickhouse
volumeClaimTemplates:
- name: clickhouse-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
configuration:
clusters:
- name: cluster01
layout:
shardsCount: 1
replicasCount: 1
templates:
podTemplate: clickhouse-pod-template
Several things are new here, as you would expect:
- We added persistent storage (
volumeMount
s) to thepodTemplate
for our ClickHouse cluster. The storage is defined in a template namedclickhouse-storage
, and it is mounted at/var/lib/clickhouse
on each pod. - We have a
volumeClaimTemplats
section that defines the parameters of the storage our pods will use. The storage will have five gigabytes of space, and itsstorageClass
isstandard
. (More on storage classes in a minute.) - At the end of the file, the definition of our cluster hasn’t changed. But
clickhouse-pod-template
now includes persistent storage.
One thing to keep in mind: the values for storageClassName
can vary from one platform to the next. The command kubectl get storageclasses
will show you what’s available in your current environment.
Running kubectl get storageclasses
on AWS gives us two options:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
gp2 kubernetes.io/aws-ebs Delete WaitForFirstConsumer false 3h12m
gp3-encrypted (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 3h4m
Azure has several options:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
azurefile file.csi.azure.com Delete Immediate true 20m
azurefile-csi file.csi.azure.com Delete Immediate true 20m
azurefile-csi-premium file.csi.azure.com Delete Immediate true 20m
azurefile-premium file.csi.azure.com Delete Immediate true 20m
default (default) disk.csi.azure.com Delete WaitForFirstConsumer true 20m
managed disk.csi.azure.com Delete WaitForFirstConsumer true 20m
managed-csi disk.csi.azure.com Delete WaitForFirstConsumer true 20m
managed-csi-premium disk.csi.azure.com Delete WaitForFirstConsumer true 20m
managed-premium disk.csi.azure.com Delete WaitForFirstConsumer true 20m
We get three storageclasses
from GCP:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
premium-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 12d
standard kubernetes.io/gce-pd Delete Immediate true 12d
standard-rwo (default) pd.csi.storage.gke.io Delete WaitForFirstConsumer true 12d
Running the same command on minikube gives us only one option:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
standard (default) k8s.io/minikube-hostpath Delete Immediate false 143m
Edit manifest02.yaml
so that it has the right storageClassName
, then go ahead and apply the update:
kubectl apply -f manifest02.yaml -n quick
You’ll see a success message:
clickhouseinstallation.clickhouse.altinity.com/cluster01 configured
This redeploys your ClickHouse cluster with the new settings. You can look at the chi
to see how things are progressing:
kubectl get chi -o wide -n quick
You’ll likely see a status of InProgress
for a while, but eventually you’ll see Completed
:
NAME VERSION CLUSTERS SHARDS HOSTS TASKID STATUS HOSTS-COMPLETED HOSTS-UPDATED HOSTS-ADDED HOSTS-DELETED ENDPOINT AGE SUSPEND
cluster01 0.25.0 1 1 1 f2b5c79a-2cac-420b-9c5f-e417a9236e63 Completed clickhouse-cluster01.quick.svc.cluster.local 11m
(You may need to scroll to the right to see the status because we used the -o wide
option.)
Let’s take a quick look to see how things look:
kubectl get all -n quick
We’ve got everything we expect:
NAME READY STATUS RESTARTS AGE
pod/chi-cluster01-cluster01-0-0-0 1/1 Running 0 5m15s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/chi-cluster01-cluster01-0-0 ClusterIP None <none> 9000/TCP,8123/TCP,9009/TCP 9m45s
service/clickhouse-cluster01 ClusterIP None <none> 8123/TCP,9000/TCP 9m33s
NAME READY AGE
statefulset.apps/chi-cluster01-cluster01-0-0 1/1 5m15s
And now for some data
Now let’s give our ClickHouse cluster some data to work with. We’ll create a database, then create a table in that database, then put data in the table. First, connect to the cluster:
kubectl exec -it chi-cluster01-cluster01-0-0-0 -n quick -- clickhouse-client
Now copy and paste these commands:
CREATE DATABASE analytics;
USE analytics;
CREATE TABLE page_views
(
`event_time` DateTime,
`user_id` UInt32,
`page_url` String,
`referrer_url` String,
`device` String,
`country` String
)
ENGINE = MergeTree
ORDER BY event_time;
INSERT INTO page_views (event_time, user_id, page_url, referrer_url, device, country) VALUES
('2025-01-01 12:00:00', 101, '/home', 'google.com', 'mobile', 'USA'),
('2025-01-01 12:05:00', 102, '/products', 'facebook.com', 'desktop', 'Canada'),
('2025-01-01 12:10:00', 103, '/cart', 'twitter.com', 'tablet', 'UK'),
('2025-01-02 14:00:00', 101, '/checkout', 'google.com', 'mobile', 'USA'),
('2025-01-06 08:20:00', 110, '/blog', 'twitter.com', 'desktop', 'Australia');
Run SELECT * FROM analytics.page_views;
to verify that your data is in the database:
SELECT *
FROM analytics.page_views
And there it is:
┌──────────event_time─┬─user_id─┬─page_url──┬─referrer_url─┬─device──┬─country───┐
1. │ 2025-01-01 12:00:00 │ 101 │ /home │ google.com │ mobile │ USA │
2. │ 2025-01-01 12:05:00 │ 102 │ /products │ facebook.com │ desktop │ Canada │
3. │ 2025-01-01 12:10:00 │ 103 │ /cart │ twitter.com │ tablet │ UK │
4. │ 2025-01-02 14:00:00 │ 101 │ /checkout │ google.com │ mobile │ USA │
5. │ 2025-01-06 08:20:00 │ 110 │ /blog │ twitter.com │ desktop │ Australia │
└─────────────────────┴─────────┴───────────┴──────────────┴─────────┴───────────┘
5 rows in set. Elapsed: 0.004 sec.
👉 Type exit
to end the kubectl exec
session.
Testing persistent storage
Okay, we’ve got everything set up, so let’s make sure it’s actually working before we move on to replication. Delete the pod. It will be restarted, of course, and if persistent storage is working, our data should still be there. Here we go:
kubectl delete pod chi-cluster01-cluster01-0-0-0 -n quick
You’ll get a message that the pod has been deleted. Check kubectl get pods -n quick
until it says the pod is running and ready. Now connect to the restarted pod and query the analytics.page_views
table to see if our data is still there:
kubectl exec -it chi-cluster01-cluster01-0-0-0 -n quick -- clickhouse-client -q "SELECT * FROM analytics.page_views;"
Everything looks good:
2025-01-02 15:30:00 105 /home direct desktop Germany
2025-01-03 11:45:00 107 /contact direct desktop India
2025-01-04 18:00:00 102 /products google.com tablet Canada
2025-01-05 21:30:00 109 /checkout facebook.com mobile USA
2025-01-09 17:20:00 110 /cart direct desktop Australia
Having persistent storage for our data is great, but any highly available system will have multiple copies (replicas) of important data. Which brings us to our next topic….