🤲 🐁 👩🏽‍🤝‍👩🏼 Our hands are not for boredom: restoring the Rook cluster in K8s 🛀 😓 🥣

We already talked about how / why we like Rook: to a significant extent, it simplifies working with storage in Kubernetes clusters. However, with this simplicity, certain difficulties come. We hope that the new material will help to better understand such difficulties before they even manifest themselves.

And to read it was more interesting, we start with the consequences of a hypothetical problem in the cluster.

"Everything is lost!"

Imagine that you once configured and launched Rook in your K8s cluster, he was pleased with his work, but at some “wonderful” moment the following happens:

New pods cannot mount RBD images from Ceph.
Commands like lsblk and df do not work on Kubernetes hosts. This automatically means “something is wrong” with the node-mounted RBD images. I can’t read them, which indicates the inaccessibility of monitors ...
Yes, there are no operational monitors in the cluster. Moreover - there are not even pods with OSD, nor MGR pods.

When was the pod rook-ceph-operator launched? Not so long ago as he was deployed. Why? Rook-operator decided to make a new cluster ... How can we now restore the cluster and the data in it?

To get started, let's go a ~~longer~~ interesting way, having carried out a thoughtful investigation into the “internals” of Rook and a step-by-step restoration of its components. Of course, there is also a ~~shorter~~ correct way: using backups. As you know, admins are divided into two types: those who don’t do backups, and those who already do them ... But more about this after the investigation.

A little practice, or a long way

Take a look and restore the monitors

So, let's look at the list of ConfigMaps: there are rook-ceph-config and rook-config-override necessary for backup. They appear upon successful deployment of the cluster.

NB : In new versions, after the adoption of this PR , ConfigMaps have ceased to be an indicator of the success of a cluster deployment.

To perform further actions, we need a hard reboot of all servers that have mounted RBD images ( ls /dev/rbd* ). It must be done through sysrq (or "on foot" to the data center). This requirement is caused by the task of disconnecting mounted RBDs, for which a regular reboot will not work (it will unsuccessfully try to unmount them normally).

The theater begins with a hanger, and the Ceph cluster begins with monitors. Let's look at them.

Rook mounts the following entities in the monitor pod:

 Volumes: rook-ceph-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: rook-ceph-config rook-ceph-mons-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-mons-keyring rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/kube-rook/log ceph-daemon-data: Type: HostPath (bare host directory volume) Path: /var/lib/rook/mon-a/data Mounts: /etc/ceph from rook-ceph-config (ro) /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro) /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw) /var/log/ceph from rook-ceph-log (rw)

Let's see what the secret of rook-ceph-mons-keyring :

 kind: Secret data: keyring: LongBase64EncodedString=

We decode and get the usual keyring with rights for the admin and monitors:

 [mon.] key = AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA== caps mon = "allow *" [client.admin] key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"

Remember. Now look at the keyring in the secret rook-ceph-admin-keyring :

 kind: Secret data: keyring: anotherBase64EncodedString=

What is in it?

 [client.admin] key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"

Same. Let's see more ... Here, for example, is the secret of rook-ceph-mgr-a-keyring :

 [mgr.a] key = AQBZR19dbVeaIhBBXFYyxGyusGf8x1bNQunuew== caps mon = "allow *" caps mds = "allow *" caps osd = "allow *"

In the end, we find a few more secrets in ConfigMap rook-ceph-mon :

 kind: Secret data: admin-secret: AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== cluster-name: a3ViZS1yb29r fsid: ZmZiYjliZDMtODRkOS00ZDk1LTczNTItYWY4MzZhOGJkNDJhCg== mon-secret: AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==

And this is the initial list with keyring, where all the secrets described above come from.

As you know (see dataDirHostPath in the documentation ), Rook stores this data in two places. Therefore, let's go to the nodes to look at the keyring'y lying in the directories that are mounted in pods with monitors and OSD. To do this, find the nodes /var/lib/rook/mon-a/data/keyring and see:

 # cat /var/lib/rook/mon-a/data/keyring [mon.] key = AXAbS19d8NNUXOBB+XyYwXqXI1asIzGcGlzMGg== caps mon = "allow *"

Suddenly, the secret turned out to be different - not like in ConfigMap.

What about the admin keyring? We also have it:

 # cat /var/lib/rook/kube-rook/client.admin.keyring [client.admin] key = AXAbR19d8GGSMUBN+FyYwEqGI1aZizGcJlHMLgx= caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"

Here is the problem. There was a failure: the cluster was recreated ... but in reality not.

It becomes clear that the newly generated keyring is stored in secrets, and they are not from our old cluster. Therefore:

we take keyring from the monitor from the file /var/lib/rook/mon-a/data/keyring (or from backup);
change the keyring in the secret rook-ceph-mons-keyring ;
register keyring from the admin and monitor in ConfigMap'e rook-ceph-mon ;
remove pod controllers with monitors.

The miracle will not take long: monitors will appear and start up. Hooray, a start!

Restore OSD

We go to the pod rook-operator : calling ceph mon dump shows that all monitors are in place, and ceph -s that they are in a quorum. However, if you look at the OSD tree ( ceph osd tree ), you will see something strange in it: OSD's started to appear, but they are empty. It turns out that they also need to be somehow restored. But how?

Meanwhile, rook-ceph-config and rook-config-override , as well as many other ConfigMaps with names of the form rook-ceph-osd-$nodename-config , appeared in ConfigMap's so needed. Let's look at them:

 kind: ConfigMap data: osd-dirs: '{"/mnt/osd1":16,"/mnt/osd2":18}'

Everything is wrong, everything is mixed up!

Scale the operator pod to zero, delete the generated Deployment pods from the OSD, and fix these ConfigMaps. But where to get the correct OSD map by nodes?

Let's try to delve into the /mnt/osd[1-2] directories on the nodes again - in the hope that we can catch hold of something there.
There are 2 subdirectories in the /mnt/osd1 directory: osd0 and osd16 . The last one is just that ID that is specified in ConfigMap (16)?
Check the size and see that osd0 much larger than osd16 .

We conclude that osd0 is the required OSD, which was specified as /mnt/osd1 in ConfigMap (we use directory based osd .)

Step by step, we check all the nodes and edit ConfigMap's. After all the instructions, you can run the pod of the Rook operator and read its logs. And everything is wonderful in them:

I am a cluster operator;
I found drives on nodes;
I found monitors;
monitors became friends, i.e. formed a quorum;
running OSD deployments ...

Let's go back to the pod of the Rook operator and check the cluster liveness ... yes, we made a little mistake with the conclusions about the OSD names on some nodes! It doesn’t matter: they again corrected ConfigMaps, deleted the extra directories from the new OSDs and came to the long-awaited state HEALTH_OK !

Check the images in the pool:

 # rbd ls -p kube pvc-9cfa2a98-b878-437e-8d57-acb26c7118fb pvc-9fcc4308-0343-434c-a65f-9fd181ab103e pvc-a6466fea-bded-4ac7-8935-7c347cff0d43 pvc-b284d098-f0fc-420c-8ef1-7d60e330af67 pvc-b6d02124-143d-4ce3-810f-3326cfa180ae pvc-c0800871-0749-40ab-8545-b900b83eeee9 pvc-c274dbe9-1566-4a33-bada-aabeb4c76c32 …

Everything is in place - the cluster is saved!

I am lazy doing backups, or the Fast Way

If backups were done for Rook, then the recovery procedure becomes much simpler and boils down to the following:

Scale to zero deployment of the Rook operator;
We remove all deployments except the Rook operator;
Restore all secrets and ConfigMaps from backup;
Restore the contents of the /var/lib/rook/mon-* on the nodes;
Restore (if suddenly lost) CRD CephCluster , CephFilesystem , CephBlockPool , CephNFS , CephObjectStore ;
Scale the deployment of the Rook operator back to 1.

Useful Tips

Make backups!

And in order to avoid situations when you need to recover from them:

Before large-scale work with the cluster, consisting in server reboots, scale the Rook operator to zero so that it does not do too much.
On monitors, add nodeAffinity in advance.
Pay attention to the pre- setting of timeouts ROOK_MON_HEALTHCHECK_INTERVAL and ROOK_MON_OUT_TIMEOUT .

Instead of a conclusion

It makes no sense to argue that Rook, being an additional “layer” (in the general scheme of organizing storage in Kubernetes), as much simplifies, it also adds new difficulties and potential problems in the infrastructure. The thing remains “small”: to make an informed, informed choice between these risks on the one hand and the benefits that the solution brings in your particular case, on the other.

By the way, the section “Adopt an existing Rook Ceph cluster into a new Kubernetes cluster” has recently been added to the Rook documentation. It describes in more detail what needs to be done in order to move existing data to a new Kubernetes cluster or to restore a cluster that has collapsed for one reason or another.

Our hands are not for boredom: restoring the Rook cluster in K8s