The OS SSD died on my kubernetes/rook/ceph cluster which left the cluster with catastrophic monitor store loss. The process for recovering from just OSDs is not well documented, especially not when combined with rook and cephfs with multiple data pools. The last of which may never have been done before.
I am writing this document after the fact, so I may be missing some things, or they may be out of order. I will try my best though as I think it could be helpful.
- Recover rook ceph after catastrophic monitor store loss
- Supplementary documentation
- Prerequisites
- Recover the Ceph cluster
- Restore CephFS
- Identify CSI volumes and create static PVs
- Takeaways
Reinstall Kubernetes and Rook. I used a newer version of Kubernetes, but kept the Rook and Ceph versions the same.
Ensure Rook is configured to only use one mon when creating the Ceph cluster as mon data must be restored to a single mon, and quorum reestablished from the single mon.
To recover a Ceph cluster from nothing but OSDs, we need to rebuild the monitor store, find the original fsid and import it into a new Ceph cluster. It's a bit tedious, but doable.
Hopefully Rook will have recreated the deployments for the OSDs, and assuming that is the case we can 'debug' one of the OSD deployments to find the original fsid.
❯ kubectl rook-ceph debug start rook-ceph-osd-0
024/10/21 23:07:53 maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS
Info: fetching the deployment rook-ceph-osd-0 to be running
Info: deployment rook-ceph-osd-0 exists
Info: setting debug command to main container
Info: deployment rook-ceph-osd-0 scaled down
Info: waiting for the deployment pod rook-ceph-osd-0-57c9766f64-rtphk to be deleted
Info: ensure the debug deployment rook-ceph-osd-0 is scaled up
Info: pod rook-ceph-osd-0-57c9766f64-rtphk is ready for debugging
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- cat /var/lib/ceph/osd/ceph-0/ceph_fsid
2a7f4468-5f3b-4063-9bea-24026ea40b4c
# or ...
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- ceph-bluestore-tool show-label --dev /dev/nvme0n1
If not, a live ISO of Ubuntu may be required. I started off with this, and concluded it was a waste of time. The only thing which it was helpful for was obtaining the original fsid and reassurance that recovery may be possible.
Assuming a live ISO of Ubuntu, install Ceph and get the fsid from one of the OSDs.
❯ apt install -y ceph ceph-common ceph-osd
❯ sudo ceph-bluestore-tool show-label --dev /dev/nvme0n1
The new Rook Ceph cluster will have a new fsid, which is useless and needs to be replaced with the original fsid.
Where 51ea9bae-86ce-4f9f-b15e-1fa7533b3508
is the original fsid:
kubectl patch secret rook-ceph-mon -p '{
"stringData": {
"fsid": "51ea9bae-86ce-4f9f-b15e-1fa7533b3508"
}
}'
Rook should eventually reconcile and create the OSD deployments if it hasn't
already, and propagate the new fsid to
/var/lib/rook/rook-ceph/rook-ceph.config
.
The new mon store needs to be updated by each OSD. This is easy on a single node, and a bit annoying on multiple nodes. This is written for recovery of a single node, but can be extrapolated for multiple nodes.
Essentially:
❯ kubectl rook-ceph debug start rook-ceph-osd-0
❯ kubectl exec -it deploy/rook-ceph-osd-0-debug -- bash
Defaulted container "osd" out of: osd, log-collector, activate (init), expand-bluefs (init), chown-container-data-dir (init)
[root@rook-ceph-osd-0-debug-695fd4c7b-c4w9q ceph]# mkdir /tmp/monstore/
[root@rook-ceph-osd-0-debug-695fd4c7b-c4w9q ceph]# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-0 --op update-mon-db --mon-store-path /tmp/monstore --no-mon-config
osd.0 : 0 osdmaps trimmed, 653 osdmaps added.
❯ kubectl rook-ceph debug stop rook-ceph-osd-0
This can be automated with the following script:
# make sure to `mkdir -p /var/lib/rook/monstore-recovery` before running
for osd in $(kubectl get deploy -lapp=rook-ceph-osd -ojsonpath='{.items[*].metadata.labels.osd}'); do
kubectl rook-ceph debug start "rook-ceph-osd-$osd"
kubectl exec "deploy/rook-ceph-osd-$osd-debug" -- ceph-objectstore-tool --type bluestore --data-path "/var/lib/ceph/osd/ceph-$osd" --op update-mon-db --mon-store-path /var/lib/rook/monstore-recovery --no-mon-config
kubectl rook-ceph debug stop "rook-ceph-osd-$osd"
done
Once the mon db has been updated, it can then be rebuilt.
❯ kubectl rook-ceph debug start rook-ceph-osd-0
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- ceph-monstore-tool /var/lib/rook/monstore-recovery rebuild -- --keyring /var/lib/rook/rook-ceph/client.admin.keyring
❯ kubectl rook-ceph debug stop rook-ceph-osd-0
Once the mon store has been rebuilt, the empty mon can be replaced with the rebuilt mon store.
❯ kubectl rook-ceph debug start rook-ceph-osd-0
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- cp -r /var/lib/rook/mon-a /var/lib/rook/mon-a-backup
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- rm -rf /var/lib/rook/mon-a/data/store.db
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- cp -r /var/lib/rook/monstore-recovery/store.db/ /var/lib/rook/mon-a/data/store.db
❯ kubectl rook-ceph debug stop rook-ceph-osd-0
Start a debug session with any OSD.
❯ kubectl rook-ceph debug start rook-ceph-osd-0
Scale down the operator and the mon.
❯ kubectl scale deploy/rook-ceph-operator --replicas=0
deployment.apps/rook-ceph-operator scaled
❯ kubectl scale deploy/rook-ceph-mon-a --replicas=0
deployment.apps/rook-ceph-mon-a scaled
Update the mon map from the OSD debug session. There should only be one mon in the mon map, but if there are more, they can be removed.
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- ceph-mon -i a --extract-monmap /tmp/monmap --mon-data /var/lib/rook/mon-a/data
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --print
monmaptool: monmap file /tmp/monmap
epoch 0
fsid 00000000-0000-0000-0000-000000000000
last_changed 2024-10-21T23:57:38.363259+0000
created 2024-10-21T23:57:38.363259+0000
min_mon_release 0 (unknown)
election_strategy: 1
0: v2:10.99.179.26:3300/0 mon.a
1: v2:10.104.57.166:3300/0 mon.b
2: v2:10.105.210.150:3300/0 mon.c
Remove superfluous mons if they exist (they shouldn't, but they did in my case because I hadn't configured rook to use only a single mon).
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --rm b
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --rm c
Where 51ea9bae-86ce-4f9f-b15e-1fa7533b3508
is your fsid.
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --fsid 51ea9bae-86ce-4f9f-b15e-1fa7533b3508
Get the new mon address from Kubernetes.
❯ kubectl get svc rook-ceph-mon-a -o jsonpath='{.spec.clusterIP}'
10.104.57.166
Where 10.104.57.166
is the IP of the new mon.
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --rm a
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --add a 10.104.57.166:3300
Print the new mon map to ensure it is correct, and then commit it.
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- monmaptool /tmp/monmap --print
monmaptool: monmap file /tmp/monmap
epoch 1
fsid 2a7f4468-5f3b-4063-9bea-24026ea40b4c
last_changed 2024-10-21T23:57:38.363259+0000
created 2024-10-21T23:57:38.363259+0000
min_mon_release 0 (unknown)
election_strategy: 1
0: v2:10.104.57.166:3300/0 mon.a
❯ kubectl exec deploy/rook-ceph-osd-0-debug -- ceph-mon -i a --inject-monmap /tmp/monmap --mon-data /var/lib/rook/mon-a/data
❯ kubectl rook-ceph debug stop rook-ceph-osd-0
❯ kubectl scale deploy/rook-ceph-operator --replicas=1
deployment.apps/rook-ceph-operator scaled
❯ kubectl scale deploy/rook-ceph-mon-a --replicas=1
deployment.apps/rook-ceph-mon-a scaled
This step should not be necessary if Rook was correctly configured to use only a single mon, but I didn't do that and had to manually update the mon config.
I saw vague errors like this, and it was because it couldn't establish a quorum.
CEPH cluster became unresponsive: e5 handle_auth_request failed to assign global_id
❯ mon_host=$(kubectl get svc rook-ceph-mon-a -o jsonpath='{.spec.clusterIP}')
kubectl patch secret rook-ceph-config -p '{"stringData": {"mon_host": "[v2:'"${mon_host}"':3300]", "mon_initial_members": "a"}}'
In addition, the mons may need to be removed from the mon endpoints configmap.
❯ mon_host=$(kubectl get svc rook-ceph-mon-a -o jsonpath='{.spec.clusterIP}')
kubectl patch cm rook-ceph-mon-endpoints -p '{"data": {"data": "a='"${mon_host}"':3300"}}'
The OSDs don't have the mgr
cap, and it needs to be added manually.
```text
Error EINVAL: key for osd.14 exists but cap mgr does not match
The caps can be added in the toolbox pod with:
❯ ceph auth caps osd.14 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *'
To save time, this script will do it for all OSDs:
for osd in $(kubectl get deploy -lapp=rook-ceph-osd -ojsonpath='{.items[*].metadata.labels.osd}'); do
kubectl exec deploy/rook-ceph-tools -- ceph auth caps $osd mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *'
done
At this point, the mon should be active and the Ceph cluster should be "healthy".
Increase the number of mons from 1 to 3 in the CephCluster custom resource.
Then, the mons should join and establish quorum.
If not, we can follow the guide for restoring mon quorum.
❯ kubectl rook-ceph mons restore-quorum a
The guidance for restoring CephFS with a single data pool should mostly just work.
Where main
is the name of the Ceph filesystem.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs new main main-metadata main-default
If, like in my case, there are multiple data pools, or erasure coding is used, then we have to do some hacks.
The existing data pools cannot be added to the Ceph filesystem, because it thinks it's already part of the filesystem, even though it isn't. This can be resolved by removing the application association with the data pool. The docs suggest there could be data loss, but my understanding is that it's simply changing some metadata. There is also no other way to do this, so.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs ls
name: main, metadata pool: main-metadata, data pools: [main-default ]
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs add_data_pool main main-nvme-ec
Error EINVAL: RADOS pool 'main-nvme-ec' is already used by filesystem 'main' as a 'data' pool for application 'cephfs'
❯ kubectl exec deploy/rook-ceph-tools -- ceph osd pool application get main-nvme-ec {
"cephfs": {
"data": "main"
}
}
❯ kubectl exec deploy/rook-ceph-tools -- ceph osd pool application rm main-nvme-ec cephfs data
removed application 'cephfs' key 'data' on pool 'main-nvme-ec'
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs add_data_pool main main-nvme-ec
Pool 'main-nvme-ec' (id '41') has pg autoscale mode 'on' but is not marked as bulk.
Consider setting the flag by running # ceph osd pool set main-nvme-ec bulk true
added data pool 41 to fsmap
Other than that, the rest of the instructions should work.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs set main joinable true
main marked joinable; MDS may join as newly active.
The MDS pods may need to be restarted, but otherwise the Ceph filesystem should now be healthy.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs status
main - 1 clients ====
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active main-b Reqs: 0 /s 0 0 0 0
0-s standby-replay main-a Evts: 0 /s 0 0 0 0
POOL TYPE USED AVAIL
main-metadata metadata 765M 3811G
main-default data 24.0k 3811G
main-nvme-ec data 6684G 7622G
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
I have opened an issue on the tracker to officially document, or support restoring Ceph filesystems with multiple data pools.
This bit is painful. The mapping of CSI volumes to persistent volumes is lost if the Kubernetes control plane is lost. This means that each CSI volume has to be inspected manually and identified. I recommend a spreadsheet for this.
Some CSI volumes are pretty easy to identify just by their size, but the rest really do need manual inspection. There will be a lot of CSI volumes which are either unused, duplicates, invalid or just left over from a 'retain' reclaim policy.
❯ kubectl exec deploy/rook-ceph-tools -- rbd ls replicapool-nvme
...
If the size of the CSI volume is useful, I wrote this script to list the volumes and their sizes:
for image in $(rbd ls replicapool-nvme); do
echo -n "$image,"
# extract "250 GiB" from "size 250 GiB in 64000 objects"
rbd info replicapool-nvme/$image | grep size | awk '{print $2, $3}' | tr -d ' ' | tr -d 'B' | numfmt --from=auto --to-unit=G
done
Follow the guide for creating the direct mount pod.
The direct mount pod can be used to mount RBD images and inspect them quickly. In my case, I had about 60 RBD images to inspect.
Map, mount and list the contents of the RBD image:
❯ kubectl exec -it deploy/rook-direct-mount -- rbd map replicapool-nvme/csi-vol-10fb6512-7888-4416-bac4-810d5623e542
❯ kubectl exec -it deploy/rook-direct-mount -- mount /dev/rbd0 /mnt
❯ kubectl exec -it deploy/rook-direct-mount -- ls -al /mnt
Unmount and unmap the RBD image:
❯ kubectl exec -it deploy/rook-direct-mount -- umount /mnt
❯ kubectl exec -it deploy/rook-direct-mount -- rbd unmap /dev/rbd0
A script which templates all the commands may be helpful to make this go a bit quicker.
console.log(`
csi-vol-10fb6512-7888-4416-bac4-810d5623e542
csi-vol-47e9037c-2187-4272-bec9-5ac0b69479f7
csi-vol-498f8772-43fe-4a7f-8676-d4c054b2b006
csi-vol-89b44878-6472-4b2b-a172-ec3eb708f399
csi-vol-aeed2194-ad77-41b0-95be-3bc625b2afa0
csi-vol-505470ff-002c-44cc-9e33-03f3c622720b
csi-vol-54b3fe0e-eb75-45a5-80dd-8957d40fecc3
csi-vol-702f5574-61ab-42ad-ac66-df3c7d160cdd
csi-vol-994b927b-379a-40c4-9c8c-3477cbe21a4d
csi-vol-b5ec04d9-5fc4-4cba-960b-08855921fc63
csi-vol-d923cd76-ad32-4862-9dbb-be33064a71a0
csi-vol-f61d70ad-2eb4-4b25-8c96-63226bb87811
`.trim().split('\n').map(v => `
# inspect ${v}
kubectl exec -it deploy/rook-direct-mount -- rbd map replicapool-nvme/${v}
kubectl exec -it deploy/rook-direct-mount -- mount /dev/rbd0 /mnt
kubectl exec -it deploy/rook-direct-mount -- ls -al /mnt
# cleanup ${v}
kubectl exec -it deploy/rook-direct-mount -- umount /mnt
kubectl exec -it deploy/rook-direct-mount -- rbd unmap /dev/rbd0
`.trim()).join('\n\n'))
Where main
is the name of the Ceph filesystem.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs subvolume ls main csi
[
{
"name": "csi-vol-95d94b9d-e3f7-4b95-bd79-6005576ee3c1"
},
{
"name": "csi-vol-deea4df5-96ec-4a3c-b452-36a9ab2b9c1f"
}
]
Truthfully, I used the Ceph dashboard to inspect the Ceph filesystems, but it is also possible to do from the command line.
❯ kubectl exec deploy/rook-ceph-tools -- ceph fs subvolume ls main csi/csi-vol-95d94b9d-e3f7-4b95-bd79-6005576ee3c1
[
{
"name": "45c65a90-bd64-46d9-aaf0-80d229aeb994"
}
]
❯ kubectl exec deploy/rook-ceph-tools -- bash-4.4$ ceph fs subvolume ls main csi/csi-vol-95d94b9d-e3f7-4b95-bd79-6005576ee3c1/45c65a90-bd64-46d9-aaf0-80d229aeb994
[
{
"name": "shows"
},
{
"name": "books"
},
{
"name": "music"
},
{
"name": "downloads"
},
{
"name": "movies"
}
]
This bit is also a bit painful, but workable.
The documentation for creating static PVs is helpful, but hopefully this document is better as it caters directly to Rook.
Instead of using the volumeName
attribute in stateful sets, I'd suggest just
creating the PVCs manually. It's a shame, but there is no other way around it if
the stateful set has multiple replicas. Even if it only has one replica, I feel
it's probably cleaner and more portable to just provision the PVCs manually.
When doing this, ensure the PVC is created before the stateful set to ensure it
doesn't create a new PVC.
For each volume, ensure the following:
- Storage requests must match.
- The
volumeName
attribute on the PVC must match the name of the PV. - The
volumeHandle
attribute must match the name of the RBD image.
Other than that, this should "just work".
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: config-jellyfin-0
namespace: media
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-nvme
volumeName: media-config-jellyfin-0
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-config-jellyfin-0
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Gi
csi:
controllerExpandSecretRef:
name: rook-csi-rbd-provisioner
namespace: rook-ceph
driver: rook-ceph.rbd.csi.ceph.com
fsType: ext4
nodeStageSecretRef:
name: rook-csi-rbd-node
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
pool: replicapool-nvme
staticVolume: "true"
volumeHandle: csi-vol-4d763654-d39f-41ac-b4e2-12024c2e297e
mountOptions:
- discard
persistentVolumeReclaimPolicy: Retain
storageClassName: rook-ceph-nvme
volumeMode: Filesystem
CephFS is much the same as RBD, except it needs some credentials to be formatted differently. This is explain in the CephFS shared volume docs.
❯ kubectl get secret rook-csi-cephfs-node -oyaml > rook-csi-cephfs-node.yaml
Change the name of the object to rook-csi-cephfs-node-user
, and change the
keys from adminID
and adminKey
to userID
and userKey
, then create the
new object.
❯ kubectl apply -f rook-csi-cephfs-node.yaml
Like before, ensure the following:
- Storage requests must match.
- The
volumeName
attribute on the PVC must match the name of the PV. - The
volumeHandle
attribute can be anything, either use the name of the PV or the CSI volume. - The
rootPath
attribute must match the path of the CephFS subvolume.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: media
namespace: media
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 8Ti
storageClassName: rook-cephfs-nvme
volumeMode: Filesystem
volumeName: media-media
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-media
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 8Ti
csi:
controllerExpandSecretRef:
name: rook-csi-cephfs-provisioner
namespace: rook-ceph
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-node-user
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: main
rootPath: /volumes/csi/csi-vol-95d94b9d-e3f7-4b95-bd79-6005576ee3c1/45c65a90-bd64-46d9-aaf0-80d229aeb994
staticVolume: "true"
volumeHandle: media-media
persistentVolumeReclaimPolicy: Retain
storageClassName: rook-cephfs-nvme
volumeMode: Filesystem
I did encounter this error when creating the CephFS volumes because I passed it the 'discard' mount option. It didn't like that.
stderr: unable to get monitor info from DNS SRV with service name: ceph-mon
2023-01-09T17:51:39.771+0000 7f9a29f55f40 -1 failed for service _ceph-mon._tcp
- Use enterprise SSDs.
- Backup important data.
- Distribute monitors across multiple machines if possible.
- Backup, or actively mirror (raid 1) the OS SSD to prevent the loss of etcd data, rook data and mon data.
- Ceph is surprisingly resilient, even if the restoration process is hard and convoluted.