-
-
Save MawKKe/caa2bbf7edcc072129d73b61ae7815fb to your computer and use it in GitHub Desktop.
#!/usr/bin/env bash | |
# | |
# Author: Markus (MawKKe) [email protected] | |
# Date: 2018-03-19 | |
# | |
# | |
# What? | |
# | |
# Linux dm-crypt + dm-integrity + dm-raid (RAID1) | |
# | |
# = Secure, redundant array with data integrity protection | |
# | |
# Why? | |
# | |
# You see, RAID1 is dead simple tool for disk redundancy, | |
# but it does NOT protect you from bit rot. There is no way | |
# for RAID1 to distinguish which drive has the correct data if rot occurs. | |
# This is a silent killer. | |
# | |
# But with dm-integrity, you can now have error detection | |
# at the block level. But it alone does not provide error correction | |
# and is pretty useless with just one disk (disks fail, shit happens). | |
# | |
# But if you use dm-integrity *below* RAID1, now you have disk redundancy, | |
# AND error checking AND error correction. Invalid data received from | |
# a drive will cause a checksum error which the RAID array notices and | |
# replaces with correct data. | |
# | |
# If you throw encryption into the mix, you'll have secure, | |
# redundant array. Oh, and the data integrity can be protected with | |
# authenticated encryption, so no-one can tamper your data maliciously. | |
# | |
# How cool is that? | |
# | |
# Also: If you use RAID1 arrays as LVM physical volumes, the overall | |
# architecture is quite similar to ZFS! All with native Linux tools, | |
# and no hacky solaris compatibility layers or licencing issues! | |
# | |
# (I guess you can use whatever RAID level you want, but RAID1 is the | |
# simplest and fastest to set up) | |
# | |
# | |
# Let's try it out! | |
# | |
# --- | |
# NOTE: The dm-integrity target is available since Linux kernel version 4.12. | |
# NOTE: This example requires LUKS2 which is only recently released (2018-03) | |
# NOTE: The authenticated encryption is still experimental (2018-03) | |
# --- | |
set -eux | |
# 1) Make dummy disks | |
cd /tmp | |
truncate -s 500M disk1.img | |
truncate -s 500M disk2.img | |
# Format the disk with luksFormat: | |
dd if=/dev/urandom of=key.bin bs=512 count=1 | |
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk1.img key.bin | |
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk2.img key.bin | |
# The luksFormat's might take a while since the --integrity causes the disks to be wiped. | |
# dm-integrity is usually configured with 'integritysetup' (see below), but as | |
# it happens, cryptsetup can do all the integrity configuration automatically if | |
# the --integrity flag is specified. | |
# Open/attach the encrypted disks | |
cryptsetup luksOpen disk1.img disk1luks --key-file key.bin | |
cryptsetup luksOpen disk2.img disk2luks --key-file key.bin | |
# Create raid1: | |
mdadm \ | |
--create \ | |
--verbose --level 1 \ | |
--metadata=1.2 \ | |
--raid-devices=2 \ | |
/dev/md/mdtest \ | |
/dev/mapper/disk1luks \ | |
/dev/mapper/disk2luks | |
# Create a filesystem, add to LVM volume group, etc... | |
mkfs.ext4 /dev/md/mdtest | |
# Cool! Now you can 'scrub' the raid setup, which verifies | |
# the contents of each drive. Ordinarily detecting an error would | |
# be problematic, but since we are now using dm-integrity, the raid1 | |
# *knows* which one has the correct data, and is able to fix it automatically. | |
# | |
# To scrub the array: | |
# | |
# $ echo check > /sys/block/md127/md/sync_action | |
# | |
# ... wait a while | |
# | |
# $ dmesg | tail -n 30 | |
# | |
# You should see | |
# | |
# [957578.661711] md: data-check of RAID array md127 | |
# [957586.932826] md: md127: data-check done. | |
# | |
# | |
# Let's simulate disk corruption: | |
# | |
# $ dd if=/dev/urandom of=disk2.img seek=30000 count=30 bs=1k conv=notrunc | |
# | |
# (this writes 30kB of random data into disk2.img) | |
# | |
# | |
# Run scrub again: | |
# | |
# $ echo check > /sys/block/md127/md/sync_action | |
# | |
# ... wait a while | |
# | |
# $ dmesg | tail -n 30 | |
# | |
# Now you should see | |
# ... | |
# [959146.618086] md: data-check of RAID array md127 | |
# [959146.962543] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39784 | |
# [959146.963086] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39840 | |
# [959154.932650] md: md127: data-check done. | |
# | |
# But now if you run scrub yet again: | |
# ... | |
# [959212.329473] md: data-check of RAID array md127 | |
# [959220.566150] md: md127: data-check done. | |
# | |
# And since we didn't get any errors a second time, we can deduce that the invalid | |
# data was repaired automatically. | |
# | |
# Great! We are done. | |
# | |
# -------- | |
# | |
# If you don't need encryption, then you can use 'integritysetup' instead of cryptsetup. | |
# It works in similar fashion: | |
# | |
# $ integritysetup format --integrity sha256 disk1.img | |
# $ integritysetup format --integrity sha256 disk2.img | |
# $ integritysetup open --integrity sha256 disk1.img disk1int | |
# $ integritysetup open --integrity sha256 disk2.img disk2int | |
# $ mdadm --create ... | |
# | |
# ...and so on. Though now you can detect and repair disk errors but have no protection | |
# against malicious cold-storage attacks. Data is also readable by anybody. | |
# | |
# 2018-03 NOTE: | |
# | |
# if you override the default --integrity value (whatever it is) during formatting, | |
# then you must specify it again when opening, like in the example above. For some | |
# reason the algorithm is not autodetected. I guess there is no header written onto | |
# disk like is with LUKS ? | |
# | |
# ---------- | |
# | |
# Read more: | |
# https://fosdem.org/2018/schedule/event/cryptsetup/ | |
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt | |
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMIntegrity | |
# https://mirrors.edge.kernel.org/pub/linux/utils/cryptsetup/v2.0/v2.0.0-rc0-ReleaseNotes |
Are there any known disadvantages with encryption for ZFS shipped with FreeBSD?
Obviously encryption happens in a separate layer there.
Main thread: https://forums.raspberrypi.com/viewtopic.php?p=2089261#p2089261
Currently (as of July 2023) there are some unresolved bugs with ZFS's native encryption and edge cases involving send/recv, that can potentially cause data corruption on both the home pool and a snapshot receiving pool.
See the following for references:
- https://discourse.practicalzfs.com/t/is-native-encryption-ready-for-production-use/532/3
- https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6swwBZXgXwdCPKgp4SbPZwTexCg/htmlview
- https://github.com/openzfs/zfs/issues?q=is%3Aopen+is%3Aissue+label%3A%22Component%3A+Encryption%22
While these seem to be reported in regards to linux systems, that's likely simply because that's where the major users are, I personally wouldn't expect FreeBSD to be free from these issues.
Basically if you don't send/recv an encrypted dataset and instead use something like rsync, then you should be fine.
I have read through the thread. Great discussion, and a lot of useful info!
Does anyone know whether disabling journaling for dm-integrity (--integrity-no-journal
) is safe with a dm-integrity > md-raid > filesystem kind of setup?
I’m trying to research this, but there is hardly any info. My logic is this: If journaling is off, and there is a corruption in the dm-integrity layer, e.g. due to a power-cut, then this would be presented to the md-raid layer as a normal unrecoverable read error. Then, md-raid would correct the error the same way as in any other case. The fact that the error was a result of disabled journaling is irrelevant.
The only problem I can see with this setup is that if the integrity error is not discovered in time, and only found out when the md-raid layer doesn’t have a copy (Raid-1) or a checksum (Raid-5 or 6) of the lost data. E.g. there is a power cut -> faulty block is written in the dm-integrity layer -> sometime later a disk dies -> we try to repair the array, but the faulty block is causing a problem.
But thinking it further, if we do a Raid scrub every time after a power-cut, then is it safe to disable journaling? Am I right with this logic?
Apart from a power-cut or yanking a disk out mid-operation, are there any situations where dm-integrity would write faulty blocks silently, and where this wouldn't happen with journaling enabled?
Given it's likely that both copies will be written at once, there's high chance that both copies will be mismatched with the checksums. So, no, it's definitely not safe in general.
This is a nice example to get ZFS-like error detection with regular filesystems like EXT4 or XFS, but what about alignment issues and write amplification? Do we have to correct mdadm or the filesystem chunk/stride calculation for alignment, or is it now taken care of by default?
The linked article https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/ suggest:
- avoiding read-modify-write by passing
--chunk 2048
to mdadm - having a /sys/block/md0/md/stripe_cache_size of 4096
- using stripe unit and stripe width when doing mkfs to reflect the hardware realities as the mdadm device wrongly looks like 4kn
Alignment issues are no worse than with any RAID: for a file system it looks like regular MD RAID.
There is journaling in integrity level so write amplification is on the horrible side. May be better if you have a device with configurable sector size (like 520 instead of 512 bytes).
If this is for the root drive, one potential advantage of doing LUKS over mdadm RAID1 instead of the reverse is that LUKS over a degraded RAID will boot normally since this patch, whereas a degraded RAID over LUKS array may fail the boot process and drop into an initramfs shell due to a missing encrypted root disk. That may require typing exit
to continue, although this may not be a big deal to some.
To journal or not to journal?
What worries me is not losing 1 sector, but possibility of losing N sectors. This is because of how metadata and data are inter-leaved: it is 1 sector full of IVs and checksums ("tags") followed by N sectors that are covered by it (source). So, what if the metadata sector becomes corrupt? Then all N sectors could be seen as having invalid checksums.
I think there's a possibility of silent corruption: suppose the filesystem is trying to write just 1 of N sectors, and something happens and all of N sectors get corrupt. Filesystem would have no idea that those other N-1 sectors could've been left dirty as well, because it thinks it was writing only the 1. Corruption of others would be a side-effect.
Thankfully it seems that most hardware writes sectors atomically so you'd either end up with N old checksums or N new checksums. and the difference would be only in checksums for the sectors you were about to write.
I sent this e-mail to dm-crypt at saout.de (is that mailing list working still?) to try and get some answers:
Hi All,
I could not find a place where some deeper implications of
--integrity-no-journal
are explained.
I understand that power loss would leave the disk in a dirty state and the checksum(s) won't match, so on the next read it would return an i/o error (until overwritten or recalculated).
That much I was able to find explained in many places.I already found a previous thread where lots of questions were answered, which already helped my understanding a lot: https://lore.kernel.org/all/[email protected]/.
However, some details about the no-journal mode are still unclear.What's not clear is: exactly how many sectors can be affected, and would a journaling file-system (like ext4) be informed that some adjacent sectors need checking (I guess not)?
I already found a paper that documents the inter-leaved metadata sectors approach (Section 4.3 of https://arxiv.org/pdf/1807.00309.pdf): 1 sector of metadata (IVs + tags) followed by N sectors of encrypted data, where N depends on how much metadata we're able to pack in 1 metadata sector.
Does this mean that metadata sector is always written first, before writing the data?
Then, in case of power loss during data write, we'd only corrupt 1 data sector, is that correct?
Ext4 would know that the sector is dirty because it'd have the journal, correct?But, what if power loss happens during metadata sector write?
Would it affect only the sectors that were being written to by the filesystem or could such a write corrupt IVs & tags of sectors that weren't being written?
If that's the case then ext4 journal wouldn't know that there's these extra sectors that were affected because, from ext4 point-of-view, it wasn't trying to touch the sectors at all.
So, power loss would result in hidden data corruption, is that correct?
Is there a way to tell ext4 that a write is affecting these other sectors so that the ext4 journal can know?If we'd use 2 dm-integrity devices to make a
mdadm
raid1 (https://gist.github.com/MawKKe/caa2bbf7edcc072129d73b61ae7815fb) then we could think we're safe without a journal because the affected sectors could be recovered from the other device.
But still, what if the same sector is being written at the same time on both devices?
Power loss would corrupt both and the error would be unrecoverable, correct?Would it be possible to somehow make mdadm not write same sectors at the same time?
Then, in case of power loss we'd have the sector(s) corrupt on 1 device, but not on the other.Best regards,
bitcoincashautist
https://gitlab.com/0353F40E
as the post you link to states, torn pages are rare, but not unheard of
and that assumes just the simple failure scenario: power outage. It doesn't consider more extreme situations, where the firmware enters some failure state and starts to misbehave.
My approach to storage media is that: if it can fail in some weird way, it will fail in some weird way.
It looks like cryptsetup/LUKS disables TRIM/discards with integrity enabled (from man cryptsetup-open):
--allow-discards
Allow the use of discard (TRIM) requests for the device. This
is also not supported for LUKS2 devices with data integrity
protection.
I wonder if there's a layering that preserves TRIM support through the stack but still gives this wonderful RAID1 scrub repair and support encryption?
Maybe something like:
dm-integrity (with integritysetup --allow-discards) > md-raid1 > LUKS (with --allow-discards) > mkfs.*
This would avoid the duplicate encryption cost as well.
I haven't tried this yet, but thanks for the great info and discussion here!
@MawKKe about the note on the default integrity algorithm for cryptsetup:
https://gitlab.com/cryptsetup/cryptsetup/-/issues/754