feature(local-disks): Add support for raid10 on instance storage #1549

lassizci · 2023-12-20T12:42:46Z

Description of changes:
I would like to be able to migrate workloads away from a node gracefully in case of instance storage drive failure. Raid10 would provide redundancy and trade off disk space.

Adding support for creating raid10 in addition to raid0. This also removes the wait block for raid resync for two reasons:

raid0 does not have redundancy and therefore no initial resync[1]
with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes to multiple hours, depending on sysctl params dev.raid.speed_limit_min and dev.raid.speed_limit_max and the speed of the disks. Initial resync for raid10 is not strictly needed[1]

filesystem creation: by default mkfs.xfs attempts to TRIM the drive. This is also something that can take tens of minutes or hours, depening on the size of drives. TRIM can be skipped, as instances are delivered with disks fully trimmed[2].

[1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport

Testing Done

on m6id.metal with kernel defaults:

# uname -a
Linux ip-10-24-0-65.eu-west-1.compute.internal 5.10.199-190.747.amzn2.x86_64 #1 SMP Sat Nov 4 16:55:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

#  sysctl dev.raid.speed_limit_min
dev.raid.speed_limit_min = 1000

# sysctl dev.raid.speed_limit_max
dev.raid.speed_limit_max = 200000

# mdadm --create --force --verbose /dev/md/kubernetes --level=10 --name=kubernetes --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855337472K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/kubernetes started.

# cat /proc/mdstat 
Personalities : [raid10] 
md127 : active raid10 nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      3710674944 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  resync =  1.1% (41396352/3710674944) finish=304.3min speed=200910K/sec
      bitmap: 28/28 pages [112KB], 65536KB chunk

With increased resync limits:

# sysctl -w dev.raid.speed_limit_min=2146999999 ; sysctl -w dev.raid.speed_limit_max=2146999999
dev.raid.speed_limit_min = 2146999999
dev.raid.speed_limit_max = 2146999999

# cat /proc/mdstat 
Personalities : [raid10] 
md127 : active raid10 nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      3710674944 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [===>.................]  resync = 19.9% (740172096/3710674944) finish=20.4min speed=2418848K/sec
      bitmap: 23/28 pages [92KB], 65536KB chunk

cartermckinnon · 2024-01-04T18:45:09Z

My only concern here is the initial resync bit. The documentation implies that an initial resync is never really necessary, but also...is? I don't know enough about RAID to make a call one way or the other; but we need to be confident that the array will be healthy and performant in a production environment before we merge this.

lassizci · 2024-01-05T08:53:51Z

My only concern here is the initial resync bit. The documentation implies that an initial resync is never really necessary, but also...is? I don't know enough about RAID to make a call one way or the other; but we need to be confident that the array will be healthy and performant in a production environment before we merge this.

RAID5 does need initial resync as per documentation too and it is not safe to skip it. I would scope out raid5 case here though.

So what would happen in initial resync is going through each block in each disk of a mirror and write a zero to ensure disks are identical. Nothing more. This is actually even harmful for SSDs. The reason initial resync isn't "strictly" needed is that if there's a write to the filesystem, the blocks is written to all the mirrored disks anyway, so no potential to have inconsistent data on reads. Is there a performance concern if one disk is trimmed and one disk is not? I guess there could be, but we also have a documented[2] guarantee that the disks are fully trimmed.

Some other corner cases could come up in a recovery situations where we have disk that has data and we perhaps try a recovery. That's on the other hand a situation we should never end up with AWS Instance Storage.

Is there something more I could do to bring some confidence?

cartermckinnon · 2024-01-12T08:36:52Z

Some other corner cases could come up in a recovery situations where we have disk that has data and we perhaps try a recovery.

That's my concern; if someone is counting on this redundancy and it doesn't work when they need it to, that's pretty terrible.

That being said, this will likely always be opt-in. Can you add some notes to the doc that, for now, this option is "experimental" or "use at your own risk" until we have some feedback in the wild?

lassizci · 2024-01-12T15:52:57Z

That's my concern; if someone is counting on this redundancy and it doesn't work when they need it to, that's pretty terrible.

This doesn't affect redundancy, because there's no way to replace a single SSD on the instance. The whole instance will be replaced at once and then all the disks are clean ones. This just helps for graceful termination, so that workloads can be migrated away before the instance is replaced.

That being said, this will likely always be opt-in. Can you add some notes to the doc that, for now, this option is "experimental" or "use at your own risk" until we have some feedback in the wild?

Yes, makes sense, will do.

cartermckinnon · 2024-01-13T01:48:27Z

This doesn't affect redundancy, because there's no way to replace a single SSD on the instance. The whole instance will be replaced at once and then all the disks are clean ones. This just helps for graceful termination, so that workloads can be migrated away before the instance is replaced.

Ah, I see! Thanks.

I think we're good to go after the doc changes. 👍

lassizci · 2024-01-15T20:08:21Z

I think we're good to go after the doc changes. 👍

Would that do? https://github.com/awslabs/amazon-eks-ami/pull/1549/files#diff-da030e6a3883ea1c1765967f4609e0e675451db33557df79c65c2a07b02c8e6fR390

lassizci · 2024-05-08T10:04:39Z

@cartermckinnon would we be happy with the doc changes mentioned above?

cartermckinnon · 2024-05-08T17:32:12Z

@lassizci

This LGTM. Can you rebase to address the conflicts? This PR got caught in the lurch of our branch rename.

This removes the wait block for raid resync for two reasons: 1) raid0 does not have redundancy and therefore no initial resync[1] 2) with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes to multiple hours, depending on sysctl params `dev.raid.speed_limit_min` and `dev.raid.speed_limit_max` and the speed of the disks. Initial resync for raid10 is not strictly needed[1] Filesystem creation: by default `mkfs.xfs` attempts to TRIM the drive. This is also something that can take tens of minutes or hours, depening on the size of drives. TRIM can be skipped, as instances are delivered with disks fully trimmed[2]. [1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation [2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport

lassizci · 2024-07-12T10:45:41Z

@cartermckinnon now rebased

lassizci · 2024-07-26T16:42:33Z

Well.. this is embarrassing, but there was and asg depending on my branch with the old path before the repo restructuring, so I think I have to create a new pull request, as the PR branch can't be changed as far as I can tell.. Sorry about that 😅

Opened a new one: #1900

lassizci force-pushed the raid-10-support branch 3 times, most recently from ab2e1d7 to 26731d7 Compare December 20, 2023 13:13

lassizci force-pushed the raid-10-support branch from 26731d7 to 2f1bcc4 Compare January 15, 2024 20:07

cartermckinnon changed the base branch from master to main May 8, 2024 17:31

lassizci force-pushed the raid-10-support branch from 2f1bcc4 to ea76763 Compare July 12, 2024 10:23

lassizci force-pushed the raid-10-support branch from ea76763 to 3276ef3 Compare July 12, 2024 10:43

ndbaker1 changed the title ~~Add support for raid10 on instance storage~~ feature(local-disks): Add support for raid10 on instance storage Jul 16, 2024

ndbaker1 added the enhancement New feature or request label Jul 16, 2024

ndbaker1 requested a review from cartermckinnon July 17, 2024 00:03

lassizci mentioned this pull request Jul 26, 2024

Add support for raid10 #1900

Open

lassizci closed this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(local-disks): Add support for raid10 on instance storage #1549

feature(local-disks): Add support for raid10 on instance storage #1549

lassizci commented Dec 20, 2023 •

edited

Loading

cartermckinnon commented Jan 4, 2024

lassizci commented Jan 5, 2024

cartermckinnon commented Jan 12, 2024

lassizci commented Jan 12, 2024

cartermckinnon commented Jan 13, 2024

lassizci commented Jan 15, 2024

lassizci commented May 8, 2024

cartermckinnon commented May 8, 2024

lassizci commented Jul 12, 2024

lassizci commented Jul 26, 2024

feature(local-disks): Add support for raid10 on instance storage #1549

feature(local-disks): Add support for raid10 on instance storage #1549

Conversation

lassizci commented Dec 20, 2023 • edited Loading

cartermckinnon commented Jan 4, 2024

lassizci commented Jan 5, 2024

cartermckinnon commented Jan 12, 2024

lassizci commented Jan 12, 2024

cartermckinnon commented Jan 13, 2024

lassizci commented Jan 15, 2024

lassizci commented May 8, 2024

cartermckinnon commented May 8, 2024

lassizci commented Jul 12, 2024

lassizci commented Jul 26, 2024

lassizci commented Dec 20, 2023 •

edited

Loading