Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(local-disks): Add support for raid10 on instance storage #1549

Closed
wants to merge 1 commit into from

Conversation

lassizci
Copy link

@lassizci lassizci commented Dec 20, 2023

Description of changes:
I would like to be able to migrate workloads away from a node gracefully in case of instance storage drive failure. Raid10 would provide redundancy and trade off disk space.

Adding support for creating raid10 in addition to raid0. This also removes the wait block for raid resync for two reasons:

  1. raid0 does not have redundancy and therefore no initial resync[1]
  2. with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes to multiple hours, depending on sysctl params dev.raid.speed_limit_min and dev.raid.speed_limit_max and the speed of the disks. Initial resync for raid10 is not strictly needed[1]

filesystem creation: by default mkfs.xfs attempts to TRIM the drive. This is also something that can take tens of minutes or hours, depening on the size of drives. TRIM can be skipped, as instances are delivered with disks fully trimmed[2].

[1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport

Testing Done

on m6id.metal with kernel defaults:

# uname -a
Linux ip-10-24-0-65.eu-west-1.compute.internal 5.10.199-190.747.amzn2.x86_64 #1 SMP Sat Nov 4 16:55:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

#  sysctl dev.raid.speed_limit_min
dev.raid.speed_limit_min = 1000

# sysctl dev.raid.speed_limit_max
dev.raid.speed_limit_max = 200000

# mdadm --create --force --verbose /dev/md/kubernetes --level=10 --name=kubernetes --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 1855337472K
mdadm: automatically enabling write-intent bitmap on large array
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/kubernetes started.

# cat /proc/mdstat 
Personalities : [raid10] 
md127 : active raid10 nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      3710674944 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  resync =  1.1% (41396352/3710674944) finish=304.3min speed=200910K/sec
      bitmap: 28/28 pages [112KB], 65536KB chunk

With increased resync limits:

# sysctl -w dev.raid.speed_limit_min=2146999999 ; sysctl -w dev.raid.speed_limit_max=2146999999
dev.raid.speed_limit_min = 2146999999
dev.raid.speed_limit_max = 2146999999

# cat /proc/mdstat 
Personalities : [raid10] 
md127 : active raid10 nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0]
      3710674944 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [===>.................]  resync = 19.9% (740172096/3710674944) finish=20.4min speed=2418848K/sec
      bitmap: 23/28 pages [92KB], 65536KB chunk

@lassizci lassizci force-pushed the raid-10-support branch 3 times, most recently from ab2e1d7 to 26731d7 Compare December 20, 2023 13:13
@cartermckinnon
Copy link
Member

My only concern here is the initial resync bit. The documentation implies that an initial resync is never really necessary, but also...is? I don't know enough about RAID to make a call one way or the other; but we need to be confident that the array will be healthy and performant in a production environment before we merge this.

@lassizci
Copy link
Author

lassizci commented Jan 5, 2024

My only concern here is the initial resync bit. The documentation implies that an initial resync is never really necessary, but also...is? I don't know enough about RAID to make a call one way or the other; but we need to be confident that the array will be healthy and performant in a production environment before we merge this.

RAID5 does need initial resync as per documentation too and it is not safe to skip it. I would scope out raid5 case here though.

So what would happen in initial resync is going through each block in each disk of a mirror and write a zero to ensure disks are identical. Nothing more. This is actually even harmful for SSDs. The reason initial resync isn't "strictly" needed is that if there's a write to the filesystem, the blocks is written to all the mirrored disks anyway, so no potential to have inconsistent data on reads. Is there a performance concern if one disk is trimmed and one disk is not? I guess there could be, but we also have a documented[2] guarantee that the disks are fully trimmed.

Some other corner cases could come up in a recovery situations where we have disk that has data and we perhaps try a recovery. That's on the other hand a situation we should never end up with AWS Instance Storage.

Is there something more I could do to bring some confidence?

@cartermckinnon
Copy link
Member

Some other corner cases could come up in a recovery situations where we have disk that has data and we perhaps try a recovery.

That's my concern; if someone is counting on this redundancy and it doesn't work when they need it to, that's pretty terrible.

That being said, this will likely always be opt-in. Can you add some notes to the doc that, for now, this option is "experimental" or "use at your own risk" until we have some feedback in the wild?

@lassizci
Copy link
Author

That's my concern; if someone is counting on this redundancy and it doesn't work when they need it to, that's pretty terrible.

This doesn't affect redundancy, because there's no way to replace a single SSD on the instance. The whole instance will be replaced at once and then all the disks are clean ones. This just helps for graceful termination, so that workloads can be migrated away before the instance is replaced.

That being said, this will likely always be opt-in. Can you add some notes to the doc that, for now, this option is "experimental" or "use at your own risk" until we have some feedback in the wild?

Yes, makes sense, will do.

@cartermckinnon
Copy link
Member

This doesn't affect redundancy, because there's no way to replace a single SSD on the instance. The whole instance will be replaced at once and then all the disks are clean ones. This just helps for graceful termination, so that workloads can be migrated away before the instance is replaced.

Ah, I see! Thanks.

I think we're good to go after the doc changes. 👍

@lassizci
Copy link
Author

@lassizci
Copy link
Author

lassizci commented May 8, 2024

@cartermckinnon would we be happy with the doc changes mentioned above?

@cartermckinnon cartermckinnon changed the base branch from master to main May 8, 2024 17:31
@cartermckinnon
Copy link
Member

@lassizci

This LGTM. Can you rebase to address the conflicts? This PR got caught in the lurch of our branch rename.

This removes the wait block for raid resync for two reasons:
1) raid0 does not have redundancy and therefore no initial resync[1]
2) with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes
   to multiple hours, depending on sysctl params `dev.raid.speed_limit_min` and
   `dev.raid.speed_limit_max` and the speed of the disks. Initial resync for
   raid10 is not strictly needed[1]

Filesystem creation: by default `mkfs.xfs` attempts to TRIM the drive. This is
also something that can take tens of minutes or hours, depening on the size of
drives. TRIM can be skipped, as instances are delivered with disks fully
trimmed[2].

[1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport
@lassizci
Copy link
Author

@cartermckinnon now rebased

@ndbaker1 ndbaker1 changed the title Add support for raid10 on instance storage feature(local-disks): Add support for raid10 on instance storage Jul 16, 2024
@ndbaker1 ndbaker1 added the enhancement New feature or request label Jul 16, 2024
@lassizci
Copy link
Author

Well.. this is embarrassing, but there was and asg depending on my branch with the old path before the repo restructuring, so I think I have to create a new pull request, as the PR branch can't be changed as far as I can tell.. Sorry about that 😅

Opened a new one: #1900

@lassizci lassizci closed this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants