Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support raw block volume #2421

Closed
slaperche-scality opened this issue Apr 17, 2020 · 13 comments
Closed

Support raw block volume #2421

slaperche-scality opened this issue Apr 17, 2020 · 13 comments
Assignees
Labels
complexity:medium Something that requires one or few days to fix topic:storage Issues related to storage

Comments

@slaperche-scality
Copy link
Contributor

Component:

operator, salt

Why this is needed:

There are some specialized applications (e.g.: database, SDS system, …) that require direct access to a block device because, for example, the file system layer introduces unneeded overhead.

To address this use case, we should be able to manage Volume that will backed by PersistentVolume with volumeMode set to Block.

What should be done:

Currently, the volumeMode is hardcoded to Filesystem in the operator, this need to change.
In the same vein, the formatting step is always executed by the Salt state, this also need to be changed.

Implementation proposal:

We need a way to know if we want to format the volume or not.
This could be either at the Volume level or at the StorageClass level.

I suggest that we control that at the StorageClass level, because:

  • that's already the place where we configure the formatting, so it's consistent
  • if we put it at the Volume level, we will have to sanity check against illogical combination (a StorageClass with ext4 on a non-formatted Volume)
  • by setting it as the StorageClass level it could apply to both SparseLoopDevice and RawBlockDevice.

The idea is to accept a special value "none" for the parameter fsType in the StorageClass.
When fsType is set to "none" then:

  • PersistentVolume are created with volumeMode set to Block
  • Salt state skip the formatting.

We could have made the field optional instead of using a special value, but I think that it's better to be explicit here.

Test plan:

If we can add support for SparseLoopDevice (may be tricky because we can't store the ID in the FS in that case), then we can add tests with SparseLoopDevice: create a volume and check that the PV has the right mode and the backing storage is not formatted.

Otherwise manuel testing I guess (since CI doesn't attach "real" volume to the VM).

@slaperche-scality slaperche-scality added the topic:storage Issues related to storage label Apr 17, 2020
@slaperche-scality
Copy link
Contributor Author

Comments are welcome! @NicolasT @gdemonet @benoit-a

Or I can make a design section in the doc, like what I've done for volume v1 (will be volume v1.1 this time 😛 )

@gdemonet
Copy link
Contributor

Comments are welcome! @NicolasT @gdemonet @benoit-a

I don't have any strong opinion - the implementation proposed looks acceptable to me, especially since it's a feature supported by PersistentVolumes in the first place.

Or I can make a design section in the doc, like what I've done for volume v1 (will be volume v1.1 this time )

On this point, I think we need to adjust the Volume design document to match the other "Architecture documents" we have in the repo. We found that having a set of "KEP-style" docs was too heavy for our needs, and simply editing the content of such docs seems enough for us (rather than creating new docs to invalidate/supersede the initial design).

This format would need some standardization however, to make it simpler when external contributors want to suggest a new design or modifications to an existing one.

@slaperche-scality slaperche-scality added the complexity:medium Something that requires one or few days to fix label Apr 17, 2020
@slaperche-scality slaperche-scality self-assigned this Jun 4, 2020
slaperche-scality added a commit that referenced this issue Jun 4, 2020
slaperche-scality added a commit that referenced this issue Jun 4, 2020
Document the new possible value for fsType: none.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 4, 2020
If `fsType` is "none", we set the volume mode to Block.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 4, 2020
TODO support not sparse

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
@slaperche-scality
Copy link
Contributor Author

slaperche-scality commented Jun 8, 2020

Hum, in the end I'm no longer convinced that putting it into the StorageClass is the right approach…

Without formatting, it seems impossible to have a constant device path for a SparseLoopDevice (for RawBlockDevice, we can fallback on the UUID of the device itself, but that won't work for a sparse file…).

If this parameter becomes RawBlockDevice-only, then it's better to add it into the RawBlockDevice spec rather than using the StorageClass.

So I think I'll move back this new settings inside the Volume object itself (it will be optional with a default value that will keep the current behavior (use formatting), thus it shouldn't break the compat)


Another issue: if one enter /dev/sda as devicePath, when creating the PV I dont' want to use such path and should use a by-uuid/… path instead
=> how do I get the UUID of a given volume?

Until now if wasn't an issue because we were using the Volume UUID as FS UUID and everything worked fine: the operator can compute the path from the Volume alone.
But that's no longer the case…

I can add an extra Salt call during the Volume provisionning (using disk.blkid), but then where do I store this info between the moment I get it and the moment I create the PV?
The CR is the only persistence I have accross operator restart/reconciliation iteration.
I can:

  • patch devicePath, but this may surprise the user…
  • add a condition: could help monitoring the progress of the deployment, but conditions are more about "problem" than "progression"
  • something else?

Edit: Hum, after some thoughts I think I'll go for a hashmap (volume, storage uuid).
It will avoid packing more stuff into the CR status, and will be persistent accross reconciliation iteration.

If we're unlucky enough to crash/restart between the time we fetch the UUID and the creation of the PV, we only have to call disk.blkid again (it's call that doesn't modify anything, no problem if we call it multiple times) and resume the creation: it should be unlikely enough to be acceptable.

If this works well for UUID, I may use the same approach for disk.dump and revert my hackish implementation that pack stuff as string into the Salt Job Handle…

@gdemonet
Copy link
Contributor

gdemonet commented Jun 8, 2020

We could add more status fields to the CR for the Operator to use as state (incl. block device UUID, size...), no?
For SparseLoopDevice _Volume_s, if using the "raw block" option, a consumer could read the device path from status.devicePath.

@slaperche-scality
Copy link
Contributor Author

slaperche-scality commented Jun 8, 2020

We could add more status fields to the CR for the Operator to use as state (incl. block device UUID, size...), no?

In the first implementation of Volume (which never got merged, because I rewrote it according to review's feedback), we had a lot of "computed" state inside the status and I think that's not something we want.

For SparseLoopDevice _Volume_s, if using the "raw block" option, a consumer could read the device path from status.devicePath.

But you would need to refresh this info at some point, because after a reboot your sparse loop device may be loop3 instead of loop1. And then you probably need to update the PV as well, otherwise kubelet will mount the wrong device…
Sounds like a lot of work for a feature we're not supposed to use in prod.

@NicolasT
Copy link
Contributor

NicolasT commented Jun 8, 2020

  • I think choosing between block or file should be done at the Volume level, not in the StorageClass. Indeed, we specify formatting options in the StorageClass, which may be a slight stretch, but it kinda fits. Remember a StorageClass ought to define the 'class' of storage (performance characteristics, local/attached,...), not how this storage is consumed.

  • For the device identifiers: we currently use FS UUIDs for File Volumes which works great: it allows us to uniquely identify a single file volume (in /dev/disk/by-uuid/), ensure we (the automation) formatted it properly,... We want to keep the same benefits for block volumes. In general, a plain block device (unpartitioned, unformatted) doesn't have any associated UUID.

In a rawBlockDevice Volume that'll become a block PersistentVolume, we are passed the path to a device. This can either be a raw device (/dev/sda, /dev/mapper/vg-data,...) or a partition (/dev/sda1).

In case of a raw device, we could enforce this device to have a related World Wide Name, i.e., an entry pointing to it in /dev/disk/by-id/wwn-*, and use this as the device path in the generated PersistentVolume.
For LVM LVs, we can resolve the given 'device' to the entry in /dev/disk/by-id/dm-uuid-LVM-*, relying on LVMs UUID mechanism.
Finally, if we're passed a 'regular' partition, we could decide to not support MBR/BIOS disks (heck, this is 2020...) and only support GPT partitioning, in which case the partition has a GPT UUID, and the volume can use /dev/disk/by-partuuid/... to reference it (or have our automation change its GPT UUID?)

In the GPT case, we could even enforce the use of specific GPT partition types.

Now, when using sparseLoopDevice, we can

  • Create the sparse file
  • Create a GPT partition on it
  • Create a single GPT partition in it, setting the Volume UUID as its partition UUID
  • losetup --find --partscan ...
  • Use /dev/disk/by-partuuid in the PV

@slaperche-scality
Copy link
Contributor Author

Thanks for the feedback.

Let's put the selection between File/Block at the Volume level then (which make more sense indeed, same as volumeMode is at the PV level, there is some symmetry here 🙂 )


About the reliable device path, IIUC instead of using disk.blkid, I can write a custom Salt module that would resolve the devicePath of a volume, e.g:

  • /dev/sda returns /dev/disk/by-id/wwn-0x5001b444a4bbea26
  • /dev/dm-2 returns /dev/disk/by-id/dm-uuid-LVM-ALKa3ucymYV7FMBOAGWXpvsGdYz90e5cE3UzfO24GNUidBmOtPpCb47HqL6ekqxa
  • /dev/sda1 returns /dev/disk/by-partuuid/b8b36029-01

And we don't support MBR/BIOS partition.

(and I like the idea for SparseLoopDevice 👍 )

Is that right?


As to where to store device size/device path until we create the PV, I guess the volatile hashtable is a good enough solution, right?

There isn't much point in persisting those in the CR (even if it's in the status) as they are only needed between the time we fetch them and the time we create the PV (at which point they become persisted in the PV).

@NicolasT
Copy link
Contributor

NicolasT commented Jun 8, 2020

Let's put the selection between File/Block at the Volume level then (which make more sense indeed, same as volumeMode is at the PV level, there is some symmetry here )

Yup.

About the reliable device path, IIUC instead of using disk.blkid, I can write a custom Salt module that would resolve the devicePath of a volume, e.g:

  • /dev/sda returns /dev/disk/by-id/wwn-0x5001b444a4bbea26
  • /dev/dm-2 returns /dev/disk/by-id/dm-uuid-LVM-ALKa3ucymYV7FMBOAGWXpvsGdYz90e5cE3UzfO24GNUidBmOtPpCb47HqL6ekqxa
  • /dev/sda1 returns /dev/disk/by-partuuid/b8b36029-01

Well, the input is not necessarily /dev/dm-2, for example (I'd likely use /dev/mapper/vg-lv or /dev/vg/lv myself as input... The /dev/dm-* nodes are not really something a human would use, right?), so this modules would first need to resolve symlinks to an actual block device node, then figure out the relevant /dev/disk/by-id path of it.

And we don't support MBR/BIOS partition.

We need to validate this with TS.

(and I like the idea for SparseLoopDevice )

Yeah, having sparse support for block volumes is a must, for test and dev at least.

As to where to store device size/device path until we create the PV, I guess the volatile hashtable is a good enough solution, right?

There isn't much point in persisting those in the CR (even if it's in the status) as they are only needed between the time we fetch them and the time we create the PV (at which point they become persisted in the PV).

Indeed. One can't know at which exact point in time we look those up anyway.

@NicolasT
Copy link
Contributor

NicolasT commented Jun 8, 2020

(As an aside: even when a whole disk is to be used, I think a GPT table should be created on it, with a single partition spanning block 2048 to the end of the disk, and using a partuuid, instead of using the raw disk)

@slaperche-scality
Copy link
Contributor Author

slaperche-scality commented Jun 8, 2020

Well, the input is not necessarily /dev/dm-2, for example (I'd likely use /dev/mapper/vg-lv or /dev/vg/lv myself as input... The /dev/dm-* nodes are not really something a human would use, right?), so this modules would first need to resolve symlinks to an actual block device node, then figure out the relevant /dev/disk/by-id path of it.

Yup indeed, we also have this indirection to handle.

And we don't support MBR/BIOS partition.

We need to validate this with TS.

Will wait for a confirmation then.

(As an aside: even when a whole disk is to be used, I think a GPT table should be created on it, with a single partition spanning block 2048 to the end of the disk, and using a partuuid, instead of using the raw disk)

Yeah, we could do it that way.
WWW ID aren't reliable? Or it's just to limit the number different of cases?

@xaf-scality
Copy link
Contributor

For me, GPT is a good solution for that. We should not use boot drive => no need to support MBR/BIOS partition and in some case (PoC) if we need to simulate multiple drive on one drive LVM+GPT will do the trick.

@NicolasT
Copy link
Contributor

NicolasT commented Jun 9, 2020

WWW ID aren't reliable? Or it's just to limit the number different of cases?

WWNs are reliable (at least that's the idea). However, indeed, since not all cases can be covered by using WWNs, maybe using an alternative that covers all cases by construction is 'better'.

@slaperche-scality
Copy link
Contributor Author

So, to sum up the feedback we got (here and in the mail thread):

  • RawBlockDevice with a real disk => use GPT partition + a single partition (by-partuuid/<volume-uuid> as path)
  • RawBlockDevice with LVM => use LVM UUID (by-id/dm-uuid-LVM-… as path)
  • RawBlockDevice with a partition => require GPT (error out if not) and use partition UUID from GPT (by-partuuid/<partition-uuid> as path)
  • SparseLoopDevice => create a sparse file and create a GPT partition on it (by-partuuid/<volume-UUID> as path)

slaperche-scality added a commit that referenced this issue Jun 26, 2020
This options allows to skip the formatting of the device.
It's optional and will be set to false by default (i.e., format the
device) to preserve the backward compatibility.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 26, 2020
slaperche-scality added a commit that referenced this issue Jun 26, 2020
This will replace call to `disk.dump`.
By returning both size and path, we can keep a single Salt call for
unformatted volume (for which we will need to get a reliable path from
Salt).

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 26, 2020
Since we will now need more information than just the size, we replace
the call to `disk.dump` by a call to our function
`metalk8s_volumes.device_info`.

For now, we simply plug the new call, we don't use information other
than size yet.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 26, 2020
We now call `metalk8s_volumes.device_info` instead of `disk.dump`.
Unlike the later, this new function relies on the pillar content and we
have a problem here because calling a Salt module doesn't refresh the
pillar.

That being said, the state we call before (volumes.prepared) does
automatially refresh the pillar because it works fine and it does look
up the volume from the pillar.
But it seems this pillar isn't kept around, since calling the module
right after fails.
If we explicitely refresh the pillar during our state, then it's kept
around and our module can reuse the fetched pillar: that's why we add a
refresh_pillar call to our state.

Now, in the unlikely case that we have a crash/restart after the state
execution BUT before executing the module, we won't have access to a
cached pillar: in such case we have a last-chance try/except that
refresh the pillar from inside the module.

I'm not super happy with the current approach…, but I don't see a better
one:
- using a state is not possible (because I want to return data)
- calling `saltutil.refresh_pillar` from the operator will complicate
  the code a bit and won't even solve the "I have a crash/reboot/restart
  just before calling my module and thus no cached pillar!" situation.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 26, 2020
Instead of persisting device information into the status (which is
hackish), we now store them in a hashmap.
It will be good enough for most of the cases, and in the worst case (we
crash after getting the device info but before creating the PV), we can
simply reschedule the Salt call to fetch the device information.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 26, 2020
We no longer use it (since we're using a volatile cache in the
operator).
We still keep some compatibility with this format, in case we have
volume with serialized Result.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 29, 2020
This options allows to skip the formatting of the device.
It's optional and will be set to false by default (i.e., format the
device) to preserve the backward compatibility.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jun 29, 2020
slaperche-scality added a commit that referenced this issue Jul 3, 2020
We have to handle different cases here, the device path can points to:
- a LVM volume: we have nothing to do (we'll use LVM UUID)
- a disk partition (we make sure the partition's label is the volume UUID)
- an entire disk: we create a GPT table and a single partition with the
  right label.

Since the last case is the same as SparseLoopDeviceNoFormat, we factor
out the common code into `prepare_noformat`.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
slaperche-scality added a commit that referenced this issue Jul 3, 2020
It seems that, in some cases, you may have a small delay where the
partition is not visible under `/dev` after some operation (like
changing the label):

    [root@test-master-1 centos]# ls /dev | grep vdb
    vdb
    vdb1
    [root@test-master-1 centos]# parted /dev/vdb name 1 foo && ls /dev/ | grep vdb
    vdb
    [root@test-master-1 centos]# ls /dev | grep vdb
    vdb
    vdb1

Because of that, device name resolution may suffer transient failure.
So let's retry a few times before giving up.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
This makes Volume closer to the underlying PersistentVolume (which
simplify the code and also helps the user already familiar with
PersistentVolume).

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
Will be less confusing.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
Will be less confusing.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
This makes Volume closer to the underlying PersistentVolume (which
simplify the code and also helps the user already familiar with
PersistentVolume).

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
Will be less confusing.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
By using sgdisk we can change the partition GUID to match the volume
UUID. Haven't found how to do it with parted…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
By using sgdisk we can change the partition GUID to match the volume
UUID. Haven't found how to do it with parted…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
By using sgdisk we can change the partition GUID to match the volume
UUID. Haven't found how to do it with parted…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 3, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 6, 2020
This makes Volume closer to the underlying PersistentVolume (which
simplify the code and also helps the user already familiar with
PersistentVolume).

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 6, 2020
Will be less confusing.

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 6, 2020
By using sgdisk we can change the partition GUID to match the volume
UUID. Haven't found how to do it with parted…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 6, 2020
In some case, this trigger an infinite reconciliation loop…

Refs: #2421
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue Jul 8, 2020
@bert-e bert-e closed this as completed in 35d026c Jul 9, 2020
TeddyAndrieux added a commit that referenced this issue Jul 16, 2020
In `metalk8s_volumes` salt module `format` was renamed to `prepare`
in 5e071c9 but one was missing for
RawBlockDevice, the only impact was that we did not have the
`force=True` set when formatting the device

Refs: #2421
TeddyAndrieux added a commit that referenced this issue Jul 17, 2020
In #2421 `format` function of `metalk8s_volumes` salt module get renamed
to `prepare` so this commit reflect this change in the salt unit tests
for this module
TeddyAndrieux added a commit that referenced this issue Jul 17, 2020
In `metalk8s_volumes` salt module `format` was renamed to `prepare`
in 5e071c9 but one was missing for
RawBlockDevice, the only impact was that we did not have the
`force=True` set when formatting the device

Refs: #2421
TeddyAndrieux added a commit that referenced this issue Jul 17, 2020
In #2421 `format` function of `metalk8s_volumes` salt module get renamed
to `prepare` so this commit reflect this change in the salt unit tests
for this module
TeddyAndrieux added a commit that referenced this issue Jul 23, 2020
In `metalk8s_volumes` salt module `format` was renamed to `prepare`
in 5e071c9 but one was missing for
RawBlockDevice, the only impact was that we did not have the
`force=True` set when formatting the device

Refs: #2421
TeddyAndrieux added a commit that referenced this issue Jul 23, 2020
In #2421 `format` function of `metalk8s_volumes` salt module get renamed
to `prepare` so this commit reflect this change in the salt unit tests
for this module
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix topic:storage Issues related to storage
Projects
None yet
Development

No branches or pull requests

4 participants