Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add class RestProfile for local maintenance of users' REST parameters. #5463

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

bekadavis9
Copy link
Contributor

@bekadavis9 bekadavis9 commented Feb 19, 2025

Add class RestProfile for local maintenance of users' REST parameters.

class RestProfile is the first step toward configuration unification. This internal API stores a user's REST parameters after they invoke TileDB-Cloud-Py::login upstream. Once a user has created a RestProfile and set their desired parameters, RestProfile::save() will serialize the entire profile onto the local, on-disk file, $HOME/.tiledb/profiles.json.

Previously, the upstream API would save users' REST credentials to a different local file, $HOME/.tiledb/cloud.json. To maintain backward compatibility, RestProfile will search for that file and inherit the parameters within, saving them to the new local file, $HOME/.tiledb/profiles.json. Going forward, we expect users to interact only with the latter of the local files. The TileDB-Cloud-Py API will be updated to reflect this change.

A RestProfile default-initializes its parameter-value mapping, and is therefore complete for saving upon creation. Attempting to create multiple profiles of the same name will overwrite the local file with the latest data. Neither local file should be changed by the unit tests, so the test wrapper uses a TemporaryLocalDirectory as a pseudo-$HOME. As such, the local files are written to /tmp during the test and removed at the end of the TEST_CASE.

A followup PR will make use of this class in class Config.

[sc-62705]


TYPE: NO_HISTORY
DESC: Add class RestProfile for local maintenance of users' REST parameters.

void save();

/** Removes this profile from the local file. */
void remove();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer it this way.

Suggested change
void remove();
static void remove(const std::string& name);

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments to begin with, I am reviewing actively so I'll be probably posting more

Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern around atomicity that we'd need to re-work. But keep me honest if I have misunderstood something.

if (std::filesystem::exists(filepath_)) {
// Temporarily append filename with a random label to guarantee atomicity.
filepath_ = filepath_ + random_label();
if (std::rename(original_filepath.c_str(), filepath_.c_str()) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a safe way to guarantee atomicity. What if we rename and then another process finds there is no file and recreates it? Then its racy, and we'll end up with partial data as well.
Couldn't we just use some mutex for loading, saving and removing the file? Also I think this is a write after read case, and loading and saving should happen in the same atomic block for guaranteeing correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to synchronize across processes, we would need a named mutex, which is not available in the C++ standard library.

I'm not sure atomicity is important to have; modifying profiles is not a frequent operation either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I read that we cared about it in the design doc @bekadavis9 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few bits to untangle here, but the gist of it is that we're renaming backwards. I shall resist the urge to go too far down the rabbit hole but I will try and explain a bit to make sure we're all on the same page. Also, I'm gonna have to rely on @teo-tsirpanis et al to figure out how this applies to Windows APIs. I'm vaguely remember that there's an equivalent, but my Windows knowledge is slim to none.

First things first, we do want to be using rename here, but not quite like what exists in the PR. Updating the profile.json should always follow this process:

  1. Create a temporary file. Adding a random label is perfect here (though I'd probably add a . separator for aesthetic reasons)
  2. Write new contents to temporary file:
    a. open file
    b. write json to file
    c. fsync file (yes really)
    d. close file
  3. Rename temporary file to the expected filename

The reason this works is because rename is guaranteed to be "atomic" which means that the inode pointed at by the hard link is swapped from one to the other. This notably doesn't mean "guaranteed" or "orderable" or anything beyond the very narrow of "it either happened or didn't, the contents of the destination file are either unchanged or exactly the contents of the temp file".

This means a few things:

  1. The profiles.json file will either exist or it won't. It'll never exist in a partially written state which gives resilience to things like a segfault or someone tripping on the power cord while the profile is being updated.
  2. If someone does have multiple processes updating this file we can still guarantee consistency of the file, which notably is not the same as guaranteeing the contents of the file. That's the best we can do though and it's a pattern that works fine for these sorts of things given the frequency of updates. Someone finding issue with this is most definitely Doing it Wrong®.
  3. That's pretty much it.

Some concrete suggestions for tidying up the PR:

  1. I'd create two simple methods that are "save_file" and "load_file". All I/O happens in these two methods and nowhere else.
  2. All of the "add/update/remove" logic is external to file handling. I noticed a few different comments around that logic which I haven't got much opinion on other than try and be internally consistent at least.
  3. That... is also pretty much it.

Also, I should probably note for anyone wondering, this approach to file handling also avoids issues on reads. And if you're wondering how, its because files don't have to be linked on disk to be readable or writable. This means you only have two states to worry about:

  1. The file doesn't exist. Fine, return an empty object.
  2. The file does exist, yay we got a reference to its inode.

Notably, step 2 doesn't care if that inode is replaced while its reading the file since the OS guarantees that the file will be readable (If memory serves Windows diverges here a bit in that you can't delete a file while its being read, absolutely no clue how that plays out here).

And one last note since I've already written this much: The bottom line here is to avoid corrupting that profile.json file where "not corrupt" means "if it exists, it contains a valid JSON object". We can't make guarantees about the contents of that file under concurrent writes, but we can at least guarantee that its valid. If I were to write a test for this, I'd spawn N threads that all do random mutations to this file for some largish number of iterations (i.e., ~1000 * N seems reasonable). Then another thread is spun up and all it does is attempt to repeatedly call load_file or w/e as fast as possible ensuring that either the file doesn't exist (only at the very beginning of the test, once it exists, it should always exist until the user deletes it) or contains valid JSON.

Thank you for coming to my Ted Talk.

// If a profile of the given name already exists, remove it.
if (data.contains(name_)) {
data.erase(name_);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the behavior we concluded we want? If we concluded that profiles should be immutable, I'd suggest that we just throw in this case, with an informative error that the profile with this name already exists, and if the user wish to replace it then he should first explicitly ask to remove it.

@kounelisagis kounelisagis self-requested a review February 27, 2025 00:48
Copy link
Contributor

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes around file update semantics.

if (std::filesystem::exists(filepath_)) {
// Temporarily append filename with a random label to guarantee atomicity.
filepath_ = filepath_ + random_label();
if (std::rename(original_filepath.c_str(), filepath_.c_str()) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few bits to untangle here, but the gist of it is that we're renaming backwards. I shall resist the urge to go too far down the rabbit hole but I will try and explain a bit to make sure we're all on the same page. Also, I'm gonna have to rely on @teo-tsirpanis et al to figure out how this applies to Windows APIs. I'm vaguely remember that there's an equivalent, but my Windows knowledge is slim to none.

First things first, we do want to be using rename here, but not quite like what exists in the PR. Updating the profile.json should always follow this process:

  1. Create a temporary file. Adding a random label is perfect here (though I'd probably add a . separator for aesthetic reasons)
  2. Write new contents to temporary file:
    a. open file
    b. write json to file
    c. fsync file (yes really)
    d. close file
  3. Rename temporary file to the expected filename

The reason this works is because rename is guaranteed to be "atomic" which means that the inode pointed at by the hard link is swapped from one to the other. This notably doesn't mean "guaranteed" or "orderable" or anything beyond the very narrow of "it either happened or didn't, the contents of the destination file are either unchanged or exactly the contents of the temp file".

This means a few things:

  1. The profiles.json file will either exist or it won't. It'll never exist in a partially written state which gives resilience to things like a segfault or someone tripping on the power cord while the profile is being updated.
  2. If someone does have multiple processes updating this file we can still guarantee consistency of the file, which notably is not the same as guaranteeing the contents of the file. That's the best we can do though and it's a pattern that works fine for these sorts of things given the frequency of updates. Someone finding issue with this is most definitely Doing it Wrong®.
  3. That's pretty much it.

Some concrete suggestions for tidying up the PR:

  1. I'd create two simple methods that are "save_file" and "load_file". All I/O happens in these two methods and nowhere else.
  2. All of the "add/update/remove" logic is external to file handling. I noticed a few different comments around that logic which I haven't got much opinion on other than try and be internally consistent at least.
  3. That... is also pretty much it.

Also, I should probably note for anyone wondering, this approach to file handling also avoids issues on reads. And if you're wondering how, its because files don't have to be linked on disk to be readable or writable. This means you only have two states to worry about:

  1. The file doesn't exist. Fine, return an empty object.
  2. The file does exist, yay we got a reference to its inode.

Notably, step 2 doesn't care if that inode is replaced while its reading the file since the OS guarantees that the file will be readable (If memory serves Windows diverges here a bit in that you can't delete a file while its being read, absolutely no clue how that plays out here).

And one last note since I've already written this much: The bottom line here is to avoid corrupting that profile.json file where "not corrupt" means "if it exists, it contains a valid JSON object". We can't make guarantees about the contents of that file under concurrent writes, but we can at least guarantee that its valid. If I were to write a test for this, I'd spawn N threads that all do random mutations to this file for some largish number of iterations (i.e., ~1000 * N seems reasonable). Then another thread is spun up and all it does is attempt to repeatedly call load_file or w/e as fast as possible ensuring that either the file doesn't exist (only at the very beginning of the test, once it exists, it should always exist until the user deletes it) or contains valid JSON.

Thank you for coming to my Ted Talk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants