Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add paths module to influxdb3_write #24579

Merged
merged 8 commits into from
Jan 19, 2024
Merged

Conversation

mgattozzi
Copy link
Contributor

This commit introduces 3 new types in the paths module for the influxdb3_write crate. They are:

  • ParquetFilePath
  • CatalogFilePath
  • SegmentFilePath

Each of these corresponds to an object store path and also on disk path that we can use to address the needed files in a consistent way and not need to have path construction be duplicated to address these files.

These types also Deref/AsRef to the Path type so that they can be used in places that expect the type such as various std::fs methods and so that we can use methods like exist() without needing to implement them for each type as they are thin wrappers around PathBuf.

This commit adds some tests to make sure that the path construction works as intended and also updates the wal.rs file to use the new SegmentFilePath instead of just a PathBuf. Currently it assumes that the prefix is just the dir handed to it.

Closes: #24578

Note: This is just the first step as I'll need to start serializing these files to disk as part of the persistor and eventually to object store.

This commit introduces 3 new types in the paths module for the
influxdb3_write crate. They are:

- ParquetFilePath
- CatalogFilePath
- SegmentFilePath

Each of these corresponds to an object store path and also on disk path
that we can use to address the needed files in a consistent way and not
need to have path construction be duplicated to address these files.

These types also Deref/AsRef to the Path type so that they can be used
in places that expect the type such as various std::fs methods and so
that we can use methods like `exist()` without needing to implement them
for each type as they are thin wrappers around PathBuf.

This commit adds some tests to make sure that the path construction
works as intended and also updates the `wal.rs` file to use the new
`SegmentFilePath` instead of just a `PathBuf`. Currently it assumes that
the prefix is just the dir handed to it.

Closes: #24578
@mgattozzi mgattozzi requested a review from pauldix January 16, 2024 20:29
@pauldix
Copy link
Member

pauldix commented Jan 16, 2024

One quick comment on this before I go to review is that there should be 4 different kinds of files:

  • Catalog file (in object storage)
  • Parquet file (in object storage)
  • SegmentInfo file (in object storage) (this is where we keep summary of all parquet files persisted for a given segment)
  • SegmentWal file (on local disk) (this is the individual write ahead log batches, it's different from the SegmentInfo file that gets persisted to object store)

I'm not sure if that changes how you'd structure this, but we should get all 4 represented.

@mgattozzi
Copy link
Contributor Author

@pauldix I changed it up a bit so that the wal works the same again (using the new type, but I would be fine just reverting that part if it doesn't feel right) and addresses local disk while all the others will create an Obj Store compatible path

const PARQUET_FILE_EXTENSION: &str = "parquet";

/// File extension for segment files
const SEGMENT_FILE_EXTENSION: &str = "wal";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be SEGMENT_WAL_FILE_EXTENSION to differentiate. Then we'd have SEGMENT_INFO_FILE_EXTENSION, which would be json.

impl SegmentFilePath {
pub fn new(prefix: impl Into<PathBuf>, segment_id: SegmentId) -> Self {
let mut path = prefix.into();
path.push("segments");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wal files don't need to have this directory since the only thing in the wal directory will be segment files.

pub struct CatalogFilePath(PathBuf);

impl CatalogFilePath {
pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequence_number should be a u32. No need for a u64 here and limiting the size ensures that 10 digit padding on the string conversion won't break for any value that is valid.

pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self {
let mut path = prefix.into();
path.push("catalogs");
path.push(format!("{sequence_number:010}"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work as a naming convention for the Catalog or SegmentInfo file names. We need file names such that when ordered lexicographically, the highest numbered ones will be returned first. Changing the file to be named format!("{(u32::MAX - sequence_number):010}") would achieve this. Same applies for SegmentInfo files.

For wal files we don't need this naming convention, but we could follow it to keep things consistent. It would also become handy if we ever want to copy the wal segments into object storage for access by other systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted out of this for wal files since they don't really need to be human readable, but for the other types I did the u32::MAX trick so that they show up in order.

prefix: impl Into<PathBuf>,
db_name: &str,
table_name: &str,
year: u16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have this be a time for the parquet file's min_time and then convert that to a string, rather than taking three separate arguments.

path.push("dbs");
path.push(db_name);
path.push(table_name);
path.push(format!("{year}-{month:02}-{day:02}"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use min_time as an argument, probably better to use a strtime equivalent format string here. Think we'll need to use Chrono for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days it or time are both maintained now, but I think to keep it consistent I'll use chrono

@mgattozzi mgattozzi requested a review from pauldix January 17, 2024 19:07
@mgattozzi
Copy link
Contributor Author

@pauldix added in what you asked for!

pub fn new(segment_id: SegmentId) -> Self {
let path = ObjPath::from(format!(
"catalogs/{:010}.{}",
u32::MAX - segment_id.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a pattern that shows up elsewhere, maybe refactor into a method on SegmentId? object_store_file_stem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in d1fbbe8

impl ParquetFilePath {
pub fn new(db_name: &str, table_name: &str, date: DateTime<Utc>, file_number: u32) -> Self {
let path = ObjPath::from(format!(
"dbs/{db_name}/{table_name}/{}/{:010}.{}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_name and table_name need to be converted into object store safe character set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in d1fbbe8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so I think what you're doing here is attempting to parse the name and returning an error if it isn't valid. That's not quite what we want. We want the db_name and table_name strings to be properly escaped so that they use valid object store names. Thus the properly escaped names would be used and the ObjPath::parse of the resulting name would be expected to never error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah if you look at what you linked then I could just use from and it should work the same.

@mgattozzi mgattozzi requested a review from pauldix January 18, 2024 20:17
Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚢

@mgattozzi
Copy link
Contributor Author

I switched it back to from and added a test to make sure it was percent encoded like the docs said. After tests pass I'll merge.

@mgattozzi mgattozzi merged commit e13cc47 into main Jan 19, 2024
12 checks passed
@mgattozzi mgattozzi deleted the mgattozzi/persister/paths branch January 19, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create ObjectStorage Paths for InfluxDB Edge
2 participants