-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add paths module to influxdb3_write #24579
Conversation
This commit introduces 3 new types in the paths module for the influxdb3_write crate. They are: - ParquetFilePath - CatalogFilePath - SegmentFilePath Each of these corresponds to an object store path and also on disk path that we can use to address the needed files in a consistent way and not need to have path construction be duplicated to address these files. These types also Deref/AsRef to the Path type so that they can be used in places that expect the type such as various std::fs methods and so that we can use methods like `exist()` without needing to implement them for each type as they are thin wrappers around PathBuf. This commit adds some tests to make sure that the path construction works as intended and also updates the `wal.rs` file to use the new `SegmentFilePath` instead of just a `PathBuf`. Currently it assumes that the prefix is just the dir handed to it. Closes: #24578
One quick comment on this before I go to review is that there should be 4 different kinds of files:
I'm not sure if that changes how you'd structure this, but we should get all 4 represented. |
@pauldix I changed it up a bit so that the wal works the same again (using the new type, but I would be fine just reverting that part if it doesn't feel right) and addresses local disk while all the others will create an Obj Store compatible path |
influxdb3_write/src/paths.rs
Outdated
const PARQUET_FILE_EXTENSION: &str = "parquet"; | ||
|
||
/// File extension for segment files | ||
const SEGMENT_FILE_EXTENSION: &str = "wal"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should be SEGMENT_WAL_FILE_EXTENSION
to differentiate. Then we'd have SEGMENT_INFO_FILE_EXTENSION
, which would be json
.
influxdb3_write/src/paths.rs
Outdated
impl SegmentFilePath { | ||
pub fn new(prefix: impl Into<PathBuf>, segment_id: SegmentId) -> Self { | ||
let mut path = prefix.into(); | ||
path.push("segments"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wal files don't need to have this directory since the only thing in the wal directory will be segment files.
influxdb3_write/src/paths.rs
Outdated
pub struct CatalogFilePath(PathBuf); | ||
|
||
impl CatalogFilePath { | ||
pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequence_number should be a u32. No need for a u64 here and limiting the size ensures that 10 digit padding on the string conversion won't break for any value that is valid.
influxdb3_write/src/paths.rs
Outdated
pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self { | ||
let mut path = prefix.into(); | ||
path.push("catalogs"); | ||
path.push(format!("{sequence_number:010}")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to work as a naming convention for the Catalog or SegmentInfo file names. We need file names such that when ordered lexicographically, the highest numbered ones will be returned first. Changing the file to be named format!("{(u32::MAX - sequence_number):010}")
would achieve this. Same applies for SegmentInfo
files.
For wal files we don't need this naming convention, but we could follow it to keep things consistent. It would also become handy if we ever want to copy the wal segments into object storage for access by other systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted out of this for wal files since they don't really need to be human readable, but for the other types I did the u32::MAX
trick so that they show up in order.
influxdb3_write/src/paths.rs
Outdated
prefix: impl Into<PathBuf>, | ||
db_name: &str, | ||
table_name: &str, | ||
year: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have this be a time
for the parquet file's min_time and then convert that to a string, rather than taking three separate arguments.
influxdb3_write/src/paths.rs
Outdated
path.push("dbs"); | ||
path.push(db_name); | ||
path.push(table_name); | ||
path.push(format!("{year}-{month:02}-{day:02}")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use min_time
as an argument, probably better to use a strtime
equivalent format string here. Think we'll need to use Chrono for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These days it or time are both maintained now, but I think to keep it consistent I'll use chrono
@pauldix added in what you asked for! |
influxdb3_write/src/paths.rs
Outdated
pub fn new(segment_id: SegmentId) -> Self { | ||
let path = ObjPath::from(format!( | ||
"catalogs/{:010}.{}", | ||
u32::MAX - segment_id.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a pattern that shows up elsewhere, maybe refactor into a method on SegmentId? object_store_file_stem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in d1fbbe8
impl ParquetFilePath { | ||
pub fn new(db_name: &str, table_name: &str, date: DateTime<Utc>, file_number: u32) -> Self { | ||
let path = ObjPath::from(format!( | ||
"dbs/{db_name}/{table_name}/{}/{:010}.{}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db_name
and table_name
need to be converted into object store safe character set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in d1fbbe8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so I think what you're doing here is attempting to parse the name and returning an error if it isn't valid. That's not quite what we want. We want the db_name
and table_name
strings to be properly escaped so that they use valid object store names. Thus the properly escaped names would be used and the ObjPath::parse
of the resulting name would be expected to never error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah if you look at what you linked then I could just use from
and it should work the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚢
I switched it back to from and added a test to make sure it was percent encoded like the docs said. After tests pass I'll merge. |
This commit introduces 3 new types in the paths module for the influxdb3_write crate. They are:
Each of these corresponds to an object store path and also on disk path that we can use to address the needed files in a consistent way and not need to have path construction be duplicated to address these files.
These types also Deref/AsRef to the Path type so that they can be used in places that expect the type such as various std::fs methods and so that we can use methods like
exist()
without needing to implement them for each type as they are thin wrappers around PathBuf.This commit adds some tests to make sure that the path construction works as intended and also updates the
wal.rs
file to use the newSegmentFilePath
instead of just aPathBuf
. Currently it assumes that the prefix is just the dir handed to it.Closes: #24578
Note: This is just the first step as I'll need to start serializing these files to disk as part of the persistor and eventually to object store.