feat: Add paths module to influxdb3_write #24579

mgattozzi · 2024-01-16T20:29:20Z

This commit introduces 3 new types in the paths module for the influxdb3_write crate. They are:

ParquetFilePath
CatalogFilePath
SegmentFilePath

Each of these corresponds to an object store path and also on disk path that we can use to address the needed files in a consistent way and not need to have path construction be duplicated to address these files.

These types also Deref/AsRef to the Path type so that they can be used in places that expect the type such as various std::fs methods and so that we can use methods like exist() without needing to implement them for each type as they are thin wrappers around PathBuf.

This commit adds some tests to make sure that the path construction works as intended and also updates the wal.rs file to use the new SegmentFilePath instead of just a PathBuf. Currently it assumes that the prefix is just the dir handed to it.

Closes: #24578

Note: This is just the first step as I'll need to start serializing these files to disk as part of the persistor and eventually to object store.

This commit introduces 3 new types in the paths module for the influxdb3_write crate. They are: - ParquetFilePath - CatalogFilePath - SegmentFilePath Each of these corresponds to an object store path and also on disk path that we can use to address the needed files in a consistent way and not need to have path construction be duplicated to address these files. These types also Deref/AsRef to the Path type so that they can be used in places that expect the type such as various std::fs methods and so that we can use methods like `exist()` without needing to implement them for each type as they are thin wrappers around PathBuf. This commit adds some tests to make sure that the path construction works as intended and also updates the `wal.rs` file to use the new `SegmentFilePath` instead of just a `PathBuf`. Currently it assumes that the prefix is just the dir handed to it. Closes: #24578

pauldix · 2024-01-16T20:47:11Z

One quick comment on this before I go to review is that there should be 4 different kinds of files:

Catalog file (in object storage)
Parquet file (in object storage)
SegmentInfo file (in object storage) (this is where we keep summary of all parquet files persisted for a given segment)
SegmentWal file (on local disk) (this is the individual write ahead log batches, it's different from the SegmentInfo file that gets persisted to object store)

I'm not sure if that changes how you'd structure this, but we should get all 4 represented.

mgattozzi · 2024-01-16T21:09:40Z

@pauldix I changed it up a bit so that the wal works the same again (using the new type, but I would be fine just reverting that part if it doesn't feel right) and addresses local disk while all the others will create an Obj Store compatible path

pauldix · 2024-01-16T20:48:31Z

influxdb3_write/src/paths.rs

+const PARQUET_FILE_EXTENSION: &str = "parquet";
+
+/// File extension for segment files
+const SEGMENT_FILE_EXTENSION: &str = "wal";


maybe this should be SEGMENT_WAL_FILE_EXTENSION to differentiate. Then we'd have SEGMENT_INFO_FILE_EXTENSION, which would be json.

pauldix · 2024-01-16T20:50:41Z

influxdb3_write/src/paths.rs

+impl SegmentFilePath {
+    pub fn new(prefix: impl Into<PathBuf>, segment_id: SegmentId) -> Self {
+        let mut path = prefix.into();
+        path.push("segments");


The wal files don't need to have this directory since the only thing in the wal directory will be segment files.

pauldix · 2024-01-16T21:39:18Z

influxdb3_write/src/paths.rs

+pub struct CatalogFilePath(PathBuf);
+
+impl CatalogFilePath {
+    pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self {


sequence_number should be a u32. No need for a u64 here and limiting the size ensures that 10 digit padding on the string conversion won't break for any value that is valid.

pauldix · 2024-01-16T21:42:46Z

influxdb3_write/src/paths.rs

+    pub fn new(prefix: impl Into<PathBuf>, sequence_number: u64) -> Self {
+        let mut path = prefix.into();
+        path.push("catalogs");
+        path.push(format!("{sequence_number:010}"));


This isn't going to work as a naming convention for the Catalog or SegmentInfo file names. We need file names such that when ordered lexicographically, the highest numbered ones will be returned first. Changing the file to be named format!("{(u32::MAX - sequence_number):010}") would achieve this. Same applies for SegmentInfo files.

For wal files we don't need this naming convention, but we could follow it to keep things consistent. It would also become handy if we ever want to copy the wal segments into object storage for access by other systems.

I opted out of this for wal files since they don't really need to be human readable, but for the other types I did the u32::MAX trick so that they show up in order.

pauldix · 2024-01-16T21:43:47Z

influxdb3_write/src/paths.rs

+        prefix: impl Into<PathBuf>,
+        db_name: &str,
+        table_name: &str,
+        year: u16,


I think it would be better to have this be a time for the parquet file's min_time and then convert that to a string, rather than taking three separate arguments.

pauldix · 2024-01-16T22:03:59Z

influxdb3_write/src/paths.rs

+        path.push("dbs");
+        path.push(db_name);
+        path.push(table_name);
+        path.push(format!("{year}-{month:02}-{day:02}"));


If we use min_time as an argument, probably better to use a strtime equivalent format string here. Think we'll need to use Chrono for this?

These days it or time are both maintained now, but I think to keep it consistent I'll use chrono

mgattozzi · 2024-01-17T19:07:30Z

@pauldix added in what you asked for!

pauldix · 2024-01-17T22:47:54Z

influxdb3_write/src/paths.rs

+    pub fn new(segment_id: SegmentId) -> Self {
+        let path = ObjPath::from(format!(
+            "catalogs/{:010}.{}",
+            u32::MAX - segment_id.0,


since this is a pattern that shows up elsewhere, maybe refactor into a method on SegmentId? object_store_file_stem?

Implemented in d1fbbe8

pauldix · 2024-01-17T22:49:32Z

influxdb3_write/src/paths.rs

+impl ParquetFilePath {
+    pub fn new(db_name: &str, table_name: &str, date: DateTime<Utc>, file_number: u32) -> Self {
+        let path = ObjPath::from(format!(
+            "dbs/{db_name}/{table_name}/{}/{:010}.{}",


db_name and table_name need to be converted into object store safe character set

Implemented in d1fbbe8

Ah, so I think what you're doing here is attempting to parse the name and returning an error if it isn't valid. That's not quite what we want. We want the db_name and table_name strings to be properly escaped so that they use valid object store names. Thus the properly escaped names would be used and the ObjPath::parse of the resulting name would be expected to never error.

I think you'd use: https://docs.rs/object_store/latest/object_store/path/struct.Path.html#encode

Ah if you look at what you linked then I could just use from and it should work the same.

pauldix

LGTM! 🚢

mgattozzi · 2024-01-19T15:39:49Z

I switched it back to from and added a test to make sure it was percent encoded like the docs said. After tests pass I'll merge.

mgattozzi requested a review from pauldix January 16, 2024 20:29

mgattozzi added 2 commits January 16, 2024 16:04

fix: clippy and add SegmentInfoFilePath

4e0b907

fix: Have SegmentWalPath be on disk not Obj Store

c973beb

pauldix requested changes Jan 16, 2024

View reviewed changes

fix: Address comments

c0d2a6b

mgattozzi requested a review from pauldix January 17, 2024 19:07

fix: Make paths work for ObjectStore

0c2b9e7

pauldix reviewed Jan 17, 2024

View reviewed changes

mgattozzi added 2 commits January 18, 2024 15:11

fix: Update deps for h2 advisory

6eeaf9b

feat: Add validation for ParquetFilePath

d1fbbe8

mgattozzi requested a review from pauldix January 18, 2024 20:17

pauldix approved these changes Jan 18, 2024

View reviewed changes

fix: Change ParquetFilePath to use from, add test

bb9c479

mgattozzi merged commit e13cc47 into main Jan 19, 2024
12 checks passed

mgattozzi deleted the mgattozzi/persister/paths branch January 19, 2024 15:58

mgattozzi mentioned this pull request Jan 22, 2024

feat: Implement Persister for PersisterImpl #24588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add paths module to influxdb3_write #24579

feat: Add paths module to influxdb3_write #24579

mgattozzi commented Jan 16, 2024

pauldix commented Jan 16, 2024

mgattozzi commented Jan 16, 2024

pauldix Jan 16, 2024

pauldix Jan 16, 2024

pauldix Jan 16, 2024

pauldix Jan 16, 2024

mgattozzi Jan 17, 2024

pauldix Jan 16, 2024

pauldix Jan 16, 2024

mgattozzi Jan 17, 2024

mgattozzi commented Jan 17, 2024

pauldix Jan 17, 2024

mgattozzi Jan 18, 2024

pauldix Jan 17, 2024

mgattozzi Jan 18, 2024

pauldix Jan 18, 2024

pauldix Jan 18, 2024

mgattozzi Jan 18, 2024

pauldix left a comment

mgattozzi commented Jan 19, 2024

feat: Add paths module to influxdb3_write #24579

feat: Add paths module to influxdb3_write #24579

Conversation

mgattozzi commented Jan 16, 2024

pauldix commented Jan 16, 2024

mgattozzi commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgattozzi commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldix left a comment

Choose a reason for hiding this comment

mgattozzi commented Jan 19, 2024