-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: binary search tree for attributes #9
Conversation
WalkthroughThis pull request integrates comprehensive attribute indexing capabilities across multiple components. A new Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant C as CLI (Ser Command)
participant FW as FcbWriter
participant HW as HeaderWriter
participant S as Serializer
U->>C: Invoke serialize with attr_index option
C->>FW: Pass input file, output file, and attr_index list
FW->>HW: Build header_options (including attribute_indices)
HW->>S: Call to_fcb_header with attribute_indices_info
S->>HW: Return header with AttributeIndex info
HW->>FW: Provide header buffer
FW->>FW: Write attribute index buffer & sorted feature buffer
FW->>U: Output serialized FCB file
sequenceDiagram
participant U as User
participant FR as FcbReader
participant MQ as MultiIndex
participant Q as Query Engine
U->>FR: Request attribute query with conditions
FR->>FR: Read header with attribute_index information
FR->>MQ: Build MultiIndex from attribute index entries
MQ->>Q: Process Query with conditions (equality, range, etc.)
Q->>MQ: Return matching offsets
MQ->>FR: Provide query result (FeatureIter)
FR->>U: Deliver matching features
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Here's the code health analysis summary for commits Analysis Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (37)
src/rust/fcb_core/tests/read.rs (4)
16-22
: Consider more explicit error context for file reading.
Currently, a simple "?" error propagation is used (adequate in a test), but for clarity, you could add more context to the error message (e.g., which file failed).
24-33
: Consider using named constants for repeated attribute identifiers.
The hard-coded strings (e.g., "b3_h_dak_50p") might be better extracted as constants or configuration values for maintainability.
34-44
: Document or make the node size configurable.
The choice of node_size=16 may warrant a comment explaining the rationale or a configuration option for different datasets.
46-48
: Potential performance enhancement for feature insertion.
If there's a large number of features, consider adding a bulk insertion method to minimize repeated calls.src/rust/bst/src/query.rs (5)
5-14
: Consider extending operator support.
The current operators cover basic comparisons. You may later introduce “Between” or “In” for more advanced querying.
16-25
: Typed keys could improve clarity.
Storing keys as bytes is flexible, but type-safe keys can reduce parse runtime errors and clarify usage.
27-31
: Potential for OR/NOT conditions.
Right now, conditions are AND-combined. If you need OR or complex filtering in the future, consider a more expressive query structure.
57-105
: Optimize candidate set intersection.
Currently, all offsets are collected before intersecting. For large datasets, consider sorting conditions by expected match size or using streaming intersection to improve performance.
114-123
: Review necessity of final sorting.
Sorting the final intersection is O(n log n). If order is not strictly required, you could skip or make it optional.src/rust/bst/src/sorted_index.rs (5)
9-13
: Consider memory strategies for storing offsets.
If many entries share the same key, you might investigate more advanced data structures to reduce memory usage.
27-39
: Sorting approach could be documented more clearly.
You're using a simple comparison-based sort. If stable ordering among equals matters, consider a stable sort (e.g., sort_by with stable sort logic) or mention if duplicates are expected.
57-100
: Half-open range queries may require extra clarity.
The interval [lower, upper) might confuse callers if they expect inclusive upper bounds. Documenting or exposing an inclusive-range variant is helpful.
102-161
: Serialization versioning or checksums recommended.
Consider embedding version fields or checksums to detect corruption or incompatible formats.
163-189
: Guard against deserialization failures for AnyIndex.
When converting bytes to type T, parse errors could occur. Currently, these may silently fail. Consider stronger error handling or a fallback logic.src/rust/bst/src/byte_serializable.rs (2)
112-119
: Avoid panicking on invalid UTF-8 for String.The current approach uses String::from_utf8(...).unwrap(), which panics if the byte slice is not valid UTF-8. Consider returning a Result or handling the error gracefully.
fn from_bytes(bytes: &[u8]) -> Self { - String::from_utf8(bytes.to_vec()).unwrap() + // A safer approach returning a default or an error: + String::from_utf8(bytes.to_vec()).unwrap_or_default() }
4-11
: Return a Result from from_bytes to handle potential conversion errors.All from_bytes methods currently return the type itself, possibly panicking or producing incorrect data on invalid inputs. Consider changing the trait to return a Result to gracefully handle malformed byte slices.
Also applies to: 117-118, 215-231
src/rust/fcb_core/src/reader/attr_query.rs (2)
19-135
: Efficient attribute query implementation.The approach of selectively deserializing only indexes used in the query is optimal. You might consider partial skipping of any irrelevant index bytes to save I/O if the file layout permits.
138-263
: Avoid code duplication across query methods.select_attr_query_seq and select_attr_query share much logic. Consider extracting a shared internal function that both can call to reduce repetition and potential maintenance overhead.
src/rust/bst/src/lib.rs (3)
20-70
: Consider extracting record definitions into a helper function.
Repeatedly defining the same structure for test data (ID, city, height, year) can be error-prone and clutter tests. Extracting a helper function to generate these records would make tests more readable and maintainable.
74-120
: Reduce boilerplate in building various index entries.
Inserting data into each index (ID, city, height, year) follows a similar pattern of checking for an existing key-value and appending offsets or creating a new entry. Consider extracting a reusable function that handles this routine logic and reduces repetition.
289-349
: Potential improvement for error handling in serialization.
All serialization calls use “unwrap()” without handling potential I/O or deserialization errors. Consider replacing “unwrap()” with error propagation (e.g., “? operator”) or custom error handling to maintain robust error reporting.src/rust/fcb_core/src/writer/attr_index.rs (1)
310-312
: Avoid using println! for logging.
Using “println!” in a library context may cause undesired output in client applications. Consider using a logging crate (e.g., log or tracing) or returning an error if needed.src/rust/fcb_core/src/reader/mod.rs (1)
501-505
: Consider unifying item_filter and item_attr_filter logic.
If both filters exist, the iteration logic might become fragmented. Coordinating feature offsets (R-Tree filter) and attribute offsets (item_attr_filter) can get complicated. Consider a unified filter approach or a consistent fallback order.src/rust/fcb_core/src/writer/attribute.rs (2)
8-8
: Consider clarifying the doc comment further.
While the comment helpfully explains the High-Level concept of the schema, you might also enumerate or briefly mention the tuple structure (index, type) within the comment for quick reference.
202-216
: Check performance implications for large datasets.
This function iterates over city objects to build index entries. If the number of objects is very large, consider potential optimizations (e.g., streamed or parallel processing). Otherwise, this is a neat approach to generate typed index entries.src/rust/fcb_core/tests/http.rs (1)
8-12
: Document the significance of these coordinates.The hardcoded coordinates appear to be specific geographic coordinates. Consider adding a comment explaining what area these coordinates represent and why these specific values were chosen for the test.
+ // Coordinates representing <specific area> let minx = 84227.77; let miny = 445377.33; let maxx = 85323.23; let maxy = 446334.69;
src/rust/fcb_core/src/bin/write.rs (1)
27-33
: Make attribute indices configurable.The attribute names are currently hardcoded. Consider making these configurable through command-line arguments or a configuration file for better flexibility.
Example approach:
use clap::{App, Arg}; // In main() let matches = App::new("FCB Writer") .arg(Arg::new("attr-indices") .long("attr-indices") .help("Comma-separated list of attributes to index") .takes_value(true)) .get_matches(); let attr_indices = matches.value_of("attr-indices") .map(|s| s.split(',').map(String::from).collect::<Vec<_>>());src/rust/fcb_core/src/writer/header_writer.rs (2)
19-20
: Improve documentation for attribute indices.The documentation comment for
attribute_indices_info
could be more descriptive, explaining the purpose and structure of the attribute indices.- /// Attribute indices + /// Information about indexed attributes, including their names and corresponding feature offsets. + /// This is used to build a binary search tree for efficient attribute-based queries. pub(super) attribute_indices_info: Option<Vec<AttributeIndexInfo>>,
30-31
: Add detailed documentation for attribute indices option.The documentation for
attribute_indices
in HeaderWriterOptions should explain its purpose and usage.- /// Attribute indices + /// List of attribute names to be indexed for efficient querying. + /// When provided, binary search trees will be built for these attributes. pub attribute_indices: Option<Vec<String>>,src/rust/fcb_core/src/writer/feature_writer.rs (1)
27-32
: Add documentation for AttributeFeatureOffset struct.The new struct lacks documentation explaining its purpose and fields.
+/// Stores information about attribute offsets within a feature. +/// +/// This struct is used to track the position and size of attribute data +/// within the serialized feature, along with any index entries for +/// efficient attribute-based querying. #[derive(Clone, PartialEq, Debug)] pub(super) struct AttributeFeatureOffset { + /// The offset of the attribute data within the serialized feature pub(super) offset: usize, + /// The size of the attribute data in bytes pub(super) size: usize, + /// Index entries for attributes that need to be indexed pub(super) index_entries: Vec<AttributeIndexEntry>, }src/rust/fcb_core/tests/attr_index.rs (3)
46-46
: Consider parameterizing the test with different attribute combinations.The test currently uses hardcoded attribute names. Consider parameterizing the test to cover different combinations of attributes.
-let attr_indices = vec!["b3_h_dak_50p".to_string(), "identificatie".to_string()]; +#[test] +fn test_attr_index_with_params() -> Result<()> { + for attrs in vec![ + vec!["b3_h_dak_50p"], + vec!["identificatie"], + vec!["b3_h_dak_50p", "identificatie"], + ] { + let attr_indices = attrs.into_iter().map(String::from).collect::<Vec<_>>(); + // ... rest of the test logic + } + Ok(()) +}
81-88
: Consider adding early return for performance.The loop continues checking features even after reaching
feat_count
. Consider adding an early return.while let Ok(Some(feat_buf)) = reader.next() { let feature = feat_buf.cur_cj_feature()?; deserialized_features.push(feature); feat_num += 1; - if feat_num >= feat_count { - break; - } + if feat_num == feat_count { + return Ok(()); + } }
91-107
: Simplify attribute verification logic.The nested loops and multiple flag variables make the code harder to follow. Consider using
any
orall
combinators.-let mut contains_b3_h_dak_50p = false; -let mut contains_identificatie = false; -for co in feature.city_objects.values() { - if co.attributes.is_some() { - let attrs = co.attributes.as_ref().unwrap(); - if let Some(b3_h_dak_50p) = attrs.get("b3_h_dak_50p") { - if b3_h_dak_50p.as_f64().unwrap() > 2.0 { - contains_b3_h_dak_50p = true; - } - } - if let Some(identificatie) = attrs.get("identificatie") { - if identificatie.as_str().unwrap() == "NL.IMBAG.Pand.0503100000012869" { - contains_identificatie = true; - } - } - } -} +let contains_b3_h_dak_50p = feature.city_objects.values().any(|co| { + co.attributes.as_ref().and_then(|attrs| { + attrs.get("b3_h_dak_50p").and_then(|v| v.as_f64()) + }).map_or(false, |v| v > 2.0) +}); + +let contains_identificatie = feature.city_objects.values().any(|co| { + co.attributes.as_ref().and_then(|attrs| { + attrs.get("identificatie").and_then(|v| v.as_str()) + }).map_or(false, |v| v == "NL.IMBAG.Pand.0503100000012869") +});src/fbs/header.fbs (1)
63-66
: Consider adding documentation comments.The new
AttributeIndex
struct lacks documentation explaining its purpose and field meanings.+// Represents an index entry for an attribute in the binary search tree struct AttributeIndex { + // Zero-based index of the attribute in the attributes array index: ushort; + // Length of the attribute data in bytes length: uint; }src/rust/cli/src/main.rs (1)
34-36
: Add example usage in help text.Consider adding an example in the help text to show the expected format of the attribute list.
/// Comma-separated list of attributes to create index for +/// Example: --attr-index "height,width,depth" #[arg(long)] attr_index: Option<String>,
src/rust/fcb_core/tests/e2e.rs (2)
115-149
: Remove debug print statements.These debug print statements should be removed as they're marked with "remove these lines later".
- // ===============remove these lines later================= - println!( - "is CityObject same? {:?}", - orig_co == des_feat.city_objects.get(id).unwrap() - ); - - println!( - "is attribute same======? {:?}", - orig_co.attributes == des_feat.city_objects.get(id).unwrap().attributes - ); - if orig_co.attributes != des_feat.city_objects.get(id).unwrap().attributes { - println!(" attributes======:"); - - let orig_attrs = orig_co.attributes.as_ref().unwrap(); - let des_attrs = des_feat - .city_objects - .get(id) - .unwrap() - .attributes - .as_ref() - .unwrap(); - if orig_attrs.is_object() && des_attrs.is_object() { - for (key, value) in orig_attrs.as_object().unwrap() { - let des_value = des_attrs.get(key); - if des_value.is_none() { - println!(" key not found: {:?}", key); - } else if value != des_value.unwrap() { - println!(" key: {:?}", key); - println!(" original: {:?}", value); - println!(" deserialized: {:?}", des_value.unwrap()); - } - } - } - } - // ===============remove these lines later=================
181-186
: Address TODO comment.There's a TODO comment about implementing attributes that should be addressed.
Would you like me to help implement the attribute comparison logic mentioned in the TODO comment?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
src/fbs/header.fbs
(1 hunks)src/rust/Cargo.toml
(2 hunks)src/rust/bst/Cargo.toml
(1 hunks)src/rust/bst/src/byte_serializable.rs
(1 hunks)src/rust/bst/src/lib.rs
(1 hunks)src/rust/bst/src/query.rs
(1 hunks)src/rust/bst/src/sorted_index.rs
(1 hunks)src/rust/cli/src/main.rs
(4 hunks)src/rust/fcb_core/Cargo.toml
(1 hunks)src/rust/fcb_core/src/bin/read.rs
(0 hunks)src/rust/fcb_core/src/bin/write.rs
(1 hunks)src/rust/fcb_core/src/fb/feature_generated.rs
(1 hunks)src/rust/fcb_core/src/fb/header_generated.rs
(10 hunks)src/rust/fcb_core/src/http_reader/mod.rs
(0 hunks)src/rust/fcb_core/src/reader/attr_query.rs
(1 hunks)src/rust/fcb_core/src/reader/mod.rs
(14 hunks)src/rust/fcb_core/src/writer/attr_index.rs
(1 hunks)src/rust/fcb_core/src/writer/attribute.rs
(2 hunks)src/rust/fcb_core/src/writer/feature_writer.rs
(5 hunks)src/rust/fcb_core/src/writer/header_writer.rs
(6 hunks)src/rust/fcb_core/src/writer/mod.rs
(8 hunks)src/rust/fcb_core/src/writer/serializer.rs
(4 hunks)src/rust/fcb_core/tests/attr_index.rs
(1 hunks)src/rust/fcb_core/tests/e2e.rs
(1 hunks)src/rust/fcb_core/tests/http.rs
(1 hunks)src/rust/fcb_core/tests/read.rs
(1 hunks)src/rust/makefile
(1 hunks)src/rust/wasm/src/lib.rs
(0 hunks)
💤 Files with no reviewable changes (3)
- src/rust/fcb_core/src/bin/read.rs
- src/rust/fcb_core/src/http_reader/mod.rs
- src/rust/wasm/src/lib.rs
✅ Files skipped from review due to trivial changes (1)
- src/rust/bst/Cargo.toml
🔇 Additional comments (35)
src/rust/fcb_core/tests/read.rs (3)
2-6
: Imports look appropriate and comprehensive.
No issues observed with the newly introduced imports from fcb_core and standard I/O libraries.
50-52
: Memory buffer usage is solid.
Writing to and then seeking in a memory buffer is a neat approach for in-memory testing. No issues found.
59-59
: Bounding box query integration looks correct.
The method call to select and test bounding boxes is consistent with your approach.src/rust/bst/src/query.rs (1)
33-36
: MultiIndex definition is straightforward.
No issues noted; the design seems minimal and reusable.src/rust/bst/src/sorted_index.rs (3)
5-6
: Clear offset alias usage.
Using a type alias for offsets keeps code readable and flexible.
16-19
: SortedIndex structure is well-organized.
Maintains a simple vector of KeyValue pairs, which is straightforward and efficient for typical use cases.
42-55
: SearchableIndex trait is clear.
The trait-based approach allows for flexible query strategies. No issues noted here.src/rust/fcb_core/src/writer/mod.rs (2)
3-41
: Good integration of attribute indexing.The addition of the attribute_index_entries HashMap, alongside tracking each feature’s offsets, is a clear and maintainable approach to building an index.
94-109
: Validate final sorted offsets for features and attribute indices.Because the code reorders features based on the Hilbert curve, ensure that node.offset usage matches the correct feature in feat_offsets. Also confirm that the final assignment of each attribute index entry matches the sorted user data.
Would you like a script to confirm offset correctness by comparing the final sorted_feature_buf layout with the recorded offsets?
Also applies to: 110-114, 145-149, 175-198, 200-220, 222-224, 225-235, 239-245
src/rust/bst/src/lib.rs (1)
18-70
: Good test coverage for date-based queries.
The tests for date comparisons (e.g., year <= 2000-01-01, year between 1950 and 2010) offer strong coverage for chronological edge cases. It's great to see these tests ensuring correct filtering logic.src/rust/fcb_core/src/reader/mod.rs (3)
120-130
: Double-check rtree_index vs. attributes offset usage.
Here, “rtree_index: index_size” is set to the sum of both R-Tree and attribute indexes, while “attributes: self.attr_index_size()” only accounts for attribute index size. Ensure that the final offsets align with the intended meaning of each field for consistent file positioning.
164-180
: Confirm correctness of skipping the attribute index within select_bbox_seq.
The code discards “attr_index_size” bytes before constructing FeatureOffset. Make sure the order of skipping R-Tree vs. attribute index is consistent across different methods, especially if you plan to combine bounding-box and attribute filters.
348-357
: Ensure consistent offset updates when seeking with item_attr_filter.
In the Seekable iterator, when “item_attr_filter” is present, you jump to “self.feature_offset.total_size() + item_offset” on every iteration, then set “self.cur_pos = item_offset.” Verify these offset calculations won't skip or overlap any features, and that they align with how index data is laid out immediately after R-Tree data.src/rust/fcb_core/src/writer/attribute.rs (2)
3-4
: New imports seem appropriately utilized.
These imports (chrono::NaiveDateTime and cjseq::CityJSONFeature) appear necessary for the newly introduced functions dealing with date-time parsing and city JSON features.
189-200
: Robust typed attribute handling.
The new AttributeIndexEntry enum covers the majority of data types used in this schema. Its typed approach is clear and maintainable. Just confirm that deliberately unsupported types (like Byte, UByte, etc.) are handled elsewhere or gracefully reported.src/rust/fcb_core/src/writer/serializer.rs (5)
10-10
: Importing AttributeIndex and related structs is valid.
These imports appear to integrate the new attribute indexing into the serializer mechanism seamlessly.
23-27
: Struct name and fields are concise.
AttributeIndexInfo (index, length) is well-named. Confirm that “length” consistently refers to the same concept (e.g., byte length, item count, etc.) throughout the codebase.
43-43
: Function signature extension for attribute indices.
Allowing an Option<&[AttributeIndexInfo]> provides flexibility and avoids overhead when no indices are needed. This approach is a good design choice.
50-60
: Conditional creation of attribute index vector.
The logic for constructing the FlatBuffer vector only when attribute_indices_info is Some is straightforward and avoids unnecessary allocations.
119-119
: Passing the new attribute_index to Header.
This ensures the attribute index information is properly serialized in the header. The change is consistent with the extended Header struct usage.src/rust/fcb_core/src/fb/header_generated.rs (8)
157-157
: Added Default derive for Vector.
Deriving Default is helpful for initialization, especially in scenarios that require zeroed or uninitialized placeholders. No issues here.
477-591
: New AttributeIndex struct for FlatBuffers integration.
This struct adds first-class representation for attribute index data (index, length) in the FlatBuffer schema. Implementation looks solid.
1127-1145
: Added VT_ATTRIBUTE_INDEX constants.
These offsets properly integrate the new attribute index field into the header’s FlatBuffer layout.
1268-1279
: New getter method for attribute_index.
Exposing Option<flatbuffers::Vector<'a, AttributeIndex>> is consistent with how other optional fields are handled.
1486-1491
: Verification step for attribute_index.
Ensures that any stored AttributeIndex data is structurally sound in the FlatBuffer. Helps prevent invalid data usage downstream.
1584-1584
: Extending HeaderArgs with attribute_index.
Adding attribute_index to this argument struct further cements the field’s integration into the header.
1666-1674
: Builder method to set attribute_index.
This method aligns well with FlatBuffers’ builder pattern, making it easy to attach the new field.
1830-1830
: Expanded debug output with attribute_index.
Including attribute_index in debug statements aids traceability during debugging or logging.src/rust/cli/src/main.rs (1)
98-103
: 🏁 Script executed:#!/bin/bash # Description: Check if there's any validation of attribute names against schema ast-grep --pattern 'fn validate_attribute_names($_, $_) { $$$ }'Length of output: 68
src/rust/fcb_core/src/fb/feature_generated.rs (2)
10-59
: LGTM! Appropriate deprecation notices for enum constants.The deprecation notices for
ENUM_MIN_CITY_OBJECT_TYPE
,ENUM_MAX_CITY_OBJECT_TYPE
, andENUM_VALUES_CITY_OBJECT_TYPE
are well-documented with clear migration guidance to use associated constants.
505-505
: LGTM! Improved derive attribute consolidation.The consolidation of derive attributes for the
Vertex
struct improves code readability.src/rust/Cargo.toml (2)
7-7
: LGTM! Added bst to workspace members.The addition of "bst" to workspace members is consistent with the PR objectives for implementing a binary search tree.
31-34
: LGTM! Added appropriate dependencies for binary search tree implementation.The new dependencies are well-chosen for the implementation:
- bincode: For efficient binary serialization
- chrono: For date/time handling
- ordered-float: For floating-point comparisons
- once_cell: For lazy static initialization
src/rust/makefile (1)
10-10
: LGTM! Updated clippy command with correct package name.The clippy command now correctly uses
fcb_wasm
instead ofwasm
, maintaining consistency with the project's package naming.src/rust/fcb_core/Cargo.toml (1)
31-33
: LGTM! Added required dependencies for binary search tree functionality.The new dependencies are appropriately configured:
- bst: Local dependency for binary search tree implementation
- ordered-float and chrono: Using workspace-level versions for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
src/rust/fcb_core/tests/read.rs (1)
83-91
: Improve bbox intersection logic.The current implementation might have edge cases where features are incorrectly included/excluded. Consider using a more robust intersection test.
- // MEMO: it retrieves all features which has intersection with the bbox - if x_s.iter().any(|x| *x >= minx as i64) - || y_s.iter().any(|y| *y >= miny as i64) - || x_s.iter().any(|x| *x <= maxx as i64) - || y_s.iter().any(|y| *y <= maxy as i64) + // Check if any vertex is inside the bbox or if the feature's bounds intersect the bbox + let feat_min_x = x_s.iter().min().copied().unwrap_or(0); + let feat_max_x = x_s.iter().max().copied().unwrap_or(0); + let feat_min_y = y_s.iter().min().copied().unwrap_or(0); + let feat_max_y = y_s.iter().max().copied().unwrap_or(0); + if !(feat_max_x < minx as i64 || feat_min_x > maxx as i64 || + feat_max_y < miny as i64 || feat_min_y > maxy as i64)
🧹 Nitpick comments (6)
src/rust/fcb_core/tests/read.rs (2)
24-31
: Consider extracting attribute schema building into a helper function.The attribute schema building logic is duplicated in both test functions. Extract it into a helper function to improve code reusability and maintainability.
+fn build_attr_schema(cj_seq: &CityJSONSeq) -> AttributeSchema { + let mut attr_schema = AttributeSchema::new(); + for feature in cj_seq.features.iter() { + for (_, co) in feature.city_objects.iter() { + if let Some(attributes) = &co.attributes { + attr_schema.add_attributes(attributes); + } + } + } + attr_schema +}
105-113
: Add Debug trait implementation for NonSeekableCursor.Implementing Debug would help with debugging and error messages.
+ #[derive(Debug)] struct NonSeekableCursor<T: AsRef<[u8]>> { inner: Cursor<T>, }
src/rust/fcb_core/tests/attr_index.rs (1)
92-110
: Extract feature validation into a helper function.The feature validation logic is duplicated and could be made more reusable.
+fn validate_feature(feature: &CityJSONFeature) -> Result<()> { + let mut contains_b3_h_dak_50p = false; + let mut contains_identificatie = false; + for co in feature.city_objects.values() { + if let Some(attrs) = &co.attributes { + if let Some(b3_h_dak_50p) = attrs.get("b3_h_dak_50p") { + if b3_h_dak_50p.as_f64().unwrap() > 2.0 { + contains_b3_h_dak_50p = true; + } + } + if let Some(identificatie) = attrs.get("identificatie") { + if identificatie.as_str().unwrap() == "NL.IMBAG.Pand.0503100000012869" { + contains_identificatie = true; + } + } + } + } + assert!(contains_b3_h_dak_50p); + assert!(contains_identificatie); + Ok(()) +}src/rust/fcb_core/src/reader/attr_query.rs (1)
92-117
: Consider using a builder pattern for query construction.The query construction logic could be made more maintainable using a builder pattern.
+struct QueryBuilder { + conditions: Vec<QueryCondition>, +} + +impl QueryBuilder { + fn new() -> Self { + Self { + conditions: Vec::new(), + } + } + + fn add_condition(mut self, name: String, operator: Operator, value: ByteSerializableValue) -> Self { + self.conditions.push(QueryCondition { + field: name, + operator, + key: value.to_bytes(), + }); + self + } + + fn build(self) -> Query { + Query { + conditions: self.conditions, + } + } +}src/rust/fcb_core/src/reader/mod.rs (2)
50-56
: Add documentation for FeatureOffset struct.The struct lacks documentation explaining its purpose and the meaning of each field.
+/// Represents the offsets of different sections in the FCB file format. +/// +/// # Fields +/// +/// * `magic_bytes` - Size of the magic bytes section (typically 8 bytes) +/// * `header` - Size of the header section +/// * `rtree_index` - Size of the R-tree index section +/// * `attributes` - Size of the attributes section #[doc(hidden)] pub(super) struct FeatureOffset { magic_bytes: u64, header: u64, rtree_index: u64, attributes: u64, }
299-328
: Consider using a state machine pattern for feature iteration.The feature iteration logic could be made clearer using a proper state machine pattern.
+#[derive(Debug, PartialEq, Eq)] +enum IterationState { + Init, + ReadFirstFeatureSize, + Reading { + current_offset: u64, + feature_number: usize, + }, + Finished, +} + +impl IterationState { + fn advance(&mut self, new_offset: u64) { + if let Self::Reading { current_offset, feature_number } = self { + *current_offset = new_offset; + *feature_number += 1; + } + } +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/rust/fcb_core/src/reader/attr_query.rs
(1 hunks)src/rust/fcb_core/src/reader/mod.rs
(15 hunks)src/rust/fcb_core/tests/attr_index.rs
(1 hunks)src/rust/fcb_core/tests/read.rs
(2 hunks)
match col.type_() { | ||
ColumnType::Int => { | ||
let index = SortedIndex::<i32>::deserialize(&mut buffer.as_slice())?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::Long => { | ||
let index = SortedIndex::<i64>::deserialize(&mut buffer.as_slice())?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::Float => { | ||
let index = SortedIndex::<OrderedFloat<f32>>::deserialize( | ||
&mut buffer.as_slice(), | ||
)?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::Double => { | ||
let index = SortedIndex::<OrderedFloat<f64>>::deserialize( | ||
&mut buffer.as_slice(), | ||
)?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::String => { | ||
let index = SortedIndex::<String>::deserialize(&mut buffer.as_slice())?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::Bool => { | ||
let index = SortedIndex::<bool>::deserialize(&mut buffer.as_slice())?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
ColumnType::DateTime => { | ||
let index = | ||
SortedIndex::<DateTime<Utc>>::deserialize(&mut buffer.as_slice())?; | ||
multi_index.add_index(col.name().to_string(), Box::new(index)); | ||
} | ||
_ => { | ||
return Err(anyhow!("unsupported column type")); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Extract column type deserialization into a separate function.
The column type deserialization logic is duplicated between seekable and non-seekable implementations. Consider extracting it into a reusable function.
+fn deserialize_column_index(
+ col_type: ColumnType,
+ buffer: &[u8],
+ col_name: String,
+ multi_index: &mut MultiIndex,
+) -> Result<()> {
+ match col_type {
+ ColumnType::Int => {
+ let index = SortedIndex::<i32>::deserialize(buffer)?;
+ multi_index.add_index(col_name, Box::new(index));
+ }
+ ColumnType::Long => {
+ let index = SortedIndex::<i64>::deserialize(buffer)?;
+ multi_index.add_index(col_name, Box::new(index));
+ }
+ // ... other cases ...
+ _ => return Err(anyhow!("unsupported column type")),
+ }
+ Ok(())
+}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
src/rust/fcb_core/src/reader/mod.rs (2)
26-49
: Consider documenting the new fields inFeatureIter
.While most fields have documentation comments, the new fields
item_attr_filter
,feature_offset
, andtotal_feat_count
lack documentation explaining their purpose.Add documentation comments for the new fields:
/// Selected features or None if no bbox filter item_filter: Option<Vec<packed_rtree::SearchResultItem>>, + /// Selected attributes or None if no attribute filter item_attr_filter: Option<Vec<ValueOffset>>, /// Number of selected features (None for undefined feature count) count: Option<usize>, /// Current feature number feat_no: usize, /// File offset within feature section cur_pos: u64, /// Reading state state: State, /// Whether or not the underlying reader is Seek seekable_marker: PhantomData<S>, + /// Offsets to different sections in the file feature_offset: FeatureOffset, + /// Total number of features in the file total_feat_count: u64,
52-57
: Consider makingFeatureOffset
fields private.The fields in
FeatureOffset
are currently public by default. Since this is apub(super)
struct, consider making the fields private and providing accessor methods if needed.pub(super) struct FeatureOffset { - magic_bytes: u64, - header: u64, - rtree_index: u64, - attributes: u64, + magic_bytes: u64, + header: u64, + rtree_index: u64, + attributes: u64, }src/rust/fcb_core/src/bin/read_cj.rs (2)
8-8
: Add return type documentation.The function's return type
Result<()>
should be documented to clarify what errors might be returned.-fn read_cj(input: &str) -> Result<()> { +/// Reads a CityJSON file and extracts specific attributes from features. +/// +/// # Errors +/// Returns an error if: +/// - The file cannot be opened or read +/// - The JSON parsing fails +/// - The data is not a valid CityJSONSeq +fn read_cj(input: &str) -> Result<()> {
26-30
: Extract attribute names as constants.The attribute names "documentnummer" and "identificatie" are hardcoded. Consider extracting them as constants for better maintainability.
+const DOCUMENT_NUMBER_ATTR: &str = "documentnummer"; +const IDENTIFICATION_ATTR: &str = "identificatie"; + fn read_cj(input: &str) -> Result<()> { // ... if let Some(attributes) = &co.attributes { if let Some(document_number) = - attributes.get("documentnummer").and_then(|v| v.as_str()) + attributes.get(DOCUMENT_NUMBER_ATTR).and_then(|v| v.as_str()) { document_nummers.push(document_number.to_string()); } if let Some(identification) = - attributes.get("identificatie").and_then(|v| v.as_str()) + attributes.get(IDENTIFICATION_ATTR).and_then(|v| v.as_str()) { identifications.push(identification.to_string()); } }Also applies to: 31-35
src/rust/fcb_core/benches/read_attr.rs (2)
52-63
: Use constants for query parameters.The attribute name and values in the query are hardcoded. Use the same constants as defined above for consistency.
let query: AttrQuery = vec![ ( - "b3_h_dak_50p".to_string(), + ATTR_NAME.to_string(), Operator::Gt, - ByteSerializableValue::F64(OrderedFloat(20.0)), + ByteSerializableValue::F64(OrderedFloat(MIN_VALUE)), ), ( - "b3_h_dak_50p".to_string(), + ATTR_NAME.to_string(), Operator::Lt, - ByteSerializableValue::F64(OrderedFloat(50.0)), + ByteSerializableValue::F64(OrderedFloat(MAX_VALUE)), ), ];
104-137
: Add documentation for benchmark parameters.The benchmark function lacks documentation about its parameters and what it's measuring.
+/// Benchmarks FCB file reading with and without attribute indexing. +/// +/// # Parameters +/// - `c`: Criterion instance for benchmarking +/// +/// The benchmark compares: +/// - Reading without attribute index: Full scan of features +/// - Reading with attribute index: Using indexed attributes for faster access +/// +/// Results are printed in a table format showing mean execution times. pub fn read_benchmark(c: &mut Criterion) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/rust/fcb_core/Cargo.toml
(2 hunks)src/rust/fcb_core/benches/read_attr.rs
(1 hunks)src/rust/fcb_core/src/bin/read_cj.rs
(1 hunks)src/rust/fcb_core/src/reader/attr_query.rs
(1 hunks)src/rust/fcb_core/src/reader/mod.rs
(15 hunks)src/rust/makefile
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/rust/makefile
- src/rust/fcb_core/src/reader/attr_query.rs
🔇 Additional comments (3)
src/rust/fcb_core/src/reader/mod.rs (2)
3-3
: LGTM! Module organization looks good.The new imports and module declarations are well-organized:
bst::ValueOffset
for attribute indexing- New
attr_query
module with proper visibilityAlso applies to: 16-16, 18-18
317-328
: Verify attribute filtering logic in both iterator implementations.The attribute filtering logic differs between seekable and non-seekable iterators:
- Non-seekable: Uses relative offsets and skips bytes
- Seekable: Uses absolute offsets with
seek_from_start
Please verify that both implementations handle edge cases correctly:
- When attributes are filtered but no spatial filtering is applied
- When both spatial and attribute filters are applied
- When neither filter is applied
Also applies to: 363-372
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Search for test cases covering these scenarios rg -A 10 "test.*attr.*filter"Length of output: 31
Action Required: Manual Verification of Attribute Filtering Edge Cases
The automated search did not yield any test cases verifying the attribute filtering logic for both seekable and non-seekable iterator implementations. This makes it difficult to conclusively assess the correctness of handling the following edge cases:
- When attributes are filtered but no spatial filtering is applied
- When both spatial and attribute filters are applied
- When neither filter is applied
Please manually verify these cases with appropriate tests or additional runtime checks to ensure consistency between the two implementations.
src/rust/fcb_core/Cargo.toml (1)
31-33
: LGTM! Dependencies and targets are well-structured.The new dependencies and targets align well with the PR's objectives. The workspace-level version management for
ordered-float
andchrono
ensures consistency across the project.Also applies to: 64-66, 72-74
src/rust/fcb_core/src/reader/mod.rs
Outdated
fn attr_index_size(&self) -> u64 { | ||
let header = self.buffer.header(); | ||
header | ||
.attribute_index() | ||
.map(|attr_index| attr_index.iter().map(|ai| ai.length()).sum::<u32>()) | ||
.unwrap_or(0) as u64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding error handling for malformed attribute indices.
The attr_index_size
method assumes the attribute index data is well-formed. Consider adding validation to handle potential corruption in the flatbuffer data.
fn attr_index_size(&self) -> u64 {
let header = self.buffer.header();
- header
- .attribute_index()
- .map(|attr_index| attr_index.iter().map(|ai| ai.length()).sum::<u32>())
- .unwrap_or(0) as u64
+ header
+ .attribute_index()
+ .map(|attr_index| {
+ attr_index
+ .iter()
+ .try_fold(0u32, |acc, ai| {
+ let len = ai.length();
+ if len > u32::MAX - acc {
+ Err(anyhow!("Attribute index size overflow"))
+ } else {
+ Ok(acc + len)
+ }
+ })
+ .unwrap_or(0)
+ })
+ .unwrap_or(0) as u64
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fn attr_index_size(&self) -> u64 { | |
let header = self.buffer.header(); | |
header | |
.attribute_index() | |
.map(|attr_index| attr_index.iter().map(|ai| ai.length()).sum::<u32>()) | |
.unwrap_or(0) as u64 | |
} | |
fn attr_index_size(&self) -> u64 { | |
let header = self.buffer.header(); | |
header | |
.attribute_index() | |
.map(|attr_index| { | |
attr_index | |
.iter() | |
.try_fold(0u32, |acc, ai| { | |
let len = ai.length(); | |
if len > u32::MAX - acc { | |
Err(anyhow!("Attribute index size overflow")) | |
} else { | |
Ok(acc + len) | |
} | |
}) | |
.unwrap_or(0) | |
}) | |
.unwrap_or(0) as u64 | |
} |
if let Some(b3_h_dak_50p) = attributes.get("b3_h_dak_50p") { | ||
if b3_h_dak_50p.as_f64().unwrap() > 20.0 | ||
&& b3_h_dak_50p.as_f64().unwrap() < 50.0 | ||
{ | ||
// println!("b3_h_dak_50p: {:?}", b3_h_dak_50p); | ||
target_feat_num += 1; | ||
continue; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Extract duplicate feature counting logic.
The feature counting logic is duplicated between read_fcb_without_attr_index
and read_fcb_with_attr_index
. Consider extracting it into a shared function.
+const ATTR_NAME: &str = "b3_h_dak_50p";
+const MIN_VALUE: f64 = 20.0;
+const MAX_VALUE: f64 = 50.0;
+
+fn count_matching_features(attributes: &serde_json::Map<String, serde_json::Value>) -> bool {
+ if let Some(value) = attributes.get(ATTR_NAME) {
+ if let Some(num) = value.as_f64() {
+ return num > MIN_VALUE && num < MAX_VALUE;
+ }
+ }
+ false
+}
+
// In both functions, replace the duplicate logic with:
if let Some(attributes) = &co.attributes {
- if let Some(b3_h_dak_50p) = attributes.get("b3_h_dak_50p") {
- if b3_h_dak_50p.as_f64().unwrap() > 20.0
- && b3_h_dak_50p.as_f64().unwrap() < 50.0
- {
- target_feat_num += 1;
- continue;
- }
+ if count_matching_features(attributes) {
+ target_feat_num += 1;
+ continue;
}
}
Also applies to: 74-82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
src/rust/fcb_core/src/writer/attribute.rs (1)
69-69
:⚠️ Potential issueAdd proper error handling for unwrap() calls.
Multiple unwrap() calls could lead to panics if the values are null or of unexpected types.
Apply this diff to add proper error handling:
- size_of::<u32>() + colval.as_str().unwrap().len() + size_of::<u32>() + colval.as_str().unwrap_or_default().len() - size_of::<u32>() + colval.as_str().unwrap().len(), //TODO: check if this is correct + size_of::<u32>() + colval.as_str().unwrap_or_default().len() //TODO: check if this is correctAlso applies to: 75-75
♻️ Duplicate comments (1)
src/rust/fcb_core/src/writer/attr_index.rs (1)
63-185
: 🛠️ Refactor suggestionReduce code duplication in match arms.
The match arms follow an identical pattern with only the types and variant names changing. This duplication can be reduced using a macro or helper function.
Here's a suggested approach using a macro:
macro_rules! handle_attribute_type { ($schema_index:expr, $attribute_entries:expr, $type:ty, $variant:pat => $val:expr) => { build_index_generic::<$type, _>($schema_index, $attribute_entries, |entry| { if let $variant = entry { if index == $schema_index { Some($val) } else { None } } else { None } }) }; } // Usage in match: match *coltype { ColumnType::Bool => handle_attribute_type!( *schema_index, attribute_entries, bool, AttributeIndexEntry::Bool { index, val } => *val ), ColumnType::Int => handle_attribute_type!( *schema_index, attribute_entries, i32, AttributeIndexEntry::Int { index, val } => *val ), // ... similar for other types }
🧹 Nitpick comments (7)
src/rust/fcb_core/src/writer/attribute.rs (3)
45-45
: Consider improving numeric type guessing.The TODO comment indicates uncertainty about defaulting to ULong. Consider implementing range-based type inference or accepting type hints from the schema.
Would you like me to help implement a more accurate type guessing mechanism that considers value ranges?
235-236
: Replace println! with proper logging.Using println! for error handling is not ideal for production code. Consider using a proper logging framework.
Apply this diff to use proper logging:
- println!("Attribute {} not found in schema", attr); + log::warn!("Attribute {} not found in schema", attr); - eprintln!("Failed to parse DateTime: {}", e); + log::error!("Failed to parse DateTime: {}", e); - println!("Attribute {} is not supported for indexing", attr); + log::warn!("Attribute {} is not supported for indexing", attr);Also, consider adding the
log
crate to your dependencies:[dependencies] +log = "0.4"
Also applies to: 307-311, 317-318
377-461
: Consider adding more test cases.While the current test cases are good, consider adding tests for:
- DateTime parsing
- Binary data handling
- Invalid/malformed attribute values
- Edge cases for numeric types
Would you like me to help implement these additional test cases?
src/rust/fcb_core/src/reader/attr_query.rs (2)
107-108
: Improve error messages with more context.The error messages for missing attribute index and columns could be more descriptive to help with debugging:
- .ok_or_else(|| anyhow!("attribute index not found"))?; + .ok_or_else(|| anyhow!("attribute index not found in header"))?; - .ok_or_else(|| anyhow!("no columns found in header"))?; + .ok_or_else(|| anyhow!("no columns found in header schema"))?;Also applies to: 110-111
164-165
: Extract error handling into constants for consistency.The error handling is duplicated between
select_attr_query
andselect_attr_query_seq
. Consider extracting the error messages into constants:+const ERR_ATTRIBUTE_INDEX_NOT_FOUND: &str = "attribute index not found in header"; +const ERR_NO_COLUMNS_FOUND: &str = "no columns found in header schema"; - .ok_or_else(|| anyhow::anyhow!("attribute index not found"))?; + .ok_or_else(|| anyhow::anyhow!(ERR_ATTRIBUTE_INDEX_NOT_FOUND))?; - .ok_or_else(|| anyhow::anyhow!("no columns found in header"))? + .ok_or_else(|| anyhow::anyhow!(ERR_NO_COLUMNS_FOUND))?Also applies to: 167-169
src/rust/fcb_core/src/reader/mod.rs (1)
329-340
: Add comments explaining the attribute filtering logic.The attribute filtering logic in both seekable and non-seekable implementations would benefit from comments explaining:
- Why we need to check if we haven't reached the attribute offset yet
- The difference in seeking strategies between seekable and non-seekable implementations
Also applies to: 375-384
src/rust/fcb_core/src/writer/attr_index.rs (1)
15-23
: Add documentation for the generic function.Consider adding documentation to explain:
- Purpose of the function
- Generic type constraints
- Parameters and return value
- Example usage
+/// Builds a sorted index for attribute values of type T. +/// +/// # Type Parameters +/// * `T` - The attribute value type that must implement `Ord + Clone + ByteSerializable` +/// * `F` - A closure type that extracts Option<T> from AttributeIndexEntry +/// +/// # Arguments +/// * `schema_index` - The index of the attribute in the schema +/// * `attribute_entries` - Map of feature offsets to their attribute entries +/// * `extract` - Closure that extracts the typed value from an AttributeIndexEntry +/// +/// # Returns +/// * `Option<(Vec<u8>, AttributeIndexInfo)>` - The serialized index and its metadata fn build_index_generic<T, F>(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/rust/bst/src/byte_serializable.rs
(1 hunks)src/rust/fcb_core/src/bin/read_cj.rs
(1 hunks)src/rust/fcb_core/src/reader/attr_query.rs
(1 hunks)src/rust/fcb_core/src/reader/mod.rs
(15 hunks)src/rust/fcb_core/src/writer/attr_index.rs
(1 hunks)src/rust/fcb_core/src/writer/attribute.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/rust/fcb_core/src/bin/read_cj.rs
- src/rust/bst/src/byte_serializable.rs
🔇 Additional comments (10)
src/rust/fcb_core/src/writer/attribute.rs (3)
1-9
: LGTM! Clean imports and type definitions.The new imports and type definitions are well-organized and properly documented.
189-200
: LGTM! Well-structured attribute index entry enum.The enum is well-designed with appropriate derive macros and covers all necessary attribute types.
202-216
: LGTM! Clear and focused implementation.The function has a single responsibility and effectively processes CityJSON features for indexing.
src/rust/fcb_core/src/reader/attr_query.rs (3)
18-67
: Extract column type deserialization into a separate function.The column type deserialization logic is duplicated between different column types. Consider extracting it into a reusable function to improve maintainability and reduce code duplication.
69-87
: LGTM!The function is well-structured, handles all possible variants of
ByteSerializableValue
, and follows a consistent pattern.
89-100
: LGTM!The function is concise, focused, and efficiently maps query conditions.
src/rust/fcb_core/src/reader/mod.rs (3)
52-57
: LGTM!The
FeatureOffset
struct is well-organized and provides a clean abstraction for managing file offsets. Thetotal_size
method is simple and correct.Also applies to: 303-305
515-517
: Consider handling potential overflow infeatures_count
.The unchecked conversion from
u64
tousize
could cause issues on 32-bit platforms.
281-299
: LGTM!The
attr_index_size
method is well-implemented with proper overflow checking and error handling.src/rust/fcb_core/src/writer/attr_index.rs (1)
1-13
: LGTM! Well-organized imports.The imports are logically organized and all are actively used in the implementation.
This change is
Summary by CodeRabbit
New Features
Chores