-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600
Conversation
WalkthroughThe pull request modifies the command-line argument parsing in the Changes
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Co-authored-by: haiqi96 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/JsonConstructor.hpp (1)
33-33
: Add documentation for ordered_chunk_size memberWhile the initialization change to default initialization is good, this new functionality needs documentation. Consider adding a comment explaining that this member controls the chunk size in bytes during ordered decompression, and that a value of 0 means no chunking.
Apply this diff to add documentation:
+ /// Size in bytes for chunking during ordered decompression. + /// When set to 0 (default), output will not be divided into chunks. size_t ordered_chunk_size{};components/core/src/clp_s/JsonConstructor.cpp (1)
162-164
: Document potential chunk size varianceConsider adding a comment explaining that chunks may slightly exceed the requested size since we finalize only after exceeding the threshold. This behaviour aligns with the PR objectives but should be documented for maintainability.
+ // Note: Actual chunk size may slightly exceed ordered_chunk_size since we + // always include the complete record that causes the size threshold to be exceeded if (0 != m_option.ordered_chunk_size && chunk_size >= m_option.ordered_chunk_size) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(1 hunks)components/core/src/clp_s/JsonConstructor.cpp
(2 hunks)components/core/src/clp_s/JsonConstructor.hpp
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- components/core/src/clp_s/CommandLineArguments.hpp
🔇 Additional comments (4)
components/core/src/clp_s/JsonConstructor.cpp (3)
84-86
: LGTM: Clean variable initialization
The initialization of tracking variables follows modern C++ practices and provides clear intent.
168-168
: LGTM: Proper handling of final chunk
The condition correctly ensures that partial chunks are properly finalized while preventing the creation of empty files.
152-160
: Verify buffer length accumulation for multi-byte characters
The implementation correctly accumulates buffer lengths for chunk size tracking. However, let's verify that the buffer length calculation handles multi-byte characters consistently.
✅ Verification successful
Let me check the specific UTF-8 handling in the JsonConstructor and related components.
Let me check how the buffer is populated and if there's any string encoding conversion happening in the message processing chain.
Let me check one final aspect - how the actual message content is handled in the schema reader implementation.
Buffer length calculation is safe for multi-byte characters
The implementation is correct as it uses std::string::length()
which returns the number of bytes (not characters) in the UTF-8 encoded string. The chunk size accumulation is byte-accurate since:
- The message content is handled as raw bytes throughout the processing chain
- No character-level operations or encoding conversions are performed
- The buffer's byte length is directly used for chunk size calculations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any UTF-8 or multi-byte character handling in the codebase
rg -l "utf-?8|encoding|multi.*byte|mb.*len|wcs.*len" --type cpp
Length of output: 2737
Script:
#!/bin/bash
# Check how buffer and string handling is done in JsonConstructor and related files
rg -A 5 "buffer\.(length|size|c_str)" components/core/src/clp_s/JsonConstructor.cpp
rg -A 5 "std::string.*buffer" components/core/src/clp_s/JsonConstructor.cpp
# Check if there's any specific string encoding handling
rg -A 5 "encoding|utf-?8" components/core/src/clp_s/JsonConstructor.cpp components/core/src/clp_s/JsonParser.cpp
Length of output: 2136
Script:
#!/bin/bash
# Check how messages are processed and how the buffer is populated
rg -A 10 "get_next_message.*buffer" components/core/src/clp_s/
rg -A 10 "class.*SchemaReader" components/core/src/clp_s/
# Check if there's any string encoding conversion in the schema reader
rg -A 10 "class.*JsonReader" components/core/src/clp_s/
Length of output: 2578
Script:
#!/bin/bash
# Check the SchemaReader implementation for message handling
rg -A 10 "void.*get_next_message" components/core/src/clp_s/SchemaReader.cpp components/core/src/clp_s/SchemaReader.hpp
# Look for any string encoding or conversion in the message processing chain
rg -A 5 "string.*message" components/core/src/clp_s/SchemaReader.cpp components/core/src/clp_s/SchemaReader.hpp
Length of output: 5055
components/core/src/clp_s/CommandLineArguments.cpp (1)
308-309
: LGTM! Clear and accurate description update.
The updated description accurately reflects that the chunk size is measured in bytes, which aligns with the PR's objective of modifying the chunking behavior to focus on byte-size rather than record count.
if (0 != m_option.ordered_chunk_size | ||
&& num_records_marshalled >= m_option.ordered_chunk_size) | ||
{ | ||
if (0 != m_option.ordered_chunk_size && chunk_size >= m_option.ordered_chunk_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove 0 != m_option.ordered_chunk_size
. See comments below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its worth keeping this behaviour because its pretty much the only way an actual CLI user (including me) will use this command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -30,7 +30,7 @@ struct JsonConstructorOption { | |||
std::string archive_id; | |||
std::string output_dir; | |||
bool ordered{false}; | |||
size_t ordered_chunk_size{0}; | |||
size_t ordered_chunk_size{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: @kirkrodrigues wonder should we give it a non-0 default value? A default = 0 may look unintuitive to user.
If we don't have any other use case for ordered-decompression other than log viewer, we can use a default like 128MB which is what CLP is using. Or we can just set it to a large number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep the current default behaviour. We can add more to the description of the argument to describe that 0 gives you no chunking.
I think it makes more sense for the package to pick its behaviour inside of the package config instead of forcing it into the CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, now I agree that it should be in the package config. It is never there though, maybe I should consider changing this behavior in another PR to let the default to be configurable through the package
Can we agree on a new default for log-viewer, it's currently set to 100'000 records. Please also update the default values in components/clp-package-utils/clp_package_utils/scripts/decompress.py |
@@ -299,7 +299,7 @@ def main(argv): | |||
json_extraction_parser = command_args_parser.add_parser(EXTRACT_JSON_CMD) | |||
json_extraction_parser.add_argument("archive_id", type=str, help="Archive ID") | |||
json_extraction_parser.add_argument( | |||
"--target-chunk-size", type=int, help="Target chunk size.", required=True | |||
"--target-chunk-size", type=int, help="Target chunk size (B)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with removing the default value, but can you make sure the script still works without specifying target-chunk-size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, already tested on my end and it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. @kirkrodrigues do you want to do another pass?
@@ -178,7 +178,7 @@ class CommandLineArguments { | |||
size_t m_max_document_size{512ULL * 1024 * 1024}; // 512 MB | |||
bool m_structurize_arrays{false}; | |||
bool m_ordered_decompression{false}; | |||
size_t m_ordered_chunk_size{0}; | |||
size_t m_ordered_chunk_size{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming this to m_target_ordered_chunk_size
to indicate that it's a threshold that may be exceeded? Ditto for the CLI option name and JsonConstructorOption
.
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
components/core/src/clp_s/JsonConstructor.cpp (1)
162-166
: Consider improving readability of chunk size conditionWhile the logic is correct, the condition could be more readable by combining the checks into a single expression or using early return.
Here's a suggested improvement:
- if (0 != m_option.target_ordered_chunk_size - && chunk_size >= m_option.target_ordered_chunk_size) + const bool should_finalize_chunk = (m_option.target_ordered_chunk_size > 0 + && chunk_size >= m_option.target_ordered_chunk_size); + if (should_finalize_chunk)components/core/src/clp_s/CommandLineArguments.cpp (2)
305-309
: Enhance the description of target-ordered-chunk-size parameter.The description is clear but could be more specific about the behaviour when the chunk size is exceeded. Consider adding details about the actual chunking behaviour.
- "Chunk size (B) for each output file when decompressing records in log order." - " When set to 0, no chunking is performed." + "Target chunk size (B) for each output file when decompressing records in log order." + " Records are added to a chunk until the size exceeds this value." + " When set to 0, no chunking is performed."
372-375
: Improve error message clarity for the ordered-chunk-size validation.The error message should reference the exact command-line flag name for better user experience.
- throw std::invalid_argument("ordered-chunk-size must be used with ordered argument" + throw std::invalid_argument("--target-ordered-chunk-size must be used with --ordered"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
components/core/src/clp_s/CommandLineArguments.cpp
(2 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonConstructor.cpp
(2 hunks)components/core/src/clp_s/JsonConstructor.hpp
(1 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/CommandLineArguments.hpp
🔇 Additional comments (5)
components/core/src/clp_s/JsonConstructor.hpp (2)
33-33
: LGTM! The variable name accurately reflects its purpose.
The rename to target_ordered_chunk_size
better indicates that this is a target size that may be slightly exceeded during chunking, which aligns well with the size-based chunking behaviour.
33-33
: Size validation could be strengthened
The previous review comment about adding size validation to prevent excessive memory usage is still applicable.
components/core/src/clp_s/JsonConstructor.cpp (3)
84-86
: LGTM: Clean variable initialization
The default initialization of tracking variables follows modern C++ practices and maintains clear intent.
170-171
: LGTM: Proper handling of final chunk
The condition ensures that any remaining data is properly handled and no records are lost, which is essential for data integrity.
152-160
: 🛠️ Refactor suggestion
Consider integer overflow protection
When accumulating chunk_size
, there's a potential risk of integer overflow when adding buffer.length()
. Consider adding a check to ensure the addition won't overflow.
Here's a suggested improvement:
- chunk_size += buffer.length();
+ if (chunk_size > std::numeric_limits<size_t>::max() - buffer.length()) {
+ throw OperationFailed(
+ ErrorCodeFailure,
+ __FILE__,
+ __LINE__,
+ "Chunk size would overflow"
+ );
+ }
+ chunk_size += buffer.length();
command.append("--target-ordered-chunk-size") | ||
command.append(str(extract_json_config.target_chunk_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Potential inconsistency found in argument name transition
There appears to be a remaining reference to the old ordered-chunk-size
in an error message within components/core/src/clp_s/CommandLineArguments.cpp
. While the parameter name has been updated to target-ordered-chunk-size
, the error message still uses the old terminology, which could cause confusion.
components/core/src/clp_s/CommandLineArguments.cpp
: Update error message to use the new argument nametarget-ordered-chunk-size
instead ofordered-chunk-size
🔗 Analysis chain
The argument name change aligns with PR objectives
The update from --ordered-chunk-size
to --target-ordered-chunk-size
correctly implements the requested change for byte-based chunking. The implementation properly handles the optional nature of the parameter.
Let's verify the consistency of this change across the codebase:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the argument name change is consistent across the codebase
# Expected: No occurrences of the old argument name remain
# Check for any remaining instances of the old argument name
rg --type py --type cpp --type hpp "ordered-chunk-size"
# Verify the new argument name is used consistently
rg --type py --type cpp --type hpp "target-ordered-chunk-size"
Length of output: 174
Script:
#!/bin/bash
# Let's try without the hpp extension and use a more focused search
# Check for any remaining instances of the old argument name
echo "Searching for old argument name:"
rg --type py --type cpp "ordered-chunk-size"
echo -e "\nSearching for new argument name:"
rg --type py --type cpp "target-ordered-chunk-size"
# Also search in any potential test files or configuration
echo -e "\nSearching in all files (including others):"
rg "ordered-chunk-size"
Length of output: 1408
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(clp-s): Chunk output by size (in bytes) during ordered decompression.
…sion. (y-scope#600) Co-authored-by: haiqi96 <[email protected]> Co-authored-by: kirkrodrigues <[email protected]>
Description
This PR changes the behaviour of chunking during ordered decompression to target a chunk size in bytes instead of number of records. Generating chunks in this way should allow the log viewer to have more consistent memory usage and computation required per chunk.
Chunks are generated by appending records to them until the total chunk size exceeds that specified by the
--ordered-chunk-size
argument. That is,--ordered-chunk-size
specifies a lower bound for the chunk size which may be slightly exceeded.As before, specifying
--ordered-chunk-size 0
(which is also the default) will result in the output not being divided into chunks.Validation performed
--ordered-chunk-size 0
--target-chunk-size
argument still functions correctly in the package.Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Chores