fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

paopa · 2025-02-24T07:51:50Z

This PR improves the SQL data preprocessing pipeline by implementing a progressive data reduction approach for handling large datasets.

Key Updates:

Introduced a new reduce_data_size helper function that gradually reduces data size in steps
Replaced static data truncation with an iterative reduction process
Added logging for better visibility into the data reduction process

Technical Details:

Data is now reduced in configurable steps (default: 50 rows) until token count is below 100,000
Maintains original column structure and dtypes while only reducing row count
Added logging statements to track token count and data size during reduction

Benefits:

More controlled approach to data reduction
Better visibility into the reduction process
More flexible than the previous hard-coded 250-row limit

Testing:

Please verify:

Large datasets are properly reduced below the 100,000 token threshold
Data structure integrity is maintained
Logging provides adequate visibility into the reduction process

Summary by CodeRabbit

New Features
- Enhanced data handling now adjusts processing based on input size, ensuring improved performance and responsiveness during intensive operations. Users may experience smoother performance when working with large datasets.
- Introduced a new method for reducing data size, allowing for more efficient management of large data lists. This includes improved error handling and a more robust approach to managing token counts.

coderabbitai · 2025-02-24T07:51:58Z

Walkthrough

The changes introduce a new inner function reduce_data_size within the preprocess function in the specified file. This function iteratively reduces the size of the data list by removing a specified number of elements from the end until the token count is no longer greater than 100,000. The previous logic that limited the data to the first 250 entries has been removed. An iteration counter has been added to prevent infinite loops, and the updated token count is logged after each reduction. The return statement remains unchanged.

Changes

File Path	Change Summary
`wren-ai-service/.../preprocess_sql_data.py`	Added `reduce_data_size` function inside `preprocess` that utilizes a while loop to remove elements in fixed steps until the token count is ≤ 100,000. Logs updated token count.

Sequence Diagram(s)

sequenceDiagram
    participant Pre as preprocess()
    participant RD as reduce_data_size()
    participant Log as Logger

    Pre->>RD: Call reduce_data_size(data, reduction_step)
    loop while token count > 100,000
        RD->>Pre: Remove reduction_step items from data
        Pre->>Log: Log updated token count
    end
    Pre-->>Pre: Return sql_data, num_rows_used_in_llm, tokens

Suggested reviewers

cyyeh

Poem

I'm a rabbit in a code field,
Hopping through lines both bold and mild,
Trimming data with each little bite,
Logging tokens through the day and night,
With every reduction, I cheer with delight!
🐇💻

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1)
16-43: Consider architectural improvements for better flexibility and performance.

The current implementation could benefit from several architectural improvements:

Consider caching token counts to avoid recalculation

Make the reduction strategy configurable

Add metrics for monitoring the reduction process

Here are some suggestions:

Add configuration parameters to the pipeline class:
class PreprocessSqlData(BasicPipeline):
    def __init__(
        self,
        llm_provider: LLMProvider,
        max_token_count: int = 100_000,
        reduction_strategy: str = "progressive",
        reduction_step: int = 50,
        **kwargs,
    ):
Consider implementing a token count cache:
from functools import lru_cache

@lru_cache(maxsize=1000)
def calculate_token_count(data_str: str, encoding: tiktoken.Encoding) -> int:
    return len(encoding.encode(data_str))
Add metrics for monitoring:
from prometheus_client import Counter, Histogram

data_reduction_iterations = Counter(
    'sql_data_reduction_iterations_total',
    'Number of iterations needed to reduce data size'
)
token_reduction_time = Histogram(
    'sql_data_reduction_seconds',
    'Time spent reducing data size'
)
Would you like me to create an issue to track these architectural improvements?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdaa1b1 and d6effa3.

📒 Files selected for processing (1)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (2)
22-46: Consider returning a copy to prevent unintended modifications.

While the function has been improved with input validation, documentation, and edge case handling, it could be further enhanced by:

Returning a copy to prevent unintended modifications of the original data

Being more specific in the docstring about the return type being a shallow copy

Apply this diff to implement these improvements:
     def reduce_data_size(data: list, reduction_step: int = 50) -> list:
         """Reduce the size of data by removing elements from the end.
 
         Args:
             data: The input list to reduce
             reduction_step: Number of elements to remove (must be positive)
 
         Returns:
-            list: A list with reduced size
+            list: A shallow copy of the input list with reduced size
 
         Raises:
             ValueError: If reduction_step is not positive
         """
         if reduction_step <= 0:
             raise ValueError("reduction_step must be positive")
 
         elements_to_keep = max(0, len(data) - reduction_step)
-        returned_data = data[:elements_to_keep]
+        returned_data = data[:elements_to_keep].copy()
 
         logger.info(
             f"Reducing data size by {reduction_step} rows. "
             f"Original size: {len(data)}, New size: {len(returned_data)}"
         )
 
         return returned_data
52-66:
❓ Verification inconclusive

Enhance the data reduction loop implementation.

While the loop has been improved with infinite loop protection, consider these additional enhancements:

Define the token limit as a constant

Add more detailed logging including token reduction progress

Consider caching or optimizing token count calculation

Apply this diff to implement these improvements:
+    MAX_TOKEN_COUNT = 100_000
+
-    while _token_count > 100_000:
+    while _token_count > MAX_TOKEN_COUNT:
         if iteration > 1000:
-            """
-            Avoid infinite loop
-            If the token count is still too high after 1000 iterations, break
-            """
+            logger.error(
+                f"Failed to reduce token count below {MAX_TOKEN_COUNT} "
+                f"after 1000 iterations. Current count: {_token_count}"
+            )
             break
 
         iteration += 1
 
         data = sql_data.get("data", [])
+        previous_token_count = _token_count
         sql_data["data"] = reduce_data_size(data)
         num_rows_used_in_llm = len(sql_data.get("data", []))
         _token_count = len(encoding.encode(str(sql_data)))
-        logger.info(f"Token count: {_token_count}")
+        token_reduction = previous_token_count - _token_count
+        logger.info(
+            f"Iteration {iteration}: Token count reduced by {token_reduction} "
+            f"({previous_token_count} -> {_token_count})"
+        )
Run this script to check if the token count reduction is effective:

🏁 Script executed:
#!/bin/bash
# Description: Check if token count reduction is effective
# Look for patterns where token count doesn't decrease significantly

rg "Token count" | grep -E "Token count.*->.*" | \
awk -F'[()]' '{
    split($2, counts, "->")
    before = counts[1]
    after = counts[2]
    reduction = before - after
    if (reduction < 1000) {
        print "Small reduction detected:", $0
    }
}'
Length of output: 253
Enhance the data reduction loop in wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py.

The improvements for infinite loop protection and logging are a good step forward. As further enhancements, please:

Define the token limit as a constant (e.g., MAX_TOKEN_COUNT) to improve readability and maintainability.

Enrich the logging to clearly show the progress of token reduction between iterations.

Consider ways to optimize or cache the token count calculation if it becomes a performance concern.

The shell script provided to verify token count reduction did not produce output, so I recommend manual verification to ensure that each iteration is indeed making significant progress in reducing the token count.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6effa3 and 215a4f6.

📒 Files selected for processing (1)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

… 100k

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (2)

22-46: Consider additional improvements to the helper function.

While the function has been improved with input validation and documentation, consider these additional enhancements:

Return a copy of the data to prevent unintended modifications
Add type hints for better IDE support and static type checking

Apply this diff to implement these improvements:

-    def reduce_data_size(data: list, reduction_step: int = 50) -> list:
+    def reduce_data_size(data: list[any], reduction_step: int = 50) -> list[any]:
         """Reduce the size of data by removing elements from the end.
         
         Args:
             data: The input list to reduce
             reduction_step: Number of elements to remove (must be positive)
         
         Returns:
-            list: A list with reduced size
+            list[any]: A new list with reduced size, preserving the original
         
         Raises:
             ValueError: If reduction_step is not positive
         """
         if reduction_step <= 0:
             raise ValueError("reduction_step must be positive")

         elements_to_keep = max(0, len(data) - reduction_step)
-        returned_data = data[:elements_to_keep]
+        returned_data = data[:elements_to_keep].copy()

         logger.info(
             f"Reducing data size by {reduction_step} rows. "
             f"Original size: {len(data)}, New size: {len(returned_data)}"
         )

         return returned_data

52-66: Enhance the token reduction loop with constants and better logging.

The loop implementation could benefit from these improvements:

Define constants for magic numbers
Enhance logging to show token count reduction
Improve the break condition message

Apply this diff to implement these improvements:

+    MAX_TOKEN_COUNT = 100_000
+    MAX_ITERATIONS = 1000
+    
+    logger.info(f"Initial token count: {_token_count}, rows: {num_rows_used_in_llm}")
-    while _token_count > 100_000:
-        if iteration > 1000:
+    while _token_count > MAX_TOKEN_COUNT:
+        if iteration > MAX_ITERATIONS:
             """
             Avoid infinite loop
-            If the token count is still too high after 1000 iterations, break
+            If the token count is still too high after MAX_ITERATIONS iterations, break
             """
+            logger.error(
+                f"Failed to reduce token count below {MAX_TOKEN_COUNT} "
+                f"after {MAX_ITERATIONS} iterations. "
+                f"Final token count: {_token_count}, rows: {num_rows_used_in_llm}"
+            )
             break

         iteration += 1

         data = sql_data.get("data", [])
+        previous_token_count = _token_count
         sql_data["data"] = reduce_data_size(data)
         num_rows_used_in_llm = len(sql_data.get("data", []))
         _token_count = len(encoding.encode(str(sql_data)))
-        logger.info(f"Token count: {_token_count}")
+        token_reduction = previous_token_count - _token_count
+        logger.info(
+            f"Iteration {iteration}: Token count reduced by {token_reduction} "
+            f"({previous_token_count} -> {_token_count})"
+        )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 215a4f6 and 9825d2f.

📒 Files selected for processing (1)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: pytest
GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (1)

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1)

18-72: LGTM! The integration is well-structured.

The overall integration of the progressive data reduction logic within the preprocess function is clean and maintains backward compatibility:

Helper function is properly scoped

Token count tracking is maintained

Function signature and return values remain unchanged

coderabbitai bot reviewed Feb 24, 2025

View reviewed changes

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py Outdated Show resolved Hide resolved

wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py Outdated Show resolved Hide resolved

paopa force-pushed the fix/limit-row-for-context-window branch from d6effa3 to 215a4f6 Compare February 24, 2025 08:04

coderabbitai bot reviewed Feb 24, 2025

View reviewed changes

paopa added module/ai-service ai-service related ci/ai-service ai-service related labels Feb 24, 2025

paopa requested a review from wwwy3y3 February 24, 2025 09:25

paopa added 3 commits February 24, 2025 17:39

chore: todo mark to mention the code where should be changes

54f8a24

feat: data size reduction process to ensure the token count less than…

068567d

… 100k

chore: docstring and avoid infinite loop

9825d2f

paopa force-pushed the fix/limit-row-for-context-window branch from 215a4f6 to 9825d2f Compare February 24, 2025 09:40

coderabbitai bot reviewed Feb 24, 2025

View reviewed changes

wwwy3y3 approved these changes Feb 25, 2025

View reviewed changes

paopa merged commit 1084eba into main Feb 25, 2025
10 checks passed

paopa deleted the fix/limit-row-for-context-window branch February 25, 2025 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

paopa commented Feb 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

Conversation

paopa commented Feb 24, 2025 • edited by coderabbitai bot Loading

Key Updates:

Technical Details:

Benefits:

Testing:

Summary by CodeRabbit

coderabbitai bot commented Feb 24, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

paopa commented Feb 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)