Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(wren-ai-service): Optimize SQL Data Preprocessing with Progressive Data Reduction #1331

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

paopa
Copy link
Member

@paopa paopa commented Feb 24, 2025

This PR improves the SQL data preprocessing pipeline by implementing a progressive data reduction approach for handling large datasets.

Key Updates:

  • Introduced a new reduce_data_size helper function that gradually reduces data size in steps
  • Replaced static data truncation with an iterative reduction process
  • Added logging for better visibility into the data reduction process

Technical Details:

  • Data is now reduced in configurable steps (default: 50 rows) until token count is below 100,000
  • Maintains original column structure and dtypes while only reducing row count
  • Added logging statements to track token count and data size during reduction

Benefits:

  • More controlled approach to data reduction
  • Better visibility into the reduction process
  • More flexible than the previous hard-coded 250-row limit

Testing:

Please verify:

  • Large datasets are properly reduced below the 100,000 token threshold
  • Data structure integrity is maintained
  • Logging provides adequate visibility into the reduction process

Summary by CodeRabbit

  • New Features
    • Enhanced data handling now adjusts processing based on input size, ensuring improved performance and responsiveness during intensive operations. Users may experience smoother performance when working with large datasets.
    • Introduced a new method for reducing data size, allowing for more efficient management of large data lists. This includes improved error handling and a more robust approach to managing token counts.

Copy link
Contributor

coderabbitai bot commented Feb 24, 2025

Walkthrough

The changes introduce a new inner function reduce_data_size within the preprocess function in the specified file. This function iteratively reduces the size of the data list by removing a specified number of elements from the end until the token count is no longer greater than 100,000. The previous logic that limited the data to the first 250 entries has been removed. An iteration counter has been added to prevent infinite loops, and the updated token count is logged after each reduction. The return statement remains unchanged.

Changes

File Path Change Summary
wren-ai-service/.../preprocess_sql_data.py Added reduce_data_size function inside preprocess that utilizes a while loop to remove elements in fixed steps until the token count is ≤ 100,000. Logs updated token count.

Sequence Diagram(s)

sequenceDiagram
    participant Pre as preprocess()
    participant RD as reduce_data_size()
    participant Log as Logger

    Pre->>RD: Call reduce_data_size(data, reduction_step)
    loop while token count > 100,000
        RD->>Pre: Remove reduction_step items from data
        Pre->>Log: Log updated token count
    end
    Pre-->>Pre: Return sql_data, num_rows_used_in_llm, tokens
Loading

Suggested reviewers

  • cyyeh

Poem

I'm a rabbit in a code field,
Hopping through lines both bold and mild,
Trimming data with each little bite,
Logging tokens through the day and night,
With every reduction, I cheer with delight!
🐇💻

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1)

16-43: Consider architectural improvements for better flexibility and performance.

The current implementation could benefit from several architectural improvements:

  1. Consider caching token counts to avoid recalculation
  2. Make the reduction strategy configurable
  3. Add metrics for monitoring the reduction process

Here are some suggestions:

  1. Add configuration parameters to the pipeline class:
class PreprocessSqlData(BasicPipeline):
    def __init__(
        self,
        llm_provider: LLMProvider,
        max_token_count: int = 100_000,
        reduction_strategy: str = "progressive",
        reduction_step: int = 50,
        **kwargs,
    ):
  1. Consider implementing a token count cache:
from functools import lru_cache

@lru_cache(maxsize=1000)
def calculate_token_count(data_str: str, encoding: tiktoken.Encoding) -> int:
    return len(encoding.encode(data_str))
  1. Add metrics for monitoring:
from prometheus_client import Counter, Histogram

data_reduction_iterations = Counter(
    'sql_data_reduction_iterations_total',
    'Number of iterations needed to reduce data size'
)
token_reduction_time = Histogram(
    'sql_data_reduction_seconds',
    'Time spent reducing data size'
)

Would you like me to create an issue to track these architectural improvements?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdaa1b1 and d6effa3.

📒 Files selected for processing (1)
  • wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)

@paopa paopa force-pushed the fix/limit-row-for-context-window branch from d6effa3 to 215a4f6 Compare February 24, 2025 08:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (2)

22-46: Consider returning a copy to prevent unintended modifications.

While the function has been improved with input validation, documentation, and edge case handling, it could be further enhanced by:

  1. Returning a copy to prevent unintended modifications of the original data
  2. Being more specific in the docstring about the return type being a shallow copy

Apply this diff to implement these improvements:

     def reduce_data_size(data: list, reduction_step: int = 50) -> list:
         """Reduce the size of data by removing elements from the end.
 
         Args:
             data: The input list to reduce
             reduction_step: Number of elements to remove (must be positive)
 
         Returns:
-            list: A list with reduced size
+            list: A shallow copy of the input list with reduced size
 
         Raises:
             ValueError: If reduction_step is not positive
         """
         if reduction_step <= 0:
             raise ValueError("reduction_step must be positive")
 
         elements_to_keep = max(0, len(data) - reduction_step)
-        returned_data = data[:elements_to_keep]
+        returned_data = data[:elements_to_keep].copy()
 
         logger.info(
             f"Reducing data size by {reduction_step} rows. "
             f"Original size: {len(data)}, New size: {len(returned_data)}"
         )
 
         return returned_data

52-66:

❓ Verification inconclusive

Enhance the data reduction loop implementation.

While the loop has been improved with infinite loop protection, consider these additional enhancements:

  1. Define the token limit as a constant
  2. Add more detailed logging including token reduction progress
  3. Consider caching or optimizing token count calculation

Apply this diff to implement these improvements:

+    MAX_TOKEN_COUNT = 100_000
+
-    while _token_count > 100_000:
+    while _token_count > MAX_TOKEN_COUNT:
         if iteration > 1000:
-            """
-            Avoid infinite loop
-            If the token count is still too high after 1000 iterations, break
-            """
+            logger.error(
+                f"Failed to reduce token count below {MAX_TOKEN_COUNT} "
+                f"after 1000 iterations. Current count: {_token_count}"
+            )
             break
 
         iteration += 1
 
         data = sql_data.get("data", [])
+        previous_token_count = _token_count
         sql_data["data"] = reduce_data_size(data)
         num_rows_used_in_llm = len(sql_data.get("data", []))
         _token_count = len(encoding.encode(str(sql_data)))
-        logger.info(f"Token count: {_token_count}")
+        token_reduction = previous_token_count - _token_count
+        logger.info(
+            f"Iteration {iteration}: Token count reduced by {token_reduction} "
+            f"({previous_token_count} -> {_token_count})"
+        )

Run this script to check if the token count reduction is effective:


🏁 Script executed:

#!/bin/bash
# Description: Check if token count reduction is effective
# Look for patterns where token count doesn't decrease significantly

rg "Token count" | grep -E "Token count.*->.*" | \
awk -F'[()]' '{
    split($2, counts, "->")
    before = counts[1]
    after = counts[2]
    reduction = before - after
    if (reduction < 1000) {
        print "Small reduction detected:", $0
    }
}'

Length of output: 253


Enhance the data reduction loop in wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py.

The improvements for infinite loop protection and logging are a good step forward. As further enhancements, please:

  • Define the token limit as a constant (e.g., MAX_TOKEN_COUNT) to improve readability and maintainability.
  • Enrich the logging to clearly show the progress of token reduction between iterations.
  • Consider ways to optimize or cache the token count calculation if it becomes a performance concern.

The shell script provided to verify token count reduction did not produce output, so I recommend manual verification to ensure that each iteration is indeed making significant progress in reducing the token count.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6effa3 and 215a4f6.

📒 Files selected for processing (1)
  • wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)

@paopa paopa added module/ai-service ai-service related ci/ai-service ai-service related labels Feb 24, 2025
@paopa paopa requested a review from wwwy3y3 February 24, 2025 09:25
@paopa paopa force-pushed the fix/limit-row-for-context-window branch from 215a4f6 to 9825d2f Compare February 24, 2025 09:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (2)

22-46: Consider additional improvements to the helper function.

While the function has been improved with input validation and documentation, consider these additional enhancements:

  1. Return a copy of the data to prevent unintended modifications
  2. Add type hints for better IDE support and static type checking

Apply this diff to implement these improvements:

-    def reduce_data_size(data: list, reduction_step: int = 50) -> list:
+    def reduce_data_size(data: list[any], reduction_step: int = 50) -> list[any]:
         """Reduce the size of data by removing elements from the end.
         
         Args:
             data: The input list to reduce
             reduction_step: Number of elements to remove (must be positive)
         
         Returns:
-            list: A list with reduced size
+            list[any]: A new list with reduced size, preserving the original
         
         Raises:
             ValueError: If reduction_step is not positive
         """
         if reduction_step <= 0:
             raise ValueError("reduction_step must be positive")

         elements_to_keep = max(0, len(data) - reduction_step)
-        returned_data = data[:elements_to_keep]
+        returned_data = data[:elements_to_keep].copy()

         logger.info(
             f"Reducing data size by {reduction_step} rows. "
             f"Original size: {len(data)}, New size: {len(returned_data)}"
         )

         return returned_data

52-66: Enhance the token reduction loop with constants and better logging.

The loop implementation could benefit from these improvements:

  1. Define constants for magic numbers
  2. Enhance logging to show token count reduction
  3. Improve the break condition message

Apply this diff to implement these improvements:

+    MAX_TOKEN_COUNT = 100_000
+    MAX_ITERATIONS = 1000
+    
+    logger.info(f"Initial token count: {_token_count}, rows: {num_rows_used_in_llm}")
-    while _token_count > 100_000:
-        if iteration > 1000:
+    while _token_count > MAX_TOKEN_COUNT:
+        if iteration > MAX_ITERATIONS:
             """
             Avoid infinite loop
-            If the token count is still too high after 1000 iterations, break
+            If the token count is still too high after MAX_ITERATIONS iterations, break
             """
+            logger.error(
+                f"Failed to reduce token count below {MAX_TOKEN_COUNT} "
+                f"after {MAX_ITERATIONS} iterations. "
+                f"Final token count: {_token_count}, rows: {num_rows_used_in_llm}"
+            )
             break

         iteration += 1

         data = sql_data.get("data", [])
+        previous_token_count = _token_count
         sql_data["data"] = reduce_data_size(data)
         num_rows_used_in_llm = len(sql_data.get("data", []))
         _token_count = len(encoding.encode(str(sql_data)))
-        logger.info(f"Token count: {_token_count}")
+        token_reduction = previous_token_count - _token_count
+        logger.info(
+            f"Iteration {iteration}: Token count reduced by {token_reduction} "
+            f"({previous_token_count} -> {_token_count})"
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 215a4f6 and 9825d2f.

📒 Files selected for processing (1)
  • wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pytest
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/src/pipelines/retrieval/preprocess_sql_data.py (1)

18-72: LGTM! The integration is well-structured.

The overall integration of the progressive data reduction logic within the preprocess function is clean and maintains backward compatibility:

  • Helper function is properly scoped
  • Token count tracking is maintained
  • Function signature and return values remain unchanged

@paopa paopa merged commit 1084eba into main Feb 25, 2025
10 checks passed
@paopa paopa deleted the fix/limit-row-for-context-window branch February 25, 2025 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/ai-service ai-service related module/ai-service ai-service related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants