Support the input of a large DataFrame when generating follow-up questions #761

Flyfoxs · 2025-01-29T03:58:52Z

…tions.

It might get this error when submit a big df to LLM.

File "/workspace/venv/lib/python3.10/site-packages/vanna/base/base.py", line 317, in generate_followup_questions
llm_response = self.submit_prompt(message_log, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/vanna/bedrock/bedrock_converse.py", line 85, in submit_prompt
raise Exception(f"A Bedrock client error occurred: {message}")
Exception: A Bedrock client error occurred: Input is too long for requested model.

…tions. It might get this error when submit a big df to LLM. File "/workspace/venv/lib/python3.10/site-packages/vanna/base/base.py", line 317, in generate_followup_questions llm_response = self.submit_prompt(message_log, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/vanna/bedrock/bedrock_converse.py", line 85, in submit_prompt raise Exception(f"A Bedrock client error occurred: {message}") Exception: A Bedrock client error occurred: Input is too long for requested model.

llamapreview

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 PR Summary

Business value and requirements alignment: This PR aims to support the input of large DataFrames when generating follow-up questions, addressing an issue where large DataFrames cause errors due to exceeding the token limit of the LLM.
Key components modified: The primary modification is in the generate_followup_questions function within src/vanna/base/base.py.
Impact assessment: The change reduces the risk of "Input too long" errors but may affect the relevance of generated follow-up questions due to DataFrame truncation.
System dependencies and integration impacts: The modification impacts the interaction with the Bedrock LLM service by reducing the input size, potentially affecting the quality of follow-up questions.

1.2 Architecture Changes

System design modifications: The PR modifies the way DataFrames are included in the prompt sent to the LLM, truncating them to the first 25 rows.
Component interactions: The change directly affects the generate_followup_questions function and its interaction with the Bedrock LLM service.
Integration points: The modification is localized to the generate_followup_questions function, but it has broader implications for the quality of follow-up questions generated.

2. Detailed Technical Analysis

2.1 Code Logic Deep-Dive

Core Logic Changes

src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:

    {df.head(25).to_markdown()}

Analysis:
- The PR modifies the generate_followup_questions function to truncate the DataFrame df to its first 25 rows before converting it to markdown and including it in the prompt to the LLM.
- The primary issue is that df.head(25) might not be representative of the entire dataset, especially if the DataFrame is sorted or has a specific ordering. This could lead to the LLM generating follow-up questions that are only relevant to the first 25 rows and miss important patterns or insights present in the rest of the data.
- For instance, if the DataFrame is sorted by date in ascending order and the user's question pertains to recent trends, head(25) would only show the oldest data, potentially misleading the LLM and resulting in irrelevant follow-up questions.
LlamaPReview Suggested Improvements:

    # Option 1: Sample rows instead of head
    {df.sample(min(25, len(df))).to_markdown()}

    # Option 2: Include descriptive statistics along with head
    f"""
    Descriptive statistics of the DataFrame:
    {df.describe().to_markdown()}

    First 25 rows of the DataFrame:
    {df.head(25).to_markdown()}
    """

    # Option 3: Conditionally include full DataFrame for smaller datasets
    dataframe_markdown = df.to_markdown() if len(df) <= 25 else df.head(25).to_markdown()
    f"""
    The following is a pandas DataFrame with the results of the query:
    {dataframe_markdown}
    """

Improvement rationale:
- Option 1 (Sampling): Using df.sample(min(25, len(df))) provides a more randomized and potentially representative subset of the DataFrame, especially if the data has inherent ordering. min(25, len(df)) ensures it handles cases where the DataFrame has fewer than 25 rows gracefully.
  - Technical benefits: Increases the likelihood of capturing diverse data points, potentially leading to more relevant follow-up questions compared to just using head(25).
  - Business value: Better user experience by providing more insightful follow-up questions.
  - Risk assessment: Slightly increases randomness but reduces the risk of biased follow-up questions due to non-representative head.
- Option 2 (Statistics + Head): Including descriptive statistics (df.describe()) provides the LLM with a high-level summary of the data distribution, even when the full DataFrame is truncated. Combining this with head(25) gives both a statistical overview and a glimpse into the actual data.
  - Technical benefits: Provides richer context to the LLM without significantly increasing token count, potentially improving the quality of generated questions.
  - Business value: Enhanced question relevance and user understanding of the data.
  - Risk assessment: Slightly increases prompt length but provides valuable summary information.
- Option 3 (Conditional Inclusion): For DataFrames with 25 rows or less, the full DataFrame is included, preserving all information. For larger DataFrames, truncation to head(25) is applied as in the original PR.
  - Technical benefits: Optimizes for smaller datasets where full context is feasible while still addressing the token limit issue for larger datasets.
  - Business value: Balances between providing full information when possible and mitigating token limit errors for large datasets.
  - Risk assessment: Adds a simple conditional check but maintains the core benefit of handling large DataFrames.

Core Logic Changes

src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:

    {df.head(25).to_markdown()}

Analysis:
- The current implementation directly uses to_markdown() on the DataFrame (or its head). While markdown is human-readable, it might not be the most efficient format for LLMs to process tabular data, especially when considering token limits. Markdown representation can be verbose, including formatting characters and potentially redundant information for the LLM's understanding of the data's structure and content.
- The initial review does not consider alternative data serialization formats that might be more token-efficient and equally or more informative for the LLM.
LlamaPReview Suggested Improvements:

    # Option 1: Convert DataFrame to CSV string (more compact than markdown)
    import io
    csv_buffer = io.StringIO()
    df.head(25).to_csv(csv_buffer, index=False)
    dataframe_representation = csv_buffer.getvalue()

    # Option 2: Convert DataFrame to JSON (structured and potentially more parsable)
    dataframe_representation = df.head(25).to_json(orient="records")

    # Option 3:  Provide column names and data types separately, then top rows (hybrid approach)
    column_info = [{"name": col, "type": str(df[col].dtype)} for col in df.columns] # Simplified type representation
    top_rows = df.head(25).to_dict(orient="records")
    dataframe_representation = f"Columns: {column_info}\nTop Rows: {top_rows}"

    f"""
    The following is a pandas DataFrame with the results of the query:
    {dataframe_representation}
    """

Improvement rationale:
- Option 1 (CSV): CSV is generally more compact than markdown for tabular data. It removes markdown formatting overhead and focuses on the raw data values and delimiters.
  - Technical benefits: Reduces token count compared to markdown, potentially allowing for slightly larger data context within the token limit. CSV is still easily parsable by LLMs.
  - Business value: More efficient use of tokens, potentially leading to cost savings and/or allowing for more context to be included in the prompt.
  - Risk assessment: Minimal risk, CSV is a standard format.
- Option 2 (JSON): JSON provides a structured representation of the data, which can be beneficial for LLMs that are trained to process structured data. orient="records" creates a list of dictionaries, where each dictionary represents a row.
  - Technical benefits: Highly structured format, potentially easier for LLMs to parse and understand the data schema and values.
  - Business value: Potentially improved LLM understanding of the data, leading to better follow-up questions.
  - Risk assessment: Slightly more verbose than CSV in some cases, but the structured nature can be advantageous.
- Option 3 (Hybrid: Column Info + Top Rows): This approach explicitly provides column names and data types, followed by the top rows in JSON format. This gives the LLM explicit schema information, which can be very helpful for understanding the data.
  - Technical benefits: Provides explicit schema information, which can significantly improve LLM's understanding of the data structure and types.
  - Business value: Potentially leads to the most relevant and accurate follow-up questions by providing clear data context.
  - Risk assessment: Slightly more complex to implement but offers potentially the most informative representation within token limits. The type representation is simplified and might need refinement for complex data types in a production setting.

Core Logic Changes

src/vanna/base/base.py - generate_followup_questions
- Submitted PR Code:

    llm_response = self.submit_prompt(message_log, **kwargs)

Analysis:
- While the PR addresses the "Input too long" error by truncating the DataFrame, it doesn't explicitly handle the case where even the truncated DataFrame (e.g., head(25)) might still result in a prompt that exceeds the LLM's token limit, especially when combined with the question, SQL query, and system/user messages.
- The initial review does not mention explicit error handling for situations where the prompt remains too large even after truncation. The code currently relies on the submit_prompt function to raise an exception if the input is still too long, but this is a reactive approach rather than proactive size management.
LlamaPReview Suggested Improvements:

    import pandas as pd

    def generate_followup_questions(self, question: str, sql: str, df: pd.DataFrame, n_questions: int = 3, **kwargs):
        max_dataframe_rows = 25 # Configurable parameter
        dataframe_head = df.head(max_dataframe_rows)
        dataframe_markdown = dataframe_head.to_markdown()

        system_message_content = f"""You are a helpful data assistant. The user asked the question: '{question}'

        The SQL query for this question was: {sql}

        The following is a pandas DataFrame with the results of the query:
        {dataframe_markdown}
        """

        message_log = [
            self.system_message(system_message_content),
            self.user_message(
                f"Generate a list of {n_questions} followup questions that the user might ask about this data. Respond with a list of questions, one per line. Do not answer with any explanations -- just the questions. Remember that there should be an unambiguous SQL query that can be generated from the question. Prefer questions that are answerable outside of the context of this conversation. Prefer questions that are slight modifications of the SQL query that was generated that allow digging deeper into the data. Each question will be turned into a button that the user can click to generate a new SQL query so don't use 'example' type questions. Each question must have a one-to-one correspondence with an instantiated SQL query." +
                self._response_language()
            ),
        ]

        prompt_text = "".join([msg["content"] for msg in message_log]) # Reconstruct prompt for length check
        estimated_prompt_tokens = len(prompt_text.split()) # Simple token estimation - consider using tokenizer for accuracy

        max_model_tokens = 4096 # Example max token limit - should be model-specific and configurable
        token_margin = 200 # Buffer for response and other overhead

        if estimated_prompt_tokens > max_model_tokens - token_margin:
            print(f"Warning: Estimated prompt tokens ({estimated_prompt_tokens}) exceeds recommended limit. Further truncating DataFrame.") # Log warning
            dataframe_markdown = "DataFrame data too large to include in full. Please refer to the query results directly." # Or more aggressive truncation/summarization
            system_message_content = f"""You are a helpful data assistant. The user asked the question: '{question}'

            The SQL query for this question was: {sql}

            {dataframe_markdown}
            """
            message_log[0]["content"] = system_message_content # Update message log with truncated content

        llm_response = self.submit_prompt(message_log, **kwargs) # Submit prompt after potential further truncation

        numbers_removed = re.sub(r"^\d+\.\s*", "", llm_response, flags=re.MULTILINE)
        return [q.strip() for q in numbers_removed.strip().split("\n") if q.strip()]

Improvement rationale:
- Proactive Token Estimation and Handling: The improved code includes a basic token estimation (splitting by spaces - for more accurate estimation, a tokenizer library should be used). It checks if the estimated token count exceeds a threshold before calling submit_prompt.
- Further Truncation/Alternative Message: If the prompt is still too long, it further reduces the DataFrame representation to a simple message indicating that the data is too large. More sophisticated strategies could be employed here, such as more aggressive summarization or feature selection, but a simple message is a reasonable starting point for handling extreme cases.
- Warning Logging: A warning message is printed when further truncation occurs, which can be helpful for debugging and monitoring.
- Configurable Parameters: max_dataframe_rows, max_model_tokens, and token_margin are introduced as parameters (though currently hardcoded for demonstration). These should ideally be configurable to adapt to different LLMs and use cases.
  - Technical benefits: Prevents runtime errors due to exceeding token limits more proactively. Provides a fallback mechanism when even truncated data is too large.
  - Business value: Improves system robustness and user experience by gracefully handling large inputs and providing informative feedback (warning message).
  - Risk assessment: Reduces the risk of unexpected errors and improves the system's ability to handle a wider range of inputs. The token estimation is simplified and should be replaced with a proper tokenizer for production use for better accuracy. The max_model_tokens should be dynamically retrieved or configured based on the actual LLM being used.

2.2 Implementation Quality

Code organization and structure: The PR is well-organized, focusing on a single function modification.
Design patterns usage: The PR does not introduce new design patterns but modifies an existing function to handle large DataFrames.
Error handling approach: The PR addresses the "Input too long" error by truncating the DataFrame. However, it does not handle cases where the truncated DataFrame still results in a prompt that exceeds the token limit.
Resource management: The PR does not introduce new resource management concerns.

3. Critical Findings

3.1 Potential Issues

🔴 Critical Issues
- DataFrame Truncation: The current implementation uses df.head(25) to truncate the DataFrame, which may not be representative of the entire dataset, leading to potentially irrelevant follow-up questions.
  - Impact: Reduced relevance and quality of follow-up questions.
  - Recommendation: Consider using sampling or including descriptive statistics to provide a more representative subset of the data.
- Token Limit Handling: The PR does not handle cases where the truncated DataFrame still results in a prompt that exceeds the token limit.
  - Impact: Potential runtime errors and reduced system robustness.
  - Recommendation: Implement proactive token estimation and further truncation or summarization strategies.
🟡 Warnings
- Markdown Representation: Using markdown for DataFrame representation can be verbose and inefficient for LLMs.
  - Potential risks: Increased token usage and reduced efficiency.
  - Suggested improvements: Consider using more compact formats like CSV or JSON for DataFrame representation.

3.2 Code Quality Concerns

Maintainability aspects: The PR introduces a simple truncation strategy, which may need to be revisited for more complex datasets.
Readability issues: The code is readable, but the truncation logic could be better documented.
Performance bottlenecks: The PR does not introduce new performance bottlenecks.

4. Security Assessment

Authentication/Authorization impacts: No direct impacts.
Data handling concerns: No direct concerns.
Input validation: The PR does not introduce new input validation concerns.
Security best practices: The PR adheres to security best practices.
Potential security risks: No direct security risks identified.
Mitigation strategies: No additional mitigation strategies required.
Security testing requirements: No additional security testing requirements identified.

5. Testing Strategy

5.1 Test Coverage

Unit test analysis: The PR should include unit tests to verify the handling of large DataFrames and the truncation logic.
Integration test requirements: Integration tests should assess the end-to-end impact on question generation quality.
Edge cases coverage: Tests should cover edge cases where the DataFrame is very large or very small.

5.2 Test Recommendations

Suggested Test Cases

  def test_generate_followup_questions_large_df():
      # Test with a large DataFrame to ensure truncation and error handling
      large_df = pd.DataFrame({'column': range(1000)})
      questions = generate_followup_questions(question="Test question", sql="SELECT * FROM table", df=large_df)
      assert len(questions) > 0

  def test_generate_followup_questions_small_df():
      # Test with a small DataFrame to ensure full inclusion
      small_df = pd.DataFrame({'column': range(10)})
      questions = generate_followup_questions(question="Test question", sql="SELECT * FROM table", df=small_df)
      assert len(questions) > 0

Coverage improvements: Ensure tests cover a variety of DataFrame sizes and structures.
Performance testing needs: No additional performance testing needs identified.

6. Documentation & Maintenance

Documentation updates needed: Update the documentation to reflect the changes in DataFrame handling and the potential impacts on follow-up question quality.
Long-term maintenance considerations: Monitor the quality of follow-up questions generated with truncated DataFrames and consider alternative strategies if issues arise.
Technical debt and monitoring requirements: No significant technical debt introduced, but ongoing monitoring of follow-up question quality is recommended.

7. Deployment & Operations

Deployment impact and strategy: The change should be deployed with careful monitoring of follow-up question quality.
Key operational considerations: Ensure that the system can handle large DataFrames gracefully and that users are aware of the potential impacts on question quality.

8. Summary & Recommendations

8.1 Key Action Items

Implement sampling or descriptive statistics to provide a more representative subset of the DataFrame.
Add proactive token estimation and further truncation or summarization strategies.
Consider using more compact formats like CSV or JSON for DataFrame representation.
Update documentation to reflect changes in DataFrame handling.

8.2 Future Considerations

Technical evolution path: Explore more advanced DataFrame summarization techniques to improve follow-up question quality.
Business capability evolution: Continuously monitor and improve the quality of follow-up questions to enhance user experience.
System integration impacts: Ensure that any future changes to DataFrame handling are well-documented and tested to maintain system robustness.

💡 Help Shape LlamaPReview
How's this review format working for you? Vote in our Github Discussion Polls to help us improve your review experience!

llamapreview bot reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the input of a large DataFrame when generating follow-up questions #761

Support the input of a large DataFrame when generating follow-up questions #761

Flyfoxs commented Jan 29, 2025

llamapreview bot left a comment

Support the input of a large DataFrame when generating follow-up questions #761

Are you sure you want to change the base?

Support the input of a large DataFrame when generating follow-up questions #761

Conversation

Flyfoxs commented Jan 29, 2025

llamapreview bot left a comment

Choose a reason for hiding this comment

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 PR Summary

1.2 Architecture Changes

2. Detailed Technical Analysis

2.1 Code Logic Deep-Dive

Core Logic Changes

Core Logic Changes

Core Logic Changes

2.2 Implementation Quality

3. Critical Findings

3.1 Potential Issues

3.2 Code Quality Concerns

4. Security Assessment

5. Testing Strategy

5.1 Test Coverage

5.2 Test Recommendations

Suggested Test Cases

6. Documentation & Maintenance

7. Deployment & Operations

8. Summary & Recommendations

8.1 Key Action Items

8.2 Future Considerations