-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(wren-ai-service): Add LLM-based evaluation metrics for SQL generation #1303
Conversation
Caution Review failedThe pull request is closed. WalkthroughThis update enhances the evaluation process by improving documentation and readability within the evaluation module. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Evaluator
participant Formatter
participant MetricsInitiator
participant Pipeline
participant Judge
participant LLMProvider
User->>Evaluator: Call eval(meta, predictions)
Evaluator->>Formatter: Call formatter(prediction, meta)
Formatter-->>Evaluator: Return formatted data (incl. "reasoning")
Evaluator->>MetricsInitiator: Call metrics_initiator(pipeline, dataset, pipe_components)
MetricsInitiator->>Pipeline: Call metrics(engine_info, enable_semantics, component)
Pipeline->>Judge: Invoke judge methods (QuestionToReasoning, ReasoningToSql, SqlSemantics)
Judge->>LLMProvider: Call a_measure (async evaluation)
LLMProvider-->>Judge: Return evaluation result
Judge-->>Pipeline: Return formatted EvalResult
Pipeline-->>MetricsInitiator: Return aggregated metrics
MetricsInitiator-->>Evaluator: Return metrics report
Possibly related PRs
Suggested labels
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (6)
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
79f33d7
to
8977e2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
wren-ai-service/eval/metrics/llm/__init__.py (4)
1-29
: Consider renaming theformat
function to avoid overshadowing built-in behavior.Although Python does not prohibit using built-in names, it can cause confusion. A more descriptive name like
parse_eval_response
might improve clarity.
32-78
: Reduce repeated code by introducing a shared base class for all LLM-based judges.All three judges define similar initialization and measurement behavior. Consider creating a base judge class (e.g.,
LLMBaseJudge
) to consolidate the shared logic (e.g., setting up prompts, executing the model, validating responses) and reduce duplication.+class LLMBaseJudge(BaseMetric): + def __init__(self, llm_provider: LLMProvider, system_prompt: str, template: str, **_): + self.threshold = 0 + self.score = 0 + self.llm_provider = llm_provider + self.llm = llm_provider.get_generator( + system_prompt=system_prompt, + generation_kwargs=_MODEL_KWARGS, + ) + self.prompt_builder = PromptBuilder(template=template) + + def measure(self, test_case: LLMTestCase): + return asyncio.run(self.a_measure(test_case)) + + # ...
80-126
: Allow configurable thresholds for success criteria.Currently,
self.threshold = 0
means any positive score qualifies as success. Making this an adjustable parameter could improve flexibility and accuracy for different use cases.
128-174
: Consider normalizing SQL before semantic comparison.SQL syntax can vary by whitespace, case, or vendor-specific features. Using a standard SQL parser and normalizing the queries before comparison could improve the consistency and reliability of semantic checks.
wren-ai-service/eval/evaluation.py (1)
100-103
: Graceful handling of ignored errors inevaluate
.Passing
ignore_errors=True
helps the evaluation continue despite potential exceptions, but any repeated or systematic errors could be silently missed. Consider extra logging or warnings to debug recurring issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
wren-ai-service/eval/evaluation.py
(4 hunks)wren-ai-service/eval/metrics/__init__.py
(2 hunks)wren-ai-service/eval/metrics/llm/__init__.py
(1 hunks)wren-ai-service/eval/pipelines.py
(7 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)
🔇 Additional comments (9)
wren-ai-service/eval/metrics/__init__.py (1)
7-11
: Imports and exports look good.Adding
QuestionCoherenceJudge
,ReasoningValidityJudge
, andSqlSemanticsJudge
to both the import statements and__all__
list appears consistent and ensures these new metrics are publicly accessible as intended.Also applies to: 25-27
wren-ai-service/eval/evaluation.py (3)
22-37
: Thank you for adding detailed docstrings toformatter
.This improves maintainability and clarifies the function’s responsibilities and return values. The docstring is well-structured and easy to follow.
51-51
: Verify whether thereasoning
field contains sensitive or private information.Before logging or storing the
reasoning
text, ensure it does not expose personally identifiable information (PII) that could be traced back to users.
173-175
: New parameter usage inmetrics_initiator
aligns well with the updated pipeline flow.Passing
dataset
,pipe_components
, andsemantics
ensures the new LLM-based metrics integrate seamlessly. This refactor helps keep the code consistent with the new features.wren-ai-service/eval/pipelines.py (3)
28-30
: LGTM! New LLM-based judges added for comprehensive evaluation.The addition of
QuestionCoherenceJudge
,ReasoningValidityJudge
, andSqlSemanticsJudge
aligns well with the PR objective of enhancing SQL generation assessment.
464-471
: LGTM! Improved metrics initialization with dataset and components.The
metrics_initiator
function now acceptsdataset
andpipe_components
parameters, providing better context for metrics initialization. The engine configuration is correctly derived from the dataset's MDL and components.
295-299
:❓ Verification inconclusive
Verify the initialization of LLM-based judges.
The new judges are initialized with the evaluation component. Ensure that the component configuration contains all required parameters for these judges.
Run this script to check the judge initialization parameters:
Also applies to: 311-313
🏁 Script executed:
#!/bin/bash # Description: Check for judge initialization parameters in the codebase ast-grep --pattern 'class $_(PipelineComponent): def __init__(self, $$$): $$$'Length of output: 86
Action Required: Manually Verify LLM-based Judge Initialization
The automated check using the ast-grep pattern did not return any matches, which makes it inconclusive whether the required judge initialization parameters are correctly provided. Please manually verify that the evaluation component’s initialization (specifically in the sections around lines 295–299 and 311–313 inwren-ai-service/eval/pipelines.py
) properly incorporates and configures all required parameters for LLM-based judges.wren-ai-service/tools/config/config.full.yaml (1)
148-149
: LGTM! Evaluation pipeline properly configured.The evaluation pipeline is correctly configured to use the same LLM model (gpt-4o-mini-2024-07-18) as other components, ensuring consistency in the evaluation process.
wren-ai-service/tools/config/config.example.yaml (1)
148-149
: LGTM! Configuration consistency maintained.The evaluation pipeline configuration is consistently implemented across both example and full configuration files.
8977e2a
to
d4e7a29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
wren-ai-service/eval/metrics/llm/__init__.py (3)
32-78
: Enhance the system prompt and make the threshold configurable.Two suggestions for improvement:
- The system prompt could be more specific about what constitutes "making sense" in the context of SQL questions.
- The threshold should be configurable rather than hardcoded to 0.
Consider these changes:
_system_prompt = """ - You are an expert evaluator. Your task is to analyze the reasoning provided for a given question and determine if it makes sense. + You are an expert SQL evaluator. Your task is to analyze if the question is well-formed, unambiguous, and contains all necessary context for SQL query generation. Provide a score in the range 0.0~1.0 and a detailed explanation for your evaluation. """ def __init__(self, llm_provider: LLMProvider, **_): - self.threshold = 0 + self.threshold = _.get('threshold', 0.7) # Default to 0.7 for stricter evaluation
80-126
: Enhance the system prompt with specific SQL reasoning criteria.The system prompt should outline specific criteria for evaluating SQL reasoning validity.
Consider this enhancement:
_system_prompt = """ - You are an expert evaluator. Your task is to analyze the reasoning provided for a given SQL query and determine if it makes sense. + You are an expert SQL evaluator. Your task is to analyze if the reasoning: + 1. Correctly identifies the required tables and their relationships + 2. Properly explains the filtering conditions and their purpose + 3. Justifies any joins, aggregations, or complex operations + 4. Aligns with the actual SQL output Provide a score in the range 0.0~1.0 and a detailed explanation for your evaluation. """ def __init__(self, llm_provider: LLMProvider, **_): - self.threshold = 0 + self.threshold = _.get('threshold', 0.7) # Default to 0.7 for stricter evaluation
128-174
: Enhance the system prompt with specific SQL semantic comparison criteria.The system prompt should outline specific criteria for evaluating SQL semantic equivalence.
Consider this enhancement:
_system_prompt = """ - You are an expert evaluator. Your task is to analyze the actual SQL query and the expected SQL query and determine if they are semantically equivalent. + You are an expert SQL evaluator. Your task is to analyze if the queries are semantically equivalent by checking: + 1. Both queries return the same columns (or equivalent derived columns) + 2. Both queries apply equivalent filtering conditions + 3. Both queries join the same tables in a logically equivalent way + 4. Both queries handle NULL values similarly + 5. Both queries produce the same results despite potential syntactic differences Provide a score in the range 0.0~1.0 and a detailed explanation for your evaluation. """ def __init__(self, llm_provider: LLMProvider, **_): - self.threshold = 0 + self.threshold = _.get('threshold', 0.7) # Default to 0.7 for stricter evaluation
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
wren-ai-service/eval/evaluation.py
(4 hunks)wren-ai-service/eval/metrics/__init__.py
(2 hunks)wren-ai-service/eval/metrics/llm/__init__.py
(1 hunks)wren-ai-service/eval/pipelines.py
(7 hunks)wren-ai-service/tools/config/config.example.yaml
(1 hunks)wren-ai-service/tools/config/config.full.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- wren-ai-service/tools/config/config.example.yaml
- wren-ai-service/tools/config/config.full.yaml
- wren-ai-service/eval/metrics/init.py
- wren-ai-service/eval/evaluation.py
🔇 Additional comments (3)
wren-ai-service/eval/metrics/llm/__init__.py (1)
11-14
: LGTM! Well-structured evaluation result model.The Pydantic model provides a clean interface for structured evaluation results with appropriate field types.
wren-ai-service/eval/pipelines.py (2)
294-316
: LGTM! Well-integrated evaluation metrics.The new LLM-based judges are properly integrated into both pipeline classes with appropriate component configuration.
Also applies to: 411-436
464-483
: LGTM! Clean refactor of metrics initialization.The metrics initialization is properly updated to handle the new component-based configuration while maintaining backward compatibility.
b4712cd
to
b24d278
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR introduces new LLM-based evaluation metrics to enhance our SQL generation assessment capabilities.
Key changes:
Summary by CodeRabbit