Update job chat prompt and add prompt evaluation for issue 97 #118
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Short Description
This PR primarily adjusts the system prompt in the job_chat service to be less strict about external information. It also adds a notebook to evaluate the online and the new prompts.
Fixes #97 #108 partially.
Implementation Details
To address issue #97 , the prompt was edited to allow the assistant to provide information on external platforms and services.
To address issue #108 , this PR also adds a notebook to generate a prompt test dataset targeting the issue in question. The notebook provides an initial case study of how we can track and evaluate the effects of changes to the LLM pipeline more thoroughly without relying only on qualitative evaluation and spot checking.
The small fully generated evaluation dataset is also added in this PR, and it can be used as part of a routine test and expanded on as we target different issue areas. The dataset shows the LLM outputs for the same set of questions using the online v1 and the candidate v2 prompts. The generated result column indicates whether the answer successfully answered the question, and can be used to calculate a success score. The dataset shows that our new prompt improves the success rate on the external information issue from 20% to 60%.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy