Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citator Experiments: Bedrock Claude #5070

Open
rachlllg opened this issue Feb 13, 2025 · 5 comments
Open

Citator Experiments: Bedrock Claude #5070

rachlllg opened this issue Feb 13, 2025 · 5 comments
Assignees
Labels

Comments

@rachlllg
Copy link

  1. Use Pau's dataset + excerpts + prompt to run through Bedrock Claude models to establish Claude baseline
  2. Tune the prompt to better suit Claude models + consider using Tool to force output in json format
  3. Analyze the result and identify the best model + prompt & identify shortfalls

Models:

  • Claude 3.5 Haiku (Cheapest & Fastest)
  • Claude 3.5 Sonnet v2 (Most intelligent)
  • Claude 3 Haiku (Older but finetunable)
@rachlllg rachlllg self-assigned this Feb 13, 2025
@rachlllg rachlllg moved this to To Do in Sprint (Web Team) Feb 13, 2025
@rachlllg rachlllg moved this from To Do to In progress in Sprint (Web Team) Feb 13, 2025
@rachlllg
Copy link
Author

Summarized below are the experiment results.

I went through quite a number of iterations of prompts by reviewing the false predictions (FN & FP) from the various models, this only lists the results that I think are worth keeping for reference.

Image

Conclusion:

  1. Compared to the baseline prompt, my prompts generally include more extensive definitions about what constitutes an overrule and what does not, and I included 2 examples to construct a few-shots prompt.
  2. The v217 prompt version is the best overall based on F1, unsurprisingly there is a trade-off between precision & recall (compare v213 & v217), v213 prompt is a close second with much higher recall.
  3. The v214 prompt employed a plan, write, and revise technique, but it did not improve the performance, so I removed the "plan, write, and revise" logic in the later iterations of the prompt.
  4. In reviewing the false predictions, below are the notable shortfalls:
    a. The model gets the Acting Case & Target Case confused
    • Proposed solution: get a list of the Acting & Target Case names and citations & include them in the prompt
      b. The model makes mistakes when the language of overrule is ambiguous:
    • Propose solution: I think I can benefit from a chat with SME, as I too find some of these cases hard to distinguish
      c. The excerpts that discuss overrule is missing:
    • Proposed solution: this should be fixed after Eyecite enhancements.

Immediate next steps:

  1. Get a list of the Acting & Target Case names and citations & include in the prompt
  2. Meet with a SME to discuss some of the failed cases where the language seems ambiguous and brainstorm how a SME would think through these tough cases. Also would like SME to review the 2 examples included in the prompt.
    • @mlissner: would you prefer if I schedule a meeting with Rebecca or post these question to Slack as open discussion?
  3. Craft a final pipeline + prompt to run through the Claude models (Haiku & Sonnet) one last time and move on to using the same pipeline + prompt on other models to compare models.

Long term ideas:

  1. Consider finetuning
  2. Consider RAG (instead of extracting only the excerpts that are near the case name, embed the entire opinion and feed the embedded opinion as an input)
  3. Consider model as evaluator (as the model is currently often confidently wrong, add an evaluator model to the current pipeline to evaluate the model results and have the model revise itself)

Note on metrics:
P: Precision (Of all predicted positives, how many were actual positives)
R: Recall (Of all actual positives, how many were predicted to be positives)
F1: A balanced metric considering both Precision & Recall
I'm not reporting Accuracy here because our dataset is imbalanced, making accuracy a less an ideal metric. Same with Specificity, which evaluates the negative class (not overruled), which is the majority class.

I decided to hold using tool to force the output format until a later time as the current prompt is already producing json format with minimal errors.

Also, I decided to hold finetuning until a later time due to:

  1. Need time to create (and SME to review) dataset to be used for finetuning (need to select (maybe ~100?) both overrule and not-overrule examples with varying language and circumstances. To hold our test dataset fixed, this finetuning dataset needs to use cases not already present in the test dataset)
  2. The excerpts we are using for the evaluation needs to be enhanced, at least 1/3 of the false-predictions is because the language dictating overrule (or not overrule) is not included in the excerpts, or because the excerpts did not separate lead opinion from concurring opinion from dissenting opinion
  3. There is still potential for improvements from prompt engineering & model provider alone

instructions.txt

@mlissner
Copy link
Member

Yes, I think if you reach out to Rebecca, that'd be great, just be sure to make sure she has time. If she's busy, there are others we can reach out to, though she's probably the best!

Get a list of the Acting & Target Case names and citations & include in the prompt

I'm not sure what you mean by this. Do you mean that if we have something like:

injury is sufficient to present an actual case or controversy. Younger v. Harris, 401 U. S., at 41-42;

You want a list of things like:

  • Younger v. Harris, 401 U.S. at 41-42

Or am I misunderstanding?

@rachlllg
Copy link
Author

rachlllg commented Feb 18, 2025

I'm not sure what you mean by this.

I think I want to include a list of what a case could be referred to as, since we currently only enclose the short case name 401 U. S., at 41-42 (because that's where the data_id appears), it appears sometimes the model doesn't know 401 U. S., at 41-42 is Younger v. Harris. So I would add to the prompt something like:

'''
The Target Case is Younger v. Harris, 401 U. S., at 41-42, below are all the names this case could be referred to as:

  • Younger v. Harris, 401 U. S., at 41-42
  • Younger v. Harris
  • 401 U. S., at 41-42
  • Younger, supra
  • ...

'''

I think I can get this from Eyecite/Reporters-DB right?

@mlissner
Copy link
Member

Just catching up now. Yeah, I'm not sure how good of results you can get from eyecite right now. I think it will soon have a mechanism for pulling case names from the DB, but I'm not sure if that's landed yet. @flooie or @grossir will know, if you want to ping one of them.

One way or another, you'll probably want to pull these from the DB though, since parsing the case name from the text is pretty difficult. Either eyecite can help pull from the DB, or if it can't and you end up doing it by hand, the field you want is OpinionCluster.case_name.

Note that that field can occasionally have very long values. We have an issue to fix that, but until we do, you'll probably want to discard them from your prompt, since they're not much use. The case law team can probably provide an example of these too, but the simple thing is probably to only add cases that have fewer than 10 or 15 words in them to the prompt — something like that.

@rachlllg
Copy link
Author

Thanks Mike! This is very helpful. I also chatted with Gianfranco briefly and it sounds like what I want may be more difficult than I anticipated!

I also forgot to note down the conclusion after discussing with Rebecca, so here it is:

  • The examples and the prompt look good, with a minor comment on "outdated" language, I will update.
  • Headmatter may or may not be a good source of determining overrule or not, as different jurisdictions have different rules, and sometimes the arguments presented in the headmatter can be misleading, I think I will stick with the opinion body itself for now.
  • In general, if there is a lead opinion, the lead opinion carries more weight, but concurring and dissenting opinions may be helpful in interpreting the lead opinion.
  • Sometimes different matters within a case may have different number of jurors, these can get really complicated.
  • All the 4 cases Rebecca helped to review, where the model failed, are "close call" cases, it's often not so black and white, strongly consider adding another category for these close call cases. This is where our dataset would need to evolve to accommodate.

A running list of open items/ideas:

Dataset & Excerpts:

  1. Review the labels and consider adding a 3rd class for "close call" cases
  2. Identify the different opinion types (plurality opinion, combined opinion, unanimous opinion, lead opinion, dissenting opinion, and concurring opinion) in the excerpts
  3. Extract the "list of case names" and provide as part of the prompt
  4. Perhaps if the opinion in itself is less than xx tokens, use the entire opinion instead of excerpts

Model:

  1. Consider finetuning
  2. Consider RAG (instead of extracting only the excerpts that are near the case name, embed the entire opinion and feed the embedded opinion as an input)
  3. Consider model as evaluator (as the model is currently often confidently wrong, add an evaluator model to the current pipeline to evaluate the model results and have the model revise itself)

For now, I will move on to Cohere (with the prompt that has decent results) to compare model intelligence before getting back to these open items/ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In progress
Development

No branches or pull requests

2 participants