Different optimization results between 2.5.16 -> 2.5.20 #1722

nielsgl · 2024-10-30T11:51:45Z

Hi!

I created a simple module and a set of 10 questions and answers to evaluate a single pdf loaded into chromadb. When evaluating using DSPy version 2.5.16 like

evaluate = dspy.Evaluate(
    devset=data, metric=metric, num_threads=24, display_progress=True, display_table=3
)
evaluate(rag)

I get a semantic F1 score of 69, then when I run the optimization (which takes about 15 minutes) and evaluating it I get a score of about 79.

tp = dspy.MIPROv2(
    metric=metric, auto="medium", num_threads=24
)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(
    RAG(),
    trainset=data[:7],
    valset=data[7:],
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    requires_permission_to_run=False,
    seed=0
)

evaluate(optimized_rag)

However when I run this with version 2.5.20, I first get a score of 61 and after optimization I get a score of 69. These seem quite different from each other and significantly lower. Everything is the same except I upgrade the DSPy library. Interestingly the optimization now finished in about 2 minutes which is significantly faster. Any thoughts on these differences?

okhat · 2024-10-30T12:47:21Z

Hey @nielsgl ! We adjusted the adapters layer (which sits between signatures and LMs) in DSPy 2.5.19, which you can find in the releases page: https://github.com/stanfordnlp/dspy/releases

Perhaps we should save the adapter logic as part of the saved program, actually, so when you load it in the future, it's exactly identical in behavior to your older runs.

What do you think?

(Separately, I wouldn't think too much about the 69 vs 79 scores, since you're working with a valset with 3 examples, so noise is going to have a lot of room.)

chenmoneygithub · 2024-11-19T02:20:55Z

@okhat We can save the adapter code with cloudpickle, so it's technically doable. But I feel our adapter change should not really cause performance downgrade? Conceptually it's just parsing the input and output, and if there is a true performance downgrade then it could indicate that we are doing something wrong. So instead of officially supporting serializing DSPy model together with Adapter code, we may want to ensure that newer-version adapter doesn't have negative performance effect?

okhat closed this as completed Nov 18, 2024

okhat reopened this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different optimization results between 2.5.16 -> 2.5.20 #1722

Different optimization results between 2.5.16 -> 2.5.20 #1722

nielsgl commented Oct 30, 2024

okhat commented Oct 30, 2024 •

edited

Loading

chenmoneygithub commented Nov 19, 2024

Different optimization results between 2.5.16 -> 2.5.20 #1722

Different optimization results between 2.5.16 -> 2.5.20 #1722

Comments

nielsgl commented Oct 30, 2024

okhat commented Oct 30, 2024 • edited Loading

chenmoneygithub commented Nov 19, 2024

okhat commented Oct 30, 2024 •

edited

Loading