Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPoolExecutor breaks dspy.Evaluate config in parallel execution #1766

Open
glesperance opened this issue Nov 6, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@glesperance
Copy link
Contributor

Using ThreadPoolExecutor to paralellize dspy calls breaks internal config management for threaded dspy.Evaluate.

Specifically doing this:

class QuestionAnswer(dspy.Signature):
    question: str = dspy.InputField(description="The question")
    answer: int = dspy.OutputField(description="The answer to the question")

solver = dspy.ChainOfThought(QuestionAnswer)

from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(tqdm(executor.map(lambda x: solver(**x.inputs()), trainset), total=len(trainset)))

breaks

# This breaks, both models have the same score. The threads all run on gpt4o_mini which was the last configured 
# model before the ThreadPoolExecutor was created.
evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=10, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# >>>>> gpt4o
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1608.90it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)
# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 1795.91it/s]
# 2024/11/06 11:49:23 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

that is, all calls to dspy.configure will be ignored.

Turning off threading on dspy.Evaluate works properly:

# Turning off threading works as expected: both models have different scores again.

evaluator = dspy.Evaluate(devset=devset, metric=is_correct, num_threads=1, display_progress=True)

dspy.configure(lm=gpt4o)
evaluator(solver)

dspy.configure(lm=gpt4o_mini)
evaluator(solver)

# ###### Output ######
# >>>>> gpt4o
# Average Metric: 48 / 50  (96.0): 100%|██████████| 50/50 [00:00<00:00, 892.83it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 48 / 50 (96.0%)

# >>>>> gpt4o_mini
# Average Metric: 47 / 50  (94.0): 100%|██████████| 50/50 [00:00<00:00, 976.47it/s] 
# 2024/11/06 11:49:24 INFO dspy.evaluate.evaluate: Average Metric: 47 / 50 (94.0%)

See this notebook for full repro.

@isaacbmiller
Copy link
Collaborator

This is unfortunately known, and the best way to address it is:

  1. As of the time of writing - Run everything inside an evaluate call first with a dummy metric (lambda x, y, z=None, w=None: 1) and return your outputs using the kwarg in Evaluate
  2. Once Add support for native parallel execution in DSPy #1690 merges, use that for parallel execution.

We should catch/warn in this situation @okhat @krypticmouse

@isaacbmiller isaacbmiller added the bug Something isn't working label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants