Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP ] New planner and plan following for Gaia #99

Closed
wants to merge 65 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
89672cf
new detailed planned, answer formalisation node
ollmer Nov 14, 2024
bc06b4b
explicit links to previous steps
ollmer Nov 14, 2024
170d6c6
support calling subagent from the mononode
ollmer Nov 15, 2024
62d2e4a
make plain text node common
ollmer Nov 15, 2024
238e2ff
controllable plan execution, full outline with node contents
ollmer Nov 15, 2024
83ef3e0
implement gaia new nodes, use free form thought for reflection everyw…
ollmer Nov 15, 2024
6f3e293
better debugging
ollmer Nov 15, 2024
a44ec5c
fix
ollmer Nov 15, 2024
85916dd
consistent node names, camel case
ollmer Nov 15, 2024
37d32f7
limit subagents scope using tape view stack
ollmer Nov 15, 2024
e8dff83
use free form reasoning for some thoughts
ollmer Nov 15, 2024
9921457
better subagent calls
ollmer Nov 15, 2024
f2495f4
proper return of subagent results
ollmer Nov 15, 2024
f092eda
pass any step positions as call args
ollmer Nov 15, 2024
173cb92
fix
ollmer Nov 15, 2024
f201acf
allow browser to show steps with missed prompts
ollmer Nov 18, 2024
1ebe86d
make formalize part of ThinkingNode, pass relevant facts to executor …
ollmer Nov 18, 2024
1c04dda
appendable web cache file
ollmer Nov 18, 2024
e91b5d5
update prompts, fix facts ledger update
ollmer Nov 18, 2024
4931d63
improve facts ledger
ollmer Nov 18, 2024
21834a4
pass subagent args using reference node and respond from subagent usi…
ollmer Nov 18, 2024
32459f3
simplify executor call, fix ledger update
ollmer Nov 18, 2024
440c648
remove print task in debug
ollmer Nov 18, 2024
aa022f1
adjust render
ollmer Nov 19, 2024
94e4304
mononode v2
ollmer Nov 19, 2024
c7b18ae
better debug
ollmer Nov 19, 2024
e9ed2e3
remove n_attempts and subtasks from old gaia
ollmer Nov 19, 2024
bffd013
reasoner subagent, use old gaia agent as executor
ollmer Nov 19, 2024
ca3c753
first eval of v2
ollmer Nov 19, 2024
a3f4e39
fix reasoner call
ollmer Nov 19, 2024
12f4b5a
update prompts, more fact format options
ollmer Nov 20, 2024
0d54e11
concurrent llm cache
ollmer Nov 20, 2024
7235004
allow load tapes with legacy steps in browser and results aggregation
ollmer Nov 20, 2024
575b69c
prompt adjust
ollmer Nov 20, 2024
3474156
fix
ollmer Nov 20, 2024
9092779
gaia node v2
ollmer Nov 20, 2024
7e53c5a
code sandbox
ollmer Nov 20, 2024
20f3fb3
fix v2 agent and test
ollmer Nov 21, 2024
79eadd5
fixes
ollmer Nov 21, 2024
48fb5b3
pass subtask result to the dependent task, looped reasoner with reading
ollmer Nov 21, 2024
5178f84
fix code execution
ollmer Nov 21, 2024
2668be6
reasonser start guidance
ollmer Nov 21, 2024
17dfb8a
coder agent, stop replan after 3 attempts
ollmer Nov 21, 2024
3c4f6ad
fix
ollmer Nov 21, 2024
be109b4
llm info and cost tracking
rizar Nov 21, 2024
a0b151c
actually log llm info
rizar Nov 21, 2024
0c9d187
better replan, guess when failed
ollmer Nov 22, 2024
ff35365
record loop termination on the tape
ollmer Nov 22, 2024
22cc688
show terminations in tape browser
ollmer Nov 22, 2024
0e83540
more loops for new arch
ollmer Nov 22, 2024
48e68d9
fix
ollmer Nov 22, 2024
442ada4
fix guess node
ollmer Nov 22, 2024
d9820ce
fix most tests
rizar Nov 22, 2024
cf24342
Merge branch 'gaia_planner' into llm_costs_and_info
rizar Nov 22, 2024
fc20b08
rm comment
rizar Nov 22, 2024
3791205
don't crash for local models because of LLM cost
rizar Nov 22, 2024
4e89b87
Merge pull request #108 from ServiceNow/llm_costs_and_info
rizar Nov 22, 2024
508ee8d
Merge branch 'main' into gaia_planner
ollmer Dec 2, 2024
dd49555
fix
ollmer Dec 2, 2024
3c5712f
fix test
ollmer Dec 2, 2024
7eb7fa5
update workarena agent, better llm replay errors
ollmer Dec 2, 2024
2eb1342
fix old agent
ollmer Dec 2, 2024
9c42610
update gaia test
ollmer Dec 2, 2024
31b5a41
update intro test
ollmer Dec 2, 2024
4ce872c
Merge branch 'main' into gaia_planner
ollmer Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions conf/gaia_llama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ split: validation
n_attempts: 1
agent:
planning_mode: simple
subtasks: false

env:
safe_calculator: false
Expand All @@ -17,4 +16,4 @@ env:

hydra:
run:
dir: outputs/gaia/runs/${exp_name}
dir: ${exp_path}
5 changes: 2 additions & 3 deletions conf/gaia_openai.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@ defaults:
- llm: gpt4o_mini
- _self_

exp_name: gpt4o_mini_val_batch32_6
exp_name: gpt4o_mini_val_l1test1
exp_path: outputs/gaia/runs/${exp_name}
split: validation
n_attempts: 1
batch: 32
agent:
planning_mode: simple
subtasks: false

env:
safe_calculator: false
Expand All @@ -18,4 +17,4 @@ env:

hydra:
run:
dir: outputs/gaia/runs/${exp_name}
dir: ${exp_path}
27 changes: 9 additions & 18 deletions examples/gaia_agent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,12 @@
from .steps import (
ActionExecutionFailure,
CalculationResultObservation,
FinishSubtask,
GaiaAgentStep,
GaiaQuestion,
ListOfFactsThought,
PageObservation,
PlanThought,
PreviousFactsObservation,
PythonCodeAction,
SearchResultsObservation,
SourcesThought,
UseCalculatorAction,
Expand All @@ -42,14 +40,14 @@ class PlanningMode(str, Enum):
class GaiaNode(MonoNode):
system_prompt: str = PromptRegistry.system_prompt
steps_prompt: str = PromptRegistry.allowed_steps
agent_step_cls: Any = Field(exclude=True, default=GaiaAgentStep)
output_cls: Any = Field(exclude=True, default=GaiaAgentStep)

def get_steps_description(self, tape: GaiaTape, agent: Any) -> str:
def get_steps_description(self, tape: GaiaTape) -> str:
"""
Allow different subset of steps based on the agent's configuration
"""
plan_thoughts = not tape.has_fact_schemas()
allowed_steps = get_allowed_steps(agent.subtasks, plan_thoughts)
allowed_steps = get_allowed_steps(plan_thoughts)
return self.steps_prompt.format(allowed_steps=allowed_steps)

def prepare_tape(self, tape: GaiaTape, max_chars: int = 200) -> GaiaTape:
Expand All @@ -75,7 +73,7 @@ def postprocess_step(self, tape: GaiaTape, new_steps: list[Step], step: Step) ->
if isinstance(step, ListOfFactsThought):
# remove empty facts produced by the model
step.given_facts = [fact for fact in step.given_facts if fact.value is not None and fact.value != ""]
elif isinstance(step, (UseCalculatorAction, PythonCodeAction)):
elif isinstance(step, (UseCalculatorAction)):
# if calculator or code action is used, add the facts to the action call
step.facts = tape.model_copy(update=dict(steps=tape.steps + new_steps)).facts()
return step
Expand All @@ -84,9 +82,7 @@ def trim_tape(self, tape: GaiaTape) -> GaiaTape:
"""
Make tape shorter to fit llm context size limits
"""
finish_subtask_positions = [i for i, step in enumerate(tape) if isinstance(step, FinishSubtask)]
# trim either after last finished subtask or at 2/3 of the tape
summarization_border = (finish_subtask_positions[-1] + 1) if finish_subtask_positions else int(len(tape) * 0.66)
summarization_border = int(len(tape) * 0.66) # trim at 2/3 of the tape
short_tape = tape.model_copy(update=dict(steps=[]))
pre_tape: GaiaTape = tape[:summarization_border] # type: ignore
for step in pre_tape.steps:
Expand All @@ -100,22 +96,19 @@ def trim_tape(self, tape: GaiaTape) -> GaiaTape:


class GaiaAgent(Agent):
subtasks: bool

@classmethod
def create(
cls,
llm: LLM,
planning_mode: PlanningMode = PlanningMode.simple,
subtasks: bool = False,
):
nodes = cls.prepare_guidance(planning_mode, subtasks)
return super().create(llm, nodes=nodes, max_iterations=2, subtasks=subtasks)
nodes = cls.prepare_guidance(planning_mode)
return super().create(llm, nodes=nodes, max_iterations=2)

@classmethod
def prepare_guidance(cls, planning_mode: PlanningMode, subtasks: bool) -> list[GaiaNode]:
def prepare_guidance(cls, planning_mode: PlanningMode) -> list[GaiaNode]:
"""
Prepare mononodes based on the planning mode and subtasks flag
Prepare mononodes based on the planning mode
"""
guidance_nodes = []
if planning_mode == PlanningMode.simple:
Expand Down Expand Up @@ -179,6 +172,4 @@ def prepare_guidance(cls, planning_mode: PlanningMode, subtasks: bool) -> list[G
]
else:
raise ValueError(f"Unknown planning mode: {planning_mode}")
if subtasks:
guidance_nodes.append(GaiaNode(name="check_subtask_finished", guidance=PromptRegistry.is_subtask_finished))
return guidance_nodes
45 changes: 40 additions & 5 deletions examples/gaia_agent/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from tapeagents.environment import Environment
from tapeagents.tools.calculator import calculate
from tapeagents.tools.container_executor import CodeBlock, ContainerExecutor
from tapeagents.tools.python_interpreter import python_calculate, run_python_code
from tapeagents.tools.simple_browser import SimpleTextBrowser
from tapeagents.utils import FatalError
Expand All @@ -27,8 +28,14 @@


class GaiaEnvironment(Environment):
def __init__(self, safe_calculator: bool = True, **kwargs) -> None:
def __init__(
self,
safe_calculator: bool = True,
code_sandbox: ContainerExecutor | None = None,
**kwargs,
) -> None:
super().__init__()
self.code_sandbox = code_sandbox
self.browser = SimpleTextBrowser(**kwargs)
self.calculate = calculate if safe_calculator else python_calculate

Expand Down Expand Up @@ -76,10 +83,28 @@ def react(self, tape: GaiaTape) -> GaiaTape:
result = self.calculate(action.expression, action.facts or {})
tape = tape.append(CalculationResultObservation(name=action.fact_name, result=result))
case PythonCodeAction():
result, stdout, stderr = run_python_code(action.code, action.facts or {})
tape = tape.append(
CodeResultObservation(name=action.fact_name, result=result, stdout=stdout, stderr=stderr)
)
if self.code_sandbox is not None:
result = self.code_sandbox.execute_code_blocks(
[CodeBlock(code=print_last_line(action.code), language="python")]
)
obs = CodeResultObservation(
result=result.output.strip(),
stdout=f"Exit code: {result.exit_code}",
stderr="",
)
else:
# TODO: remove this option and permutations crutch
logger.warning(f"Code sandbox is not provided, running code locally!\n{action.code}")
if "permutations" in action.code:
result, stdout, stderr = "", "", "Execution timeout"
else:
result, stdout, stderr = run_python_code(action.code, {})
obs = CodeResultObservation(
result=result,
stdout=stdout,
stderr=stderr,
)
tape = tape.append(obs)
case LLMOutputParsingFailureAction():
pass
case _:
Expand All @@ -91,3 +116,13 @@ def react(self, tape: GaiaTape) -> GaiaTape:
tape = tape.append(ActionExecutionFailure(error=str(e)))
break
return tape


def print_last_line(python_code: str) -> str:
lines = python_code.splitlines()
if " = " in lines[-1]:
name = lines[-1].split("=")[0].strip()
lines.append(f"print({name})")
else:
lines[-1] = f"print({lines[-1]})"
return "\n".join(lines)
79 changes: 28 additions & 51 deletions examples/gaia_agent/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@
from huggingface_hub import snapshot_download
from termcolor import colored

from tapeagents.core import TerminationStep
from tapeagents.io import load_tapes, save_json_tape
from tapeagents.orchestrator import main_loop
from tapeagents.orchestrator import MainLoopStatus, main_loop
from tapeagents.renderers import step_view

from .agent import GaiaAgent
from .environment import GaiaEnvironment
from .scorer import question_scorer
from .steps import GaiaAnswer, GaiaQuestion, PlanThought
from .steps import GaiaQuestion
from .tape import GaiaMetadata, GaiaTape

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -96,56 +97,32 @@ def load_dataset(split: str):
return tasks


def solve_task(task: dict, agent: GaiaAgent, env: GaiaEnvironment, n_attempts: int = 1) -> GaiaTape:
def solve_task(task: dict, agent: GaiaAgent, env: GaiaEnvironment, level: int, tries: int = 1) -> GaiaTape:
question = task_to_question_step(task, env)
tapes: list[GaiaTape] = []
results: list[Any] = []
previous_plans: list[str] = []
while len(tapes) < n_attempts:
predicted = None
tries = 3
while not predicted and tries:
tape = GaiaTape(steps=[question])
logger.info(colored(f"Attempt {len(tapes)+1}", "green"))
discard_attempt = False
planned = False
step = None
try:
for event in main_loop(agent, tape, env, max_loops=30):
if event.agent_event and event.agent_event.step:
step = event.agent_event.step
tape = tape.append(step) # type: ignore
if isinstance(step, PlanThought) and not planned:
plan_dump = "\n".join(step.plan)
if plan_dump in previous_plans:
logger.info("Plan already been used, discard attempt")
discard_attempt = True
break
else:
planned = True
previous_plans.append(plan_dump)
if event.observation:
tape = tape.append(event.observation) # type: ignore
if discard_attempt:
continue
except Exception as e:
tape.metadata.error = str(e)
logger.exception(f"Failed to solve task: {e}")
break
predicted = step.answer if isinstance(step, GaiaAnswer) else None
tries -= 1
predicted = str(predicted)
tapes.append(tape)
results.append(predicted)
logger.info(f"Expected: {task['Final answer']}, Agent produced: {predicted}")
logger.info(f"Produced {len(tapes)} tapes, vote")
best = majority_vote(results)
logger.info(f"Majority vote best non-empty result: {best}, out of {results}")
best_tape = tapes[best]
best_tape.metadata = GaiaMetadata.model_validate(
best_tape.metadata.model_dump() | {"task": task, "result": results[best]}
)
return best_tape
result = None
tape = GaiaTape(steps=[question])
while not result and tries:
tape = GaiaTape(steps=[question])
try:
for event in main_loop(agent, tape, env, max_loops=60):
if event.agent_event and event.agent_event.step:
tape = tape.append(event.agent_event.step) # type: ignore
if event.observation:
tape = tape.append(event.observation) # type: ignore
if event.status == MainLoopStatus.TERMINATED:
tape = tape.append(TerminationStep())
tape.metadata.terminated = True
except Exception as e:
tape.metadata.error = str(e)
logger.exception(f"Fatal Error. Failed to solve task: {e}")
break
result = getattr(tape[-1], "answer", None)
tries -= 1
result = getattr(tape[-1], "answer", None)
logger.info(f"Expected: {task['Final answer']}, Agent produced: {result}")
tape.metadata = GaiaMetadata.model_validate(tape.metadata.model_dump() | {"task": task, "result": str(result)})
tape.metadata.level = level
return tape


def task_to_question_step(task: dict, env: GaiaEnvironment, max_doc_length: int = 8000) -> GaiaQuestion:
Expand Down
Loading