This folder contains sample test instances for evaluating a LLM-based Agent. The instances are manually created based on the test instance patterns.
The evaluation can be run via the main.py
file (assuming the necessary dependencies were installed as instructed in installation).
The test instances depend on the architecture specifications defined in the examples
folder. There are currently three variants of the architecture specification DSL to test. The desired variant is set by the variant
variable in the main.py
file.
The first variant is the DSL developed for the XXP project. In this variant, there can be workflows derived from other workflows (reuse and modify them). In the derived workflow (called assembled in the project), the configuration of the tasks can be updated, but no new control flow or data flow can be added. To fully understand the derived workflow (especially to be able to reason about the control and data flow), it is necessary to obtain the specification of the base workflow (which is referenced from the derived workflow).
The tasks can either be primitive (an implementation file is specified) or composite (composed of other tasks). For the composite tasks, a reference to a sub-workflow is specified where the sub-tasks are defined.
An example workflow specification can be found in the /examples/artificial_workflow/1_separate_files
folder.
The second variant is a simplified version of the first variant, where only assembled workflows exist. Instead of referencing the base workflow, all the information are copied into the specification of the derived workflow (and slightly syntactically simplified).
An example workflow specification can be found in the /examples/artificial_workflow/2_assembled
folder.
The third variant is created from the second variant by inlining all the sub-workflows. Instead of a reference to a sub-workflow, the whole specification is copied inside the composite task. This can lead to a high level of nesting of tasks.
An example workflow specification can be found in the /examples/artificial_workflow/3_expanded
folder.
We manually created 25 test instances based on the test patterns. The table summarizes score for each pattern and variant by
Note: The Basic functionality patterns are labeled semantics
in the source code and raw results.
The LLM has to ask for workflow specifications.
Pattern category | Pattern name | Number of instances | Variant 1 Score | Variant 2 Score | Variant 3 Score |
---|---|---|---|---|---|
Structure | List of tasks | 5 | |||
Structure | Links in flow | 4 | |||
Structure | Task after task | 4 | |||
Structure | Next tasks in flow | 4 | |||
Structure | Flow cycle | 4 | |||
Structure | Total | 21 | |||
Behavior | Mutually exclusive conditional flow | 2 | |||
Behavior | Total | 2 | |||
Basic functionality | Inconsistent task name and description | 1 | |||
Basic functionality | Inconsistent descriptions | 1 | |||
Basic functionality | Total | 2 |
All workflow specifications are presented in the initial prompt.
Note that for Variant 3, it does not make it does not make sense to differentiate between agent and up front modes as there is always only one WADL specification file.
Pattern category | Pattern name | Number of instances | Variant 1 Score | Variant 2 Score |
---|---|---|---|---|
Structure | List of tasks | 5 | ||
Structure | Links in flow | 4 | ||
Structure | Task after task | 4 | ||
Structure | Next tasks in flow | 4 | ||
Structure | Flow cycle | 4 | ||
Structure | Total | 21 | ||
Behavior | Mutually exclusive conditional flow | 2 | ||
Behavior | Total | 2 | ||
Basic functionality | Inconsistent task name and description | 1 | TODO | TODO |
Basic functionality | Inconsistent descriptions | 1 | TODO | TODO |
Basic functionality | Total | 2 | TODO | TODO |
The LLM has to ask for workflow specifications.
Pattern category | Pattern name | Number of instances | Variant 1 Score | Variant 2 Score | Variant 3 Score |
---|---|---|---|---|---|
Structure | List of tasks | 5 | |||
Structure | Links in flow | 4 | |||
Structure | Task after task | 4 | |||
Structure | Next tasks in flow | 4 | |||
Structure | Flow cycle | 4 | |||
Structure | Total | 21 | |||
Behavior | Mutually exclusive conditional flow | 2 | |||
Behavior | Total | 2 | |||
Basic functionality | Inconsistent task name and description | 1 | |||
Basic functionality | Inconsistent descriptions | 1 | |||
Basic functionality | Total | 2 |
It should be noted that the responses of LLMs are stochastic; thus running the experiment with the same test instances again may produce slightly different results.
We noticed that it is necessary to formulate the questions as exactly as possible. For example, when the LLM is asked to "list all tasks that have a parameter", it sometimes also lists tasks that depend on the Hyperparameters
data object. However, when we improved the question by adding "(specified via the 'param' keyword)", we got the answers we wanted.
Another example is the "Links in flow" and "Task after task" patterns. Although they technically ask about the same workflow feature (control flow links), the question is formulated differently, and the results differ. Interestingly, the test instance that scored the worst was asking whether ModelTraining
task is "directly after" ModelEvaluation
task in the control flow (the order was intentionally semantically incorrect in the specification). On the other hand, the LLM answers mostly correctly if the question is formulated whether there "is a control flow link" between the tasks.
In our evaluation setup, we do not present the workflow specifications to the LLM in the initial prompt. Rather, the LLM works as an agent (based on the ReAct approach and function calling) and can request the workflow specification files it needs. We chose this approach because we wanted to see whether the LLM is able to choose which information is necessary. This is aligned with our long-term goal of building an assistant that could also work, for example, with experiment results, and always presenting all the available information to the LLM can become costly.
We evaluated all the basic functionality patterns using the recall from ROUGE-1 metric. In short, this counts how many words from the reference answer appeared in the LLM's answer. Clearly, this metric depends on the concrete formulation of the answer so a correct answer formulated in different words will not get a perfect score. For reference, we also checked all the answers by hand and the LLM's responses contained the correct answer for all the basic functionality instances. That means that the scores for semantic patterns in table above correspond to correct answers.
It should also be noted that even a wrong answer would score well in the ROUGE-1 metric if it contained all the words from the reference answer (possibly in incorrect order or with different meanings). Further investigation is necessary to assess whether other metrics (e.g., BERTScore, G-Eval) are more appropriate for evaluating open questions.
We noticed that the LLM performed significantly better when the chain of thought technique was applied. We asked the LLM to first think about the answer and explain it, and then submit the final answer.
There is no guarantee that the explanation will correspond to the answer, but it can give us insights into why the LLM does not understand a particular type of question. For instance, it can show us that the LLM cannot interpret parts of the DSL correctly and understands it differently than we meant.
This can be taken one step further by asking the LLM what information it needs to know to answer the question. Again, giving that information to the LLM is not guaranteed to improve the answer, but it can at least give us ideas on what to try.
In our setup (LLM inputs: system prompt with general information, specification of workflow in our DSL, test instance question; LLM outputs: answer to the test instance, explanation of the answer; LLM model: GPT-4 Turbo), the cost of running the evaluation is roughly 0.02 USD per test instance.
In our setup, it takes approximately 10 seconds to process one test instance. That totals to approximately 4-5 minutes for the evaluation with 25 test instances.