This is a full list of test instance patterns that can be used for evaluating understanding workflow architectures by LLMs.
To create test instances from the patterns, substitute all the parameters. Note that it might also be necessary to alter the question formulations to adapt them to the language of the specific workflow architecture.
Rationale: Can the LLM list the tasks in a workflow and filter them?
Parameters:
-
$P$ : property of the tasks (e.g., the task has a parameter), can be empty (to list all the tasks) -
$N$ : number of tasks satisfying$P$ -
$T$ : total number of tasks -
$W$ : workflow name
Architecture: Workflow
Question: List all tasks in workflow
Reference answer: a set of $N$ tasks satisfying $P$ (depends on the parameter substitution for the test instance)
Evaluation metric: Jaccard index
Example instance: "List all tasks in workflow 'MLTrainingAndEvaluation' that have a parameter."
Rationale: Can the LLM understand flow links between tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', is there a control flow link from 'MLModelTraining' to 'MLModelEvaluation'?"
Rationale: Can the LLM understand flow links between tasks? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', is there a control flow link from 'MLModelEvaluation' to 'MLModelTraining'?"
Rationale: Can the LLM understand flow links between tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', does 'MLModelEvaluation' directly follow 'MLModelTraining' in control flow?"
Rationale: Can the LLM understand flow links between tasks? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', does 'MLModelTraining' directly follow 'MLModelEvaluation' in control flow?"
Rationale: Can the LLM understand flow links between tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_0$ : task in$W$ -
$T_1, \dots, T_N$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer:
Evaluation metric: Jaccard index
Example instance: "In workflow 'MainWorkflow', which tasks come directly after 'Task2' in the control flow?", reference answer: { "Task1", "Task3" } (block of parallel tasks)
Rationale: Can the LLM find cycles in the flow?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$C$ : length of the cycle -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Rationale: Can the LLM find cycles in the flow? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM understand task hierarchy (if the architecture is hierarchical)?
Parameters:
-
$W$ : workflow name -
$T$ : task in the workflow$W$ that is composite (has sub-tasks) -
$S$ : sub-task of$T$
Architecture: Workflow
Question: Is task
Reference answer: yes
Evaluation metric: correctness
Example instance: "Is task 'HyperparameterProposal' a part of task 'HyperparameterOptimization' (from workflow 'FailurePredictionInManufacture')?"
Rationale: Can the LLM understand task hierarchy (if the architecture is hierarchical)? (negative test)
Parameters:
-
$W$ : workflow name -
$T$ : task in the workflow$W$ (it might be composite) -
$S$ : a different task from$W$ (or another workflow) that is not sub-task of$T$
Architecture: Workflow
Question: Is task
Reference answer: no
Evaluation metric: correctness
Example instance: "Is task 'DataRetrieval' a part of task 'HyperparameterOptimization' (from workflow 'FailurePredictionInManufacture')?"
Rationale: Can the LLM detect infinite recursion in references (e.g., sub-workflows if the architecture is hierarchical)?
Parameters:
-
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Note: The recursion might also be more complicated than just a simple self-reference, i.e.,
Rationale: Can the LLM detect infinite recursion in references (e.g., sub-workflows if the architecture is hierarchical)? (negative test)
Parameters:
-
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM understand dependencies (in the flow)?
Parameters:
-
$W$ : workflow name -
$E$ : entity (e.g., task, data) in the workflow -
$T$ : task in the workflow that depends on$E$ (trough e.g., control flow, data flow)
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instances:
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelEvaluation' depend on task 'MLModelTraining'?"
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on data 'TrainingData'?"
Rationale: Can the LLM understand dependencies (in the flow)? (negative test)
Parameters:
-
$W$ : workflow name -
$E$ : entity (e.g., task, data) in the workflow -
$T$ : task in the workflow that does not depend on$E$ (trough e.g., control flow, data flow)
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instances:
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on task 'MLModelEvaluation'?"
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on data 'TestData'?"
Rationale: Can the LLM list dependencies (in the flow)?
Parameters:
-
$W$ : workflow name -
$E$ : entity type (e.g., task, data) -
$E_1, \dots, E_K$ : entities (of type$E$ ) in the workflow -
$T$ : task in the workflow that depends on$E_1, \dots, E_K$
Architecture: Workflow
Question: List all entities of type $E$ that
Reference answer:
Evaluation metric: Jaccard index
Example instances:
- "List all data that task 'MLModelEvaluation' (from workflow 'MLTrainingAndEvaluation') depends on.", reference answer: { MLModel, TestFeatures }
- "List all tasks that must run before task 'MLModelEvaluation' in workflow 'MLTrainingAndEvaluation'.", reference answer: { FeatureExtraction, ModelTraining }
Rationale: Can the LLM detect which task produces the given data?
Parameters:
-
$W$ : workflow name -
$D$ : data (or other entity) in the workflow -
$T$ : task which produces$D$ as its output
Architecture: Workflow
Question: In workflow
Reference answer:
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation' which task produces 'MLModel'?", reference answer: 'MLModelTraining'
Rationale: Can the LLM determine if a trace of tasks can occur?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$S$ : situation (e.g., parameter values) -
$R_1, \dots, R_L$ ($L \le K$ ): tasks that will run when the workflow is executed with initial situation$S$
Architecture: Workflow
Question: Can the trace of tasks
Reference answer: yes
Evaluation metric: correctness
Example instance: "Can the trace of tasks 'FeatureExtraction', 'MLModelTraining', 'MLModelEvaluation' occur in workflow 'MLTrainingAndEvaluation'?"
Rationale: Can the LLM determine if a trace of tasks can occur? (negative test)
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$R_1, \dots, R_L$ : tasks (from$W$ or different workflow)
Architecture: Workflow
Question: Can the trace of tasks
Reference answer: no
Evaluation metric: correctness
Example instance: "Can the trace of tasks 'TrainTestSplit', 'MLModelEvaluation', 'MLModelTraining' occur in workflow 'MLTrainingAndEvaluation'?"
Rationale: Can the LLM understand the order of tasks in the workflow without conditional flow?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$
Architecture: Workflow
Question: List all the tasks in workflow
Reference answer:
Evaluation metric: Damerau–Levenshtein distance (note: special care must be given to the order of parallel tasks)
Example instance: "List all the tasks in workflow 'MLTrainingAndEvaluation' in order in which they run.", reference answer: FeatureExtraction, MLModelTraining, MLModelEvaluation
Rationale: Can the LLM understand conditional flow guards?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, T_1, T_2$ : tasks in the workflow -
$C_1, C_2$ : conditions for conditional links (in flow$F$ ) that are mutually exclusive
Architecture: Workflow
Question: Are conditional links in flow
Reference answer: yes
Evaluation metric: correctness (
Example instance: Workflow with a parameter
Rationale: Can the LLM understand conditional flow guards? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, T_1, T_2$ : tasks in the workflow -
$C_1, C_2$ : conditions for conditional links (in flow$F$ ) that are not mutually exclusive
Architecture: Workflow
Question: Are conditional links in flow
Reference answer: no
Evaluation metric: correctness (
Example instance: Workflow with a parameter
Rationale: Can the LLM understand conditional flow guards?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T$ : task in the workflow$W$
Architecture: Workflow
Question: Will the task
Reference answer: yes
Evaluation metric: correctness
Example instance: TODO
Rationale: Can the LLM evaluate conditional flow?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, \dots, T_K$ : tasks in the workflow -
$C_1, \dots, C_K$ : conditions for conditional links (in flow$F$ ) -
$S$ : situation (e.g., parameter values)
Architecture: Workflow
Question: Which task will follow
Reference answer:
Evaluation metric: correctness (
Example instance: "Which task will follow 'HyperparameterProposal' in control flow in workflow 'HyperparameterOptimization'?"
Rationale: Can the LLM understand the order of tasks in the workflow with conditional flow?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$S$ : situation (e.g., parameter values) -
$R_1, \dots, R_L$ ($L \le K$ ): tasks that will run when the workflow is executed with initial situation$S$
Architecture: Workflow
Question: Given the initial situation
Reference answer:
Evaluation metric: Damerau–Levenshtein distance (note: special care must be given to the order of parallel tasks)
Example instance: "Given the initial situation p=0, list all the tasks that will run when workflow 'Workflow3' is executed in order in which they will run.", reference answer: Task7, Task8
In the source code and raw results, these patterns are labeled semantics
.
Rationale: Can the LLM detect inconsistent task name and description?
Parameters:
-
$W$ : workflow name -
$T$ : task name -
$D_T$ : task description that is inconsistent with name$T$
Architecture: Workflow
Question: Identify potential errors in the specification of
Reference answer: The description of task
Evaluation metric: ROUGE
Example instance: Task named 'BinaryClassificationModelTraining' with description 'Training of a regression ML model'
Rationale: Can the LLM detect inconsistent descriptions of workflow and tasks?
Parameters:
-
$W$ : workflow name -
$D_W$ : workflow description -
$D_T$ : task description that is inconsistent with$D_W$
Architecture: Workflow containing a task with inconsistent description.
Question: Identify potential errors in the specification of
Reference answer: description of the inconsistency (depends on the test instance)
Evaluation metric: ROUGE
Example instance: A workflow specifying an ML pipeline where the ML goal is said to be "binary classification" in the workflow description. At the same time, the tasks perform training of a "regression" ML model (which is inconsistent with "binary classification").
Rationale: Can the LLM detect tasks that are clearly in wrong order (semantically)?
Parameters:
-
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$ -
$F$ : flow type (e.g., control flow)
Architecture: Workflow
Question: Identify potential errors in the specification of
Reference answer: Task
Evaluation metric: ROUGE
Example instance: Workflow with task 'MLModelTraining' after 'MLModelEvaluation'.
Rationale: Can the LLM understand meaning of tasks (e.g., task performs an operation that is not directly mentioned in the name)?
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing) -
$T_1$ : task in$W$ with property$P$ -
$T_2$ : task in$W$
Architecture: Workflow
Question: Does a task with property
Reference answer: yes
Evaluation metric: correctness
Example instance: Task 'FeatureExtraction' is labeled as "data preprocessing" and it precedes 'MLModelTraining', question: "Does a data preprocessing task run before 'MLModelTraining?"
Note: Other variants of this pattern can also be created, e.g., "Does a task with property
Rationale: Can the LLM understand meaning of tasks (e.g., task performs an operation that is not directly mentioned in the name)? (negative test)
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing) -
$T_2$ : task in$W$
Architecture: Workflow
Question: Does a task with property
Reference answer: no
Evaluation metric: correctness
Example instance: "Does a data preprocessing task run before 'DataRetrieval?" (some of the later tasks might be labeled as "data preprocessing")
Note: Other variants of this pattern can also be created, e.g., "Does a task with property
-
correctness:
$1$ if LLM's answer = reference answer,$0$ otherwise
- Jaccard index: the size of the intersection divided by the size of the union of the sets (LLM's answer set and the reference answer set)
- Damerau–Levenshtein distance: edit distance between two sequences allowing insertions, deletions, substitutions, and transposition (swap) of two adjacent elements
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): the word overlap of the reference answer and the LLM output
- BERTScore: the cosine similarity of word embeddings (that capture the meaning of words)
- manual: manual assessment of the LLM's output
- LLM as a judge (e.g., G-Eval): use another LLM to assess the LLM's output