Is it possible to generate a dataset based on exisiting documents using ollama, openai . #229

Anandhfullstack · 2025-02-26T03:29:11Z

Anandhfullstack
Feb 26, 2025

I am working on automating the generation of structured datasets using Kiln AI's TaskRun. My goal is to take a summarized text input and generate an expected dataset in the format shown in the UI (see attached image).

This is my task instruction

I want to create dataset from existing documents.

In UI, I can directly input a text and generate a dataset [single data] like this.

I want to generate a whole dataset like this , for that I need to automate the process.

I referred to the Kiln AI core docs and found that I can create multiple TaskRun instances in a loop. However, when using the following code to generate a single TaskRun, I am unable to leverage an LLM to produce the expected dataset from the summarized text.

item = kiln_ai.datamodel.TaskRun( parent=task, input="The AIE1515_FC_C does not come with an operating system which must be loaded first before installation of any software into the computer.", input_source=kiln_ai.datamodel.DataSource(type=kiln_ai.datamodel.DataSourceType.synthetic, properties={"model_name": "phi4","model_provider":"ollama", "adapter_name": "ollama_adapter"} ), output=kiln_ai.datamodel.TaskOutput( output=json.dumps({"answer_1": "", "question_1": "", "answer_2": "", "question_2": "", "answer_3": "", "question_3": ""}), source=kiln_ai.datamodel.DataSource( type=kiln_ai.datamodel.DataSourceType.synthetic, properties={"model_name": "phi4", "model_provider":"ollama","adapter_name": "ollama_adapter"} ) ), )

Expected Outcome:

The LLM should process the given summarized text and automatically generate answer_1, question_1, answer_2, question_2, etc.
The dataset should be structured as shown in the UI example.
This process should be scalable, allowing multiple TaskRun instances to be created in a loop.

Request for Help:

How can I modify TaskRun so that the LLM actively generates the structured dataset instead of just initializing an empty output?
Are there specific parameters or methods in Kiln AI that allow integrating the LLM for dynamic output generation?

Any help or suggestions would be greatly appreciated!

Answered by scosman

Feb 27, 2025

You probably should be creating Tasks, and using it to produce TaskRuns. Creating TaskRuns directly would only be better if you already had the results.

Below is an example pulled from the tests you can modify to do this.

Roughly:

Create a Task with your output schema as JSON schema. See build_structured_output_test_task for example. Saving it is optional.
Get an adapter for the task adapter = adapter_for_task(task, model_name=model_name, provider=provider)
Run the adapter to get a TaskRun: await adapter.invoke("input"). The example calls invoke_returning_raw which doesn't produce a TaskRun.

def build_structured_output_test_task(tmp_path: Path):
    project = datamodel.Project(name="tes…

View full answer

scosman · 2025-02-27T05:19:34Z

scosman
Feb 27, 2025
Maintainer

You probably should be creating Tasks, and using it to produce TaskRuns. Creating TaskRuns directly would only be better if you already had the results.

Below is an example pulled from the tests you can modify to do this.

Roughly:

Create a Task with your output schema as JSON schema. See build_structured_output_test_task for example. Saving it is optional.
Get an adapter for the task adapter = adapter_for_task(task, model_name=model_name, provider=provider)
Run the adapter to get a TaskRun: await adapter.invoke("input"). The example calls invoke_returning_raw which doesn't produce a TaskRun.

def build_structured_output_test_task(tmp_path: Path):
    project = datamodel.Project(name="test", path=tmp_path / "test.kiln")
    project.save_to_file()
    task = datamodel.Task(
        parent=project,
        name="test task",
        instruction="You are an assistant which tells a joke, given a subject.",
    )
    task.output_json_schema = json_joke_schema
    schema = task.output_schema()
    assert schema is not None
    assert schema["properties"]["setup"]["type"] == "string"
    assert schema["properties"]["punchline"]["type"] == "string"
    task.save_to_file()
    assert task.name == "test task"
    assert len(task.requirements) == 0
    return task

async def run_structured_output_test(tmp_path: Path, model_name: str, provider: str):
    task = build_structured_output_test_task(tmp_path)
    a = adapter_for_task(task, model_name=model_name, provider=provider)
    try:
        parsed = await a.invoke_returning_raw("Cows")  # a joke about cows
    except ValueError as e:
        if str(e) == "Failed to connect to Ollama. Ensure Ollama is running.":
            pytest.skip(
                f"Skipping {model_name} {provider} because Ollama is not running"
            )
        raise e
    if parsed is None or not isinstance(parsed, Dict):
        raise RuntimeError(f"structured response is not a dict: {parsed}")
    assert parsed["setup"] is not None
    assert parsed["punchline"] is not None
    if "rating" in parsed and parsed["rating"] is not None:
        rating = parsed["rating"]
        # Note: really should be an int according to json schema, but mistral returns a string
        if isinstance(rating, str):
            rating = int(rating)
        assert rating >= 0
        assert rating <= 10

0 replies

Anandhfullstack · 2025-02-27T09:23:20Z

Anandhfullstack
Feb 27, 2025
Author

Thank you @scosman for your answer. I will try this and update here.

0 replies

Anandhfullstack · 2025-02-27T09:43:53Z

Anandhfullstack
Feb 27, 2025
Author

import json
from pathlib import Path
import asyncio
from typing import Dict
import kiln_ai
import kiln_ai.datamodel as datamodel
from kiln_ai.datamodel import Project
from kiln_ai.adapters.model_adapters.langchain_adapters import LangchainAdapter

async def run_existing_project_and_task(
    project_path: str, task_path: str, model_name: str, provider: str
) -> None:
    """
    Loads an existing project and task from file, creates an adapter,
    and invokes the task with a test input.

    Args:
        project_path (str): Path to the project file.
        task_path (str): Path to the task file.
        model_name (str): The model's name.
        provider (str): The provider name.
    """
    # Load the existing project and print details
    project = Project.load_from_file(project_path)
    print("Project:", project.name, "-", project.description)
    
    # Load the existing task from file
    task = datamodel.Task.load_from_file(task_path)
    
    # Create the LangchainAdapter with the loaded task.
    # Make sure to pass None for custom_model if you're not using one.
    adapter = LangchainAdapter(task, custom_model=None, model_name=model_name, provider=provider)
    
    # Provide an appropriate input for your task; adjust as necessary.
    test_input = "Language is a system of communication that uses sounds, gestures, or written symbols to convey meaning"
    
    # Invoke the task and print the result
    result = await adapter.invoke_returning_raw(test_input)
    print("Task output:", result)


if __name__ == "__main__":
    # Specify the paths for the project and task files
    project_file = "your_project_location/project.kiln"
    task_file = "your_project_location/task_location/task.kiln"
    model_name = "phi4" #modify based on your model
    provider = "ollama"
    
    # Run the function in the event loop
    asyncio.run(run_existing_project_and_task(project_file, task_file, model_name, provider))

I followed your guidance and everything is working now. I've updated the function using my existing task and project, and here’s an example function that demonstrates it successfully.

0 replies

scosman · 2025-02-27T16:16:18Z

scosman
Feb 27, 2025
Maintainer

That works. I'd suggest using the adapter_for_task method to create adapters, hard coding the Langchain adapter isn't guaranteed to work forever. We're moving away from Langchain. The adapter_for_task will get you right adapter for the model.

0 replies

Anandhfullstack · 2025-03-03T09:09:41Z

Anandhfullstack
Mar 3, 2025
Author

Okay, got it. I have modified the method as you suggested. Thank you so much for the guidance. : )

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to generate a dataset based on exisiting documents using ollama, openai . #229

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is it possible to generate a dataset based on exisiting documents using ollama, openai . #229

Anandhfullstack Feb 26, 2025

Expected Outcome:

Request for Help:

Any help or suggestions would be greatly appreciated!

Replies: 5 comments

scosman Feb 27, 2025 Maintainer

Anandhfullstack Feb 27, 2025 Author

Anandhfullstack Feb 27, 2025 Author

scosman Feb 27, 2025 Maintainer

Anandhfullstack Mar 3, 2025 Author

Anandhfullstack
Feb 26, 2025

scosman
Feb 27, 2025
Maintainer

Anandhfullstack
Feb 27, 2025
Author

Anandhfullstack
Feb 27, 2025
Author

scosman
Feb 27, 2025
Maintainer

Anandhfullstack
Mar 3, 2025
Author