Is it possible to generate a dataset based on exisiting documents using ollama, openai . #229
-
I am working on automating the generation of structured datasets using Kiln AI's TaskRun. My goal is to take a summarized text input and generate an expected dataset in the format shown in the UI (see attached image). This is my task instruction I want to create dataset from existing documents. In UI, I can directly input a text and generate a dataset [single data] like this. I want to generate a whole dataset like this , for that I need to automate the process. I referred to the Kiln AI core docs and found that I can create multiple TaskRun instances in a loop. However, when using the following code to generate a single TaskRun, I am unable to leverage an LLM to produce the expected dataset from the summarized text.
Expected Outcome:
Request for Help:
Any help or suggestions would be greatly appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
You probably should be creating Tasks, and using it to produce TaskRuns. Creating TaskRuns directly would only be better if you already had the results. Below is an example pulled from the tests you can modify to do this. Roughly:
|
Beta Was this translation helpful? Give feedback.
-
Thank you @scosman for your answer. I will try this and update here. |
Beta Was this translation helpful? Give feedback.
-
import json
from pathlib import Path
import asyncio
from typing import Dict
import kiln_ai
import kiln_ai.datamodel as datamodel
from kiln_ai.datamodel import Project
from kiln_ai.adapters.model_adapters.langchain_adapters import LangchainAdapter
async def run_existing_project_and_task(
project_path: str, task_path: str, model_name: str, provider: str
) -> None:
"""
Loads an existing project and task from file, creates an adapter,
and invokes the task with a test input.
Args:
project_path (str): Path to the project file.
task_path (str): Path to the task file.
model_name (str): The model's name.
provider (str): The provider name.
"""
# Load the existing project and print details
project = Project.load_from_file(project_path)
print("Project:", project.name, "-", project.description)
# Load the existing task from file
task = datamodel.Task.load_from_file(task_path)
# Create the LangchainAdapter with the loaded task.
# Make sure to pass None for custom_model if you're not using one.
adapter = LangchainAdapter(task, custom_model=None, model_name=model_name, provider=provider)
# Provide an appropriate input for your task; adjust as necessary.
test_input = "Language is a system of communication that uses sounds, gestures, or written symbols to convey meaning"
# Invoke the task and print the result
result = await adapter.invoke_returning_raw(test_input)
print("Task output:", result)
if __name__ == "__main__":
# Specify the paths for the project and task files
project_file = "your_project_location/project.kiln"
task_file = "your_project_location/task_location/task.kiln"
model_name = "phi4" #modify based on your model
provider = "ollama"
# Run the function in the event loop
asyncio.run(run_existing_project_and_task(project_file, task_file, model_name, provider)) I followed your guidance and everything is working now. I've updated the function using my existing task and project, and here’s an example function that demonstrates it successfully. |
Beta Was this translation helpful? Give feedback.
-
That works. I'd suggest using the adapter_for_task method to create adapters, hard coding the Langchain adapter isn't guaranteed to work forever. We're moving away from Langchain. The adapter_for_task will get you right adapter for the model. |
Beta Was this translation helpful? Give feedback.
-
Okay, got it. I have modified the method as you suggested. Thank you so much for the guidance. : ) |
Beta Was this translation helpful? Give feedback.
You probably should be creating Tasks, and using it to produce TaskRuns. Creating TaskRuns directly would only be better if you already had the results.
Below is an example pulled from the tests you can modify to do this.
Roughly:
build_structured_output_test_task
for example. Saving it is optional.adapter = adapter_for_task(task, model_name=model_name, provider=provider)
await adapter.invoke("input")
. The example calls invoke_returning_raw which doesn't produce a TaskRun.