Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DMP 2024] Generate Job expressions #620

Open
christad92 opened this issue Mar 5, 2024 · 30 comments
Open

[DMP 2024] Generate Job expressions #620

christad92 opened this issue Mar 5, 2024 · 30 comments
Labels
DMP 2024 Submission for DMP

Comments

@christad92
Copy link

christad92 commented Mar 5, 2024

Overview

When building workflows, users spend most of their time writing simple to advanced jobs on OpenFn. We'd like to harness AI to write job expressions based on english text requirements

This feature is able to generate a job expression given some sample input data, the adaptor specification (name and version), and a text description of the desired output/instruction. It is expected that the generated job expression can be executed in the CLI.

Deliverables

  • A new service in the apollo repo called gen_job
  • The service should receive a bunch of inputs as JSON, and return JSON with a string expression

We are looking for a new python module to be implemented on apollo. This can be called from the existing openfn apollo command, but has no direct CLI implementation.

Inputs

We expect the following JSON payload to be submitted to the endpoint:

{
	"expression": "the user's existing js expression"
	"adaptor": "@openfn/[email protected]"
	"state": { /* input state*/ }
	"instruction": "a natural language command"
}

All of these inputs should be considered optional - if any are excluded, the generation should continue as best it can.

We may also require metadata to be included with this request - such as which model to use, perhaps parameters to drive the model, and an api key.

Output

The service should return an expression nested in a JSON object

{
	expression: "fn(s => s)"
}

Note that the expression string should be pure code, suitable for inserting into a code editor. No natural language, no markdown annotations of any kind.

Implementation Notes

Highly valuable to use, and likely critical to this work, are the following two issues:

Once a basic job expression generator has been created, it may be wise to implement and integrate these issues.

Sample Inputs

  1. One or more sample inputs (valid JSON) which can serve as the initial state for the job
{
  "data": {
    "name": "bukayo saka",
    "gender": "male"
  }
}
  1. The adaptor specification will be in the form of “@openfn/[email protected]” or “@openfn/[email protected]
  1. The text instructions will be in the form of: “Create a new object based on patient object, and set it’s status attribute “enrolled” or “Create a trackedEntityInstance record in DHIS2 using the data from state.person”
Create a new trackedEntityInstance "person" in dhis2 for the "dWOAzMcK2Wt" orgUnit.

Sample Output

Given the inputs above, we'd expect the output code to be:

create('trackedEntityInstances', {
  orgUnit: "dWOAzMcK2Wt",
  trackedEntityType: 'nEenWmSyUEp',
  attributes: [
    { attribute: 'w75KJ2mc4zz', value: state.data.name.split(' ')[0] },
    { attribute: 'zDhUuAYrxNC', value: state.data.name.split(' ')[1] },
    { attribute: 'cejWyOfXge6', value: state.data.gender },
  ],
});

Background

OpenFn is an open source platform for data integration and workflow automation accessible to users thorough a CLI or a web UI.

To use OpenFn, users build workflows which are made up of one or more steps—at the time of writing these are all JavaScript-based "jobs" (the JS code itself is called a "job expression"). These jobs make use of adaptors to perform their tasks, e.g. Make a request to an API endpoint, update a record in a database, aggregate data, send data to an external platform.

Here is an example of a job that uses the common adaptor to transform an input data (in state) into a new object transformedPatient

fn(state => {
  const transformedPatient = { ...state.data.patient, status: "enrolled" }
  return { ...state, transformedPatient };
})

And here is another job expression that uses the dhis2 adaptor to create a new patient record referred to as trackedEntityInstance.

create('trackedEntityInstances', {
  orgUnit: "dWOAzMcK2Wt" /*Alkalia CHP*/,
  trackedEntityType: 'nEenWmSyUEp' /*Person*/,
  attributes: [
    { attribute: 'w75KJ2mc4zz', value: state.person.age.first_name },
    { attribute: 'zDhUuAYrxNC', value: state.person.age.last_name },
    { attribute: 'cejWyOfXge6', value: state.person.age.gender },
  ],
});

Learn more about adaptors here and how they are used in OpenFn workflows.

Documentation:

@christad92 christad92 added the DMP 2024 Submission for DMP label Mar 5, 2024
@github-project-automation github-project-automation bot moved this to New Issues in v2 Mar 5, 2024
@christad92 christad92 moved this from New Issues to Icebox in v2 Mar 5, 2024
@taylordowns2000 taylordowns2000 changed the title [DMP 2024] Generate Job expressions (expression.js) from 3 sample inputs and a desired output [DMP 2024] Generate Job expressions Mar 5, 2024
@josephjclark

This comment was marked as outdated.

@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

@maverickcodex18
Copy link

Hey mentors, please do assign this job to me .
I am also bit confused regarding the use of AI in this project

@falgun143
Copy link

Hello @christad92, Can I work on this issue? . I have strong experience with javascript and I have worked on Large language models.Please let me know.

@falgun143
Copy link

falgun143 commented May 2, 2024

@josephjclark Iam highly interested in this project ,These were some of my contributions done to a gsoc organisation
https://github.com/sugarlabs/musicblocks/commits/master/?author=falgun143.

Can you please tell me how should I proceed. I cloned the repo and ran the command pnpm run setup and I see the below error.
image
Should I first create an image using docker and then run the command

@christad92
Copy link
Author

Hi @falgun143, thanks for your interest in this project. To be considered as the contributor, please apply through the unstop platform. the mentors will then shortlist best proposals and select a contributor.

I hope this helps?

@falgun143
Copy link

@christad92 Is there any channel in the discord under Code4GovTech ? ,I am not able to find .If there is a channel please let me know.

@christad92
Copy link
Author

@falgun143 I can confirm that and get back to you.

@falgun143
Copy link

@josephjclark Iam highly interested in this project ,These were some of my contributions done to a gsoc organisation https://github.com/sugarlabs/musicblocks/commits/master/?author=falgun143.

Can you please tell me how should I proceed. I cloned the repo and ran the command pnpm run setup and I see the below error. image Should I first create an image using docker and then run the command

@christad92 Can you help me with this. And also is it necessary to complete all the github classroom assignment or is it good to understand more about the project and submit the proposal .And any updates on the discord channel?

@Saksham0303
Copy link

Greetings, @christad92 ,
I want to contribute my sincere interest in the Development of this project & I can assure you about giving my best dedication in the development of project with my Graphic designing and UI/UX designing, front-end development skills in ReactJS and JavaScript, coupled with a passion for creating intuitive user experiences. My technical expertise, combined with a keen eye for design and functionality, positions me well to contribute effectively to the development of this project."

These are the approaches founded by me :

  1. Scripting Approach: This approach is paramount due to its ability to provide precise control over the generation process. By writing custom scripts, you can tailor the file generation logic to match the specific requirements of the task. This approach is highly adaptable and scalable, making it well-suited for handling diverse input-output scenarios efficiently.

  2. Graph-Based Approach: Represent the input-output relationships as a graph, where nodes represent the sample inputs and desired output, and edges denote the transformations between them. Use graph algorithms to traverse the graph and generate the job expression.js file based on the discovered paths and transformations.

  3. Evolutionary Algorithm Approach: Employ evolutionary algorithms, such as genetic algorithms or genetic programming, to evolve candidate solutions for the job expression.js file. Represent potential solutions as individuals in a population, and iteratively apply genetic operators (e.g., mutation, crossover) to produce offspring with improved fitness. Evaluate the fitness of each individual based on its ability to match the provided sample inputs and desired output, ultimately generating a high-quality job expression.js file.

Here is my Resume : https://drive.google.com/file/d/1e4cOxVAfIjehLf7LemzX4oxPFhWd4y4D/view?usp=drive_link

@DGRYZER
Copy link

DGRYZER commented May 3, 2024

Hello,
My name is Debajyoti Ghosh. I am a Jr. Frontend Developer (Fresher). I have studied the project description and I am sharing my opinion to achieve it.
To implement this project concept, we can utilize a combination of natural language processing (NLP) techniques along with code generation capabilities. Here's a high-level solution outline:
Solution Outline:

  1. Input Parsing:
  • Parse the provided sample inputs (initial state for the job), adaptor specification, and text instructions.
  1. Natural Language Understanding (NLU):
  • Utilize NLP techniques to extract key information from the text instructions.
  • Identify entities such as object names, attributes, actions, and adaptor-specific details.
  1. Code Generation:
  • Generate JavaScript code based on the extracted information and sample inputs.
  • Ensure adherence to the conventions specified by the adaptor documentation.
  • Use string interpolation or templating to insert dynamic values from the sample inputs into the generated code.
  1. Integration with CLI:
  • Develop a CLI tool or integrate the solution with an existing CLI environment.
  • Accept input parameters from the CLI, including sample inputs, adaptor specification, and text instructions.
  • Output the generated job expression in a format compatible with OpenFn's CLI execution.
  1. Testing and Validation:
  • Implement testing procedures to ensure the generated job expressions meet the desired functionality and adhere to adaptor conventions.
  • Validate the generated code by executing it against sample inputs and verifying the output.
  1. Documentation and User Guidance:
  • Provide comprehensive documentation on how to use the tool, including CLI commands and input parameters.
  • Offer guidance on writing effective text instructions to maximize the accuracy of code generation.
  • Include examples and best practices for different use cases.
  1. Iterative Improvement:
  • Collect feedback from users and incorporate improvements to enhance the accuracy and efficiency of the code generation process.
  • Continuously update the tool to support new adaptor specifications and accommodate evolving requirements.
    Technology Stack:
  • JavaScript: For code generation and CLI implementation.
  • NLP Libraries (e.g., spaCy, NLTK): For natural language understanding and entity extraction.
  • Python (Optional): For preprocessing tasks or integration with NLP libraries.
  • CLI Frameworks (e.g., Commander.js, yargs): For building the command-line interface.
  • Testing Frameworks (e.g., Jest, Mocha): For automated testing of generated code.
    Development Process:
  1. Requirement Analysis: Understand the specific requirements and constraints of the project.
  2. Design and Architecture: Define the architecture of the solution and the workflow for code generation.
  3. Implementation: Develop the code generation logic, CLI interface, and integration with NLP components.
  4. Testing: Conduct thorough testing to ensure functionality and reliability.
  5. Documentation: Create user guides, API documentation, and tutorials.
  6. Deployment: Distribute the tool through appropriate channels and ensure ease of installation.
  7. Maintenance and Updates: Address bug fixes, incorporate feedback, and release updates as needed.

By following this approach, we can create a robust solution that automates the generation of job expressions based on sample inputs and text instructions, thereby enhancing the efficiency of workflow development in OpenFn.

Thank You.
DEBAJYOTI GHOSH

@REC-1104
Copy link

REC-1104 commented May 6, 2024

Hello @christad92 ,
The links for "how to write jobs?" and "writing jobs" in under documentation sub-heading are not working.
Screenshot 2024-05-06 221757

I want to know the convention of writing jobs. Can you please help me.

@christad92
Copy link
Author

@REC-1104 the link has been fixed. Thank you

@falgun143
Copy link

Hello @christad92 I have mailed you my proposal ,Please have a look at it let me know any changes so that I can finally submit it on the unstop website.
@josephjclark I didn't found your email any where, Can you please send me your email? .So that I can send my proposal to you for review.

@Saksham0303
Copy link

Greetings @christad92,
I've sent you my proposal via email. Could you please take a moment to review it and provide any feedback or suggestions for improvement. Once finalized, I'll be ready to submit it on the website.

@SatyamMattoo
Copy link
Contributor

SatyamMattoo commented Jun 11, 2024

Hey @josephjclark @christad92,
I hope you're both doing well. I wanted to inform you that I have been selected for this DMP project. Over the past few days, I've been contemplating various approaches to solve this, and one of the most feasible options seems to be leveraging the existing Apollo (formerly Gen) repository. I noticed that you have set up the initial framework for making calls to Apollo services through the CLI.

To move forward, I propose creating a service for job generation using the following inputs. Users would provide these inputs via a .json file:

{
  "api_key": "apiKey",
  "adaptor": "@openfn/[email protected]",
  "data": {
    "name": "bukayo saka",
    "gender": "male"
  },
  "signature": "Create a new trackedEntityInstance 'person' in dhis2 for the 'dWOAzMcK2Wt' orgUnit."
}

The CLI command openfn apollo job_expression_generator tmp/data.json -o tmp/output.json would then be used to call the job generation service on the Apollo server and return the desired result.

For job generation on the server, we can create a job_expression_generator service. This service would parse inputs from the .json file and generate the required output. Below is a sample implementation:

from util import DictObj, createLogger

from .utils import (
    generate_job_prompt,
)

from inference import inference


logger = createLogger("job_expression_generator")


class Payload(DictObj):
    api_key: str
    adaptor: str
    signature: str
    data: dict


# Generate job expression based on the input data, adaptor specification, and instructions
def main(dataDict) -> str:
    data = Payload(dataDict)
    logger.info("Running job expression generator with adaptor {}".format(data.adaptor))
    result = generate(data.adaptor_spec, data.instructions, data.sample_input, data.get("api_key"))
    logger.success("Job expression generation complete!")
    return result


def generate(adaptor_spec, instructions, sample_input, key) -> str:
    prompt = generate_job_prompt(adaptor_spec, instructions, sample_input)

    result = inference.generate("gpt3_turbo", prompt, {"key": key})

    return result

The prompt for this might look like:

prompts = {
    "job_expression": (
        "You are a helpful Javascript code assistant.",
        "Below is a description of a task along with the adaptor specification and sample input data. "
        "Generate a JavaScript job expression that performs the task described. Ensure the job expression "
        "follows the conventions defined in the adaptor documentation.\n\n"
        "Adaptor: {adaptor}\n"
        "Instructions: {signature}\n"
        "Sample Input: {sample_input}\n"
        "====",
    ),
}

For testing, we can run this with sample inputs from the CLI, write tests in the Apollo repo itself, or both.

I believe this approach aligns with what you're looking for. Could you please provide feedback on whether I am on the right track or suggest any improvements? Your guidance would be greatly appreciated.

PS: Apologies for bringing this up here, but I'm encountering some issues with setting up the Apollo repo. I've used Docker to set up the repo locally for now, but the conventional path throws errors related to $PATH not being found (for poetry). Apart from this, I would love to contribute towards the development of the apollo services.

Best regards

@josephjclark
Copy link
Collaborator

Hi @SatyamMattoo

Sorry for the late reply - and congrats! I'm delighted you've been chosen.

Unfortunately I'm tied up with various things this week and I can't get back to you right away. As you've seen a few things have changed since we put the issue up!

We're going to set up a kick off call late next week to go through this. I need to do a bit of planning beforehand. We'll be in touch soon to get that arranged, then we can let you loose!

@SatyamMattoo
Copy link
Contributor

SatyamMattoo commented Jun 19, 2024

Weekly Learnings & Updates

Week 1

  • Implemented bind mounts using Docker to set up the Apollo repository.
  • Used basic commands, such as echo, to test the functionality of the CLI.
  • Gained introductory knowledge on RAG (Retrieval-Augmented Generation).

Week 2

  • Set up the repo locally without docker.
  • Resolved bun related issues.
  • Exam delay

Week 3

  • Learned about vector databases and LangChains to support RAG.
  • Gained knowledge about embedding models and various vector databases.
  • Introduced to Milvus.
  • Designed an approach to add the vector database to Apollo.

Week 4

  • Learned about Sentence Transformers and used it locally.
  • Implemented the Milvus vector database into Apollo.
  • Set up Milvus locally for a hardcoded list of strings.
  • Learned about Zilliz (cloud database for Milvus).
  • Studied various Docker concepts, including build secrets.

Week 5

  • Learnt more about milvus.
  • Researched and implemented better approaches to embed the docs or the search queries.
  • Firstly, used HuggingFace API to embed the corpus but switched to OpenAI later.
  • Started exploring ways to embed the actual docs during the docker build.

Week 6

  • Embedded actual docs from the OpenFn docs repo to the vector database.
  • Find better ways to improve search results.

Week 7

  • Integration of search service and job generation service to add more context to the prompts.
  • Adding the adaptor description service to add adaptor info to the prompts along with relevant information from docs.

@josephjclark
Copy link
Collaborator

@SatyamMattoo Did you get your local environment set up?

I don't think you should be using the docker build locally, that's just going to make life hard for yourself. Just install poetry etc on your machine, per the instructions, and you'll have a much better dev experience.

You'll be the first person, so far as I know, to setup and run apollo locally. So any feedback on the documentation and getting started stuff would be much appreciated. Please raise issues (or even PRs) over there for anything you struggle with!

@SatyamMattoo
Copy link
Contributor

Hey @josephjclark,

I attempted to set it up locally following the provided steps, but I encountered an issue where the virtual environment could not locate the $PATH to Poetry. After explicitly setting the path, it was unable to find the Python command. Both Poetry and Python are installed on my system and added to the virtual environment.

Despite following all the steps mentioned, there might be something I am missing. After spending two days trying to resolve these errors, I decided to use Docker bind mounts. It is working fine with Docker. While debugging, I noticed that there might be something missing in the documentation regarding the ENV PATH we add during the Dockerization of the repository.

Here is a screenshot of the error I received after running openfn apollo echo ./tmp/input.json -o output.json --local:

Screenshot 2024-06-21 235330

@josephjclark
Copy link
Collaborator

@SatyamMattoo What do you mean by "added to the virtual environment"? What operating system are you using?

What does poetry --version return from inside the apollo repo?

Can you run bun py echo tmp/test.json? You may need a simple json file at tmp/test.json, but a file not found error would suggest that your poetry installation is working

@SatyamMattoo
Copy link
Contributor

SatyamMattoo commented Jun 22, 2024

Hey @josephjclark,
I am using Ubuntu. I meant the .venv/bin folder contains the python dependency yet it is unable to find the python command.

Screenshot 2024-06-22 173752

Running poetry --version returns me the current version of poetry as follows:
Screenshot 2024-06-22 172450

Running bun py echo tmp/test.json returns me:
Screenshot 2024-06-22 172617

@josephjclark
Copy link
Collaborator

Hmm. What version of Ubuntu? I updated to 24.04 on Thursday evening and I seem to have a similar error now. All was working on Thursday afternoon...

@josephjclark
Copy link
Collaborator

I've just force reinstalled poetry and after re-running poetry install my setup is working (after reporting a broken install).

@josephjclark
Copy link
Collaborator

Looking at your error again it's coming out of bash trying to execute poetry run. It's like there's something funny with your bash environment. Nothing to do with the venv.

Can you run:

poetry run python services/entry.py echo tmp/test.json

?

You might get a list out of range exception but that would mean your environment is working (and I'm about to merge a fix to that into main)

@SatyamMattoo
Copy link
Contributor

SatyamMattoo commented Jun 22, 2024

I am using Ubuntu 22.04.4 LTS. The command is working as expected.
Screenshot 2024-06-22 201754

Okay I will try installing everything and setup the repo again and see if it fixes the issue. Do you think this might be due to a different version of Ubuntu? If so, I will update it.

@josephjclark
Copy link
Collaborator

No, I don't think it's related to the Ubuntu version. The upgrade broke me setup and I wondered if you'd also updated.

The problem seems to be in the bun environment. When running a bun script, bun is invoking bash and bash can't find poetry. But if you run those same commands directly, they work.

So it's something in the bun setup. Or perhaps your shell I suppose. What shell are you using? Anything strange in your setup?

To prove it, you can add a bun script to package json which just calls poetry --version, and I expect it to fail.

@SatyamMattoo
Copy link
Contributor

SatyamMattoo commented Jun 22, 2024

Thank you @josephjclark! You were absolutely right; the error was in the bun setup. The Bun documentation does not mention adding of these env variables in the .bashrc:

export BUN_INSTALL= "$HOME/.bun"
export PATH= "$BUN_INSTALL/bin:$PATH"

After adding them the commands run absolutely as expected. Thank you again for your assistance.

@josephjclark
Copy link
Collaborator

How strange!

I'll add a note in the readme about this in case it helps someone out.

@hustler0109
Copy link

hustler0109 commented Jun 29, 2024

Weekly Goals

Week 1

  • Setting up the Apollo repository and using it with the CLI.
  • Updating the GitHub issue
  • Implementing the basic gen_job service

Week 2:

  • Setting up the Apollo repository locally without using Docker.
  • Resolving any issues related to the bun package manager.
  • Delays due to exams.

Week 3:

  • More about vector databases and LangChains to support Retrieval-Augmented Generation (RAG).
  • More about embedding models and various vector databases.
  • Introducing and designing an approach to integrate the Milvus vector database with Apollo.

Week 4:

  • Practical experience with Sentence Transformers and using them locally.
  • Implementing the Milvus vector database into the Apollo system.
  • Setting up Milvus locally and test with a hardcoded list of strings.
  • More about Zilliz,and experiment with the cloud database for Milvus.
  • Applying various Docker concepts, including build secrets.

Week 5:

  • More about Milvus and its capabilities.
  • Researching and implementing better approaches for embedding documents or search queries.
  • Transitioning from using HuggingFace API to OpenAI for embedding the corpus.
  • Exploring methods to embed actual documents during the Docker build process.

Week 6:

  • Embedding actual documents from the OpenFn docs repository to the vector database.
  • Finding better ways to improve search results.

Week 7:

  • Integrating the search service and job generation service to add more context to the prompts.
  • Adding the adaptor description service to include adaptor info to the prompts along with relevant information from the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMP 2024 Submission for DMP
Projects
Status: Icebox
Development

No branches or pull requests

10 participants