Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First attempt at a parametrized JobCreate #740

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

javiermtorres
Copy link
Contributor

@javiermtorres javiermtorres commented Jan 24, 2025

What's changing

The JobCreate schema is changed to include a separate specific job_config. The openapi produced includes a oneOf constraint:

      "JobCreate": {
        "properties": {
          "name": {
            "type": "string",
            "title": "Name"
          },
[...]
          "job_config": {
            "oneOf": [
              {
                "$ref": "#/components/schemas/JobEvalConfig"
              },
              {
                "$ref": "#/components/schemas/JobEvalLiteConfig"
              },
              {
                "$ref": "#/components/schemas/JobInferenceConfig"
              },
              {
                "$ref": "#/components/schemas/JobAnnotateConfig"
              }
            ],
            "title": "Job Config",
            "discriminator": {
              "propertyName": "job_type",
              "mapping": {
                "annotate": "#/components/schemas/JobAnnotateConfig",
                "eval_lite": "#/components/schemas/JobEvalLiteConfig",
                "evaluate": "#/components/schemas/JobEvalConfig",
                "inference": "#/components/schemas/JobInferenceConfig"
              }
            }
          }
        },

The jobs and experiments services are changed accordingly.

Closes #706

How to test it

Tests should run correctly.

Additional notes for reviewers

N/A

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required
    • No DB migration needed

@github-actions github-actions bot added backend api Changes which impact API/presentation layer schemas Changes to schemas (which may be public facing) labels Jan 24, 2025
@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch from cf6d9aa to d8ae072 Compare January 24, 2025 17:28
Copy link
Contributor

@njbrake njbrake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far! Made a code suggestion but looks like a very logical refactor. My only question would be whether you plan on addressing the custom logic of _get_job_params in this PR. If you don't plan on addressing it here, can you make a separate issue to track elevating that out of the service layer?

lumigator/python/mzai/backend/backend/api/routes/jobs.py Outdated Show resolved Hide resolved
lumigator/python/mzai/backend/backend/api/routes/jobs.py Outdated Show resolved Hide resolved
lumigator/python/mzai/backend/backend/services/jobs.py Outdated Show resolved Hide resolved
@javiermtorres
Copy link
Contributor Author

The SDK needs to be updated. I have checked the backend unit and integration tests locally and they seem to work.

@github-actions github-actions bot added the sdk label Jan 28, 2025
@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch 2 times, most recently from 72585b0 to a17f040 Compare January 28, 2025 15:30
@javiermtorres
Copy link
Contributor Author

The SDK and notebook tests have been updated. @veekaybee @aittalam I've changed the code of the notebook slightly. One important difference is that I have removed the model param in the eval lite job. AFAICT, it's not needed there. The notebook takes it from the initial model spec in the notebook and not from the output of the summarization job. Since the output is a csv, it didn't make sense to put the model there, but I'll check the results metadata.

@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch 2 times, most recently from 0c8ef33 to 4ca7e65 Compare January 29, 2025 15:22
@javiermtorres javiermtorres marked this pull request as ready for review January 29, 2025 15:55
@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch 2 times, most recently from 9159569 to e6f13f3 Compare January 31, 2025 11:43
@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch from ab23ff5 to aa1f92a Compare February 3, 2025 08:51
@javiermtorres javiermtorres force-pushed the javiermtorres/issue-706-organize-creation-records branch from c9357b2 to 293ae98 Compare February 4, 2025 16:20
Copy link
Contributor

@njbrake njbrake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this PR is dropping support for the JobType.EVALUATION, which is needed to support the current frontend design. I may misunderstand the code. Other than that, only minor comments. Thanks for the work on this! (Let me know about JobType.EVALUATION and then I'll approve once that's worked out).

Comment on lines +73 to +82
inference_job_create_config_dict = job_create_request.job_config.dict()
inference_job_create_config_dict["model"] = "hf://facebook/bart-large-cnn"
inference_job_create_config_dict["output_field"] = "ground_truth"
inference_job_create_config_dict["store_to_dataset"] = True
inference_job_create_config_dict["job_type"] = JobType.INFERENCE

inference_job_create_request_dict = job_create_request.model_dump()
inference_job_create_request_dict["job_config"] = inference_job_create_config_dict

inference_job_create_request = JobCreate(**inference_job_create_request_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code confuses me slightly: the job_create_request comes in as a JobCreate, then the code dumps it out of that class, changes a handful of content, and then re-puts it in as the same JobCreate class that it came in on, only now it has different params? Would it work instead to change the params on the job_create_request object? Something like `job_create_request.model = "hf://facebook/bart-large-cnn".

Or, did you mean for lin 82 to be inference_job_create_request = JobInferenceCreate(**inference_job_create_request_dict) instead of JobCreate?

Now here comes a "scope creep" comment, since your changes don't introduce this error but it might be an easy one to fix. Could we have some sort of warning if the key was already set to something else? I'm especially thinking of the "model" parameter. Let's say I make a call to the backend and I want to use the ""hf://facebook/bart-base-cnn" model instead of the bart-large-cnn model. This code as it stands now would quietly overwrite my request settings. Might be nice to log some sort of warning that we are throwing away a request parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some sort of warning if the key was already set to something else?

The idea would be to prevent setting any param that would not be acceptable for an Annotation job config. However, there are params like model that should perhaps be moved to the jobs config, since they don't deal with job lifecycle. Maybe as an interim measure we could log a warning until we settle on the schemas.

Conceptually, an annotation is a different kind of job, even though internally is also mapped to an inference (for example, a model cannot be set in an annotation since the annotation job is designed to use exclusively the BART model [please @ividal confirm]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: an annotation job is a constrained inference job (one where the user is not choosing model or parameters). We should at least be transparent about the constraints (from API to UI).

This is more a product decision than a technical one: Lumigator at this point is supposed to be opinionated about the model used to generate annotations. So an annotation job request should not make the user believe they can set any model they like, since it would be quietly ignored (replaced currently by BART, tomorrow who knows).

That's why I'd worry about the API offering (as this PR introduces):

{
  "name": "string",
  ...
  "job_config": {
    "job_type": "evaluate", # ouch, minor gripe, but this is a default taken from swagger for an annotation job
    "model": "string", # misleading: this would be promptly ignored by the backend
    "model_url": "string", # same
    "system_prompt": "string",
    "skip_inference": false
  }
}

What do you think - is there any reasonable way to avoid the confusions mentioned above (specially related to the model)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm not getting this, but the annotation does not allow a model to be set:
https://github.com/mozilla-ai/lumigator/pull/740/files#diff-851bf090de2849773e9c7486bebd855b586c8a24f6b20cdd2eb10d65139dd1c2R100-R103

If we are talking about the standalone EVALUATE job, then we should remove this one ASAP and port the current usage to INFERENCE+EVAL_LITE.

lumigator/backend/backend/services/jobs.py Show resolved Hide resolved

def create_job(
self,
request: JobEvalCreate | JobEvalLiteCreate | JobInferenceCreate,
request: JobCreate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function dropping support for JobEval? From looking at the logic a few lines lower it seems like only JobType.EVALUATION_LITE and JobType.INFERENCE are supported, which will break the frontend API which is currently only using EVALUATION and not EVALUATION_LITE + INFERENCE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the current jobs are handled via the experiments endpoint, afaict, not directly via the jobs endpoint, so the UI should not see a difference.

lumigator/backend/backend/services/jobs.py Show resolved Hide resolved
Comment on lines +73 to +82
inference_job_create_config_dict = job_create_request.job_config.dict()
inference_job_create_config_dict["model"] = "hf://facebook/bart-large-cnn"
inference_job_create_config_dict["output_field"] = "ground_truth"
inference_job_create_config_dict["store_to_dataset"] = True
inference_job_create_config_dict["job_type"] = JobType.INFERENCE

inference_job_create_request_dict = job_create_request.model_dump()
inference_job_create_request_dict["job_config"] = inference_job_create_config_dict

inference_job_create_request = JobCreate(**inference_job_create_request_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: an annotation job is a constrained inference job (one where the user is not choosing model or parameters). We should at least be transparent about the constraints (from API to UI).

This is more a product decision than a technical one: Lumigator at this point is supposed to be opinionated about the model used to generate annotations. So an annotation job request should not make the user believe they can set any model they like, since it would be quietly ignored (replaced currently by BART, tomorrow who knows).

That's why I'd worry about the API offering (as this PR introduces):

{
  "name": "string",
  ...
  "job_config": {
    "job_type": "evaluate", # ouch, minor gripe, but this is a default taken from swagger for an annotation job
    "model": "string", # misleading: this would be promptly ignored by the backend
    "model_url": "string", # same
    "system_prompt": "string",
    "skip_inference": false
  }
}

What do you think - is there any reasonable way to avoid the confusions mentioned above (specially related to the model)?

@ividal
Copy link
Contributor

ividal commented Feb 5, 2025

Thanks for this! One note, @javiermtorres this PR should still be in sync with the UI and keep an eye on how it interacts with /experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Changes which impact API/presentation layer backend schemas Changes to schemas (which may be public facing) sdk
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate the different creation jobs into an organized class/package.
3 participants