Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for incremental model re-hashing #160

Open
laurentsimon opened this issue Apr 17, 2024 · 11 comments
Open

API for incremental model re-hashing #160

laurentsimon opened this issue Apr 17, 2024 · 11 comments
Labels
enhancement New feature or request
Milestone

Comments

@laurentsimon
Copy link
Collaborator

laurentsimon commented Apr 17, 2024

Provide an API that allows re-computing the hash of a subset of files in a model. This is useful in cases where only a (small) set of files have changed, eg if a user has updated a README in a huggingface repo with files of several hundred of GB.

The API may look something like the following:

sign(..., updated_files)

with

updated_files = list [
   (src_file, action, dst_file)
]

Example:

updated_files = [
  "path/to/file1", "RENAMED", "path/to/file2",
  "path/to/file3", "DELETED", "",
  "path/to/file4", "UPDATED", "path/to/file4"
]

Something along these lines. The API would use the existing manifest file and update it by re-computing only the hashes for files that need to be re-hashed.

@laurentsimon laurentsimon added the enhancement New feature or request label Apr 17, 2024
@laurentsimon laurentsimon changed the title API for incremental model re-computation API for incremental model re-hashing Apr 17, 2024
@laurentsimon
Copy link
Collaborator Author

A simple first iteration could be to support just a list of files to recompute a hash for. The serializer would list files on disk, and for those that needs to be recomputed, it would re-serialize them. For the others, it would take it from the existing signature file.

@mihaimaruseac
Copy link
Collaborator

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Moving/renaming files and just copying the hash will speed up the update process as we won't need to re-read the file's content, but we can risk a misuse of the API where a file that's both moved and modified will have the wrong hash.

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 14, 2024

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Depends what semantics we define for the list. If we allow files in the list to not be present on disk, it would allow the caller to pass a list of changed files (either deleted, renamed or added) without the need for them to know the details of what was changed. Our lib would figure what to recompute by listing existing files on disk and using the list to determine which to recompute (deleted files would simply be ignored since they won't be on disk).

I'm thinking that's as simple as it gets for callers. They would be able to use existing git command to list files changed, without parsing what sort of changes were made. To be honest I have not not played with git yet to see what command is needed to get the diff between a commit and the latest push :)

A slightly more advanced solution is for callers to pass the type of changes (either via 2 lists as you suggest) or via a more structured parameters (in the original description).

@mihaimaruseac mihaimaruseac mentioned this issue May 14, 2024
10 tasks
@mihaimaruseac
Copy link
Collaborator

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

@mihaimaruseac mihaimaruseac added this to the V1 release milestone May 14, 2024
@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 15, 2024

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

@laurentsimon
Copy link
Collaborator Author

laurentsimon commented May 15, 2024

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

@mihaimaruseac
Copy link
Collaborator

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

Using a CLI is a solution, but then it would require a separate process, that some MLOps engineers might not follow. Having this in the pipeline code itself would uplift it everywhere.

@laurentsimon
Copy link
Collaborator Author

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

I was not thinking in term of CLI vs API (both will be available), but in terms of whether the API interface works for the use case or not. In #160 (comment), would the "simple" recompute_files parameter work for this use case?

@mihaimaruseac
Copy link
Collaborator

Oh, I misunderstood, given the git scenario to list the files that have a diff. But, I see upsides and downsides to either approach.

With an explicit list of files that are changed and deleted we don't need to scan the model tree and compare with a list, but we can miss some modifications if the library users don't pass the file in the proper list. And we'll have to define the error scenarios.

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

@laurentsimon
Copy link
Collaborator Author

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

yeah I think it's simple and the slowdown is likely negligible: even 100's files on disk is fairly fast to list them and compare them. It's also easier for the caller to not make mistakes.

@McPatate
Copy link

McPatate commented May 21, 2024

There exists repos with ungodly file structures, but let's not consider them, they are pretty rare 😄

Even https://huggingface.co/datasets/HuggingFaceFW/fineweb "only" has a couple thousand files.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

I'm not exactly sure how huggingface_hub computes its diffs, but I believe it has the information of what changed in the repo (cf _commit_api.py).
I think the API you chose for re-hashing makes sense. Imo I would go with creating a dedicated CLI as a first step before integrating to huggingface_hub. I'm happy to lend a hand here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants