API for incremental model re-hashing #160

laurentsimon · 2024-04-17T03:41:22Z

Provide an API that allows re-computing the hash of a subset of files in a model. This is useful in cases where only a (small) set of files have changed, eg if a user has updated a README in a huggingface repo with files of several hundred of GB.

The API may look something like the following:

sign(..., updated_files)

with

updated_files = list [
   (src_file, action, dst_file)
]

Example:

updated_files = [
  "path/to/file1", "RENAMED", "path/to/file2",
  "path/to/file3", "DELETED", "",
  "path/to/file4", "UPDATED", "path/to/file4"
]

Something along these lines. The API would use the existing manifest file and update it by re-computing only the hashes for files that need to be re-hashed.

The text was updated successfully, but these errors were encountered:

laurentsimon · 2024-05-13T23:47:42Z

A simple first iteration could be to support just a list of files to recompute a hash for. The serializer would list files on disk, and for those that needs to be recomputed, it would re-serialize them. For the others, it would take it from the existing signature file.

mihaimaruseac · 2024-05-14T17:32:36Z

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Moving/renaming files and just copying the hash will speed up the update process as we won't need to re-read the file's content, but we can risk a misuse of the API where a file that's both moved and modified will have the wrong hash.

laurentsimon · 2024-05-14T17:41:57Z

I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed.

For a deleted file, we just remove it from the manifest, error if it was not present.

For files that need to be rehashed, error if they're not present, recompute hash otherwise.

Depends what semantics we define for the list. If we allow files in the list to not be present on disk, it would allow the caller to pass a list of changed files (either deleted, renamed or added) without the need for them to know the details of what was changed. Our lib would figure what to recompute by listing existing files on disk and using the list to determine which to recompute (deleted files would simply be ignored since they won't be on disk).

I'm thinking that's as simple as it gets for callers. They would be able to use existing git command to list files changed, without parsing what sort of changes were made. To be honest I have not not played with git yet to see what command is needed to get the diff between a commit and the latest push :)

A slightly more advanced solution is for callers to pass the type of changes (either via 2 lists as you suggest) or via a more structured parameters (in the original description).

mihaimaruseac · 2024-05-14T17:47:25Z

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

laurentsimon · 2024-05-15T00:43:15Z

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

laurentsimon · 2024-05-15T00:46:07Z

Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.

git status --porcelain lists the files, but will need some additional bash scripting to remove the markers at the start.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

mihaimaruseac · 2024-05-15T18:03:52Z

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

Using a CLI is a solution, but then it would require a separate process, that some MLOps engineers might not follow. Having this in the pipeline code itself would uplift it everywhere.

laurentsimon · 2024-05-15T19:42:46Z

Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case?

We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed.

I was not thinking in term of CLI vs API (both will be available), but in terms of whether the API interface works for the use case or not. In #160 (comment), would the "simple" recompute_files parameter work for this use case?

mihaimaruseac · 2024-05-15T19:55:00Z

Oh, I misunderstood, given the git scenario to list the files that have a diff. But, I see upsides and downsides to either approach.

With an explicit list of files that are changed and deleted we don't need to scan the model tree and compare with a list, but we can miss some modifications if the library users don't pass the file in the proper list. And we'll have to define the error scenarios.

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

laurentsimon · 2024-05-15T22:33:01Z

With a recompute_files we need to scan the directory tree and then we need to compare with the entire manifest (i.e., traversing every key in a dictionary) to compute the 2 lists above. We guarantee completeness but we would have a slowdown. But this would likely be simpler when doing a git show command to list the files and then piping that list into a CLI to update hashes.

yeah I think it's simple and the slowdown is likely negligible: even 100's files on disk is fairly fast to list them and compare them. It's also easier for the caller to not make mistakes.

McPatate · 2024-05-21T14:54:38Z

There exists repos with ungodly file structures, but let's not consider them, they are pretty rare 😄

Even https://huggingface.co/datasets/HuggingFaceFW/fineweb "only" has a couple thousand files.

@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API?

I'm not exactly sure how huggingface_hub computes its diffs, but I believe it has the information of what changed in the repo (cf _commit_api.py).
I think the API you chose for re-hashing makes sense. Imo I would go with creating a dedicated CLI as a first step before integrating to huggingface_hub. I'm happy to lend a hand here!

laurentsimon added the enhancement New feature or request label Apr 17, 2024

laurentsimon changed the title ~~API for incremental model re-computation~~ API for incremental model re-hashing Apr 17, 2024

mihaimaruseac mentioned this issue May 14, 2024

v1 release #175

Open

10 tasks

laurentsimon mentioned this issue May 14, 2024

APIs design #172

Open

mihaimaruseac added this to the V1 release milestone May 14, 2024

mihaimaruseac mentioned this issue Jul 23, 2024

Add signing and verification CLI scripts. #240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for incremental model re-hashing #160

API for incremental model re-hashing #160

laurentsimon commented Apr 17, 2024 •

edited

Loading

laurentsimon commented May 13, 2024

mihaimaruseac commented May 14, 2024

laurentsimon commented May 14, 2024 •

edited

Loading

mihaimaruseac commented May 14, 2024

laurentsimon commented May 15, 2024 •

edited

Loading

laurentsimon commented May 15, 2024 •

edited

Loading

mihaimaruseac commented May 15, 2024

laurentsimon commented May 15, 2024

mihaimaruseac commented May 15, 2024

laurentsimon commented May 15, 2024

McPatate commented May 21, 2024 •

edited

Loading

API for incremental model re-hashing #160

API for incremental model re-hashing #160

Comments

laurentsimon commented Apr 17, 2024 • edited Loading

laurentsimon commented May 13, 2024

mihaimaruseac commented May 14, 2024

laurentsimon commented May 14, 2024 • edited Loading

mihaimaruseac commented May 14, 2024

laurentsimon commented May 15, 2024 • edited Loading

laurentsimon commented May 15, 2024 • edited Loading

mihaimaruseac commented May 15, 2024

laurentsimon commented May 15, 2024

mihaimaruseac commented May 15, 2024

laurentsimon commented May 15, 2024

McPatate commented May 21, 2024 • edited Loading

laurentsimon commented Apr 17, 2024 •

edited

Loading

laurentsimon commented May 14, 2024 •

edited

Loading

laurentsimon commented May 15, 2024 •

edited

Loading

laurentsimon commented May 15, 2024 •

edited

Loading

McPatate commented May 21, 2024 •

edited

Loading