Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of assets? #707

Closed
siefkenj opened this issue Mar 19, 2024 · 18 comments · Fixed by #906
Closed

Caching of assets? #707

siefkenj opened this issue Mar 19, 2024 · 18 comments · Fixed by #906

Comments

@siefkenj
Copy link
Contributor

Currently there is some support for rebuilding assets only if they've changed, but it seems to rely on document structure. Since assets are extracted and them compiled in isolation, I imagine if you stored <md5sum>.svg files in some .cache folder, you could just detect if the asset contents was the same and copy over the cached version instead of running compile again. This method would not rely on document structure at all.

@StevenClontz
Copy link
Member

+1

So we have an element like <latex-image xml:id="bar">FOO</latex-image>, we checksum FOO to abc123, then save the result to .cache/latex-image/abc123.svg as well as generated-assets/latex-image/bar.svg. Then on future builds, we simply copy .cache/latex-image/abc123.svg to generated-assets/latex-image/bar.svg (or wherever it should be, in case the filename changes.

@rbeezer
Copy link
Collaborator

rbeezer commented Mar 19, 2024 via email

@oscarlevin
Copy link
Member

I'm not sure I understand what issue this resolves. Currently, If you have an asset with xml:id="bar" (or if bar is the id of the youngest ancestor of the asset that has an xml:id), then we store the hash of the asset with the xml:id. If the author changes the asset, then the hashes won't match, so we ask for the asset to be regenerated (and put into the generated-assets).

With this proposal, we keep a copy of the generated asset in .cache. If the author changes the asset, the hash will no longer match, so we regenerate the asset (an put it in .cache and generated-assets).

In both cases, if the asset isn't changed, nothing gets regenerated.

Last case: the asset isn't changed, but the xml:id is changed. Now, the asset is regenerated. Under the proposal, the asset isn't regenerated, but a new copy is made with the new name. I see there is an advantage here, but the disadvantage is keeping every version of the generated asset in the cache and copying over every asset from the cache to generated-assets.

What am I missing?

@StevenClontz
Copy link
Member

Another potential use-case: user has <latex-image xml:id="foo">BAR</latex-image> and later <latex-image xml:id="baz">BAR</latex-image>. Maybe it's an anti-pattern that should have been solved with an xref but this would avoid building the same image twice.

@siefkenj
Copy link
Contributor Author

This would also mean images are cached without assigning an ID to them.

@StevenClontz
Copy link
Member

StevenClontz commented Jun 16, 2024

I'm waiting on https://github.com/TeamBasedInquiryLearning/precalculus/actions/runs/9538778663 and I'm seeing a lot of duplication of assets being generated. This could probably be avoided through cleverer configuration of the action, but I still think having a .generated-cache directory that contains a bunch of ELEMENT/FORMAT/HASH.FMT files that is checked before every build and copied over (barring some kind of --force-regenerate) would be excellent.

Another use case: I change my sageplot from blue to green, then hate it, then change it back to blue. The old blue version is still cached so I get it immediately.

@oscarlevin
Copy link
Member

I am coming around to really liking this idea. I think this would be handled by core though, correct? So definitely something we will want to collaborate on.

@StevenClontz
Copy link
Member

I think this would be handled by core though, correct?

💯 - and this is a good week to do it

@StevenClontz
Copy link
Member

Caching should be used in tandem with https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows to speed up CI/CD for PreTeXt projects

@StevenClontz
Copy link
Member

@oscarlevin
Copy link
Member

Okay, @StevenClontz and @siefkenj, I'm going to implement this this week. Here is what I'm thinking; please feel free to nudge me a different direction if you have time to consider this.

When a user runs pretext build, we want to ensure the generated assets are up to date. We do the following in order:

  1. Hash assets to see if any appear to have changed. For any type of asset that has a change, we add that type to the list that should be generated.
  2. For each asset type that should be generated, we call core.generate_* on the entire document (not subsetting like we do now, unless this was explicitly requested). We also pass in our version of "individual_generate_asset" using the provided hook from core (already implemented for asymptote, latex-image, and sage).
  3. Our custom generator function gets passed an individual file that will be generated. We hash this and check whether hash.ext exists in .generated_cache. If so, we just copy it over where it should go. If not, we call core's individual generation function and get the new asset that way, but in addition, make a copy into our .generated_cache folder with the appropriate hash and ext.
  4. After all the assets are successfully generated (or copied), we update the hash table for that target.

I think that the .generated_cache folder should live in the root of the project and be added to .gitignore. It could also go inside generated_assets and when we build we don't copy it over ever.

Of course we would keep all the forced generation flags we have. Probably should also add a way to clear the generated cache (perhaps pretext generate --clean; or should this be done whenever forced generate happens?).

@StevenClontz
Copy link
Member

I'm a little hazy on what gets hashed and what we compare with.

I imagine something like this workflow (which may be exactly what you're suggesting):

  1. When generating assets for <element>...</element> (which should be expanded for any xi:includes) for format *.fmt, first hash the string <element>...</element> as HASH.
  2. Check if HASH.fmt exists in .cache. (note: I would call it .cache to not be confused with generated-assets)
    • If it does:
      1. Copy HASH.fmt to the correct filename within generated-assets
    • If it does not:
      1. Use core PreTeXt routine to create correct file in generated-assets
      2. Copy it as HASH.fmt to the ._cache directory

@oscarlevin oscarlevin mentioned this issue Jan 21, 2025
@oscarlevin
Copy link
Member

That's basically the plan, except that we will hash the output of the extract_*.xsl where * is the asset type. So we are hashing the actual latex/tikz, not the xml source. This is better anyway, since there is no way to tell core to just build the xml element; it would need to extract all the tex code anyway.

@StevenClontz
Copy link
Member

Does every format have an extract_ file? I think this might also be useful for preview images for interactives and YouTube videos, to avoid network calls and headless browsers.

@StevenClontz
Copy link
Member

And would <latex-image>foo</latex-image> and <sageplot>foo</sageplot> be hashed the same or different?

@oscarlevin
Copy link
Member

Yeah, probably the same, although I don't know exactly what the extract templates do.

I don't think this is an issue for latex-image, as you wouldn't have \begin{tikzpicture} in either a sage or asymptote. But perhaps there would be collision between those two?

I suppose we could always prepend the asset type to the hash.

@StevenClontz
Copy link
Member

Another option: .cache/latex-image/HASH.fmt vs .cache/sageplot/HASH.fmt.

@oscarlevin
Copy link
Member

Great idea. Implemented in #908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants