Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement caching using Souin #154

Closed
wants to merge 1 commit into from

Conversation

mac-chaffee
Copy link
Contributor

Fixes #149

This PR implements in-memory caching of pgs response bodies inside Caddy using a plugin called Souin.

Background

Currently, serving a single asset from pgs can be a little slower than it could be due to DNS lookups, DB queries, and re-reading _headers and _redirects. Experiments in #151 showed that all these expensive operations are necessary and they cannot be simply cached in-process since pgs-web and pgs-ssh are separate processes that both need control over the cache. Thus, we have to use a cache like Souin which runs separate from pgs-web and pgs-ssh. But since it runs inside Caddy, no separate infrastructure is needed.

How does the caching work?

When any user requests an asset like https://example.com/styles.css for the first time, the asset is fetched and returned to the user like normal, but now a copy of the response body and headers is stored by Souin (specifically using a high-performance in-memory store called Otter which has impressive benchmarks and requires no special parameter tuning making it a good choice IMO).

The cached response body is assigned a TTL of 5 minutes (which we could increase later if we get more confident with our cache flushing code). It's also associated with two keys:

  • There's a main key which looks like GET-https-example.com-/styles.css. This key is used for quickly serving existing cached files when new requests arrive for the same assets.
  • We also set a "surrogate key" which looks like fakeuser-www-proj (this matches the *.pgs.sh subdomain or the value in the TXT entry). This key is used for purging the cache for an entire project when any files are modified.

If any other users request the same file within 5 minutes, the responses will be served directly by Caddy from the in-memory cache. After 5 minutes, the cache for that asset expires.

Caveat: The cache is limited to 10,000 items. Otter uses a special algorithm to evict rarely-accessed items from the cache to make room for new items.

How does cache purging work?

Souin has an API for examining and purging entries from the cache. Unfortunately, there's no way to expose this API on a different port, so we have to protect it from abuse with Basic Auth.

When any files are written or deleted (like with rsync), we use a golang channel to asynchronously purge the cache for the entire project by using surrogate keys. This involves sending an HTTP request to PURGE https://pgs.sh/souin-api/souin/. These purge requests are debounced so that we only purge the cache once per site per 5 seconds.

This API is reachable from the public internet, just protected with basic auth. So in case of emergencies, admins can do things like curl -i -X PURGE -u testuser:password https://pgs.sh/souin-api/souin/flush to purge the entire cache.

FAQ

  • How much RAM will this use?
    • 50th percentile asset size seems to be around 500KB so a full cache (10,000 items) could be 5GB. Max theoretical is 10GB due to the 1MB max_cacheable_body_bytes.
  • Are there any infrastructure changes required?
    • Mostly no. You'll need to rebuild/repush the Caddy image and you should generate a new basic_auth username/password.
  • What isn't cached?
    • Errors, and the /check endpoint. Everything else (including the assets for https://pgs.sh) is cached.
  • Are private pages still kept private?
    • I think so, since those requests bypass Caddy and go straight to pgs-ssh.

Still TODO

  • Actually execute this code. I tested individual parts but not the whole thing yet.
  • Make sure this doesn't break other services like imgs or prose.
  • Figure out what to do about analytics. Cached responses never hit pgs-web, so it can't increment counters in the DB. An easy/temp option could be not caching HTML files, but then we lose a lot of the benefits of caching. Possibly later we could pull page view counts from Caddy prometheus metrics.

Open to feedback and questions!

@mac-chaffee
Copy link
Contributor Author

mac-chaffee commented Oct 22, 2024

Was able to test this locally and found the surrogate key system doesn't behave according to the spec; they keys are never stored so the purging does nothing. I filed an issue upstream: darkweak/souin#563

Everything else seems correct though.

@mac-chaffee
Copy link
Contributor Author

mac-chaffee commented Oct 22, 2024

Some benchmarking on my local laptop (I expect the performance difference to be even greater in production since my laptop has DNS caching and a very fast SSD):

Without caching

$ oha -n 20000 -c 50 -t 10s --insecure -H "Host: picouser-www.localhost" https://localhost
Summary:
  Success rate:	100.00%
  Total:	64.1613 secs
  Slowest:	2.5051 secs
  Fastest:	0.0433 secs
  Average:	0.1604 secs
  Requests/sec:	311.7142

With caching

$ oha -n 20000 -c 50 -t 10s --insecure -H "Host: picouser-www.localhost" https://localhost
Summary:
  Success rate:	100.00%
  Total:	8.7945 secs
  Slowest:	2.4364 secs
  Fastest:	0.0005 secs
  Average:	0.0219 secs
  Requests/sec:	2274.1586

@neurosnap
Copy link
Member

neurosnap commented Oct 22, 2024

Thanks so much for taking the time to submit this PR! I've been discussing it with my colleague @antoniomika

One thing we are planning right now is multi-region pgs, I wonder how caching would work when we are serving sites for multiple regions (e.g. cache busting)?

Further, we would also need to come up with a solution for analytics since that is a big feature for us.

Given these requirements, It makes us wonder if your original in-app cache would be better suited for our needs.

It kind of seems like we might want to implement pgs global first then think about a holistic caching system that can accommodate it. Thoughts?

For example, we just released https://pipe.pico.sh and are wondering if incorporating an event system into pgs caching makes sense.

@mac-chaffee
Copy link
Contributor Author

mac-chaffee commented Oct 22, 2024

Multi-region pgs sounds like it would require the cache storage and the cache purging to also be multi-region too, right?

Do you all have a stance on the use of CDNs? IMO, "just swipe the credit card and use a CDN" would probably be easiest, but so far I've enjoyed the challenge of trying to find ways to avoid CDNs and additional infrastructure.

With any of those options, we can mostly re-use the code in this PR, just removing the Caddy changes and changing the format of the API call in the purgeCache function (and the header name of course). Analytics would have to pulled from the CDN.

If you want to do multi-region caching without a CDN, one option is to simply replace Otter with one of the other Souin storage backends that support clustering such as Redis. Still have to pull analytics from Caddy/Prometheus.

Another more-ambitious idea could be using/building a golang http caching middleware like https://github.com/victorspringer/http-cache , combined with something like https://github.com/riverqueue/river to coordinate cache purging through Postgres. EDIT: This option has the benefit of allowing us to keep analytics mostly unchanged.

I don't have a strong opinion, I'm happy with any option that makes my pgs site blazing fast, whether it includes my Go code or not :)

@darkweak
Copy link

Hey, I'm the creator of Souin. First thank you for considering to add this HTTP cache in this great project.
About the Cloudflare, Akamai, Fastly support, Souin can invalidate by tags and can store data in these CDNs.
If you want a PG storage, I can try to find time to add this storage in the storage repository.

Open to discuss about your needs and what we could do together.

@neurosnap
Copy link
Member

neurosnap commented Oct 28, 2024

@mac-chaffee Thanks so much for taking the time to research and write a thoughtful response, much appreciated.

@darkweak Greetings! Thanks for finding our little project and providing an awesome library (Souin), also much appreciated.

@antoniomika and I have had a chat and because of the way we record analytics and consider it a first-class feature at pico, we are leaning towards implementing the caching mechanism in our app code, although we can be convinced otherwise if there's a better way to outsource http caching (which is a solved problem that doesn't need re-inventing).

@mac-chaffee I like your idea of using http-cache since it simply wraps the Handler interface Handler.ServeHTTP. However, this doesn't solve the problem of being able to record analytics when there's a cache hit. Further, we would ideally like to be standards compliant: RFC 7234 which it appears http-cache isn't quite there yet.

Here's a patch that we could use to be able to record analytics for cache hits which we could try to upstream:
https://erock.pastes.sh/http-cache.diff

However, this still doesn't solve how we are going to clear the cache when a site gets uploaded since these are two separate services without a great way to communicate between them. river is definitely a possibility. Although, we could also use http-cache.refreshKey which would involve us sending an http request for each uri in a project.

Another important issue here is we support proxying requests to external services: https://pico.sh/pgs#proxy-to-another-service. We do not want to cache those responses. If we went with http-cache we would need to figure out how to ignore caching some responses, ideally with a flag or something.

In summary:

  • We need to record site-usage analytics regardless of cache hits
  • We must not cache http responses from proxy'd external services
  • We need a way to clear cache keys when a user uploads new static files (which is in a separate service from the web service that uses the cache)

I'd be happy to continue facilitating this conversation so lemme know what you need from me!

@darkweak
Copy link

However, this still doesn't solve how we are going to clear the cache when a site gets uploaded since these are two separate services without a great way to communicate between them. river is definitely a possibility. Although, we could also use http-cache.refreshKey which would involve us sending an http request for each uri in a project.

You can tag your resources in Souin and invalidate by tag when needed, Souin exposes an API for that (you can purge by regexp, tag or per resource/uri).

Another important issue here is we support proxying requests to external services: https://pico.sh/pgs#proxy-to-another-service. We do not want to cache those responses. If we went with http-cache we would need to figure out how to ignore caching some responses, ideally with a flag or something.

Just return a Cache-Control no-store directive should be enough for responses that are proxied.

@mac-chaffee
Copy link
Contributor Author

Let's move the conversation to #149. I'll go ahead and close this but will leave the branch intact.

@neurosnap
Copy link
Member

neurosnap commented Oct 29, 2024

@darkweak Those are really nice features! Do you have a recommendation on how we could still record cache hits in our site-usage analytics systems?

https://pico.sh/analytics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving pgs asset serving performance (brainstorming)
3 participants