Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arguments for/against embedding tzdb #201

Open
ariebovenberg opened this issue Jan 20, 2025 · 10 comments
Open

Arguments for/against embedding tzdb #201

ariebovenberg opened this issue Jan 20, 2025 · 10 comments
Labels
discussion Discussion is needed before proceeding

Comments

@ariebovenberg
Copy link
Owner

The timezone db can generally be retrieved in two different ways:

  1. bundled in the library itself
  2. from the platform itself, if available (on mac/linux). This is the current approach.

Each approach has it advantages.

Pros of embedding

  • Faster timezone calculations, since ZoneInfo has quite some overhead (method call and timedelta). I'd expect this to be about 3-4x faster
  • Guarantees the same result in platform-independent way, so long as you're on the same version of whenever
  • ZonedDateTime can be made fully threadsafe on noGIL builds. In time, the zoneinfo module will likely enable this, but it's unclear how long this will take. (see CPython issue 116738)
  • Faster object creation. Embedding allows pre-parsing the TZif file. Compared to zoneinfo, this can be a significant difference
  • ZonedDateTime object size can be marginally smaller (24 to 20) since it could store some kind of "index" instead of a full *mut PyObject. Tiny benefit, but still

Pros of not embedding

  • No extra effort required to develop or maintain
  • Timezone logic will have the same result as the standard library and other code using the platform tzdb
  • Users don't need to update whenever to get updated timezone data. No extra releases needed when tzdb updates.
  • No extra storage required. Compiled tzdb is around ±3 MB. Not negligible, but not a big deal for most Python programs either

Other notes

The performance advantages of embedding are primarily due to the fact that ZoneInfo has significant overhead. It's also possible to not embed, but to write a custom TZif parser to replace zoneinfo.

How do other libraries/languages do it?

  • Python standard library has the option to embed (tzdata package)
  • java.time, NodaTime (C#) embed the tzdb
  • chrono (Rust) embeds the tzdb
  • jiff (Rust) has the option to embed, but by default only does so on Windows
@ariebovenberg ariebovenberg added the discussion Discussion is needed before proceeding label Jan 20, 2025
@BurntSushi
Copy link

Do you know the source of the overhead for ZoneInfo?

jiff (Rust) has the option to embed, but by default only does so on Windows

Yeah. Basically, I perceive a very strong benefit to using the system tzdb if it's available.

Jiff will also by default embed tzdb for WASM as well.

The only environments with a "system" tzdb as far as I'm aware are Unix environments.

@ariebovenberg
Copy link
Owner Author

Do you know the source of the overhead for ZoneInfo?

To clarify, the ZoneInfo I'm talking about here is the python ZoneInfo class. whenever uses it since it's already available in Python's standard library. Saves rolling your own TZif parser. ZoneInfo is implemented in C, so it's pretty fast in itself. The performance overhead comes from the fact that any calculations using this class still need to go through the Python method calling machinery. There's also extra overhead in constructing the input, and destructuring the output (all must be refcounted PyObjects). I'm currently using DateTime_FromTimeStamp from python datetime's C API (it was the fastest overall), but you can see here how much extra stuff is going on. Overhead makes sense of course, as the method needs to handle arbitrary tzinfo subclasses (not just ZoneInfo which is just a concrete subclass).

My "3-4x" improvement was based on an earlier benchmark I did comparing zoneinfo to chrono in #145. My assumption here is that any manual implementation would be in the same ballpark, if not faster (since no conversion to chrono types is needed)

Yeah. Basically, I perceive a very strong benefit to using the system tzdb if it's available.

I'm assuming here that performance in jiff is the same, whether the embedded version is used or not? TZif parsing is done regarldess of the souce? You load them on demand and have some kind of cache mechanism?

@BurntSushi
Copy link

BurntSushi commented Jan 20, 2025

I see, thanks for explaining!

I'm assuming here that performance in jiff is the same, whether the embedded version is used or not? TZif parsing is done regarldess of the souce? You load them on demand and have some kind of cache mechanism?

Yeah that's right. TZif reading and parsing is cached. The in-memory data structure is oriented so that time zone transition lookups are fast (binary search).

I think there are still some optimizations to perform, but I coded up the "obvious" thing until the desirable fast paths are better known while still doing the "obvious" optimizations (i.e., not reading and parsing TZif for every single time zone lookup).

FWIW, the TZif reading and parsing is not terrible to implement. You could probably copy Jiff's implementation (including its POSIX TZ string implementation) almost verbatim. There's some utility code you'll need to copy, and maybe don't use Jiff's internal ranged integers, but the only other major thing you'd need is a way to move between instants and civil time, which I assume you already have. (This is also another area where TZ lookups can be slow, e.g., if you're trying to find the offset for a civil datetime in a particular time zone.)

@ariebovenberg
Copy link
Owner Author

ariebovenberg commented Jan 22, 2025

I see now that the relevant dilemma isn't embedding, but whether to substitute Python's zoneinfo module for a Rust implementation. This can be done regardless of embedding, and even reuse Python’s own tzdata mechanism.

But then…the question becomes how to model this, and I arrive at the same Copy dilemma that you do here @BurntSushi . My initial thought was also to also have “some kind of integer handle into a global time zone database”. I’d be curious to hear more about the options you tried, and why they weren’t satisfactory. My first plan would be some kind of optimistic concurrency model where writes are accepted to be slow(ish), so that reads can be lock-free (of course assuming the values are immutable). I think this would allow tz calculations to bypass synchronization entirely, while only slowing down constructors and civil->zoned conversions 🤔. I’m probably overlooking things though, as I haven’t ever done synchronization at such a low level (coming from Python, after all 😁 )

One difference with jiff here is that I'd like to limit whenever.ZonedDateTime to tz DB only. I think that this should mitigate the need for garbage collection (there's only so many TZ DB entries, and they don't take up that much space), and perhaps solve some of the synchronization issues you'd get from accepting arbitrary TZifs and POSIX strings...

@BurntSushi
Copy link

But then…the question becomes how to model this, and I arrive at the same Copy dilemma that you do here @BurntSushi . My initial thought was also to also have “some kind of integer handle into a global time zone database”. I’d be curious to hear more about the options you tried, and why they weren’t satisfactory.

What Jiff currently does is the only thing I actually tried, but I did consider other designs. I remember ruminating quite a bit on it, but I didn't really write any of it down. When I was thinking about this, one of the use cases I had in mind was something like, "Jiff should make using an implicit global time zone database very easy, but it shouldn't be required." I was thinking about use cases like, "how would someone using Jiff update their copy of tzdb without restarting the process?" Use cases like that really demand a form of garbage collection. And if you're throwing TimeZone objects around everywhere that are actually just integer handles to some global database, doing garbage collection is actually quite difficult. i.e., How do you know it's safe to throw away in-memory cached copies of time zone objects? You kinda don't. So then you're stuck with designs where the handle itself encodes enough information to re-capitulate whatever time zone object it was pointing to. And that in turn becomes tricky because even if you figure out how to do it, you better be damn sure that you're returning the same time zone transitions it was pointing to, lest you wind up in situations where the length of a day changes throughout a particular routine.

Now if you don't care about this constraint, i.e., you're totally cool to leak the time zone object such that its memory can never be reclaimed, then yes, I do think there are some feasible designs here. But I would caution against it personally. It's the kind of thing where it's probably fine for a ton of use cases, but then your library gets popular and you find the really niche use cases from folks wanting to use your library but can't for whatever reason. With that said, I could absolutely be over-stating things here. Honestly, it's hard to know.

With all that said, Python has a garbage collector. Is there any way you can just reuse that? Like, I was chasing a Zoned that implements Copy for ergonomics as a Rust library. But presumably in Python land, the Copy and non-Copy distinction isn't nearly as important? What is motivating the pursuit of Copy here?

One difference with jiff here is that I'd like to limit whenever.ZonedDateTime to tz DB only. I think that this should mitigate the need for garbage collection (there's only so many TZ DB entries, and they don't take up that much space), and perhaps solve some of the synchronization issues you'd get from accepting arbitrary TZifs and POSIX strings...

Yeah I can see how it might look that way, but I don't think arbitrary TZifs or POSIX strings really change anything fundamental here. Whatever infrastructure you're using to turn tzdb time zones into handles should be usable to do the same thing for arbitrary TZifs or POSIX strings. But maybe you're saying that means if you're only limited to tzdb, then the pressure to permit freeing them at some point is lessened. That is perhaps true. But I'd still be apprehensive about designing a system around the concept of leaking user provided data (where /usr/share/zoneinfo is "user provided data").

Caveat emptor is that while I've worked on a smattering of low level synchronization primitives in the past, I would not qualify myself as an expert. Therefore, you should not consider my statements about what is possible as authoritative. I could have absolutely missed points in the design space.

@ariebovenberg
Copy link
Owner Author

ariebovenberg commented Jan 24, 2025

"how would someone using Jiff update their copy of tzdb without restarting the process?" Use cases like that really demand a form of garbage collection

haha wow—I always love being reminded how some use cases can be more common in other languages 😁. You're absolutely right of course that you can't ignore garbage collection if there's no tight lid on the number of timezones that users would create during the run of a program—especially if long-running. I see your point about assuming too much here too. As you mention at the end, zoneinfo is essentially user-provided data. There's no guaranteed, safe, "lid" on it.

Looking at C++ libraries, I see now my instincts about the relative importance of Copy are off: they all opt for a non-static references to timezone info, i.e. not Copy. I was surprised to see that even chrono’s DateTime isn’t Copy, even though it could arguably have done so with its static tz DB.

With all that said, Python has a garbage collector. Is there any way you can just reuse that? [...]
But presumably in Python land, the Copy and non-Copy distinction isn't nearly as important? What is motivating the pursuit of Copy here?

Python's reference-counting mechanism can indeed be used to clean up unused timezone definitions. Nice and neat, but...
my hidden agenda here was to keep the door open for some kind of integration with Python's dataframe/array libraries (remember pola-rs/polars#20471). These are really commonly used in Python data science circles and could be a real 'killer feature'. These libraries rely on vectorization, parallelization, SIMD etc. to crunch through datasets quickly. My assumption here is that these speedups work most effectively (if not exclusively) on bitwise copyable data (i.e. no reference counters and stuff).

But…having dug into the polars code, I see their datetime type is just a 64-bit timestamp under the hood. A sensible decision for datetime arithmetic—so long as users are aware of the bounds. Each column can only have one timezone. I might follow up with them if they’ve ever considered changing this.

@BurntSushi
Copy link

BurntSushi commented Jan 24, 2025

But…having dug into the polars code, I see their datetime type is just a 64-bit timestamp under the hood. A every sensible decision for datetime arithmetic—so long as users are aware of the bounds. Each column can only have one timezone. I might follow up with them if they’ve ever considered changing this.

Yeah I've looked at Polars too to see if Jiff could meet their demands. They've often complained about chrono (the Rust crate) being slow. But I actually think it's the data model mismatch. Not only do they use 64-bit timestamps, but I believe they also use Unix epoch days to represent dates. This makes certain calculations faster (like determining the weekday or even just doing arithmetic on days).

My understanding is that the Polars use case is very hard to serve, and that you basically need a design like C++'s chrono library to do it. That is, where users can bring their own representation and precision. But, this comes with a fair bit of API complexity.

@ariebovenberg
Copy link
Owner Author

Well, that settles the "static TZ DB" discussion for now 👏

On to the "essentially reimplementing Python's zoneinfo in Rust" topic. @BurntSushi would you be up for splitting the tzif parser into a separate crate? I see you're considering something similar with zic.

So splitting this out would either mean splitting Zic's civil times into a separate that that both jiff and jiff-zic could depend upon, or it would mean copying at least some parts of the civil time code. (For example, the code that converts between civil time and timestamps, and the code for determining the nth weekday. And maybe other things.)

(from BurntSushi/jiff#20)

Perhaps you'll end up with some kind of jiff-base (for common functionality), jiff-zic and jiff-tzif 🙃 . If it'd be an option, I wouldn't mind doing some of the tedious setup/maintenenace for jiff-tzif initially.

@BurntSushi
Copy link

Definitely not unfortunately. The tedious aspect of the initial setup is barely a blip in the total cost of such an enterprise. I did exactly the same thing in regex (which is split into regex-automata, regex-syntax, memchr and aho-corasick) and even for ripgrep, and it is quite honestly a pretty large burden. The burden comes from having a bunch of semver boundaries among the internals of the library. Moreover, I suspect the initial effort here is way bigger than you imagine. For Jiff in particular, the TZif parser isn't exactly de-coupled from the rest of the library. Its API uses jiff::Timestamp and jiff::civil::DateTime, for example, and there is non-trivial code involved here connecting these pieces. And the cherry on top is that Jiff uses a special ranged integer abstraction internally that is excellent at finding bugs but has very poor ergonomics.

I've had other requests to split Jiff into libraries, but they generally come with the same motivation: serving DRY in some way. But the split doesn't really benefit users directly (aside from the benefits that might come from following DRY, e.g., in theory fewer bugs).

My philosophy on dependencies is overall rather conservative. At the same time, I acknowledge there are of course benefits. Which is why I split regex up. But if I ever split Jiff up, it will probably be further off in the future when the use cases are better understood. And especially use cases that benefit users directly.

With that said, I am very happy to have you just copy the code you need out of Jiff. Just leave a comment in the source code for where you got it from. That is the far easier solution IMO. You'll need to plug in your own timestamp and civil datetime types, and you'll want to rip out the ranged integers types I used in places, but it should overall be pretty mechanical and pretty easy, I think. Note that you'll need both the TZif parser and the POSIX TZ string parser.

@BurntSushi
Copy link

And yeah, I know I considered it with Zic, but I think my position has hardened against it over time. At least for now anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion is needed before proceeding
Projects
None yet
Development

No branches or pull requests

2 participants