-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arguments for/against embedding tzdb #201
Comments
Do you know the source of the overhead for
Yeah. Basically, I perceive a very strong benefit to using the system tzdb if it's available. Jiff will also by default embed tzdb for WASM as well. The only environments with a "system" tzdb as far as I'm aware are Unix environments. |
To clarify, the My "3-4x" improvement was based on an earlier benchmark I did comparing zoneinfo to chrono in #145. My assumption here is that any manual implementation would be in the same ballpark, if not faster (since no conversion to chrono types is needed)
I'm assuming here that performance in jiff is the same, whether the embedded version is used or not? TZif parsing is done regarldess of the souce? You load them on demand and have some kind of cache mechanism? |
I see, thanks for explaining!
Yeah that's right. TZif reading and parsing is cached. The in-memory data structure is oriented so that time zone transition lookups are fast (binary search). I think there are still some optimizations to perform, but I coded up the "obvious" thing until the desirable fast paths are better known while still doing the "obvious" optimizations (i.e., not reading and parsing TZif for every single time zone lookup). FWIW, the TZif reading and parsing is not terrible to implement. You could probably copy Jiff's implementation (including its POSIX TZ string implementation) almost verbatim. There's some utility code you'll need to copy, and maybe don't use Jiff's internal ranged integers, but the only other major thing you'd need is a way to move between instants and civil time, which I assume you already have. (This is also another area where TZ lookups can be slow, e.g., if you're trying to find the offset for a civil datetime in a particular time zone.) |
I see now that the relevant dilemma isn't embedding, but whether to substitute Python's But then…the question becomes how to model this, and I arrive at the same Copy dilemma that you do here @BurntSushi . My initial thought was also to also have “some kind of integer handle into a global time zone database”. I’d be curious to hear more about the options you tried, and why they weren’t satisfactory. My first plan would be some kind of optimistic concurrency model where writes are accepted to be slow(ish), so that reads can be lock-free (of course assuming the values are immutable). I think this would allow tz calculations to bypass synchronization entirely, while only slowing down constructors and civil->zoned conversions 🤔. I’m probably overlooking things though, as I haven’t ever done synchronization at such a low level (coming from Python, after all 😁 ) One difference with jiff here is that I'd like to limit |
What Jiff currently does is the only thing I actually tried, but I did consider other designs. I remember ruminating quite a bit on it, but I didn't really write any of it down. When I was thinking about this, one of the use cases I had in mind was something like, "Jiff should make using an implicit global time zone database very easy, but it shouldn't be required." I was thinking about use cases like, "how would someone using Jiff update their copy of tzdb without restarting the process?" Use cases like that really demand a form of garbage collection. And if you're throwing Now if you don't care about this constraint, i.e., you're totally cool to leak the time zone object such that its memory can never be reclaimed, then yes, I do think there are some feasible designs here. But I would caution against it personally. It's the kind of thing where it's probably fine for a ton of use cases, but then your library gets popular and you find the really niche use cases from folks wanting to use your library but can't for whatever reason. With that said, I could absolutely be over-stating things here. Honestly, it's hard to know. With all that said, Python has a garbage collector. Is there any way you can just reuse that? Like, I was chasing a
Yeah I can see how it might look that way, but I don't think arbitrary TZifs or POSIX strings really change anything fundamental here. Whatever infrastructure you're using to turn tzdb time zones into handles should be usable to do the same thing for arbitrary TZifs or POSIX strings. But maybe you're saying that means if you're only limited to tzdb, then the pressure to permit freeing them at some point is lessened. That is perhaps true. But I'd still be apprehensive about designing a system around the concept of leaking user provided data (where Caveat emptor is that while I've worked on a smattering of low level synchronization primitives in the past, I would not qualify myself as an expert. Therefore, you should not consider my statements about what is possible as authoritative. I could have absolutely missed points in the design space. |
haha wow—I always love being reminded how some use cases can be more common in other languages 😁. You're absolutely right of course that you can't ignore garbage collection if there's no tight lid on the number of timezones that users would create during the run of a program—especially if long-running. I see your point about assuming too much here too. As you mention at the end, zoneinfo is essentially user-provided data. There's no guaranteed, safe, "lid" on it. Looking at C++ libraries, I see now my instincts about the relative importance of Copy are off: they all opt for a non-static references to timezone info, i.e. not Copy. I was surprised to see that even chrono’s DateTime isn’t Copy, even though it could arguably have done so with its static tz DB.
Python's reference-counting mechanism can indeed be used to clean up unused timezone definitions. Nice and neat, but... But…having dug into the polars code, I see their datetime type is just a 64-bit timestamp under the hood. A sensible decision for datetime arithmetic—so long as users are aware of the bounds. Each column can only have one timezone. I might follow up with them if they’ve ever considered changing this. |
Yeah I've looked at Polars too to see if Jiff could meet their demands. They've often complained about My understanding is that the Polars use case is very hard to serve, and that you basically need a design like C++'s |
Well, that settles the "static TZ DB" discussion for now 👏 On to the "essentially reimplementing Python's
Perhaps you'll end up with some kind of |
Definitely not unfortunately. The tedious aspect of the initial setup is barely a blip in the total cost of such an enterprise. I did exactly the same thing in I've had other requests to split Jiff into libraries, but they generally come with the same motivation: serving DRY in some way. But the split doesn't really benefit users directly (aside from the benefits that might come from following DRY, e.g., in theory fewer bugs). My philosophy on dependencies is overall rather conservative. At the same time, I acknowledge there are of course benefits. Which is why I split With that said, I am very happy to have you just copy the code you need out of Jiff. Just leave a comment in the source code for where you got it from. That is the far easier solution IMO. You'll need to plug in your own timestamp and civil datetime types, and you'll want to rip out the ranged integers types I used in places, but it should overall be pretty mechanical and pretty easy, I think. Note that you'll need both the TZif parser and the POSIX TZ string parser. |
And yeah, I know I considered it with Zic, but I think my position has hardened against it over time. At least for now anyway. |
The timezone db can generally be retrieved in two different ways:
Each approach has it advantages.
Pros of embedding
ZoneInfo
has quite some overhead (method call and timedelta). I'd expect this to be about 3-4x fasterwhenever
ZonedDateTime
can be made fully threadsafe on noGIL builds. In time, thezoneinfo
module will likely enable this, but it's unclear how long this will take. (see CPython issue 116738)zoneinfo
, this can be a significant differenceZonedDateTime
object size can be marginally smaller (24 to 20) since it could store some kind of "index" instead of a full*mut PyObject
. Tiny benefit, but stillPros of not embedding
whenever
to get updated timezone data. No extra releases needed when tzdb updates.Other notes
The performance advantages of embedding are primarily due to the fact that
ZoneInfo
has significant overhead. It's also possible to not embed, but to write a customTZif
parser to replace zoneinfo.How do other libraries/languages do it?
tzdata
package)java.time
, NodaTime (C#) embed the tzdbThe text was updated successfully, but these errors were encountered: