Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost timezone info when reading from cache #135

Open
JrtPec opened this issue Apr 20, 2016 · 8 comments
Open

Lost timezone info when reading from cache #135

JrtPec opened this issue Apr 20, 2016 · 8 comments
Assignees
Labels

Comments

@JrtPec
Copy link
Member

JrtPec commented Apr 20, 2016

TL;DR: There is a bug with iso8601. It is fixed, but they haven't published the fix on pypi yet.

Here I go again with the long story of a wild bug chase:

By default, pandas.read_csv() does not "really" support timezone-aware timestamps. I'm using double quotes because Pandas does convert them to UTC and returns them as timezone-naive. In caching.py, @saroele has added the line df.index = df.index.tz_localize('UTC'), so technically you are returned the correct time, albeit in UTC instead of the timezone you used when you did the caching.

I'm caching weather data per day, my timestamps look something like 2016-04-20 00:00:00+02:00, and after caching they look like 2016-04-19 22:00:00+00:00. This gives me really stupid errors and a lot of headaches when I try to compare dates.

The solution is easy: instead of using the default pandas date parser, you can set the date_parser argument in pandas.read_csv() to use the iso8601 library, like so: pandas.read_csv(date_parser=iso8601.parse_date). This works great: it parses the timezone-aware timestamp and uses a FixedOffset to represent the timezone.

However, when I try .truncate() or .loc() on the resulting frame, iso8601 gets stuck in an infinite loop... a bug that was fixed on 2015-11-18 and will be included in the NEXT RELEASE!

So I'm writing all this down so that I don't forget to check the iso8601 pypi page someday in the future to check if they have released version 0.1.12... In the meantime I'll figure out some workaround... bleeurg I hate timezones

@JrtPec JrtPec self-assigned this Apr 20, 2016
@JrtPec JrtPec added the bug label Apr 20, 2016
@JrtPec
Copy link
Member Author

JrtPec commented Apr 20, 2016

Or, instead of using the iso8601 package I could just use dateutil.parser.parse.

I need a drink.

@JrtPec JrtPec closed this as completed Apr 20, 2016
@saroele
Copy link
Member

saroele commented Apr 21, 2016

and you deserve one
+1 for bleeurg I hate timezones

The question is: what is the desired behaviour if you ask a cached consumption of 20/04/2016 ? I agree that you expect to get the data of that day, in local time. Timezones and dates get messed up as you have illustrated in your example above.

An often used principle is to store everything in UTC. This works for eg. hourly timeseries. However, caching daily data in UTC may have been a wrong idea, and the datestamps (and data) should be taken according to local time... I guess we can discuss on grid:camp :-)

@JrtPec
Copy link
Member Author

JrtPec commented Apr 21, 2016

You are correct about the desired behaviour, this is exactly what Forecast.io does. It returns its 'daily' report with a timestamp at midnight, locally. If you store this in UTC you can get a date mismatch, so you'd need to localise it again, but from what? The timezone information has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired timezone for each site in the Houseprint (which will bring along a whole different set of issues I'm sure)

@saroele
Copy link
Member

saroele commented Apr 21, 2016

We need to combine both solutions. Timezone is a sensor characteristic.
Thank god we work with buildings and not container tracking sensors :-)
So we need to have it at sensor level, and then for redundancy reasons, we
can include it in the timestamps in the cache.

On Thu, Apr 21, 2016 at 11:32 AM, Jan Pecinovsky [email protected]
wrote:

You are correct about the desired behaviour, this is exactly what
Forecast.io does. It returns its 'daily' report with a timestamp at
midnight, locally. If you store this in UTC you can get a date mismatch, so
you'd need to localise it again, but from what? The timezone information
has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired
timezone for each site in the Houseprint (which will bring along a whole
different set of issues I'm sure)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/opengridcc/opengrid/issues/135#issuecomment-212830128

@JrtPec
Copy link
Member Author

JrtPec commented Apr 21, 2016

The saga continues:

A Pandas DatetimeIndex only allows OR a fully localised index (eg. 'UTC' or 'Europe/Brussels') OR an index with a Fixed Offset (eg. +02:00), for all timestamps.

So when we cache a localised index where a change from/to DST happens, the parser cannot convert them back to a tz-aware index, because the offset changes from +01:00 to +02:00.

So this might be the final straw for me, and I'm inclined to say that we do have to save everything in UTC and add a timezone field to the houseprint...

@JrtPec JrtPec reopened this Apr 21, 2016
@JrtPec
Copy link
Member Author

JrtPec commented Apr 21, 2016

I have a solution: instead of saving to CSV and having to go through parsing everything, why don't we just save to pickle? That way we're sure the data is read from cache in exactly the same format as how we've written it in the first place.

@saroele
Copy link
Member

saroele commented Apr 21, 2016

the advantage of the csv is that it also happens to be useful for excel
champions and import in other tools or platforms. With pickle, we loose
this again. Maybe the pandas-json-pandas route works? And json is somewhat
more universal than pickle.

On Thu, Apr 21, 2016 at 3:18 PM, Jan Pecinovsky [email protected]
wrote:

I have a solution: instead of saving to CSV and having to go through
parsing everything, why don't we just save to pickle? That way we're sure
the data is read from cache in exactly the same format as how we've written
it in the first place.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/opengridcc/opengrid/issues/135#issuecomment-212915864

@JrtPec
Copy link
Member Author

JrtPec commented Apr 21, 2016

Caching to JSON presents the same behaviour as to CSV. I expect every format where Pandas has to do some parsing to behave the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants