Lost timezone info when reading from cache #135

JrtPec · 2016-04-20T22:55:51Z

TL;DR: There is a bug with iso8601. It is fixed, but they haven't published the fix on pypi yet.

Here I go again with the long story of a wild bug chase:

By default, pandas.read_csv() does not "really" support timezone-aware timestamps. I'm using double quotes because Pandas does convert them to UTC and returns them as timezone-naive. In caching.py, @saroele has added the line df.index = df.index.tz_localize('UTC'), so technically you are returned the correct time, albeit in UTC instead of the timezone you used when you did the caching.

I'm caching weather data per day, my timestamps look something like 2016-04-20 00:00:00+02:00, and after caching they look like 2016-04-19 22:00:00+00:00. This gives me really stupid errors and a lot of headaches when I try to compare dates.

The solution is easy: instead of using the default pandas date parser, you can set the date_parser argument in pandas.read_csv() to use the iso8601 library, like so: pandas.read_csv(date_parser=iso8601.parse_date). This works great: it parses the timezone-aware timestamp and uses a FixedOffset to represent the timezone.

However, when I try .truncate() or .loc() on the resulting frame, iso8601 gets stuck in an infinite loop... a bug that was fixed on 2015-11-18 and will be included in the NEXT RELEASE!

So I'm writing all this down so that I don't forget to check the iso8601 pypi page someday in the future to check if they have released version 0.1.12... In the meantime I'll figure out some workaround... bleeurg I hate timezones

The text was updated successfully, but these errors were encountered:

JrtPec · 2016-04-20T23:02:50Z

Or, instead of using the iso8601 package I could just use dateutil.parser.parse.

I need a drink.

saroele · 2016-04-21T07:07:54Z

and you deserve one
+1 for bleeurg I hate timezones

The question is: what is the desired behaviour if you ask a cached consumption of 20/04/2016 ? I agree that you expect to get the data of that day, in local time. Timezones and dates get messed up as you have illustrated in your example above.

An often used principle is to store everything in UTC. This works for eg. hourly timeseries. However, caching daily data in UTC may have been a wrong idea, and the datestamps (and data) should be taken according to local time... I guess we can discuss on grid:camp :-)

JrtPec · 2016-04-21T09:32:23Z

You are correct about the desired behaviour, this is exactly what Forecast.io does. It returns its 'daily' report with a timestamp at midnight, locally. If you store this in UTC you can get a date mismatch, so you'd need to localise it again, but from what? The timezone information has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired timezone for each site in the Houseprint (which will bring along a whole different set of issues I'm sure)

saroele · 2016-04-21T10:11:12Z

We need to combine both solutions. Timezone is a sensor characteristic.
Thank god we work with buildings and not container tracking sensors :-)
So we need to have it at sensor level, and then for redundancy reasons, we
can include it in the timestamps in the cache.

On Thu, Apr 21, 2016 at 11:32 AM, Jan Pecinovsky [email protected]
wrote:

You are correct about the desired behaviour, this is exactly what
Forecast.io does. It returns its 'daily' report with a timestamp at
midnight, locally. If you store this in UTC you can get a date mismatch, so
you'd need to localise it again, but from what? The timezone information
has been thrown away.

So there are two solutions: do a tz-aware caching, or store the desired
timezone for each site in the Houseprint (which will bring along a whole
different set of issues I'm sure)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/opengridcc/opengrid/issues/135#issuecomment-212830128

JrtPec · 2016-04-21T12:59:01Z

The saga continues:

A Pandas DatetimeIndex only allows OR a fully localised index (eg. 'UTC' or 'Europe/Brussels') OR an index with a Fixed Offset (eg. +02:00), for all timestamps.

So when we cache a localised index where a change from/to DST happens, the parser cannot convert them back to a tz-aware index, because the offset changes from +01:00 to +02:00.

So this might be the final straw for me, and I'm inclined to say that we do have to save everything in UTC and add a timezone field to the houseprint...

JrtPec · 2016-04-21T13:18:10Z

I have a solution: instead of saving to CSV and having to go through parsing everything, why don't we just save to pickle? That way we're sure the data is read from cache in exactly the same format as how we've written it in the first place.

saroele · 2016-04-21T13:22:14Z

the advantage of the csv is that it also happens to be useful for excel
champions and import in other tools or platforms. With pickle, we loose
this again. Maybe the pandas-json-pandas route works? And json is somewhat
more universal than pickle.

On Thu, Apr 21, 2016 at 3:18 PM, Jan Pecinovsky [email protected]
wrote:

I have a solution: instead of saving to CSV and having to go through
parsing everything, why don't we just save to pickle? That way we're sure
the data is read from cache in exactly the same format as how we've written
it in the first place.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/opengridcc/opengrid/issues/135#issuecomment-212915864

JrtPec · 2016-04-21T13:40:32Z

Caching to JSON presents the same behaviour as to CSV. I expect every format where Pandas has to do some parsing to behave the same.

JrtPec self-assigned this Apr 20, 2016

JrtPec added the bug label Apr 20, 2016

JrtPec closed this as completed Apr 20, 2016

JrtPec reopened this Apr 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost timezone info when reading from cache #135

Lost timezone info when reading from cache #135

JrtPec commented Apr 20, 2016

JrtPec commented Apr 20, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016

JrtPec commented Apr 21, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016

Lost timezone info when reading from cache #135

Lost timezone info when reading from cache #135

Comments

JrtPec commented Apr 20, 2016

JrtPec commented Apr 20, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016

JrtPec commented Apr 21, 2016

saroele commented Apr 21, 2016

JrtPec commented Apr 21, 2016