-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove datetime workaround logic #273
Conversation
@spencerahill given conda-forge/cftime-feedstock#2 (comment) I think it makes sense to switch to testing on the Win-64 platform rather than Win-32; do you agree? |
Awesome! I wanted to give you a quick response on a few things, but then unfortunately this next week is really crazy for me. So it's unlikely that I'll be able to give a proper review until the week of the 27th. Sorry about that. Don't forget to pin xarray version to >=0.10.4
Yes, agreed.
Do you have a sense of how difficult this will be to implement? I don't want to blow up the scope of this PR, which is already necessarily pretty large. But this would be a great feature and would be nice to go straight to there instead of requiring switching from Python's datetime objects to numpy's and/or cftime's. Or, conversely, would it be possible to still accept
Given our intention of generalizing our time reduction functionality soon (#208), how does that modify this picture? Conversely, what are the prospects for getting resample support added to CFTimeIndex? One last thought: we're probably due for a new release, v0.3, once we get this in...even without it we've already done quite a bit since 0.2. So probably in June or July. |
Totally fine! Thanks for taking the time to write these initial comments.
For the most part adding string support hopefully shouldn't be too hard. The one place I'm a little unsure of is how to handle things in
I think this is reasonably straightforward. We can support all three types; if a
I think generalizing the time reduction functionality is somewhat orthogonal to this. If we wanted to support eddy calculations we would need support of monthly resampling in some fashion. Getting general resample support added for CFTimeIndex in xarray is doable. Stephan has outlined roughly how this might look in pydata/xarray#1252 (comment).
I agree! |
aospy will now automatically convert user-input date ranges to the types required for indexing the data read in. Accepted types are: - np.datetime64 - datetime.datetime - cftime.datetime
Thanks for going ahead with the
OK. In that case I am receptive to your initial idea of "creating a one-off function for that purpose" In thinking about this, it occurs to me that the eddy functionality is one of the uglier bits of
OK. I opened pydata/xarray#2191 to keep tabs on that endeavor. |
@spencerahill thanks for opening those issues! I'm currently cleaning up the tests here (moving |
@spencerahill I've finished up cleaning up the tests; the main motivation for the refactor was so that I could parametrize the tests using different types of dates.
Sounds good -- I've left things alone with that logic in this PR for now. I was not using that functionality either. Whatever worked with data indexed by a DatetimeIndex will still work here, and it will just error if a CFTimeIndex is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good overall. Most of my comments are about docs/comments or are clarifying questions.
I do have one bigger question though: why are strings handled differently throughout than the other time types? In my (quite possibly naive) mind, I was thinking you'd cast string arguments to a datetime-like type (not sure which one) right away, and then all the other functions/methods could count on a datetime-like type. As written, there is if/else logic for strings vs. datetime-like objects in a lot of places.
aospy/data_loader.py
Outdated
@@ -180,7 +179,6 @@ def _prep_time_data(ds): | |||
The processed Dataset and minimum and maximum years in the loaded data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring now that only ds
is returned
aospy/utils/times.py
Outdated
|
||
Parameters | ||
---------- | ||
date : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str -> datetime-like object or str
aospy/utils/times.py
Outdated
|
||
Parameters | ||
---------- | ||
date : datetime-like object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime-like obj or str
aospy/utils/times.py
Outdated
|
||
Returns | ||
------- | ||
pd.Timestamp or cftime.datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the conditions under which either type is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this; this function was was leftover from my earlier commits and is no longer even used! I'll go ahead and delete it.
aospy/utils/times.py
Outdated
|
||
|
||
def maybe_convert_to_index_date_type(index, date): | ||
"""Convert a datetime-like object to the appropriate type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Here and elsewhere, let's end all the one-line summaries in the docstrings with a period. I think it's a little cleaner, and it's more consistent with other packages docs. I realize we probably haven't been 100% consistent on this in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of "appropriate type" isn't clear in this docstring...one or two sentences expanding this would help in cases like this one where a developer is returning to this after a while or is new to it.
aospy/test/test_utils_times.py
Outdated
@@ -421,6 +315,94 @@ def test_assert_has_data_for_time(): | |||
_assert_has_data_for_time(da, start_date_bad, end_date_bad) | |||
|
|||
|
|||
CFTIME_DATE_TYPES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see here you test all of the calendar types (c.f. my other comment). If you think that it's sufficient to do so at this lower level, then fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right to strive to be thorough in testing all kinds of cftime.datetime
objects at this low level. I could be more thorough in these tests (particularly for infer_year
and maybe_convert_to_index_date_type
). I'll improve those.
start_date_bad = '1999-12-31' | ||
end_date_bad = '2000-04-01' | ||
|
||
# With strings these checks are disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this difference in behavior between strings and datetime-like types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial datetime strings are somewhat complicated. It involves a fair bit of code (and tests) to fully support parsing them and interpreting what they mean as bounds. While much of that code exists in xarray (see here in cftimeindex.py
), it is private API. At least to me, this sanity check in aospy is not important enough to merit suggesting that xarray support parse_iso8601
, _parse_iso8601_with_reso
, and _parsed_string_to_bounds
as public API. While they are used by xarray, these functions are outside its scope (i.e. xarray is not a date-parsing library).
If we want to support this check with strings, it then leaves us with two options:
- Port all of that code (and tests) to aospy.
- Use those functions from xarray, despite the fact that they are private API.
In your opinion, are either of those worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's helpful. Is this the same reason for the different handling of strings vs. other types elsewhere in the code? Frankly, it does concern me a little that, for a user, behavior could be different whether they supply dates as strings or as a datetime-like type.
Putting the above question another way: let's suppose we went ahead and ported that code or imported it. Would that enable you to, as soon as dates are read in as strings, convert them to a datetime-like type? And, if yes, would that then remove the need for separate logic? Purely from a design perspective, to me the optimal solution would be to create an "adapter" (something that takes strings or any of the supported datetime-like types and outputs a datetime object), apply it right away, and then have the remainder of the pipeline ingest/output a single type. But from the technical/implementation perspective, perhaps there are good reasons that would argue against this that I'm still not appreciating.
Hope that makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that enable you to, as soon as dates are read in as strings, convert them to a datetime-like type?
I'm pretty reluctant to do that :) The beauty of using strings is that we don't need to modify them at all! We can just pass them directly as slice bounds when we call sel
to subset a Dataset in time, regardless of whether the time coordinate is indexed by a DatetimeIndex or CFTimeIndex. For the most part that is why the logic is different for them throughout what I've implemented here. If it weren't for _assert_has_data_for_time
, strings would be significantly easier to deal with than date objects (because we don't need to worry about their type).
Let me think about things a little more. There may be a solution that doesn't involve porting significant amounts of existing code or using private API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes sense to me. In that case, what about, just within _assert_has_data_for_time
, casting the string to one of the datetime objects? Sorry if I'm still being naive here. If not, at least maybe we should add a note in the docs and/or a logging message that string dates don't undergo this check?
Ultimately I trust your judgment on this one and don't want to bog you down on this point. I don't think at all that the current implementation will incur a lot of technical debt (quite the opposite in fact). So let's go with whatever you decide is best at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is exactly what I had in mind when discussing the options here: #273 (comment).
Sorry, right of course. So that probably is overkill; I think a warning/docs note would be fine.
Would it help you understand where I'm coming from if I provided an explicit example for why casting string inputs as dates is difficult and something that might best be left for upstream libraries?
An example would definitely help, but that might be overkill just for my behalf. On the other hand, it occurred to me: would this functionality be appropriate in cftime? In which case maybe a clean example would be good as part of an Issue on it in that repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example would definitely help
For sure! Let me provide some context here.
Ideally we would like to allow users to specify partial-datetime strings as time bounds for their calculations. Partial-datetime strings are a very convenient shorthand; they allow you to select data between the earliest datetime possible for the lower bound, all the way to the greatest datetime possible on the upper bound.
For instance, if you wanted to compute things for data between year 2000 and 2010 (inclusive), you could just specify a date range of ('2000', '2010')
. This is in contrast to using explicit datetimes, which would look like (datetime(2000, 1, 1), datetime(2010, 12, 31, 23, 59, 59, 999999))
. You can continue refining your interval by filling in more of the bounds. So for instance if you wanted to select data between February, 2000 and February 2010, you could use ('2000-02', '2010-02')
and this would map to (datetime(2000, 2, 1), datetime(2010, 2, 28, 23, 59, 59, 999999))
.
For this reason, it is not a simple matter of "casting" a string to a particular datetime. The datetime you transform it to matters based on the context. For instance if '2000-02'
is the lower bound, then it maps to datetime(2000, 2, 1)
, while if it is the upper bound it maps to datetime(2000, 2, 29, 23, 59, 59, 999999)
. pandas and now xarray have logic for doing this for numpy and cftime dates, so it sort of feels heavy-handed to re-implement it in aospy.
So that probably is overkill; I think a warning/docs note would be fine.
Sounds good. I'll mull this over for the next few days. If I can't think of a simple solution, I'll go forward with the warning/docs note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes total sense. Thanks.
I'll mull this over for the next few days. If I can't think of a simple solution, I'll go forward with the warning/docs note.
Sounds good. My only remaining question is re: if this functionality should/could go in cftime eventually. Perhaps Jeremy's comment on this thread on the xarray mailing list is relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only remaining question is re: if this functionality should/could go in cftime eventually. Perhaps Jeremy's comment on this thread on the xarray mailing list is relevant?
I agree with Jeremy and think a cftime version of datetime.strptime
would be great; it could help in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I opened an issue to start that discussion: Unidata/cftime#51
Who knows if/when that will progress though, so let's not wait up for it for this PR.
@@ -472,6 +397,9 @@ def _assert_has_data_for_time(da, start_date, end_date): | |||
AssertionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above in this docstring: netCDF4.netcdftime -> cftime (maybe grep for the former throughout the package)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This turns out to be the only place (and I fixed it).
aospy/utils/times.py
Outdated
int | ||
""" | ||
if isinstance(date, str): | ||
pattern = '(?P<year>\d{4})' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to describe in words what this pattern is looking for, and/or an example?
docs/examples.rst
Outdated
description='Precipitation generated by convective parameterization', | ||
) | ||
precip_largescale = Var( | ||
name='precip_largescale', # name used by aospy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to retain the 4 spaces indentation on the continuation lines
Thanks for the helpful comments @spencerahill! When you get a chance, this should be ready for another review. |
Thanks @spencerkclark for the new commits...everything looks good, but still do want to make sure I understand the design decisions c.f. my in-line comment above before merging. |
@spencerahill thanks for opening the issue in cftime! I couldn't think of a simple solution, so I went ahead with the warning and docs note. This should be ready for a final review. |
Awesome. Thanks much @spencerkclark! Great to finally have this in. |
Closes #94
Closes #98
Closes #243
Now that Xarray version 0.10.4 has been released, we can begin the process of removing our logic for working around Timestamp limitations in pandas. For the most part this was actually pretty painless.
A few notes:
One breaking change is that now instead of usingUpdated to preserve support fordatetime.datetime
objects as input for specifying dates or time ranges, we require eithernp.datetime64
orcftime.datetime
objects. This is because some dates, e.g. February 30th in a 360 day calendar, are not representable asdatetime.datetime
objects.datetime.datetime
objects.Eventually it would be good to accept strings as inputs (i.e. for partial datetime string indexing), but I think we should save that for a future PR. Strings are nice, because they are calendar-agnostic.I went ahead and added support for this in this PR.CFTimeIndex
. We do use resample when computing transient eddy quantities (to downsample to a monthly mean time series). In the meantime though, we might think about creating a one-off function for that purpose (i.e. not worry about supporting general frequency resampling and only handle the monthly case).cftime.datetime
objects are not exact to microsecond precision, I needed to add a tolerance toassert_has_data_for_time
as discussed in Selecting datetimes within a certain range (days) without having to specify the start and end dates precisely (hours or higher) #243.@spencerahill does this seem reasonable
(in particular the breaking change)? If it does, I'll try to clean things up over the next few days.