-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up scmdata testing #226
Comments
There is a commit here #214 that may/may not be related |
@mikapfl suggested that we also expose a ScmRun object with some known data. It can then be used in the docs etc |
@znicholls I'd also propose we flip the logic to be similar to pandas etc. Instead of "allow_unordered" the option would be "check_order" (defaults to False instead of True?) or perhaps "check_like" which is the name pandas uses. The Also I'd propose adding a How about an initial signature of: def assert_scmrun_almost_equal(
left: BaseScmRun,
right: BaseScmRun,
check_order: bool = False,
check_exact: bool = False,
check_exact_units: bool = False,
rtol: float = 1.0e-5,
atol: float = 1.0e-8,
) -> None:
... We could also leave old function as is with an additional deprecation warning and add the new implementation separately. That way, it isn't a breaking change |
Copying pandas' names as much as possible will reduce cognitive overload I think so I would go with that. I don't really understand why we have any idea of ordering. Our data model doesn't make any guarantees about ordering does it? So why would our tests imply that is something we support? Given the above, I'd pull out def assert_scmrun_almost_equal(
left: BaseScmRun,
right: BaseScmRun,
rtol: float = 1.0e-5,
atol: float = 1.0e-8,
equal_nan: bool = False,
check_exact_units: bool = False,
time_axis: str | None = None,
# this gives the user full control to use whatever pandas functionality they want
# in the timeseries comparison step
assert_frame_equal_kwargs: dict[str, any] | None = None:
) -> None: Click to see my sketch implementationdef assert_scmrun_almost_equal(
left: BaseScmRun,
right: BaseScmRun,
rtol: float = 1.0e-5,
atol: float = 1.0e-8,
equal_nan: bool = False,
check_exact_units: bool = False,
time_axis: str | None = None,
assert_frame_equal_kwargs: dict[str, any] | None = None:
) -> None:
"""
Assert scmrun almost equal using :func:`np.isclose`
"""
if assert_frame_equal_kwargs is None:
assert_frame_equal_kwargs = {}
# Something like
# Align timeseries (this is done via pandas but can probably be optimised
# to suit our internal storage better)
left_ts = left.timeseries(time_axis=time_axis)
right_ts = right.timeseries(time_axis=time_axis)
# not sure which way round axis should be
# We could guard this behind a `check_like`, same behaviour as pandas (
# i.e. we will order data before checking) but I don't fully understand
# why we wouldn't just have this always be True. Our data model doesn't
# make any guarantees about preserving order does it, so shouldn't we
# signal that to the user by always aligning before checking equivalence?
left_ts, right_ts = left_ts.align(right_ts, axis="rows")
if not check_exact_units:
# Drop out units before checking indexes
# check that indexes align, if they don't we can fast exit (unless we keep
# some ability to check that values are same even if metadata isn't, I
# know we have that currently but I don't really understand what it is
# useful for (why would we be ok if we had two ScmRun with different
# labelling but the same numbers?!))
if not left.index.equals(right.index):
# raise some error using
only_in_left = left_ts.index.difference(right_ts.index)
only_in_right = right_ts.index.difference(left_ts.index)
raise IndexAlignmentError(only_in_left, only_in_right)
if not check_exact_units:
# Convert to same units here
# The transpose makes it easier to see differences as pandas is less
# likely to truncate rows than columns
left_ts_t = left_ts.T
right_ts_t = right_ts.T
differences = ~np.isclose(
left_ts_t.values,
right_ts_t.values,
rtol=rtol,
atol=atol,
equal_nan=equal_nan,
)
if differences.any().any():
# Let pandas render a nice error for us, showing only the areas
# of the timeseries that actually have differences
pdt.assert_frame_equal(
left.loc[differences.any(axis=1), differences.any(axis=0)],
right.loc[differences.any(axis=1), differences.any(axis=0)],
# Let user pass in any other pandas kwargs they want
**assert_frame_equal_kwargs,
)
Very smart, let's do that. |
Is your feature request related to a problem? Please describe.
The
assert_scmdf_almost_equal
function is a bit messier and harder to use than is necessary.Describe the solution you'd like
We clean up the function. Immediate things to do:
assert_scmrun_almost_equal
allow_unordered
andcheck_ts_names
arguments are still relevant. If they are, try to better capture what they doBaseScmRun
pdt.assert_frame_equal
and we can removenpt.assert_allclose
(in theory, the pandas function also uses assert all close so this feels like it should be possible, we may have issues with the time point comparisons of course)Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
This would be a breaking change, so would need appropriate deprecation warnings (probably over at least two minor releases)
The text was updated successfully, but these errors were encountered: