-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
55/refactor fix_latlon_coord #63
Conversation
@@ -1,6 +1,8 @@ | |||
import unittest.mock as mock | |||
from dataclasses import dataclass | |||
from collections import namedtuple | |||
import numpy as np | |||
from iris.exceptions import CoordinateNotFoundError | |||
|
|||
import umpost.um2netcdf as um2nc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, how am I meant to get this working for testing? I can't pip install -e
the umpost
directory in its current state, so I'm getting test collection errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, due to prior history, this project has started oddly. Spencer is using a pre-existing gadi
environment & I'm reusing a local virtualenv from another related project. There's been no setup docs until last week, when I grabbed docs from a related project. Do you feel like yet another PR review for docs? #62
(these docs are missing pip -e
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like me some good docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, there is much more testing that I initially thought!
For some initial thoughts: fixing um2nc
is tricky on account of uncertain requirements (gathered from the code), retro fitting tests & that the processing code contains substantial nested boolean logic. The latter causes a combinational testing blowout (more than the the fix naming funcs).
In both the earlier cube renaming & lat/long fixes, our testing approach is blowing out in an attempt to cover most/all code paths. Part of the problem in being this thorough, is writing tests against some internal implementation details. Some mocks test_fix_lat_coord_name()
& the process/masking fixes are a symptom of coupling to internal details (although some of this is unavoidable as we work around some constraints).
Additionally, I assume writing the lat/long tests was tricky, which implies future maintenance/code changes could be harder (e..g changing processing code potentially breaks multiple tests). This could happen if cube modification requirements change.
Thus, I've been rethinking our testing strategy, helped with these TDD videos (~1 hour each):
Ian Cooper: “TDD, Where did it all go wrong?”
https://www.youtube.com/watch?v=EZ05e7EMOLM
Ian Cooper: “TDD Revisited” (more instructional)
https://www.youtube.com/watch?v=IN9lftH0cJc
One key message is focusing testing to external interfaces avoiding referencing internal implementation details. This suggests focusing tests around fix_latlon_coords()
& skipping direct testing of sub functions. Calling the sub-functions can be treated as an implementation detail, with the tests asserting correct end results (thus, we treeat the check functions as correct/tested if the cube vars are correct). Also, as um2nc
cube mod functions often don't return anything, we'll have to make assertions on modified cubes. In decoupling from details, we should move away from assertions on mocks (except for the DummyCubes
). Some mocking is needed as an intermediate stop gap while the codebase is redesigned, but I think we can reduce it iteratively.
As a next step, I think we should try an experiment to see how this works, trying these steps:
- Temporarily comment out the grid type & bounds tests as an implementation detail
- e.g.
test_is_lat/lon_river_grid
,test_add_latlon_coord_bounds_has_bounds()
- e.g.
- Test a common use case with a
test_fix_latlon_coords_type_change()
type function- Remove
mock
calls, sofix_latlon_coords()
executes code for grids, bounds & coord naming fixes etc - Are more compound data fixtures needed to make a valid input?
- Remove
- Try assertions on the modded cube (check bounds, data types etc)
- Run coverage tests to explore how this higher level testing covers the code
If the above is straightforward:
- configure a cube with different data to execute other branches
- run tests & re-analyse the coverage
Otherwise, if the testing is fiddly/tricky with setup or otherwise:
- adjust experiment, test against the mid level funcs
add_latlon_coord_bounds()
&fix_..._coord_name()
- explore the coverage...
Analysing higher level testing & coverage is probably a good pair dev exercise. It's also likely our code repair efforts will require starting with coupled tests, then iteratively refactoring the processing code & tests. Thus while things are harder now, it's temporary while iterating towards better design.
Hopefully this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that required substantially more effort than the initial skim indicated. Apologies!
This looks good to go & will help with the process of merging all the current PRs. @marc-white do you want to do a final check?
# TODO: This test will not catch changes to the exceptions in um2nc. E.g. | ||
# if decided to instead raise a RuntimeError in um2nc when timeseries | ||
# are encountered, the conversion of ESM1.5 outputs would undesirably | ||
# crash, but this test would say everything is fine, since we're | ||
# prescribing the error that is raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could define the error type raised in this instance as a global variable in conversion_driver_esm1p5.py
, and then import it for use in the test as the expected error type?
test/test_um2netcdf.py
Outdated
D_LAT_N96 = 1.25 | ||
D_LON_N96 = 1.875 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment what these numbers are for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these meant to be DELTA_LAT_N96
& DELTA_LON_N96
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes just go the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments added in 21dc396
umpost/um2netcdf.py
Outdated
def fix_lat_coord_name(lat_coordinate, grid_type, dlat): | ||
""" | ||
Add a 'var_name' attribute to a latitude coordinate object | ||
based on the grid it lies on. | ||
|
||
lat = cube.coord('latitude') | ||
NB - Grid spacing dlon only refers to variables on the main | ||
horizontal grids, and not the river grid. | ||
|
||
# Force to double for consistency with CMOR | ||
lat.points = lat.points.astype(np.float64) | ||
_add_coord_bounds(lat) | ||
lon = cube.coord('longitude') | ||
lon.points = lon.points.astype(np.float64) | ||
_add_coord_bounds(lon) | ||
|
||
lat = cube.coord('latitude') | ||
if len(lat.points) == 180: | ||
lat.var_name = 'lat_river' | ||
elif (lat.points[0] == -90 and grid_type == 'EG') or \ | ||
(np.allclose(-90.+0.5*dlat, lat.points[0]) and grid_type == 'ND'): | ||
lat.var_name = 'lat_v' | ||
Parameters | ||
---------- | ||
lat_coordinate: coordinate object from iris cube (edits in place). | ||
grid_type: (string) model horizontal grid type. | ||
dlat: (float) meridional spacing between latitude grid points. | ||
""" | ||
|
||
if lat_coordinate.name() != LATITUDE: | ||
raise ValueError( | ||
f"Wrong coordinate {lat_coordinate.name()} supplied. " | ||
f"Expected {LATITUDE}." | ||
) | ||
|
||
if is_lat_river_grid(lat_coordinate.points): | ||
lat_coordinate.var_name = VAR_NAME_LAT_RIVER | ||
elif is_lat_v_grid(lat_coordinate.points, grid_type, dlat): | ||
lat_coordinate.var_name = VAR_NAME_LAT_V | ||
else: | ||
lat_coordinate.var_name = VAR_NAME_LAT_STANDARD | ||
|
||
|
||
def fix_lon_coord_name(lon_coordinate, grid_type, dlon): | ||
""" | ||
Add a 'var_name' attribute to a longitude coordinate object | ||
based on the grid it lies on. | ||
|
||
NB - Grid spacing dlon only refers to variables on the main | ||
horizontal grids, and not the river grid. | ||
|
||
Parameters | ||
---------- | ||
lon_coordinate: coordinate object from iris cube (edits in place). | ||
grid_type: (string) model horizontal grid type. | ||
dlon: (float) zonal spacing between longitude grid points. | ||
""" | ||
|
||
if lon_coordinate.name() != LONGITUDE: | ||
raise ValueError( | ||
f"Wrong coordinate {lon_coordinate.name()} supplied. " | ||
f"Expected {LATITUDE}." | ||
) | ||
|
||
if is_lon_river_grid(lon_coordinate.points): | ||
lon_coordinate.var_name = VAR_NAME_LON_RIVER | ||
elif is_lon_u_grid(lon_coordinate.points, grid_type, dlon): | ||
lon_coordinate.var_name = VAR_NAME_LON_U | ||
else: | ||
lon_coordinate.var_name = VAR_NAME_LON_STANDARD | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given these two functions have an almost identical structure, is it worth considering refactoring them as follows:
- A private helper function that does the actual work;
- Turn the existing functions into wrappers that call the private functions with the expecyed coordinate name, and comparison options for that coordinate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put together a trial of this - let me know if it's along the lines of what you were thinking! The first attempt was using a private function:
def _fix_horizontal_coord_name(coordinate, grid_type, grid_spacing,
river_grid_check,
river_grid_name, staggered_grid_check,
staggered_name, base_name):
if river_grid_check(coordinate.points):
coordinate.var_name = river_grid_name
elif staggered_grid_check(coordinate, grid_type, grid_spacing):
coordinate.var_name = staggered_name
else:
coordinate.var_name = base_name
def fix_lat_coord_name(lat_coordinate, grid_type, dlat):
_fix_horizontal_coord_name(coordinate = lat_coordinate,
grid_type=grid_type,
grid_spacing=dlat,
river_grid_check=is_lat_river,
river_grid_name=VAR_NAME_LAT_RIVER,
staggered_grid_check=is_lat_v_grid,
staggered_name=VAR_NAME_LAT_V,
base_name=VAR_NAME_LAT_STANDARD)
And another option was to put together a dictionary holding the specified checks:
HORIZONTAL_GRID_NAMING_DATA = {
LATITUDE: {
"river_grid_check": is_lat_river_grid,
"river_grid_name": VAR_NAME_LAT_RIVER,
"staggered_name": VAR_NAME_LAT_V,
"base_name": VAR_NAME_LAT_STANDARD,
"staggered_grid_check": is_lat_v_grid
},
LONGITUDE: {
"river_grid_check": is_lon_river_grid,
"river_grid_name": VAR_NAME_LON_RIVER,
"staggered_grid_check": is_lon_u_grid,
"staggered_name": VAR_NAME_LON_U,
"base_name": VAR_NAME_LAT_STANDARD,
}
}
def fix_latlon_coord_name(coordinate, grid_type, grid_spacing):
coord_name = coordinate.name
if HORIZONTAL_GRID_NAMING_DATA["river_grid_check"](coordinate.points):
coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["river_grid_name"]
elif HORIZONTAL_GRID_NAMING_DATA["staggered_grid_check"](coordinate.points, grid_type, grid_spacing):
coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["staggered_name"]
else:
coordinate.var_name = HORIZONTAL_GRID_NAMING_DATA["base_name"]
In terms of readability, I think I find the separate tests a bit easier to follow, though it is some amount of duplicate code. Happy to hear other opinions and ideas though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these experiments on other branches?
While the code & tests work here (with a bit of duplication), refining the code can be deferred in favour of modularisation & test coverage. @blimlim What are your thoughts on adding an neatening task to issue #27, linking it to this discussion & returning to it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea, I've added a task to #27. The two experiments are up on the following branches:
https://github.com/ACCESS-NRI/um2nc-standalone/tree/55/refactor-fix_latlon_coord-private-name-func
https://github.com/ACCESS-NRI/um2nc-standalone/tree/55/refactor-fix_latlon_coord-dictionary-name-func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferred action sounds like a good idea.
My original thought was like your option 2, but even more general - each coordinate could have an iterable of (check, name) tuples that you go through until you hit a match. The last tuple in the iterable would then be (True, <default name>)
.
umpost/um2netcdf.py
Outdated
if coordinate_name not in [LONGITUDE, LATITUDE]: | ||
raise ValueError( | ||
f"Wrong coordinate {coordinate_name} supplied. " | ||
f"Expected one of {LONGITUDE}, {LATITUDE}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ever likely to get coordinate names other than LATITUDE
and LONGITUDE
? If so, it might be worth parameterizing out the 'set of coordinate names I expect' for use in this situation. You could also do it using GLOBAL_COORDS_BOUNDS.keys()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added in this error to try to guard against add_latlon_coord_bounds
being inadvertently used on the wrong coordinates, and so here we should only be expecting LATITUDE
or Longitude
.
Would adding a constant eg HORIZONTAL_COORD_NAMES = [LATITUDE, LONGITUDE]
, and then checking if coordinate_name not in HORIZONTAL_COORD_NAMES
be a bit cleaner here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely, although you've possibly already got the 'constant' available as GLOBAL_COORDS_BOUNDS.keys()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is GLOBAL_COORDS_BOUNDS
likely to change, therefore adding keys that coord name
should not be?
The advantage of the current line is that the test condition is obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a minor formatting thing, lines 266-268 look over indented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated the indentation in 37049c7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments I've made are only tweaks that you're free to think about and implement or not at your discretion - good job!
As a general observation, I think staggering the 2 reviews has been beneficial. The later 2nd review brings fresh eyes after a few refinements. I'm pretty sure I inadvertently started to skim the code as it got more familiar. |
Thanks @truth-quark and @marc-white for the very helpful reviews! And apologies it took so much work on your end! |
This pull request closes #55. Apologies that there's a bit in here... It refactors the
fix_latlon_coord
function, which was used to add bounds, convert types, and add variable names for each cube's horizontal coordinates.The main structural changes are to split the function into several more modular ones in order to help with readability and unit testing. Other changes to try and improve readability and testability include:
fix_latlon_coord
function, and raises a more specific exception.To help with the unit tests, the pull request adds a
DummyCoordinate
class to mimic iris cube coordinate objects with imitations of thename
andhas_bounds
methods needed to test the new functions. It also adds an extension to theDummyCube
class,DummyCubeWithCoords
which can be used to addDummyCoordinate
to aDummyCube
.Any suggestions about the structure, logic, tests, and anything else would be great!