-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
human-readable, type safe serialization #296
Comments
Something I also noticed when looking at dtypes is that the meta handling in scmdata also gobbles |
def test_correct_type_meta():
df = pd.DataFrame(
{
"unit": pd.Series([np.nan, np.nan], dtype=str),
"variable": pd.Series(["1", "2"], dtype=str),
"extra": pd.Series([1, 2], dtype=int),
2015: pd.Series([1.0, 2.0], dtype=float),
},
)
run = scmdata.run.BaseScmRun(pd.DataFrame(df))
assert run.meta.dtypes["extra"] == int
assert run.meta.dtypes["variable"] in (str, object)
assert run.meta.dtypes["unit"] in (str, object) fails when asserting the |
Wow that's super rough.
My quick thoughts
- git diffable means binary formats are out
- type safe means pure csv and other standard formats are out
So I think we're left with hybrid formats (csv plus metadata) like what
primap2 has or what is done with something like th data package format. I
think scmdata should be able to support that fairly easily. I would
normally hesitate to add a new data format because supporting it cn be
tricky (an experience we had a bit with the Merced format), but I can see
the use case very clearly here so I think it's worth that pain.
Plan b of only serializing to binary and then building basic tooling for
doing diffs also seems like a viable option (maybe even the better one).
On the Nan float conversion column thing. Urgh yes our initialisation paths
are not good in hindsight.
…On Thu, Jan 11, 2024, 13:46 Mika Pflüger ***@***.***> wrote:
Something I also noticed when looking at dtypes is that the meta handling
in scmdata also gobbles np.nan columns and coerces their type to float.
def test_correct_type_meta():
df = pd.DataFrame(
{
"unit": pd.Series([np.nan, np.nan], dtype=str),
"variable": pd.Series(["1", "2"], dtype=str),
"extra": pd.Series([1, 2], dtype=int),
2015: pd.Series([1.0, 2.0], dtype=float),
},
)
run = scmdata.run.BaseScmRun(pd.DataFrame(df))
assert run.meta.dtypes["extra"] == int
assert run.meta.dtypes["variable"] in (str, object)
assert run.meta.dtypes["unit"] in (str, object)
fails when asserting the dtype for unit, because it is float now.
—
Reply to this email directly, view it on GitHub
<#296 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFUH5G4BESUP4AVQ67BOCA3YN7UKRAVCNFSM6AAAAABBWOWBPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGE4TIOBTGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The PRIMAP2 style of separate metadata file is at the top of my list. If we have a CSV files (which is git friendly) and the metadata coerces the types at runtime, does that pass the requirements? Pros:
Cons:
This would be a good opportunity to migrate to a |
Yes, if the separate metadata file contains type information, it passes the requirements. Note that the primap2 interchange format on-disk representation as it stands does not contain type information. We expect that non-Excel users use the netcdf files, so we didn't consider adding type information. But would probably be easy enough to add type information. The only downside is that we need to maintain the casting and coercion logic ourselves, which is a bit awkward. |
Well, at least the definition says that in the primap2 interchange format, all text must be quoted, so it is possible to decide between |
Is your feature request related to a problem? Please describe.
I just chased a very mysterious bug for 4 hours. I couldn't believe my eyes because the results made zero sense and I thought I messed up some very basic functions. Well, turns out that I serialized an
ScmRun
to CSV, but one of my metadata columns (containing IPCC 2006 categories) had only top-level categories, which all could be parsed as integer. When deserializing thisScmRun
, the inferred data type was integer for this metadata column. Then, selecting for the category using a string would not turn up anything. This is hard to debug because data types are not shown for columns by default by pandas or ScmRun representations.Describe the solution you'd like
I want a human-readable, gitlab-diffable serialization of an ScmRun which is type safe.
My ideal serialization would be:
Nothing which is built into pandas fits the bill fully. There is
df.to_json(orient="records", lines=True)
which looks nice but isn't type safe for all-null columns (a string column with all-null values will be converted to a float column when serialized and deserialized again). There isdf.to_json(orient="table")
which is fully type safe but one giant line. You can prettyprint that with any json pretty printer, but it leads to every value on its own line, not one line per row like CSV.An alternative would maybe involve something which is based on CSV but with an additional header which looks like
df.to_json(orient="table")["schema"]
, but it wouldn't be deserializable any more with standard pandas functionality, you have to use some custom wrapper.Describe alternatives you've considered
Absolutely forbid CSV in any context and always use binary, type safe formats. Then, build infrastructure to make changes in the binary formats reviewable. Dunno, a bot which adds visual diffs or just a smallish CLI which can be given two git branches and a file name and makes nice diffs.
Additional context
Maybe I'm biased right now because I wasted so much time on this, but I think we really can't tolerate not-type-safe serializations. If I have to worry about details like "oh no, my categories are only top-level after filtering this, now I can't serialize it any more" when building higher-level functionality, it will be an ongoing mental capacity cost. Not only when debugging, but every time I have to serialize or deserialize anything. I would probably be paranoid and at least start
assert
ing dtypes everywhere on myScmRun
s.The text was updated successfully, but these errors were encountered: