Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema hinting to new sheets functionality (C4-1088) #280

Merged
merged 68 commits into from
Oct 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
4c84f0b
First cut at tools for parsing workbooks.
netsettler Aug 14, 2023
7b73a67
Refactor to separate some functionality into a separate sevice class.
netsettler Aug 14, 2023
3d4573f
Add a csv file for testing.
netsettler Aug 14, 2023
f4e5cfa
Add some negative testing.
netsettler Aug 14, 2023
e9d2465
Update lock file.
netsettler Aug 14, 2023
6e9060f
Document new sheets_utils module.
netsettler Aug 14, 2023
df12c91
Issue a beta for this functionality.
netsettler Aug 15, 2023
6a39c8a
Fix documentation for sheet_utils.
netsettler Aug 15, 2023
eedb5c6
Add some declarations. Small refactors to improve modularity.
netsettler Aug 16, 2023
a6b68fe
Rearrange some methods for presentational reasons.
netsettler Aug 16, 2023
3ff63a9
First cut at useful functionality.
netsettler Aug 17, 2023
39bd2e0
Some name changes to make things more abstract. workbook becomes read…
netsettler Aug 17, 2023
77b72f6
Rename sheetname to tabname throughout, to be more clear that this is…
netsettler Aug 17, 2023
ba8c55c
Add some doc strings. Rename load_table_set to just load. Arrange for…
netsettler Aug 17, 2023
50488cb
Add load_items function. Fix some test names. Update changelog.
netsettler Aug 17, 2023
807e525
Experimental bug fix from Will to hopefully make get_schema_names work.
netsettler Aug 17, 2023
2a8e81a
update changelog
netsettler Aug 17, 2023
718054a
Update dcicutils/sheet_utils.py
netsettler Aug 17, 2023
682c95a
Merge branch 'master' into kmp_sheet_utils
netsettler Aug 17, 2023
582f002
Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv
netsettler Aug 17, 2023
56d1459
Add some comments in response to Doug's code review.
netsettler Aug 17, 2023
2facf9e
Support TSV files.
netsettler Aug 17, 2023
bcc4e63
Add changelog info about tsv files.
netsettler Aug 17, 2023
9de282e
Add a missing data file.
netsettler Aug 17, 2023
8d6495f
First stable cut at schema hinting. Doesn't find schemas automaticall…
netsettler Aug 23, 2023
3a103ee
Merge branch 'master' into kmp_sheet_utils
netsettler Aug 23, 2023
56f702a
Mark chardet as an acceptable license for use.
netsettler Aug 24, 2023
08d428e
Merge branch 'kmp_sheet_utils' into kmp_sheet_utils_refactor_for_csv
netsettler Aug 24, 2023
42ad579
Merge branch 'kmp_sheet_utils_refactor_for_csv' into kmp_sheet_utils_…
netsettler Aug 24, 2023
60ada3f
Backport some small fixes and cosmetics from the schemas branch.
netsettler Aug 24, 2023
690a833
Cosmetic fix.
netsettler Aug 24, 2023
946b998
Add some missing newlines in data files.
netsettler Aug 24, 2023
36e7de0
Support for coping with .tsv files where trailing whitespace is 'help…
netsettler Aug 24, 2023
a51fb27
PEP8
netsettler Aug 24, 2023
6f097a6
Document our choice of why is_uuid is defined here as it is.
netsettler Aug 24, 2023
477c7a2
PEP8
netsettler Aug 24, 2023
09b4c43
Fix error handling to be clearer.
netsettler Aug 24, 2023
f3bd815
Fix CHANGELOG to reflect recent renamings.
netsettler Aug 24, 2023
7627f6f
Fix a type hint and some PEP8.
netsettler Aug 24, 2023
98cd37c
Implement a cut at escaping for tsv files.
netsettler Aug 24, 2023
3852e56
Add a test case for all of the pieces of parsing and schema hinting p…
netsettler Aug 25, 2023
660df9c
Small cosmetic changes and some additional support for upcoming work.
netsettler Aug 25, 2023
34d528b
Fix a unit test to conform to new google account name.
netsettler Aug 25, 2023
1c34ad0
Fix typo in comment (dcicutils/misc_utils.py)
netsettler Aug 25, 2023
41fad79
Add some doc strings and comments.
netsettler Aug 28, 2023
6e8ce2c
Rename tabname to tab_name throughout the sheet_utils interfaces.
netsettler Aug 30, 2023
04eb58c
Add support for reading inserts dirs, .json, .jsonl (two formats), an…
netsettler Aug 31, 2023
ce9f9bc
Bump beta version.
netsettler Aug 31, 2023
0ea5b62
Add yaml formats.
netsettler Aug 31, 2023
bcc1128
Add class AbstractItemManager. Rename InsertsItemManager to InsertsDi…
netsettler Sep 1, 2023
7de093a
Rename ._parser() to ._parse_json_data(). Factor type checks out of .…
netsettler Sep 1, 2023
b01e34b
Rename _parse_json_data, _load_json_data, and _check_json_data, respe…
netsettler Sep 1, 2023
0ae48ee
WIP. Testing good.
netsettler Sep 1, 2023
1e2c5a9
WIP. Tests passing.
netsettler Sep 2, 2023
b8a4c39
Rearrange the way escaping= works so both csv an tsv files can using …
netsettler Sep 2, 2023
a2fe079
Separate registration of regular table set managers from registration…
netsettler Sep 2, 2023
91ddce0
Stub in checking of required headers.
netsettler Sep 2, 2023
142a20b
Bump beta version.
netsettler Sep 5, 2023
e09af07
PEP8
netsettler Sep 5, 2023
70762c6
Merge branch 'kmp_schemas_from_vapp' into kmp_sheet_utils_with_vapp
netsettler Sep 7, 2023
7d2ecaa
Fix a bug in newly proposed ff_utils.get_schemas with vapp.
netsettler Sep 7, 2023
5e46273
Extend VirtualApp to amke it easier to test by adding an AbstractVirt…
netsettler Sep 7, 2023
53de60a
Implement portal_vapp= in sheet_utils.
netsettler Sep 7, 2023
630720f
Simplifications per Will's code review.
netsettler Sep 7, 2023
5a07b69
Merge utils 7.10.0 from master.
netsettler Sep 7, 2023
295adfe
Merge pull request #282 from 4dn-dcic/kmp_sheet_utils_with_vapp
netsettler Sep 7, 2023
486adce
Merge branch 'master' into kmp_sheet_utils_schema_hinting
netsettler Sep 9, 2023
54c51aa
Add support for zipped files.
netsettler Sep 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,33 @@ Change Log
----------


7.12.0
======

* New module ``sheet_utils`` for loading workbooks.

* Important things of interest:

* Class ``ItemManager`` for loading Item-style data
from any ``.xlsx``, ``.csv`` or ``.tsv`` files.

* Function ``load_items`` that does the same as ``ItemManager.load``.

* Various lower-level implementation classes such as:

* Classes ``XlsxManager``, ``CsvManager`` and ``TsvManager`` for loading raw data
from ``.xlsx``, ``.csv``, and ``.tsv`` files, respectively.

* Classes ``XlsxItemManager``, ``CsvItemManager``, and ``TsvItemManager`` for loading Item-style data
from ``.xlsx``, ``.csv``, and ``.tsv`` files, respectively.

* New functionality in ``misc_utils``:

* New function ``is_uuid`` (migrated from Fourfront)
* New function ``pad_to``
* New class ``JsonLinesReader``


7.11.0
======

Expand All @@ -16,6 +43,7 @@ Change Log
* Fix in ``get_schema`` and ``get_schemas`` for the ``portal_vapp`` returning webtest.response.TestResponse
which has a ``json`` object property rather than a function.


7.10.0
======

Expand Down
96 changes: 95 additions & 1 deletion dcicutils/misc_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import inspect
import math
import io
import json
import os
import logging
import pytz
Expand Down Expand Up @@ -191,7 +192,11 @@ class _VirtualAppHelper(webtest.TestApp):
pass


class VirtualApp:
class AbstractVirtualApp:
pass


class VirtualApp(AbstractVirtualApp):
"""
Wrapper class for TestApp, to allow custom control over submitting Encoded requests,
simulating a number of conditions, including permissions.
Expand Down Expand Up @@ -1352,6 +1357,25 @@ def capitalize1(s):
return s[:1].upper() + s[1:]


"""
Python's UUID ignores all dashes, whereas Postgres is more strict
http://www.postgresql.org/docs/9.2/static/datatype-uuid.html
See also http://www.postgresql.org/docs/9.2/static/datatype-uuid.html
And, anyway, this pattern is what our portals have been doing
for quite a while, so it's the most stable choice for us now.
Copy link
Contributor

@dmichaels-harvard dmichaels-harvard Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just misreading, but somehow this comment confused me - I thought you were saying want more strict checking (i.e. only confirm as uuid if it has the dashes) but you want less strict (dashes optional). And interesting, I did not realize curly braces are sometimes OK around UUIDs; weird that allowed to not match though, but that looks like what we actually accept in portal URLs anyways.

But I think this is not quite right; as it is, for example, it says that this is a valid UUID: 08AF90EB-C847-43A7-8B3E-2E64EBAC4683-xyzzy

I think you want the regex to begin/end with ^$, i.e.: r'^(?i)[{]?(?:[0-9a-f]{4}-?){8}[}]?$'

Oh, interesting, I see from tests, and from trying with portal URLs that some trailing stuff is tolerated - so maybe this was intentional - hmm - (I'd like to track down our code that actually parses this).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used already in schema validation just as it is. I'm going to leave this precisely how it is because it is what portals are defined to do and the whole point was to borrow the definition they were using for other uses and not to have multiple definitions. if someone reports a bug in what the portals do and we decide to fix it, my code will tolerate the fix since it's operating on the same data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had already updated the comment when I saw this. I agree it looked confused. Thanks for confirming that. :)

"""

uuid_re = re.compile(r'(?i)[{]?(?:[0-9a-f]{4}-?){8}[}]?')
netsettler marked this conversation as resolved.
Show resolved Hide resolved


def is_uuid(instance):
"""
Predicate returns true for any group of 32 hex characters with optional hyphens every four characters.
We insist on lowercase to make matching faster. See other notes on this design choice above.
"""
return bool(uuid_re.match(instance))


def string_list(s):
"""
Turns a comma-separated list into an actual list, trimming whitespace and ignoring nulls.
Expand Down Expand Up @@ -2313,3 +2337,73 @@ def parse_in_radix(text: str, *, radix: int):
except Exception:
pass
raise ValueError(f"Unable to parse: {text!r}")


def pad_to(target_size: int, data: list, *, padding=None):
"""
This will pad to a given target size, a list of a potentially different actual size, using given padding.
e.g., pad_to(3, [1, 2]) will return [1, 2, None]
"""
actual_size = len(data)
if actual_size < target_size:
data = data + [padding] * (target_size - actual_size)
return data


class JsonLinesReader:

def __init__(self, fp, padded=False, padding=None):
"""
Given an fp (the conventional name for a "file pointer", the thing a call to io.open returns,
this creates an object that can be used to iterate across the lines in the JSON lines file
that the fp is reading from.

There are two possible formats that this will return.

For files that contain a series of dictionaries, such as:
{"something": 1, "else": "a"}
{"something": 2, "else": "b"}
...etc
this will just return thos those dictionaries one-by-one when iterated over.

The same set of dictionaries will also be yielded by a file containing:
["something", "else"]
[1, "a"]
[2, "b"]
...etc
this will just return thos those dictionaries one-by-one when iterated over.

NOTES:

* In the second case, shorter lists on subsequent lines return only partial dictionaries.
* In the second case, longer lists on subsequent lines will quietly drop any extra elements.
"""

self.fp = fp
netsettler marked this conversation as resolved.
Show resolved Hide resolved
self.padded: bool = padded
self.padding = padding
self.headers = None # Might change after we see first line

def __iter__(self):
first_line = True
n_headers = 0
for raw_line in self.fp:
line = json.loads(raw_line)
if first_line:
first_line = False
if isinstance(line, list):
self.headers = line
n_headers = len(line)
continue
# If length of line is more than we expect, ignore it. Let user put comments beyond our table
# But if length of line is less than we expect, extend the line with None
if self.headers:
if not isinstance(line, list):
raise Exception("If the first line is a list, all lines must be.")
if self.padded and len(line) < n_headers:
line = pad_to(n_headers, line, padding=self.padding)
yield dict(zip(self.headers, line))
elif isinstance(line, dict):
yield line
else:
raise Exception(f"If the first line is not a list, all lines must be dictionaries: {line!r}")
Loading
Loading