-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some tools for working with spreadsheets (C4-1088) #275
Changes from all commits
4c84f0b
7b73a67
3d4573f
f4e5cfa
e9d2465
6e9060f
df12c91
6a39c8a
eedb5c6
a6b68fe
682c95a
3a103ee
56f702a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
import copy | ||
|
||
from dcicutils.common import AnyJsonData | ||
from openpyxl import load_workbook | ||
from openpyxl.worksheet.worksheet import Worksheet | ||
from openpyxl.workbook.workbook import Workbook | ||
from typing import Any, Dict, List, Optional, Union | ||
|
||
|
||
Header = str | ||
Headers = List[str] | ||
ParsedHeader = List[Union[str, int]] | ||
ParsedHeaders = List[ParsedHeader] | ||
SheetCellValue = Union[int, float, str] | ||
|
||
|
||
class ItemTools: | ||
""" | ||
Implements operations on table-related data without pre-supposing the specific representation of the table. | ||
It is assumed this can be used for data that was obtained from .json, .csv, .tsv, and .xlsx files because | ||
it does not presuppose the source of the data nor where it will be written to. | ||
|
||
For the purpose of this class: | ||
|
||
* a 'header' is a string representing the top of a column. | ||
|
||
* a 'parsed header' is a list of strings and/or ints, after splitting at uses of '#' or '.', so that | ||
"a.b.c" is represented as ["a", "b", "c"], and "x.y#0" is represented as ["x", "y", 0], and representing | ||
each numeric token as an int instead of a string. | ||
|
||
* a 'headers' object is just a list of strings, each of which is a 'header'. | ||
|
||
* a 'parsed headers' object is a non-empty list of lists, each of which is a 'parsed header'. | ||
e..g., the headers ["a.b.c", "x.y#0"] is represented as parsed hearders [["a", "b", "c"], ["x", "y", 0]]. | ||
|
||
""" | ||
|
||
@classmethod | ||
def parse_sheet_header(cls, header: Header) -> ParsedHeader: | ||
result = [] | ||
token = "" | ||
for i in range(len(header)): | ||
ch = header[i] | ||
if ch == '.' or ch == '#': | ||
if token: | ||
result.append(int(token) if token.isdigit() else token) | ||
token = "" | ||
else: | ||
token += ch | ||
if token: | ||
result.append(int(token) if token.isdigit() else token) | ||
return result | ||
|
||
@classmethod | ||
def parse_sheet_headers(cls, headers: Headers): | ||
return [cls.parse_sheet_header(header) | ||
for header in headers] | ||
|
||
@classmethod | ||
def compute_patch_prototype(cls, parsed_headers: ParsedHeaders): | ||
prototype = {} | ||
for parsed_header in parsed_headers: | ||
parsed_header0 = parsed_header[0] | ||
if isinstance(parsed_header0, int): | ||
raise ValueError(f"A header cannot begin with a numeric ref: {parsed_header0}") | ||
cls.assure_patch_prototype_shape(parent=prototype, keys=parsed_header) | ||
return prototype | ||
|
||
@classmethod | ||
def assure_patch_prototype_shape(cls, *, parent: Union[Dict, List], keys: ParsedHeader): | ||
[key0, *more_keys] = keys | ||
key1 = more_keys[0] if more_keys else None | ||
if isinstance(key1, int): | ||
placeholder = [] | ||
elif isinstance(key1, str): | ||
placeholder = {} | ||
else: | ||
placeholder = None | ||
if isinstance(key0, int): | ||
n = len(parent) | ||
if key0 == n: | ||
parent.append(placeholder) | ||
elif key0 > n: | ||
raise Exception("Numeric items must occur sequentially.") | ||
Comment on lines
+83
to
+84
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When parsing these spreadsheets, we generally won't want to raise exceptions outright that won't be caught quickly because we want to accumulate all errors across the spreadsheet before reporting them back to the user for fixing. This could certainly be caught downstream, but may be worth considering here from the get-go. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, there might be some finessing to do on this. I had certainly been trying to minimize errors because indeed raising errors at all means you only get one bit of feedback instead of many and there's no obvious handler anyway, but some of these things I didn't work all the way through and figured I can come back to. |
||
elif isinstance(key0, str): | ||
if key0 not in parent: | ||
parent[key0] = placeholder | ||
if key1 is not None: | ||
cls.assure_patch_prototype_shape(parent=parent[key0], keys=more_keys) | ||
return parent | ||
|
||
@classmethod | ||
def parse_value(cls, value: SheetCellValue) -> AnyJsonData: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would describe semantics of this processing in docstring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I'm adding some doc strings in parallel with your review. |
||
if isinstance(value, str): | ||
lvalue = value.lower() | ||
# TODO: We could consult a schema to make this less heuristic, but this may do for now | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the purpose of this module is parsing spreadsheets purely for generating items to validate against schema, I think the schema parsing is required here. Otherwise, we will likely be casting to one type here only to check the schema later and cast back, for example if the field is a string per the schema and alphanumeric (including only numeric) inputs are accepted. If we want this module for parsing spreadsheets more generically, then no issues with this at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was just talking to Will about that today. It's not just validating. I think of it (well, Excel, really, or some other spreadsheet editor like OpenOffice or LibreOffice) as a kind of user interface layer that accepts more flexible syntax and offers more interesting editing capabilities. Your original ask was to tolerate differences in case, etc. That can't be done without schemas, I agree. That part is coming downstream. Stay tuned. Just trying to keep a separation of concerns. |
||
if lvalue == 'true': | ||
return True | ||
elif lvalue == 'false': | ||
return False | ||
elif lvalue == 'null' or lvalue == '': | ||
return None | ||
elif '|' in value: | ||
return [cls.parse_value(subvalue) for subvalue in value.split('|')] | ||
else: | ||
ch0 = value[0] | ||
if ch0 == '+' or ch0 == '-' or ch0.isdigit(): | ||
try: | ||
return int(value) | ||
except Exception: | ||
pass | ||
try: | ||
return float(value) | ||
except Exception: | ||
pass | ||
return value | ||
else: # presumably a number (int or float) | ||
return value | ||
|
||
@classmethod | ||
def set_path_value(cls, datum: Union[List, Dict], path: ParsedHeader, value: Any, force: bool = False): | ||
if (value is None or value == '') and not force: | ||
return | ||
[key, *more_path] = path | ||
if not more_path: | ||
datum[key] = value | ||
else: | ||
cls.set_path_value(datum[key], more_path, value) | ||
|
||
|
||
class WorkbookManager: | ||
|
||
@classmethod | ||
def load_workbook(cls, filename: str): | ||
wb = cls(filename) | ||
return wb.load_content() | ||
|
||
def __init__(self, filename: str): | ||
self.filename: str = filename | ||
self.workbook: Optional[Workbook] = None | ||
self.headers_by_sheetname: Dict[str, List[str]] = {} | ||
self.content_by_sheetname: Dict[str, List[Any]] = {} | ||
Comment on lines
+139
to
+142
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interested to hear your perspective on whether it's preferable to update these variables (as is done here) or just assign them more explicitly, possibly caching the call. I've been gravitating more and more towards using frozen dataclasses with explicit properties or functions for getting particular pieces of information, such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not expensive to get, though. It's all O(1). And these are really only set in one place. The only thing that's done a lot is accesses, and mainly this is just making those references a little prettier. Dataclasses look like a lot more computational and textual overhead than I need here, but I don't use them, so you're welcome to send me an example or I'd be happy to take time to discuss them. |
||
|
||
def sheet_headers(self, sheetname: str) -> List[str]: | ||
return self.headers_by_sheetname[sheetname] | ||
|
||
def sheet_content(self, sheetname: str) -> List[Any]: | ||
return self.content_by_sheetname[sheetname] | ||
|
||
@classmethod | ||
def _all_rows(cls, sheet: Worksheet): | ||
row_max = sheet.max_row | ||
for row in range(2, row_max + 1): | ||
yield row | ||
|
||
@classmethod | ||
def _all_cols(cls, sheet: Worksheet): | ||
col_max = sheet.max_column | ||
for col in range(1, col_max + 1): | ||
yield col | ||
Comment on lines
+144
to
+160
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another stylistic thing you're welcome to ignore but for which I'm interested to hear your input given your background: I tend to utilize noun-only names for class properties and variables, whereas I would start these methods with verbs since deliver output dependent on input (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These need to be verbs so that it's clear to someone that a new one is made on each call and that it's not accessing a cache. (Otherwise if two people looped at the same time, they'd reuse the same info. There were bugs in the early collection classes in python around this.) I mostly use the style rule you say, but sometimes the system is overconstrained and it matters more that it looks right in context, which is:
and you don't want to say
or
|
||
|
||
def _load_headers(self, sheet: Worksheet): | ||
headers: List[str] = [str(sheet.cell(row=1, column=col).value) | ||
for col in self._all_cols(sheet)] | ||
self.headers_by_sheetname[sheet.title] = headers | ||
|
||
def _load_row(self, *, sheet: Worksheet, row: int): | ||
headers = self.sheet_headers(sheet.title) | ||
row_dict: Dict[str, Any] = {headers[col-1]: sheet.cell(row=row, column=col).value | ||
for col in self._all_cols(sheet)} | ||
return row_dict | ||
|
||
def load_content(self): | ||
workbook: Workbook = load_workbook(self.filename) | ||
self.workbook = workbook | ||
for sheetname in workbook.sheetnames: | ||
sheet: Worksheet = workbook[sheetname] | ||
self._load_headers(sheet) | ||
content = [] | ||
for row in self._all_rows(sheet): | ||
row_dict = self._load_row(sheet=sheet, row=row) | ||
content.append(row_dict) | ||
self.content_by_sheetname[sheetname] = content | ||
return self.content_by_sheetname | ||
|
||
|
||
class ItemManager(ItemTools, WorkbookManager): | ||
|
||
def __init__(self, filename: str): | ||
super().__init__(filename=filename) | ||
self.patch_prototypes_by_sheetname: Dict[str, Dict] = {} | ||
self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {} | ||
Comment on lines
+191
to
+192
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to the comment above, consider more explicitly building/returning these rather than updating them in separate class methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I agree with this. We could talk about what the advantage is. The primary reason I think that's a bad idea is it impedes or slows unit testing or possibly other operations where you don't need to construct these because they're not used, or where there might be unused sheets you're never going to touch. I think demand-creating them makes sense, and a dictionary is a good way to do that. |
||
|
||
def sheet_patch_prototype(self, sheetname: str) -> Dict: | ||
return self.patch_prototypes_by_sheetname[sheetname] | ||
|
||
def sheet_parsed_headers(self, sheetname: str) -> List[List[Union[int, str]]]: | ||
return self.parsed_headers_by_sheetname[sheetname] | ||
|
||
def _load_headers(self, sheet: Worksheet): | ||
super()._load_headers(sheet) | ||
self._compile_sheet_headers(sheet.title) | ||
|
||
def _compile_sheet_headers(self, sheetname: str): | ||
headers = self.headers_by_sheetname[sheetname] | ||
parsed_headers = self.parse_sheet_headers(headers) | ||
self.parsed_headers_by_sheetname[sheetname] = parsed_headers | ||
prototype = self.compute_patch_prototype(parsed_headers) | ||
self.patch_prototypes_by_sheetname[sheetname] = prototype | ||
|
||
def _load_row(self, *, sheet: Worksheet, row: int): | ||
parsed_headers = self.sheet_parsed_headers(sheet.title) | ||
patch_item = copy.deepcopy(self.sheet_patch_prototype(sheet.title)) | ||
for col in self._all_cols(sheet): | ||
value = sheet.cell(row=row, column=col).value | ||
parsed_value = self.parse_value(value) | ||
self.set_path_value(patch_item, parsed_headers[col - 1], parsed_value) | ||
return patch_item |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
[tool.poetry] | ||
name = "dcicutils" | ||
version = "7.9.0" | ||
version = "7.9.0.1b2" # to become "7.10.0" | ||
description = "Utility package for interacting with the 4DN Data Portal and other 4DN resources" | ||
authors = ["4DN-DCIC Team <[email protected]>"] | ||
license = "MIT" | ||
|
@@ -37,6 +37,7 @@ classifiers = [ | |
|
||
[tool.poetry.dependencies] | ||
python = ">=3.7,<3.10" | ||
|
||
boto3 = "^1.17.39" | ||
botocore = "^1.20.39" | ||
# The DCIC portals (cgap-portal and fourfront) are very particular about which ElasticSearch version. | ||
|
@@ -45,20 +46,21 @@ elasticsearch = "7.13.4" | |
aws-requests-auth = ">=0.4.2,<1" | ||
docker = "^4.4.4" | ||
gitpython = "^3.1.2" | ||
openpyxl = "^3.1.2" | ||
opensearch-py = "^2.0.1" | ||
pyOpenSSL = "^23.1.1" | ||
PyJWT = "^2.6.0" | ||
pytz = ">=2020.4" | ||
PyYAML = ">=5.1,<5.5" | ||
redis = "^4.5.1" | ||
requests = "^2.21.0" | ||
rfc3986 = "^1.4.0" | ||
structlog = "^19.2.0" | ||
toml = ">=0.10.1,<1" | ||
tqdm = "^4.65.0" | ||
typing-extensions = ">=3.8" # Fourfront uses 3.8 | ||
urllib3 = "^1.26.6" | ||
webtest = "^2.0.34" | ||
opensearch-py = "^2.0.1" | ||
redis = "^4.5.1" | ||
pyOpenSSL = "^23.1.1" | ||
PyJWT = "^2.6.0" | ||
tqdm = "^4.65.0" | ||
|
||
|
||
[tool.poetry.dev-dependencies] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
name,age,mother.name,mother.age,father.name,father.age,friends#0.name,friends#0.age,friends#1.name,friends#1.age | ||
bill,23,mary,58,fred,63,sam,22,arthur,19 | ||
joe,9,estrella,35,anthony,34,anders,9,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if
key0 < n
? If someone submits something along the lines offriend#-1.age
as a header, we should either handle it appropriately or report some kind of error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,
< n
is not that big a deal. but< 0
would be a problem. I could add that. The reason it wants to worry about> n
is to make someone not use a huge number, because that would allocate storage beyond what is reasonable. I'm less worried about negative storage allocation, but it's still reasonable to have an error check for nonsensical info. And it's possible someone might think-1
means "last", so that's worth heading off.