Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tools for working with spreadsheets (C4-1088) #275

Merged
merged 13 commits into from
Oct 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ Change Log
----------


7.10.0
======

* New module ``sheet_utils`` for loading workbooks.

* class ``WorkbookManager`` for loading raw data

* class ``ItemManager`` for loading item data


7.9.0
=====

Expand Down
6 changes: 6 additions & 0 deletions dcicutils/license_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -810,6 +810,12 @@ class C4InfrastructureLicenseChecker(LicenseChecker):
'pytest-timeout', # MIT Licensed
],

# Linking = With Restrictions, Private Use = Yes
# Ref: https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses
'GNU Lesser General Public License v2 or later (LGPLv2+)': [
'chardet' # used at runtime during server operation (ingestion), but not modified or distributed
],

# Linking = With Restrictions, Private Use = Yes
# Ref: https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses
'GNU Lesser General Public License v3 or later (LGPLv3+)': [
Expand Down
218 changes: 218 additions & 0 deletions dcicutils/sheet_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
import copy

from dcicutils.common import AnyJsonData
from openpyxl import load_workbook
from openpyxl.worksheet.worksheet import Worksheet
from openpyxl.workbook.workbook import Workbook
from typing import Any, Dict, List, Optional, Union


Header = str
Headers = List[str]
ParsedHeader = List[Union[str, int]]
ParsedHeaders = List[ParsedHeader]
SheetCellValue = Union[int, float, str]


class ItemTools:
"""
Implements operations on table-related data without pre-supposing the specific representation of the table.
It is assumed this can be used for data that was obtained from .json, .csv, .tsv, and .xlsx files because
it does not presuppose the source of the data nor where it will be written to.

For the purpose of this class:

* a 'header' is a string representing the top of a column.

* a 'parsed header' is a list of strings and/or ints, after splitting at uses of '#' or '.', so that
"a.b.c" is represented as ["a", "b", "c"], and "x.y#0" is represented as ["x", "y", 0], and representing
each numeric token as an int instead of a string.

* a 'headers' object is just a list of strings, each of which is a 'header'.

* a 'parsed headers' object is a non-empty list of lists, each of which is a 'parsed header'.
e..g., the headers ["a.b.c", "x.y#0"] is represented as parsed hearders [["a", "b", "c"], ["x", "y", 0]].

"""

@classmethod
def parse_sheet_header(cls, header: Header) -> ParsedHeader:
result = []
token = ""
for i in range(len(header)):
ch = header[i]
if ch == '.' or ch == '#':
if token:
result.append(int(token) if token.isdigit() else token)
token = ""
else:
token += ch
if token:
result.append(int(token) if token.isdigit() else token)
return result

@classmethod
def parse_sheet_headers(cls, headers: Headers):
return [cls.parse_sheet_header(header)
for header in headers]

@classmethod
def compute_patch_prototype(cls, parsed_headers: ParsedHeaders):
prototype = {}
for parsed_header in parsed_headers:
parsed_header0 = parsed_header[0]
if isinstance(parsed_header0, int):
raise ValueError(f"A header cannot begin with a numeric ref: {parsed_header0}")
cls.assure_patch_prototype_shape(parent=prototype, keys=parsed_header)
return prototype

@classmethod
def assure_patch_prototype_shape(cls, *, parent: Union[Dict, List], keys: ParsedHeader):
[key0, *more_keys] = keys
key1 = more_keys[0] if more_keys else None
if isinstance(key1, int):
placeholder = []
elif isinstance(key1, str):
placeholder = {}
else:
placeholder = None
if isinstance(key0, int):
n = len(parent)
if key0 == n:
parent.append(placeholder)
elif key0 > n:
raise Exception("Numeric items must occur sequentially.")
Comment on lines +81 to +84
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if key0 < n? If someone submits something along the lines of friend#-1.age as a header, we should either handle it appropriately or report some kind of error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, < n is not that big a deal. but < 0 would be a problem. I could add that. The reason it wants to worry about > n is to make someone not use a huge number, because that would allocate storage beyond what is reasonable. I'm less worried about negative storage allocation, but it's still reasonable to have an error check for nonsensical info. And it's possible someone might think -1 means "last", so that's worth heading off.

Comment on lines +83 to +84
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing these spreadsheets, we generally won't want to raise exceptions outright that won't be caught quickly because we want to accumulate all errors across the spreadsheet before reporting them back to the user for fixing. This could certainly be caught downstream, but may be worth considering here from the get-go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there might be some finessing to do on this. I had certainly been trying to minimize errors because indeed raising errors at all means you only get one bit of feedback instead of many and there's no obvious handler anyway, but some of these things I didn't work all the way through and figured I can come back to.

elif isinstance(key0, str):
if key0 not in parent:
parent[key0] = placeholder
if key1 is not None:
cls.assure_patch_prototype_shape(parent=parent[key0], keys=more_keys)
return parent

@classmethod
def parse_value(cls, value: SheetCellValue) -> AnyJsonData:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would describe semantics of this processing in docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm adding some doc strings in parallel with your review.

if isinstance(value, str):
lvalue = value.lower()
# TODO: We could consult a schema to make this less heuristic, but this may do for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the purpose of this module is parsing spreadsheets purely for generating items to validate against schema, I think the schema parsing is required here. Otherwise, we will likely be casting to one type here only to check the schema later and cast back, for example if the field is a string per the schema and alphanumeric (including only numeric) inputs are accepted.

If we want this module for parsing spreadsheets more generically, then no issues with this at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just talking to Will about that today. It's not just validating. I think of it (well, Excel, really, or some other spreadsheet editor like OpenOffice or LibreOffice) as a kind of user interface layer that accepts more flexible syntax and offers more interesting editing capabilities. Your original ask was to tolerate differences in case, etc. That can't be done without schemas, I agree. That part is coming downstream. Stay tuned. Just trying to keep a separation of concerns.

if lvalue == 'true':
return True
elif lvalue == 'false':
return False
elif lvalue == 'null' or lvalue == '':
return None
elif '|' in value:
return [cls.parse_value(subvalue) for subvalue in value.split('|')]
else:
ch0 = value[0]
if ch0 == '+' or ch0 == '-' or ch0.isdigit():
try:
return int(value)
except Exception:
pass
try:
return float(value)
except Exception:
pass
return value
else: # presumably a number (int or float)
return value

@classmethod
def set_path_value(cls, datum: Union[List, Dict], path: ParsedHeader, value: Any, force: bool = False):
if (value is None or value == '') and not force:
return
[key, *more_path] = path
if not more_path:
datum[key] = value
else:
cls.set_path_value(datum[key], more_path, value)


class WorkbookManager:

@classmethod
def load_workbook(cls, filename: str):
wb = cls(filename)
return wb.load_content()

def __init__(self, filename: str):
self.filename: str = filename
self.workbook: Optional[Workbook] = None
self.headers_by_sheetname: Dict[str, List[str]] = {}
self.content_by_sheetname: Dict[str, List[Any]] = {}
Comment on lines +139 to +142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested to hear your perspective on whether it's preferable to update these variables (as is done here) or just assign them more explicitly, possibly caching the call. I've been gravitating more and more towards using frozen dataclasses with explicit properties or functions for getting particular pieces of information, such as headers_by_sheetname here, and possibly caching the result if it's expensive or may be called multiple times. Given that the input here fixes everything that follows (assuming the filename and its contents are fixed when parsing), it seems like a natural fit here for me as well, with the benefit that one does not have to see where the variable is updated throughout the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not expensive to get, though. It's all O(1). And these are really only set in one place. The only thing that's done a lot is accesses, and mainly this is just making those references a little prettier. Dataclasses look like a lot more computational and textual overhead than I need here, but I don't use them, so you're welcome to send me an example or I'd be happy to take time to discuss them.


def sheet_headers(self, sheetname: str) -> List[str]:
return self.headers_by_sheetname[sheetname]

def sheet_content(self, sheetname: str) -> List[Any]:
return self.content_by_sheetname[sheetname]

@classmethod
def _all_rows(cls, sheet: Worksheet):
row_max = sheet.max_row
for row in range(2, row_max + 1):
yield row

@classmethod
def _all_cols(cls, sheet: Worksheet):
col_max = sheet.max_column
for col in range(1, col_max + 1):
yield col
Comment on lines +144 to +160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another stylistic thing you're welcome to ignore but for which I'm interested to hear your input given your background: I tend to utilize noun-only names for class properties and variables, whereas I would start these methods with verbs since deliver output dependent on input (e.g. _get_sheet_headers). I know we don't have a style guide, but sometimes I wish we did.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be verbs so that it's clear to someone that a new one is made on each call and that it's not accessing a cache. (Otherwise if two people looped at the same time, they'd reuse the same info. There were bugs in the early collection classes in python around this.) I mostly use the style rule you say, but sometimes the system is overconstrained and it matters more that it looks right in context, which is:

for x in foo._all_cols(...):

and you don't want to say

for x in foo._get_all_cols():

or _compute_all_cols() or whatever. But I also don't want:

for x in foo._all_cols:


def _load_headers(self, sheet: Worksheet):
headers: List[str] = [str(sheet.cell(row=1, column=col).value)
for col in self._all_cols(sheet)]
self.headers_by_sheetname[sheet.title] = headers

def _load_row(self, *, sheet: Worksheet, row: int):
headers = self.sheet_headers(sheet.title)
row_dict: Dict[str, Any] = {headers[col-1]: sheet.cell(row=row, column=col).value
for col in self._all_cols(sheet)}
return row_dict

def load_content(self):
workbook: Workbook = load_workbook(self.filename)
self.workbook = workbook
for sheetname in workbook.sheetnames:
sheet: Worksheet = workbook[sheetname]
self._load_headers(sheet)
content = []
for row in self._all_rows(sheet):
row_dict = self._load_row(sheet=sheet, row=row)
content.append(row_dict)
self.content_by_sheetname[sheetname] = content
return self.content_by_sheetname


class ItemManager(ItemTools, WorkbookManager):

def __init__(self, filename: str):
super().__init__(filename=filename)
self.patch_prototypes_by_sheetname: Dict[str, Dict] = {}
self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {}
Comment on lines +191 to +192
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, consider more explicitly building/returning these rather than updating them in separate class methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with this. We could talk about what the advantage is. The primary reason I think that's a bad idea is it impedes or slows unit testing or possibly other operations where you don't need to construct these because they're not used, or where there might be unused sheets you're never going to touch. I think demand-creating them makes sense, and a dictionary is a good way to do that.


def sheet_patch_prototype(self, sheetname: str) -> Dict:
return self.patch_prototypes_by_sheetname[sheetname]

def sheet_parsed_headers(self, sheetname: str) -> List[List[Union[int, str]]]:
return self.parsed_headers_by_sheetname[sheetname]

def _load_headers(self, sheet: Worksheet):
super()._load_headers(sheet)
self._compile_sheet_headers(sheet.title)

def _compile_sheet_headers(self, sheetname: str):
headers = self.headers_by_sheetname[sheetname]
parsed_headers = self.parse_sheet_headers(headers)
self.parsed_headers_by_sheetname[sheetname] = parsed_headers
prototype = self.compute_patch_prototype(parsed_headers)
self.patch_prototypes_by_sheetname[sheetname] = prototype

def _load_row(self, *, sheet: Worksheet, row: int):
parsed_headers = self.sheet_parsed_headers(sheet.title)
patch_item = copy.deepcopy(self.sheet_patch_prototype(sheet.title))
for col in self._all_cols(sheet):
value = sheet.cell(row=row, column=col).value
parsed_value = self.parse_value(value)
self.set_path_value(patch_item, parsed_headers[col - 1], parsed_value)
return patch_item
7 changes: 7 additions & 0 deletions docs/source/dcicutils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,13 @@ secrets_utils
:members:


sheet_utils
^^^^^^^^^^^

.. automodule:: dcicutils.sheet_utils
:members:


snapshot_utils
^^^^^^^^^^^^^^

Expand Down
29 changes: 28 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 8 additions & 6 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "dcicutils"
version = "7.9.0"
version = "7.9.0.1b2" # to become "7.10.0"
description = "Utility package for interacting with the 4DN Data Portal and other 4DN resources"
authors = ["4DN-DCIC Team <[email protected]>"]
license = "MIT"
Expand Down Expand Up @@ -37,6 +37,7 @@ classifiers = [

[tool.poetry.dependencies]
python = ">=3.7,<3.10"

boto3 = "^1.17.39"
botocore = "^1.20.39"
# The DCIC portals (cgap-portal and fourfront) are very particular about which ElasticSearch version.
Expand All @@ -45,20 +46,21 @@ elasticsearch = "7.13.4"
aws-requests-auth = ">=0.4.2,<1"
docker = "^4.4.4"
gitpython = "^3.1.2"
openpyxl = "^3.1.2"
opensearch-py = "^2.0.1"
pyOpenSSL = "^23.1.1"
PyJWT = "^2.6.0"
pytz = ">=2020.4"
PyYAML = ">=5.1,<5.5"
redis = "^4.5.1"
requests = "^2.21.0"
rfc3986 = "^1.4.0"
structlog = "^19.2.0"
toml = ">=0.10.1,<1"
tqdm = "^4.65.0"
typing-extensions = ">=3.8" # Fourfront uses 3.8
urllib3 = "^1.26.6"
webtest = "^2.0.34"
opensearch-py = "^2.0.1"
redis = "^4.5.1"
pyOpenSSL = "^23.1.1"
PyJWT = "^2.6.0"
tqdm = "^4.65.0"


[tool.poetry.dev-dependencies]
Expand Down
Binary file added test/data_files/sample_items.xlsx
Binary file not shown.
3 changes: 3 additions & 0 deletions test/data_files/sample_items_sheet2.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name,age,mother.name,mother.age,father.name,father.age,friends#0.name,friends#0.age,friends#1.name,friends#1.age
bill,23,mary,58,fred,63,sam,22,arthur,19
joe,9,estrella,35,anthony,34,anders,9,,
Loading
Loading