Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tools for working with spreadsheets (C4-1088) #275

Merged
merged 13 commits into from
Oct 31, 2023
Merged

Conversation

netsettler
Copy link
Collaborator

This is a preliminary cut at some tools for parsing excel workbooks.

@coveralls
Copy link

coveralls commented Aug 14, 2023

Pull Request Test Coverage Report for Build 5958607573

  • 156 of 160 (97.5%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 78.039%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dcicutils/sheet_utils.py 156 160 97.5%
Totals Coverage Status
Change from base Build 5943632943: 0.3%
Covered Lines: 8390
Relevant Lines: 10751

💛 - Coveralls

Copy link
Member

@willronchetti willronchetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very minor stylistic questions, maybe add a few docstings here and there, but nothing blocking on my end.

I would make sure that Doug has a chance to compare this to his spec before merging just so we are aligned on that front.

Comment on lines 36 to 46
@classmethod
def _all_rows(cls, sheet: Worksheet):
row_max = sheet.max_row
for row in range(2, row_max + 1):
yield row

@classmethod
def _all_cols(cls, sheet: Worksheet):
col_max = sheet.max_column
for col in range(1, col_max + 1):
yield col
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to belabor the point as we've discussed this before, but I don't really understand the use of @classmethod here when there are no references to cls. I looked this up and as it turns out most styling standards do not really mind that you do this, so I guess it's ok, but generally I find it confusing as when I see a @classmethod I am implicitly assuming there is a dependency for that method elsewhere in the class ie: it cannot be ported as a standalone function outside of the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like @staticmethod because it's not a concept in many languages. A compiler can figure out what's a static method and there should be no need to declare it. If I want to call other class methods, I like having cls already available as a variable so I don't have to distract myself by adding data flow to receive the class name. If something is truly promised to be a static method, I just make it a function. There is no need to go through the syntax of calling it indirect through a class. It could be called by any class and it's weird to hide it somewhere that makes it syntactically hard for other things to get to it. If it's part of a class, then it should have access to the class methods.

Also, as a practical matter, Python fusses if something is an instance method but could be a static method but it does not fuss if it's a class method and could be a static method, probably because it wants to accommodate exactly the programming style that I use.

Comment on lines 66 to 70
def _load_row(self, *, sheet: Worksheet, row: int):
headers = self.sheet_headers(sheet.title)
row_dict: Dict[str, Any] = {headers[col-1]: sheet.cell(row=row, column=col).value
for col in self._all_cols(sheet)}
return row_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-pick: I'd move this definition above the usage in the prior method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I want to rearrange some other stuff, too, so I'll do that when I do. I didn't want to move them out from under you once you were reviewing things. I guess now is a good time.

cls.set_path_value(datum[key], more_path, value)

@classmethod
def parse_value(cls, value: SheetCellValue) -> AnyJsonData:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would describe semantics of this processing in docstring

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm adding some doc strings in parallel with your review.

Copy link
Contributor

@drio18 drio18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a couple style comments that may be worth discussing, as well as asking one important question as to the scope of what this module is designed to do with respect to the casting of values performed here.

Comment on lines 19 to 20
self.headers_by_sheetname: Dict[List[str]] = {}
self.content_by_sheetname: Dict[List[Any]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dictionary-like type annotations, consider annotating the key type as well as the value type.

Suggested change
self.headers_by_sheetname: Dict[List[str]] = {}
self.content_by_sheetname: Dict[List[Any]] = {}
self.headers_by_sheetname: Dict[str, List[str]] = {}
self.content_by_sheetname: Dict[str, List[Any]] = {}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that got fixed later. I don't even think the way I annotated it was legit. I was just asleep while typing.

But this is a really crappy way to notate dictionaries. Newer python (I think in 3.8) has better ways to declare this. Dictionaries are often not uniform in output type and this notation has trouble handling that.

def parse_value(cls, value: SheetCellValue) -> AnyJsonData:
if isinstance(value, str):
lvalue = value.lower()
# TODO: We could consult a schema to make this less heuristic, but this may do for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the purpose of this module is parsing spreadsheets purely for generating items to validate against schema, I think the schema parsing is required here. Otherwise, we will likely be casting to one type here only to check the schema later and cast back, for example if the field is a string per the schema and alphanumeric (including only numeric) inputs are accepted.

If we want this module for parsing spreadsheets more generically, then no issues with this at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just talking to Will about that today. It's not just validating. I think of it (well, Excel, really, or some other spreadsheet editor like OpenOffice or LibreOffice) as a kind of user interface layer that accepts more flexible syntax and offers more interesting editing capabilities. Your original ask was to tolerate differences in case, etc. That can't be done without schemas, I agree. That part is coming downstream. Stay tuned. Just trying to keep a separation of concerns.

Comment on lines +139 to +142
self.filename: str = filename
self.workbook: Optional[Workbook] = None
self.headers_by_sheetname: Dict[str, List[str]] = {}
self.content_by_sheetname: Dict[str, List[Any]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested to hear your perspective on whether it's preferable to update these variables (as is done here) or just assign them more explicitly, possibly caching the call. I've been gravitating more and more towards using frozen dataclasses with explicit properties or functions for getting particular pieces of information, such as headers_by_sheetname here, and possibly caching the result if it's expensive or may be called multiple times. Given that the input here fixes everything that follows (assuming the filename and its contents are fixed when parsing), it seems like a natural fit here for me as well, with the benefit that one does not have to see where the variable is updated throughout the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not expensive to get, though. It's all O(1). And these are really only set in one place. The only thing that's done a lot is accesses, and mainly this is just making those references a little prettier. Dataclasses look like a lot more computational and textual overhead than I need here, but I don't use them, so you're welcome to send me an example or I'd be happy to take time to discuss them.

Comment on lines +144 to +160
def sheet_headers(self, sheetname: str) -> List[str]:
return self.headers_by_sheetname[sheetname]

def sheet_content(self, sheetname: str) -> List[Any]:
return self.content_by_sheetname[sheetname]

@classmethod
def _all_rows(cls, sheet: Worksheet):
row_max = sheet.max_row
for row in range(2, row_max + 1):
yield row

@classmethod
def _all_cols(cls, sheet: Worksheet):
col_max = sheet.max_column
for col in range(1, col_max + 1):
yield col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another stylistic thing you're welcome to ignore but for which I'm interested to hear your input given your background: I tend to utilize noun-only names for class properties and variables, whereas I would start these methods with verbs since deliver output dependent on input (e.g. _get_sheet_headers). I know we don't have a style guide, but sometimes I wish we did.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be verbs so that it's clear to someone that a new one is made on each call and that it's not accessing a cache. (Otherwise if two people looped at the same time, they'd reuse the same info. There were bugs in the early collection classes in python around this.) I mostly use the style rule you say, but sometimes the system is overconstrained and it matters more that it looks right in context, which is:

for x in foo._all_cols(...):

and you don't want to say

for x in foo._get_all_cols():

or _compute_all_cols() or whatever. But I also don't want:

for x in foo._all_cols:

Comment on lines +81 to +84
if key0 == n:
parent.append(placeholder)
elif key0 > n:
raise Exception("Numeric items must occur sequentially.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if key0 < n? If someone submits something along the lines of friend#-1.age as a header, we should either handle it appropriately or report some kind of error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, < n is not that big a deal. but < 0 would be a problem. I could add that. The reason it wants to worry about > n is to make someone not use a huge number, because that would allocate storage beyond what is reasonable. I'm less worried about negative storage allocation, but it's still reasonable to have an error check for nonsensical info. And it's possible someone might think -1 means "last", so that's worth heading off.

Comment on lines +83 to +84
elif key0 > n:
raise Exception("Numeric items must occur sequentially.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing these spreadsheets, we generally won't want to raise exceptions outright that won't be caught quickly because we want to accumulate all errors across the spreadsheet before reporting them back to the user for fixing. This could certainly be caught downstream, but may be worth considering here from the get-go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there might be some finessing to do on this. I had certainly been trying to minimize errors because indeed raising errors at all means you only get one bit of feedback instead of many and there's no obvious handler anyway, but some of these things I didn't work all the way through and figured I can come back to.

Comment on lines +191 to +192
self.patch_prototypes_by_sheetname: Dict[str, Dict] = {}
self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above, consider more explicitly building/returning these rather than updating them in separate class methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with this. We could talk about what the advantage is. The primary reason I think that's a bad idea is it impedes or slows unit testing or possibly other operations where you don't need to construct these because they're not used, or where there might be unused sheets you're never going to touch. I think demand-creating them makes sense, and a dictionary is a good way to do that.

@netsettler netsettler changed the title Some tools for working with spreadsheets Some tools for working with spreadsheets (C4-1088) Aug 28, 2023
@netsettler netsettler merged commit 08d428e into master Oct 31, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants