Some tools for working with spreadsheets (C4-1088) #275

netsettler · 2023-08-14T12:42:10Z

This is a preliminary cut at some tools for parsing excel workbooks.

coveralls · 2023-08-14T12:49:08Z

Pull Request Test Coverage Report for Build 5958607573

156 of 160 (97.5%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.3%) to 78.039%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dcicutils/sheet_utils.py	156	160	97.5%

Totals
Change from base Build 5943632943:	0.3%
Covered Lines:	8390
Relevant Lines:	10751

💛 - Coveralls

willronchetti

Some very minor stylistic questions, maybe add a few docstings here and there, but nothing blocking on my end.

I would make sure that Doug has a chance to compare this to his spec before merging just so we are aligned on that front.

willronchetti · 2023-08-16T14:50:05Z

dcicutils/sheet_utils.py

+    @classmethod
+    def _all_rows(cls, sheet: Worksheet):
+        row_max = sheet.max_row
+        for row in range(2, row_max + 1):
+            yield row
+
+    @classmethod
+    def _all_cols(cls, sheet: Worksheet):
+        col_max = sheet.max_column
+        for col in range(1, col_max + 1):
+            yield col


Not to belabor the point as we've discussed this before, but I don't really understand the use of @classmethod here when there are no references to cls. I looked this up and as it turns out most styling standards do not really mind that you do this, so I guess it's ok, but generally I find it confusing as when I see a @classmethod I am implicitly assuming there is a dependency for that method elsewhere in the class ie: it cannot be ported as a standalone function outside of the class.

I do not like @staticmethod because it's not a concept in many languages. A compiler can figure out what's a static method and there should be no need to declare it. If I want to call other class methods, I like having cls already available as a variable so I don't have to distract myself by adding data flow to receive the class name. If something is truly promised to be a static method, I just make it a function. There is no need to go through the syntax of calling it indirect through a class. It could be called by any class and it's weird to hide it somewhere that makes it syntactically hard for other things to get to it. If it's part of a class, then it should have access to the class methods.

Also, as a practical matter, Python fusses if something is an instance method but could be a static method but it does not fuss if it's a class method and could be a static method, probably because it wants to accommodate exactly the programming style that I use.

willronchetti · 2023-08-16T14:50:55Z

dcicutils/sheet_utils.py

+    def _load_row(self, *, sheet: Worksheet, row: int):
+        headers = self.sheet_headers(sheet.title)
+        row_dict: Dict[str, Any] = {headers[col-1]: sheet.cell(row=row, column=col).value
+                                    for col in self._all_cols(sheet)}
+        return row_dict


Nit-pick: I'd move this definition above the usage in the prior method

Yeah, I want to rearrange some other stuff, too, so I'll do that when I do. I didn't want to move them out from under you once you were reviewing things. I guess now is a good time.

willronchetti · 2023-08-16T14:52:23Z

dcicutils/sheet_utils.py

+            cls.set_path_value(datum[key], more_path, value)
+
+    @classmethod
+    def parse_value(cls, value: SheetCellValue) -> AnyJsonData:


I would describe semantics of this processing in docstring

Yeah, I'm adding some doc strings in parallel with your review.

drio18

Leaving a couple style comments that may be worth discussing, as well as asking one important question as to the scope of what this module is designed to do with respect to the casting of values performed here.

drio18 · 2023-08-15T15:31:38Z

dcicutils/sheet_utils.py

+        self.headers_by_sheetname: Dict[List[str]] = {}
+        self.content_by_sheetname: Dict[List[Any]] = {}


For dictionary-like type annotations, consider annotating the key type as well as the value type.

Suggested change

self.headers_by_sheetname: Dict[List[str]] = {}

self.content_by_sheetname: Dict[List[Any]] = {}

self.headers_by_sheetname: Dict[str, List[str]] = {}

self.content_by_sheetname: Dict[str, List[Any]] = {}

Yeah, that got fixed later. I don't even think the way I annotated it was legit. I was just asleep while typing.

But this is a really crappy way to notate dictionaries. Newer python (I think in 3.8) has better ways to declare this. Dictionaries are often not uniform in output type and this notation has trouble handling that.

drio18 · 2023-08-17T19:58:14Z

dcicutils/sheet_utils.py

+    def parse_value(cls, value: SheetCellValue) -> AnyJsonData:
+        if isinstance(value, str):
+            lvalue = value.lower()
+            # TODO: We could consult a schema to make this less heuristic, but this may do for now


If the purpose of this module is parsing spreadsheets purely for generating items to validate against schema, I think the schema parsing is required here. Otherwise, we will likely be casting to one type here only to check the schema later and cast back, for example if the field is a string per the schema and alphanumeric (including only numeric) inputs are accepted.

If we want this module for parsing spreadsheets more generically, then no issues with this at all.

I was just talking to Will about that today. It's not just validating. I think of it (well, Excel, really, or some other spreadsheet editor like OpenOffice or LibreOffice) as a kind of user interface layer that accepts more flexible syntax and offers more interesting editing capabilities. Your original ask was to tolerate differences in case, etc. That can't be done without schemas, I agree. That part is coming downstream. Stay tuned. Just trying to keep a separation of concerns.

drio18 · 2023-08-17T20:12:42Z

dcicutils/sheet_utils.py

+        self.filename: str = filename
+        self.workbook: Optional[Workbook] = None
+        self.headers_by_sheetname: Dict[str, List[str]] = {}
+        self.content_by_sheetname: Dict[str, List[Any]] = {}


Interested to hear your perspective on whether it's preferable to update these variables (as is done here) or just assign them more explicitly, possibly caching the call. I've been gravitating more and more towards using frozen dataclasses with explicit properties or functions for getting particular pieces of information, such as headers_by_sheetname here, and possibly caching the result if it's expensive or may be called multiple times. Given that the input here fixes everything that follows (assuming the filename and its contents are fixed when parsing), it seems like a natural fit here for me as well, with the benefit that one does not have to see where the variable is updated throughout the class.

It's not expensive to get, though. It's all O(1). And these are really only set in one place. The only thing that's done a lot is accesses, and mainly this is just making those references a little prettier. Dataclasses look like a lot more computational and textual overhead than I need here, but I don't use them, so you're welcome to send me an example or I'd be happy to take time to discuss them.

drio18 · 2023-08-17T20:21:40Z

dcicutils/sheet_utils.py

+    def sheet_headers(self, sheetname: str) -> List[str]:
+        return self.headers_by_sheetname[sheetname]
+
+    def sheet_content(self, sheetname: str) -> List[Any]:
+        return self.content_by_sheetname[sheetname]
+
+    @classmethod
+    def _all_rows(cls, sheet: Worksheet):
+        row_max = sheet.max_row
+        for row in range(2, row_max + 1):
+            yield row
+
+    @classmethod
+    def _all_cols(cls, sheet: Worksheet):
+        col_max = sheet.max_column
+        for col in range(1, col_max + 1):
+            yield col


Another stylistic thing you're welcome to ignore but for which I'm interested to hear your input given your background: I tend to utilize noun-only names for class properties and variables, whereas I would start these methods with verbs since deliver output dependent on input (e.g. _get_sheet_headers). I know we don't have a style guide, but sometimes I wish we did.

These need to be verbs so that it's clear to someone that a new one is made on each call and that it's not accessing a cache. (Otherwise if two people looped at the same time, they'd reuse the same info. There were bugs in the early collection classes in python around this.) I mostly use the style rule you say, but sometimes the system is overconstrained and it matters more that it looks right in context, which is:

for x in foo._all_cols(...):

and you don't want to say

for x in foo._get_all_cols():

or _compute_all_cols() or whatever. But I also don't want:

for x in foo._all_cols:

drio18 · 2023-08-17T20:27:21Z

dcicutils/sheet_utils.py

+            if key0 == n:
+                parent.append(placeholder)
+            elif key0 > n:
+                raise Exception("Numeric items must occur sequentially.")


What if key0 < n? If someone submits something along the lines of friend#-1.age as a header, we should either handle it appropriately or report some kind of error.

Actually, < n is not that big a deal. but < 0 would be a problem. I could add that. The reason it wants to worry about > n is to make someone not use a huge number, because that would allocate storage beyond what is reasonable. I'm less worried about negative storage allocation, but it's still reasonable to have an error check for nonsensical info. And it's possible someone might think -1 means "last", so that's worth heading off.

drio18 · 2023-08-17T20:29:54Z

dcicutils/sheet_utils.py

+            elif key0 > n:
+                raise Exception("Numeric items must occur sequentially.")


When parsing these spreadsheets, we generally won't want to raise exceptions outright that won't be caught quickly because we want to accumulate all errors across the spreadsheet before reporting them back to the user for fixing. This could certainly be caught downstream, but may be worth considering here from the get-go.

Yeah, there might be some finessing to do on this. I had certainly been trying to minimize errors because indeed raising errors at all means you only get one bit of feedback instead of many and there's no obvious handler anyway, but some of these things I didn't work all the way through and figured I can come back to.

drio18 · 2023-08-17T20:31:44Z

dcicutils/sheet_utils.py

+        self.patch_prototypes_by_sheetname: Dict[str, Dict] = {}
+        self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {}


Similar to the comment above, consider more explicitly building/returning these rather than updating them in separate class methods.

Not sure I agree with this. We could talk about what the advantage is. The primary reason I think that's a bad idea is it impedes or slows unit testing or possibly other operations where you don't need to construct these because they're not used, or where there might be unused sheets you're never going to touch. I think demand-creating them makes sense, and a dictionary is a good way to do that.

netsettler added 4 commits August 14, 2023 07:21

First cut at tools for parsing workbooks.

4c84f0b

Refactor to separate some functionality into a separate sevice class.

7b73a67

Add a csv file for testing.

3d4573f

Add some negative testing.

f4e5cfa

netsettler requested review from willronchetti, drio18 and dmichaels-harvard August 14, 2023 12:42

Update lock file.

e9d2465

netsettler added 4 commits August 14, 2023 10:28

Document new sheets_utils module.

6e9060f

Issue a beta for this functionality.

df12c91

Fix documentation for sheet_utils.

6a39c8a

Add some declarations. Small refactors to improve modularity.

eedb5c6

willronchetti approved these changes Aug 16, 2023

View reviewed changes

Rearrange some methods for presentational reasons.

a6b68fe

netsettler mentioned this pull request Aug 17, 2023

sheet_utils refactor to add csv functionality (C4-1088) #276

Open

drio18 reviewed Aug 17, 2023

View reviewed changes

netsettler added 3 commits August 17, 2023 17:40

Merge branch 'master' into kmp_sheet_utils

682c95a

Merge branch 'master' into kmp_sheet_utils

3a103ee

Mark chardet as an acceptable license for use.

56f702a

dmichaels-harvard approved these changes Aug 24, 2023

View reviewed changes

netsettler changed the title ~~Some tools for working with spreadsheets~~ Some tools for working with spreadsheets (C4-1088) Aug 28, 2023

netsettler merged commit 08d428e into master Oct 31, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some tools for working with spreadsheets (C4-1088) #275

Some tools for working with spreadsheets (C4-1088) #275

netsettler commented Aug 14, 2023

coveralls commented Aug 14, 2023 •

edited

Loading

willronchetti left a comment

willronchetti Aug 16, 2023

netsettler Aug 16, 2023

willronchetti Aug 16, 2023

netsettler Aug 16, 2023

willronchetti Aug 16, 2023

netsettler Aug 16, 2023

drio18 left a comment

drio18 Aug 15, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

drio18 Aug 17, 2023

netsettler Aug 23, 2023

		self.headers_by_sheetname: Dict[List[str]] = {}
		self.content_by_sheetname: Dict[List[Any]] = {}

		elif key0 > n:
		raise Exception("Numeric items must occur sequentially.")

		self.patch_prototypes_by_sheetname: Dict[str, Dict] = {}
		self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {}

Some tools for working with spreadsheets (C4-1088) #275

Some tools for working with spreadsheets (C4-1088) #275

Conversation

netsettler commented Aug 14, 2023

coveralls commented Aug 14, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5958607573

💛 - Coveralls

willronchetti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drio18 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Aug 14, 2023 •

edited

Loading