Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema hinting to new sheets functionality (C4-1088) #280

Merged
merged 68 commits into from
Oct 31, 2023

Conversation

netsettler
Copy link
Collaborator

@netsettler netsettler commented Aug 24, 2023

This is getting close to where it might be usable. Features available, including those which are available in some of the prevoius PRs. I've tried to highlight the areas of current development here:

  • New module sheet_utils for loading workbooks.

    • Important things of interest:

      • Class ItemManager for loading Item-style data from any .xlsx, .csv or .tsv files. (Planned support, only partly implemented, for JSON via .json files and JSON LInes via .jsonl files.)

      • Function load_items that does the same as ItemManager.load. Features:

        • type hinting (see class TypeHint and its subclasses) to normalize/repair various kinds of inputs. Work ongoing in this area. Doug has requested this be done differently in various ways.

        • autoloading schemas based on tab names

        • object cross-references. Work on this "instaguid" feature is in flux and an area of ongoing work. Experimentation and considering of variations of implementations still planned before converging. WIll and Doug have asked questions. Kent plans a document soon itemizing alternatives and issues.

    • Various lower-level implementation classes such as:

      • Classes XlsxManager, CsvManager and TsvManager for loading raw data from .xlsx, .csv, and .tsv files, respectively.

      • Classes XlsxItemManager, CsvItemManager, and TsvItemManager for loading Item-style data from .xlsx, .csv, and .tsv files, respectively.

  • New utility functionality in misc_utils:

    • New function is_uuid (migrated from Fourfront)
    • New function pad_to
    • New class JsonLinesReader

Still to do:

  • Better doc strings
  • Additional unit testing for coverage
  • Redesign of instaguids after planned design discussion offline.
  • Better framework for defining type hints. It would help to have input about kinds of transformations desired.

netsettler and others added 30 commits August 14, 2023 07:21
… not the workbook level artifact. Better handling of init args.
… ItemManager.load to take a tab_name argument so that CSV files can perhaps infer a type name.
Copy link
Contributor

@drio18 drio18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping a few comments, including one important one regarding the need to use the schema information more thoroughly. 4DN does much of this work already in Submit4DN (mostly here).

dcicutils/misc_utils.py Show resolved Hide resolved
dcicutils/misc_utils.py Outdated Show resolved Hide resolved
Comment on lines +274 to +297
def finder(subheader, subschema):
if not parsed_header:
return None
else:
[key1, *other_headers] = subheader
if isinstance(key1, str) and isinstance(subschema, dict):
if subschema.get('type') == 'object':
def1 = subschema.get('properties', {}).get(key1)
if not other_headers:
if def1 is not None:
t = def1.get('type')
if t == 'string':
enum = def1.get('enum')
if enum:
mapping = {e.lower(): e for e in enum}
return EnumHint(mapping)
elif t == 'boolean':
return BoolHint()
else:
pass # fall through to asking super()
else:
pass # fall through to asking super()
else:
return finder(subheader=other_headers, subschema=def1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider breaking some of logic into more discrete parts here to clarify the process and facilitate simpler testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll do that.

cls.set_path_value(datum[key], more_path, value)

@classmethod
def find_type_hint(cls, parsed_header: Optional[ParsedHeader], schema: Any):
Copy link
Contributor

@drio18 drio18 Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there needs to be additional logic for using the schema here; I could be mistaken, but I don't believe there are tests for such cases at least.

For example, prefer_number above is automatically converting things that look like numbers to ints or floats, but the schema may require the field to be a string. Hence, there's the possibility of a bit of casting back and forth happening here that needs to be handled. As currently written, some information could be lost by this casting back and forth as well, if for example "+1" is cast to an int of 1 and then a string of "1".

Similarly, if the schema defines the field to be an array of strings and the spreadsheet value is one string without the "|" character (e.g. "dog"), the string needs to be cast to an array of strings (["dog"]). On the flip side, if the schema dictates the field is a string, and someone submits a string with the "|" character, it will be converted to an array of strings via parse_item_value, and the validation will fail even though the string was valid.

My personal preference for handling this would be to cast all values to appropriate types and do the "type hinting" all in one location; currently, these operations are occurring in multiple places/stages of the processing, and the logic becomes more difficult to follow as to where an error arose. As the cases above demonstrate, it may not make sense to do anything with a cell value until the type from the schema is known.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is trying to emulate what Excel does (in the raw layer0. I really highly doubt too many people ever manage their cell type to match the schemas, so we have to assume that the data coming in will be either string or number, and it therefore doesn't hurt to do this as a hack to make csv files feel like they do similar things. The semantic (schema-hinting) layer will need to assume both types of data come in to it.

Honestly, I don't expect us to ever use the raw layer, but as a matter of abstraction it seemed important to build this in steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I actually wrote this capability for the lower level and only stubbed the other part in here before I had schemas, but yeah, I guess I can make that be optional or just remove it at that level. Even the conversion to boolean might not be helpful at this level, since it can be schema-driven now. Somewhat the thing to see is that this is evolutionary and I've been changing the nature of some of these things organically. But I generally agree with you about expanding the capabilities and making it more schema-driven.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, yes, I left the testing at this level light figuring it would change a lot. The testing, such as it is, is in the spreadsheets, which allows some of this stuff to be refactored to do similar things a different way without major changes to the tests while we're in transition. This isn't to say there are no tests, but they are more end-to-end and not down to the individual function layer at this point in development. I'll fill the testing in better when it's more stable. But if you ask me to refactor everything, you should know you're asking to have all the affected tests rewritten, so you should be glad there's little such friction. Means you're not asking for a heavy lift. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also gets involved in the error-handling behavior you want, though my present theory is that while it's hard to know what to do about bad column headers, if we can at least locate the data correctly, we can just pass through a string and let the validator complain about it.

@netsettler netsettler changed the title Add schema hinting to new sheets functionality Add schema hinting to new sheets functionality (C4-1088) Aug 28, 2023
netsettler and others added 23 commits August 30, 2023 13:38
…rectoryItemManager. Rename _JsonInsertsDataItemManager to InsertsItemManager.
…ctively, to _parse_inserts_data, _load_inserts_data, and _check_inserts_data.
Add portal_vapp= functionality to sheet_utils
with TemporaryDirectory() as temp_dir:
temp_base = os.path.join(temp_dir, target_base_part)
temp_filename = temp_base + target_ext
_do_shell_command(['cp', filename, temp_filename])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want to do as much as possible using Python libraries (e.g. file copy) as possible, rather than execing out to shell?

@netsettler netsettler merged commit f820233 into master Oct 31, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants