Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update from frictionless>=4.4,<5 to frictionless>=5,<6 across the board #3293

Open
jdangerx opened this issue Jan 24, 2024 · 2 comments
Open

Comments

@jdangerx
Copy link
Member

Once #1420 is closed, we will be using one version of frictionless across the board.

Unfortunately, that version will be 4.40.8, which is missing critical SQL-describing functionality.

We should upgrade to version 5! Migration guide is here.

Fortunately, our existing usage of the frictionless libraries is very limited - just checking if it can parse a datapackage.json and then checking to see if the metadata is valid:

ferc-xbrl-extractor/src/ferc-xbrl-extractor/xbrl.py

    if datapackage_path:
        # Verify that datapackage descriptor is valid before outputting
        frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
        if not frictionless_package.metadata_valid:
            raise RuntimeError(
                f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
            )

pudl/src/pudl/workspace/datastore.py

        dp = frictionless.Package(datapackage_json)
        if not dp.metadata_validate():

Which is pretty easy to change to the frictionless 5 Package.validate_descriptor() method.

Unfortunately, this is partly an artifact of us making mirror classes that look sort of like the frictionless classes, in pudl-archiver, ferc-xbrl-extractor, and pudl. These classes sort of replicate the frictionless classes, but also include lots of custom logic for our own purposes.

My proposal is:

  1. rename our Fake Frictionless classes to PudlResource, XbrlResource, and ArchiverResource, respectively (plus the Package/Schema/etc. equivalents)
  2. add to_frictionless methods that allow us to convert these to their frictionless counterparts
  3. Use real frictionless classes to serialize datapackage.jsons that are valid, in ferc-xbrl-extractor/pudl-archiver
  4. Create new, valid archives if necessary
  5. Use real frictionless classes to validate the datapackages in pudl/workspace/datastore.py

My guess for this work is ~20h.

Note - since our classes have required functionality that frictionless classes don't have, we should not implement from_frictionless methods. For example, every XbrlResource has to respond to get_fact_tables(), but frictionless has no such notion. So then from_frictionless can only ever make some half-functional XbrlResource - so we shouldn't do that at all.

@bendnorman
Copy link
Member

I'm new to this issue but why not have our Resources be subclasses of the frictionless framework classes? I think we'll have 3 types of extensions we'd like to make:

  1. extensions to frictionless classes that make sense to add directly to the frictionless-py package
  2. extensions that don't fit into ^ but we'd like to use in all of our projects. For example, we want to store additional metadata for all of our projects that don't fit into the frictionless specification. Another example would be if we want a method to convert a frictionless schema to a pandera schema but frictionless isn't interested in providing that functionality. To handle these types of extensions we'd probably need a Catalyst specific package that extends the frictionless tools.
  3. extensions that are project specific (XBRL, PUDL, client projects...) For this we can just extend the catalyst or frictionless package.

@jdangerx
Copy link
Member Author

I think it's probably better to have our Resource contain references to frictionless classes - roughly corresponding to the idea of preferring "composition over inheritance."

The benefit of this, to me, is that this reduces the coupling between our PudlResource interface and the implementation of the frictionless classes it depends on. You'll only need to know the public frictionless APIs instead of the implementation details. But we would still be able to use the frictionless code to reduce duplication as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants