Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile Releases Transform - include records table #63

Closed
ghost opened this issue Feb 6, 2019 · 14 comments
Closed

Compile Releases Transform - include records table #63

ghost opened this issue Feb 6, 2019 · 14 comments
Labels
steps Relating to specific steps (transforms)
Milestone

Comments

@ghost
Copy link

ghost commented Feb 6, 2019

This currently works off releases table - does it also need to look for releases in the record table?

@jpmckinney jpmckinney changed the title Transform - Compile Releases - include records table? Compile Releases Transform - include records table? Mar 12, 2019
@jpmckinney jpmckinney added the steps Relating to specific steps (transforms) label Mar 12, 2019
@jpmckinney jpmckinney changed the title Compile Releases Transform - include records table? Compile Releases Transform - include records table Mar 12, 2019
@jpmckinney
Copy link
Member

I think so – though if there is already a compiled release, we might want to decide on the default behavior (re-compile, or trust the existing compiled release) and offer a flag to change the behavior.

@jpmckinney
Copy link
Member

@ghost
Copy link
Author

ghost commented Mar 18, 2020

though if there is already a compiled release, we might want to decide on the default behavior (re-compile, or trust the existing compiled release) and offer a flag to change the behavior.

Maybe this shouldn't be a flag - we already have a transform framework and we could have two different types of compile release transforms - one for each option. Then it's easy for the operator to choose which one they want when they create a new collection with a transform. It would be up to us to decide which one we used in our standard workflow.

Having this as two different transforms also opens the door to us taking a raw collection that has compiled releases, running it through both transforms and then comparing the results to see if publishers release compilation process has problems.

The rest looks ok.

Change process/ocdskingfisherprocess/transform/compile_releases.py

Change get_ocids Function to also get ocid's from record and compiled releases table.

Maybe break process_ocid into sub functions. One to look for existing compiled releases in several places, the other to look for releases in more places and use them to make a compiled release. Note existing code and comments about releases tagged compiled.

The two different Transforms can then call the sub functions in different orders depending on what their priorities are.

And it should be relatively easy to write good tests for this

@jpmckinney
Copy link
Member

jpmckinney commented Mar 18, 2020

I don't think we should have a new transform for every option. In the long run, this will lead to a lot of transforms (exponential growth to represent every combination of options…). It's better to use hierarchy (add options to transforms, rather than reify every option as a new transform) to limit complexity for the user.

For example, in the Django branch there's an options JSON dict on the collection table for, e.g., #222. To support #222 with just a transform name (each JSON path can be assigned a merge behavior), you'd need some sort of 'dynamic' transform name parsing, at which point you've just implemented a flag/option in a more complicated way.

To compare the re-compiled releases to the original compiled releases, we can just compare the data linked from the compiled_release table to the data->compiledRelease linked from the records table. There's no need to run both transforms, especially considering that the 'pass-through' transform will still fallback to re-compiling if the compiled release isn't present, which will make the above analysis harder.

The only reason for prioritizing this issue is so that sources that only have records can be processed by Pelican. As such, let's just implement the simple option of compiling based on a record's releases (instead of passing through its compiledRelease).

The Django rewrite is still planned, so let's keep the changes minimal where possible.

ghost pushed a commit that referenced this issue Mar 24, 2020
…ns for behavoir

Options for behavoir around what to do with existing compiled releases it finds

#63
@jpmckinney jpmckinney added this to the V1 milestone Mar 24, 2020
@ghost
Copy link
Author

ghost commented Mar 25, 2020

Have come across a massive can of worms on this one:

https://standard.open-contracting.org/latest/en/schema/records_reference/#package-metadata

A record must contain: An array of releases about this contracting process - either by providing a URL for where these releases can be found, or embedding a full copy of the release

So, if we pull a release out of a record that only has a "url" field we need to get that data and fill it in BEFORE we do the merge.

But that's a problem; we can't just get the data from the "url" at the point we process the transform. We might be working with historical data, or data we local loaded. We want to work with the data we have for consistency.

So let's look for that URL in the existing data we have in the source collection - but I can see many problems there:

  • There is no guarantee the spider will have downloaded that data and put it into the source collection. Even if it has, there is no guarantee that exactly the same URL will be recorded against it.
  • Depending on how we saved and local loaded historical data, there may not even be URL's recorded against the data.
  • can we do anything with "uri" fields in packages?
  • not to mention handling edge cases where the "url" field is in the data, but it's another record, or garbage, etc

I'm going to put this on hold until I hear from @jpmckinney , basically our options are

  • Put this bit of work on hold
  • Do the work while ignoring this problem and leaving the default option as "use existing compiled releases" (I've pretty much done this - just a few tests to sort out) Just add errors against the transformed data if "url" fields are seen.
  • Tackle this can of worms now

ghost pushed a commit that referenced this issue Mar 25, 2020
…ns for behavoir

Options for behavoir around what to do with existing compiled releases it finds

#63
@jpmckinney
Copy link
Member

jpmckinney commented Mar 25, 2020

Let's do a different version of the second option. I also don't know what you mean, specifically, by "just add errors".

  1. if a record has embedded releases, use them.

  2. If a record has linked releases:

    1. Use the compiledRelease if available. If so, add a collection note once for the entire transformation:

      [compile-releases] At least one record (%(ocid)s) uses linked releases. Its compiledRelease was copied into the compiled_release table, instead of merging its individual releases.

    2. If no compiledRelease is available, add a collection note once for the entire transformation:

      [compile-releases] At least one record (%(ocid)s) uses linked releases and has no compiledRelease. Its OCID was skipped.

@ghost
Copy link
Author

ghost commented Mar 26, 2020

Ok. Looking at existing behaviour, the transform will use existing compiled releases if there are any.

So when trying to use existing compiled releases:

  • if has exactly one release tagged compile or a compiledRelease, just use that
  • if has more than one, pick one at random and use that with an message (this is current behaviour)
  • if has releases, but any one of them is linked, just skip this OCID and put a message
  • if has releases and none are linked, compile ourselves!

When in the mode that we always compile, would it be:

  • if has releases and none are linked, compile ourselves!
  • if has exactly one release tagged compile or a compiledRelease, just use that
  • if has more than one, pick one at random and use that with an message (this is standard behavoir)
  • if has releases, but any one of them is linked, just skip this OCID and put a message

Or do we not want to do modes at the moment? We could ignore modes and just stick with the existing behaviour of using existing compiled releases, for a faster completion and deploy?

@jpmckinney
Copy link
Member

For clarity: Merging compiled releases with records as input should have a separate code path than with releases as input – though they can call common sub-methods to avoid duplicate code.

We can leave the logic for releases as input as-is.

For records as input, we just need one mode, which is closer to 'always compile'. Amending for clarity:

  • if has releases and none are linked, compile ourselves.
  • if has compiledRelease, use it. Add a warning as in my earlier comment.
  • if has releases and exactly one release is tagged compiled (not expected to occur in valid data), use it. Add a warning.
  • if has releases and many releases are tagged compiled (not expected to occur in valid data), use one. Add a warning.
  • otherwise, skip and add a warning.

@jpmckinney
Copy link
Member

Just to note: getting the right behavior takes priority over fast completion.

@jpmckinney
Copy link
Member

jpmckinney commented Mar 26, 2020

For posterity, the reason the order of preference is different for releases and records is based on earlier comments: #147 (comment)

To summarize earlier discussion: When there are releases as input, a release with a tag of 'compiled' is an implementation error by the publisher, which occurs rarely. Ideally, the analyst would be able to choose which logic to follow. The default logic is to take a compiled release if present, for the reasons in the previous issue. We didn't fuss too much about which to have as the default, as this is a rare error by the publisher.

For records, we assume that the publisher has correctly included all releases in the record's releases (which is a required field in OCDS), but we do not assume that they implemented the merge routine correctly, as it is easy to implement it incorrectly. We therefore re-merge where possible. If not possible, we go through increasingly unlikely cases that get progressively further from a correct implementation.

@ghost
Copy link
Author

ghost commented Mar 27, 2020

This seems clear, and it looks like we don't need modes. Thanks.

If an OCID has both a record and releases, which process should we use?

(This is very unlikely as when an API offers both records and releases, we tend to write 2 different spiders so the data would be in separate collections. But it could happen.)

If there are > 1 records for an OCID, should we just pick one at random and note that we did that?

@jpmckinney
Copy link
Member

jpmckinney commented Mar 27, 2020

If a collection has both releases and records for an OCID, let's prefer the records code path. (We could add a third "mixed" code path that falls back to releases if the best option for records isn't available, but this is such an edge case that I'm fine with a suboptimal path.)

As you describe, we try to treat sources of records and releases as independent datasets (two representations of the same reality), so there should never be a case that a collection contains a mix of both (which would be a doubling of, or inconsistent organization of, reality).

If a collection has multiple records for an OCID, let's just go with the first one (and make a note) like we do when there are multiple compiled releases for an OCID (no need for true 'random').

ghost pushed a commit that referenced this issue Apr 14, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 14, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 14, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 14, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 14, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 15, 2020
ghost pushed a commit that referenced this issue Apr 15, 2020
#63

Continues 2d04529
- adds more tests
- adds warnings and notes
@ghost
Copy link
Author

ghost commented Apr 20, 2020

ghost pushed a commit that referenced this issue Apr 20, 2020
#63

Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
ghost pushed a commit that referenced this issue Apr 20, 2020
#63

Continues 2d04529
- adds more tests
- adds warnings and notes
@ghost
Copy link
Author

ghost commented Apr 20, 2020

Deployed - think can now close?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
steps Relating to specific steps (transforms)
Projects
None yet
Development

No branches or pull requests

1 participant