Compile Releases Transform - include records table #63

ghost · 2019-02-06T11:05:01Z

This currently works off releases table - does it also need to look for releases in the record table?

jpmckinney · 2019-03-12T19:39:18Z

I think so – though if there is already a compiled release, we might want to decide on the default behavior (re-compile, or trust the existing compiled release) and offer a flag to change the behavior.

jpmckinney · 2020-03-03T15:18:25Z

Process: Add support for records to compile release transform

ghost · 2020-03-18T15:35:08Z

though if there is already a compiled release, we might want to decide on the default behavior (re-compile, or trust the existing compiled release) and offer a flag to change the behavior.

Maybe this shouldn't be a flag - we already have a transform framework and we could have two different types of compile release transforms - one for each option. Then it's easy for the operator to choose which one they want when they create a new collection with a transform. It would be up to us to decide which one we used in our standard workflow.

Having this as two different transforms also opens the door to us taking a raw collection that has compiled releases, running it through both transforms and then comparing the results to see if publishers release compilation process has problems.

The rest looks ok.

Change process/ocdskingfisherprocess/transform/compile_releases.py

Change get_ocids Function to also get ocid's from record and compiled releases table.

Maybe break process_ocid into sub functions. One to look for existing compiled releases in several places, the other to look for releases in more places and use them to make a compiled release. Note existing code and comments about releases tagged compiled.

The two different Transforms can then call the sub functions in different orders depending on what their priorities are.

And it should be relatively easy to write good tests for this

jpmckinney · 2020-03-18T16:13:04Z

I don't think we should have a new transform for every option. In the long run, this will lead to a lot of transforms (exponential growth to represent every combination of options…). It's better to use hierarchy (add options to transforms, rather than reify every option as a new transform) to limit complexity for the user.

For example, in the Django branch there's an options JSON dict on the collection table for, e.g., #222. To support #222 with just a transform name (each JSON path can be assigned a merge behavior), you'd need some sort of 'dynamic' transform name parsing, at which point you've just implemented a flag/option in a more complicated way.

To compare the re-compiled releases to the original compiled releases, we can just compare the data linked from the compiled_release table to the data->compiledRelease linked from the records table. There's no need to run both transforms, especially considering that the 'pass-through' transform will still fallback to re-compiling if the compiled release isn't present, which will make the above analysis harder.

The only reason for prioritizing this issue is so that sources that only have records can be processed by Pelican. As such, let's just implement the simple option of compiling based on a record's releases (instead of passing through its compiledRelease).

The Django rewrite is still planned, so let's keep the changes minimal where possible.

…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63

ghost · 2020-03-25T11:36:35Z

Have come across a massive can of worms on this one:

https://standard.open-contracting.org/latest/en/schema/records_reference/#package-metadata

A record must contain: An array of releases about this contracting process - either by providing a URL for where these releases can be found, or embedding a full copy of the release

So, if we pull a release out of a record that only has a "url" field we need to get that data and fill it in BEFORE we do the merge.

But that's a problem; we can't just get the data from the "url" at the point we process the transform. We might be working with historical data, or data we local loaded. We want to work with the data we have for consistency.

So let's look for that URL in the existing data we have in the source collection - but I can see many problems there:

There is no guarantee the spider will have downloaded that data and put it into the source collection. Even if it has, there is no guarantee that exactly the same URL will be recorded against it.
Depending on how we saved and local loaded historical data, there may not even be URL's recorded against the data.
can we do anything with "uri" fields in packages?
not to mention handling edge cases where the "url" field is in the data, but it's another record, or garbage, etc

I'm going to put this on hold until I hear from @jpmckinney , basically our options are

Put this bit of work on hold
Do the work while ignoring this problem and leaving the default option as "use existing compiled releases" (I've pretty much done this - just a few tests to sort out) Just add errors against the transformed data if "url" fields are seen.
Tackle this can of worms now

…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63

jpmckinney · 2020-03-25T18:31:15Z

Let's do a different version of the second option. I also don't know what you mean, specifically, by "just add errors".

if a record has embedded releases, use them.
If a record has linked releases:
1. Use the compiledRelease if available. If so, add a collection note once for the entire transformation:
  
  [compile-releases] At least one record (%(ocid)s) uses linked releases. Its compiledRelease was copied into the compiled_release table, instead of merging its individual releases.
2. If no compiledRelease is available, add a collection note once for the entire transformation:
  
  [compile-releases] At least one record (%(ocid)s) uses linked releases and has no compiledRelease. Its OCID was skipped.

ghost · 2020-03-26T17:07:44Z

Ok. Looking at existing behaviour, the transform will use existing compiled releases if there are any.

So when trying to use existing compiled releases:

if has exactly one release tagged compile or a compiledRelease, just use that
if has more than one, pick one at random and use that with an message (this is current behaviour)
if has releases, but any one of them is linked, just skip this OCID and put a message
if has releases and none are linked, compile ourselves!

When in the mode that we always compile, would it be:

if has releases and none are linked, compile ourselves!
if has exactly one release tagged compile or a compiledRelease, just use that
if has more than one, pick one at random and use that with an message (this is standard behavoir)
if has releases, but any one of them is linked, just skip this OCID and put a message

Or do we not want to do modes at the moment? We could ignore modes and just stick with the existing behaviour of using existing compiled releases, for a faster completion and deploy?

jpmckinney · 2020-03-26T17:31:21Z

For clarity: Merging compiled releases with records as input should have a separate code path than with releases as input – though they can call common sub-methods to avoid duplicate code.

We can leave the logic for releases as input as-is.

For records as input, we just need one mode, which is closer to 'always compile'. Amending for clarity:

if has releases and none are linked, compile ourselves.
if has compiledRelease, use it. Add a warning as in my earlier comment.
if has releases and exactly one release is tagged compiled (not expected to occur in valid data), use it. Add a warning.
if has releases and many releases are tagged compiled (not expected to occur in valid data), use one. Add a warning.
otherwise, skip and add a warning.

jpmckinney · 2020-03-26T17:32:46Z

Just to note: getting the right behavior takes priority over fast completion.

jpmckinney · 2020-03-26T17:45:57Z

For posterity, the reason the order of preference is different for releases and records is based on earlier comments: #147 (comment)

To summarize earlier discussion: When there are releases as input, a release with a tag of 'compiled' is an implementation error by the publisher, which occurs rarely. Ideally, the analyst would be able to choose which logic to follow. The default logic is to take a compiled release if present, for the reasons in the previous issue. We didn't fuss too much about which to have as the default, as this is a rare error by the publisher.

For records, we assume that the publisher has correctly included all releases in the record's releases (which is a required field in OCDS), but we do not assume that they implemented the merge routine correctly, as it is easy to implement it incorrectly. We therefore re-merge where possible. If not possible, we go through increasingly unlikely cases that get progressively further from a correct implementation.

ghost · 2020-03-27T09:18:55Z

This seems clear, and it looks like we don't need modes. Thanks.

If an OCID has both a record and releases, which process should we use?

(This is very unlikely as when an API offers both records and releases, we tend to write 2 different spiders so the data would be in separate collections. But it could happen.)

If there are > 1 records for an OCID, should we just pick one at random and note that we did that?

jpmckinney · 2020-03-27T17:45:17Z

If a collection has both releases and records for an OCID, let's prefer the records code path. (We could add a third "mixed" code path that falls back to releases if the best option for records isn't available, but this is such an edge case that I'm fine with a suboptimal path.)

As you describe, we try to treat sources of records and releases as independent datasets (two representations of the same reality), so there should never be a case that a collection contains a mix of both (which would be a doubling of, or inconsistent organization of, reality).

If a collection has multiple records for an OCID, let's just go with the first one (and make a note) like we do when there are multiple compiled releases for an OCID (no need for true 'random').

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

#278 #63

#63 Continues 2d04529 - adds more tests - adds warnings and notes

ghost · 2020-04-20T10:52:43Z

When this is done, I'll comment on Check all columns work with release_type = 'record' (compiled releases) kingfisher-summarize#43 as that issue will need review after this. [ Reminder for me! ]

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

#63 Continues 2d04529 - adds more tests - adds warnings and notes

ghost · 2020-04-20T15:10:52Z

Deployed - think can now close?

#63

jpmckinney changed the title ~~Transform - Compile Releases - include records table?~~ Compile Releases Transform - include records table? Mar 12, 2019

jpmckinney added the steps Relating to specific steps (transforms) label Mar 12, 2019

jpmckinney changed the title ~~Compile Releases Transform - include records table?~~ Compile Releases Transform - include records table Mar 12, 2019

This was referenced Mar 2, 2020

Check all columns work with release_type = 'record' (compiled releases) open-contracting/kingfisher-summarize#43

Closed

In release_summary_with_data show compiled release not record. open-contracting/kingfisher-summarize#5

Closed

ghost pushed a commit that referenced this issue Mar 24, 2020

transform: Compile Releases Transform - include records table & optio…

1dd3348

…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63

ghost mentioned this issue Mar 24, 2020

WIP: transform: Compile Releases Transform - include records table & options for behaviour #268

Closed

jpmckinney added this to the V1 milestone Mar 24, 2020

ghost pushed a commit that referenced this issue Mar 25, 2020

transform: Compile Releases Transform - include records table & optio…

c327bfd

…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63

jpmckinney mentioned this issue Mar 25, 2020

Put all system messages in collection note table #242

Closed

1 task

ghost pushed a commit that referenced this issue Apr 14, 2020

Compile Releases Transform - include records

855483b

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost pushed a commit that referenced this issue Apr 14, 2020

Compile Releases Transform - include records

2f082a5

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost pushed a commit that referenced this issue Apr 14, 2020

Compile Releases Transform - include records

c292571

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost pushed a commit that referenced this issue Apr 14, 2020

Compile Releases Transform - include records

7c4020b

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost pushed a commit that referenced this issue Apr 14, 2020

Compile Releases Transform - include records

2d04529

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost mentioned this issue Apr 14, 2020

Compile Releases Transform - include records #278

Merged

ghost pushed a commit that referenced this issue Apr 15, 2020

requirements: Upgrade ocdskit and ocdsmerge

62e5745

#278 #63

ghost pushed a commit that referenced this issue Apr 15, 2020

Compile Releases Transform - include records

7c35b87

#63 Continues 2d04529 - adds more tests - adds warnings and notes

ghost pushed a commit that referenced this issue Apr 20, 2020

Compile Releases Transform - include records

686a798

#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start

ghost pushed a commit that referenced this issue Apr 20, 2020

Compile Releases Transform - include records

af8f1b8

#63 Continues 2d04529 - adds more tests - adds warnings and notes

ghost closed this as completed Apr 20, 2020

ghost mentioned this issue Apr 22, 2020

Compile Releases Transform may end up trying duplicate work #283

Closed

ghost pushed a commit that referenced this issue May 6, 2020

docs: Add Transform Compiled Release docs

1a877c8

#63

ghost mentioned this issue May 6, 2020

docs: Add Transform Compiled Release docs #289

Merged

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile Releases Transform - include records table #63

Compile Releases Transform - include records table #63

ghost commented Feb 6, 2019

jpmckinney commented Mar 12, 2019

jpmckinney commented Mar 3, 2020

ghost commented Mar 18, 2020

jpmckinney commented Mar 18, 2020 •

edited

Loading

ghost commented Mar 25, 2020 •

edited by ghost

Loading

jpmckinney commented Mar 25, 2020 •

edited

Loading

ghost commented Mar 26, 2020

jpmckinney commented Mar 26, 2020

jpmckinney commented Mar 26, 2020

jpmckinney commented Mar 26, 2020 •

edited

Loading

ghost commented Mar 27, 2020 •

edited by ghost

Loading

jpmckinney commented Mar 27, 2020 •

edited

Loading

ghost commented Apr 20, 2020 •

edited by ghost

Loading

ghost commented Apr 20, 2020

Compile Releases Transform - include records table #63

Compile Releases Transform - include records table #63

Comments

ghost commented Feb 6, 2019

jpmckinney commented Mar 12, 2019

jpmckinney commented Mar 3, 2020

ghost commented Mar 18, 2020

jpmckinney commented Mar 18, 2020 • edited Loading

ghost commented Mar 25, 2020 • edited by ghost Loading

jpmckinney commented Mar 25, 2020 • edited Loading

ghost commented Mar 26, 2020

jpmckinney commented Mar 26, 2020

jpmckinney commented Mar 26, 2020

jpmckinney commented Mar 26, 2020 • edited Loading

ghost commented Mar 27, 2020 • edited by ghost Loading

jpmckinney commented Mar 27, 2020 • edited Loading

ghost commented Apr 20, 2020 • edited by ghost Loading

ghost commented Apr 20, 2020

jpmckinney commented Mar 18, 2020 •

edited

Loading

ghost commented Mar 25, 2020 •

edited by ghost

Loading

jpmckinney commented Mar 25, 2020 •

edited

Loading

jpmckinney commented Mar 26, 2020 •

edited

Loading

ghost commented Mar 27, 2020 •

edited by ghost

Loading

jpmckinney commented Mar 27, 2020 •

edited

Loading

ghost commented Apr 20, 2020 •

edited by ghost

Loading