-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile Releases Transform - include records table #63
Comments
I think so – though if there is already a compiled release, we might want to decide on the default behavior (re-compile, or trust the existing compiled release) and offer a flag to change the behavior. |
Maybe this shouldn't be a flag - we already have a transform framework and we could have two different types of compile release transforms - one for each option. Then it's easy for the operator to choose which one they want when they create a new collection with a transform. It would be up to us to decide which one we used in our standard workflow. Having this as two different transforms also opens the door to us taking a raw collection that has compiled releases, running it through both transforms and then comparing the results to see if publishers release compilation process has problems. The rest looks ok. Change process/ocdskingfisherprocess/transform/compile_releases.py Change get_ocids Function to also get ocid's from record and compiled releases table. Maybe break process_ocid into sub functions. One to look for existing compiled releases in several places, the other to look for releases in more places and use them to make a compiled release. Note existing code and comments about releases tagged compiled. The two different Transforms can then call the sub functions in different orders depending on what their priorities are. And it should be relatively easy to write good tests for this |
I don't think we should have a new transform for every option. In the long run, this will lead to a lot of transforms (exponential growth to represent every combination of options…). It's better to use hierarchy (add options to transforms, rather than reify every option as a new transform) to limit complexity for the user. For example, in the Django branch there's an options JSON dict on the collection table for, e.g., #222. To support #222 with just a transform name (each JSON path can be assigned a merge behavior), you'd need some sort of 'dynamic' transform name parsing, at which point you've just implemented a flag/option in a more complicated way. To compare the re-compiled releases to the original compiled releases, we can just compare the The only reason for prioritizing this issue is so that sources that only have records can be processed by Pelican. As such, let's just implement the simple option of compiling based on a record's The Django rewrite is still planned, so let's keep the changes minimal where possible. |
…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63
Have come across a massive can of worms on this one: https://standard.open-contracting.org/latest/en/schema/records_reference/#package-metadata
So, if we pull a release out of a record that only has a "url" field we need to get that data and fill it in BEFORE we do the merge. But that's a problem; we can't just get the data from the "url" at the point we process the transform. We might be working with historical data, or data we local loaded. We want to work with the data we have for consistency. So let's look for that URL in the existing data we have in the source collection - but I can see many problems there:
I'm going to put this on hold until I hear from @jpmckinney , basically our options are
|
…ns for behavoir Options for behavoir around what to do with existing compiled releases it finds #63
Let's do a different version of the second option. I also don't know what you mean, specifically, by "just add errors".
|
Ok. Looking at existing behaviour, the transform will use existing compiled releases if there are any. So when trying to use existing compiled releases:
When in the mode that we always compile, would it be:
Or do we not want to do modes at the moment? We could ignore modes and just stick with the existing behaviour of using existing compiled releases, for a faster completion and deploy? |
For clarity: Merging compiled releases with records as input should have a separate code path than with releases as input – though they can call common sub-methods to avoid duplicate code. We can leave the logic for releases as input as-is. For records as input, we just need one mode, which is closer to 'always compile'. Amending for clarity:
|
Just to note: getting the right behavior takes priority over fast completion. |
For posterity, the reason the order of preference is different for releases and records is based on earlier comments: #147 (comment) To summarize earlier discussion: When there are releases as input, a release with a tag of 'compiled' is an implementation error by the publisher, which occurs rarely. Ideally, the analyst would be able to choose which logic to follow. The default logic is to take a compiled release if present, for the reasons in the previous issue. We didn't fuss too much about which to have as the default, as this is a rare error by the publisher. For records, we assume that the publisher has correctly included all releases in the record's |
This seems clear, and it looks like we don't need modes. Thanks. If an OCID has both a record and releases, which process should we use? (This is very unlikely as when an API offers both records and releases, we tend to write 2 different spiders so the data would be in separate collections. But it could happen.) If there are > 1 records for an OCID, should we just pick one at random and note that we did that? |
If a collection has both releases and records for an OCID, let's prefer the records code path. (We could add a third "mixed" code path that falls back to releases if the best option for records isn't available, but this is such an edge case that I'm fine with a suboptimal path.) As you describe, we try to treat sources of records and releases as independent datasets (two representations of the same reality), so there should never be a case that a collection contains a mix of both (which would be a doubling of, or inconsistent organization of, reality). If a collection has multiple records for an OCID, let's just go with the first one (and make a note) like we do when there are multiple compiled releases for an OCID (no need for true 'random'). |
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
|
#63 Also, in ocdskingfisherprocess/transform/compile_releases.py marked private methods as private with a _ at start
Deployed - think can now close? |
This currently works off releases table - does it also need to look for releases in the record table?
The text was updated successfully, but these errors were encountered: