Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move datasets to delete first in line #14

Merged
merged 1 commit into from
Oct 25, 2021

Conversation

jbrown-xentity
Copy link

We have reports of datasets that get re-harvested with an extra 1 in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.
@jbrown-xentity jbrown-xentity requested a review from a team October 22, 2021 19:54
@codecov-commenter
Copy link

codecov-commenter commented Oct 22, 2021

Codecov Report

Merging #14 (77a8b0f) into datagov (9354d52) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           datagov      #14   +/-   ##
========================================
  Coverage    42.16%   42.16%           
========================================
  Files           46       46           
  Lines         3166     3166           
========================================
  Hits          1335     1335           
  Misses        1831     1831           
Impacted Files Coverage Δ
ckanext/spatial/harvesters/waf.py 17.51% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9354d52...77a8b0f. Read the comment docs.

@nickumia-reisys
Copy link

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

@jbrown-xentity
Copy link
Author

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same.
Upstream removed all harvest tests, and replicating those would be complex (to say the least). We'll validate that the harvesters still work in catalog.data.gov, but replicating this issue as a test would require writing custom timing code to edit the nginx harvest endpoint after first harvesting it, and then validating: a lot of work for a small issue.

@jbrown-xentity jbrown-xentity merged commit 3828c6e into datagov Oct 25, 2021
@jbrown-xentity jbrown-xentity deleted the bugfix/dataset-renaming branch October 25, 2021 16:55
@nickumia-reisys
Copy link

@FuhuXia @jbrown-xentity Did this fix the name changing issue?

@jbrown-xentity
Copy link
Author

Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens.

@jbrown-xentity
Copy link
Author

See upstream PR and comments here: ckan#261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants