Move datasets to delete first in line #14

jbrown-xentity · 2021-10-22T19:54:07Z

We have reports of datasets that get re-harvested with an extra 1 in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

codecov-commenter · 2021-10-22T19:58:46Z

Codecov Report

Merging #14 (77a8b0f) into datagov (9354d52) will not change coverage.
The diff coverage is 0.00%.

@@           Coverage Diff            @@
##           datagov      #14   +/-   ##
========================================
  Coverage    42.16%   42.16%           
========================================
  Files           46       46           
  Lines         3166     3166           
========================================
  Hits          1335     1335           
  Misses        1831     1831

Impacted Files	Coverage Δ
ckanext/spatial/harvesters/waf.py	`17.51% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9354d52...77a8b0f. Read the comment docs.

nickumia-reisys · 2021-10-25T14:37:03Z

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

jbrown-xentity · 2021-10-25T16:55:07Z

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same.
Upstream removed all harvest tests, and replicating those would be complex (to say the least). We'll validate that the harvesters still work in catalog.data.gov, but replicating this issue as a test would require writing custom timing code to edit the nginx harvest endpoint after first harvesting it, and then validating: a lot of work for a small issue.

nickumia-reisys · 2022-10-24T12:13:43Z

@FuhuXia @jbrown-xentity Did this fix the name changing issue?

jbrown-xentity · 2022-10-24T15:15:35Z

Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens.

jbrown-xentity · 2022-10-24T15:17:12Z

See upstream PR and comments here: ckan#261

jbrown-xentity requested a review from a team October 22, 2021 19:54

jbrown-xentity merged commit 3828c6e into datagov Oct 25, 2021

jbrown-xentity deleted the bugfix/dataset-renaming branch October 25, 2021 16:55

nickumia-reisys mentioned this pull request Oct 24, 2022

Reintegrate ckanext-spatial upstream into catalog.data.gov GSA/data.gov#3938

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move datasets to delete first in line #14

Move datasets to delete first in line #14

jbrown-xentity commented Oct 22, 2021

codecov-commenter commented Oct 22, 2021 •

edited

Loading

nickumia-reisys commented Oct 25, 2021

jbrown-xentity commented Oct 25, 2021

nickumia-reisys commented Oct 24, 2022

jbrown-xentity commented Oct 24, 2022

jbrown-xentity commented Oct 24, 2022

Move datasets to delete first in line #14

Move datasets to delete first in line #14

Conversation

jbrown-xentity commented Oct 22, 2021

codecov-commenter commented Oct 22, 2021 • edited Loading

Codecov Report

nickumia-reisys commented Oct 25, 2021

jbrown-xentity commented Oct 25, 2021

nickumia-reisys commented Oct 24, 2022

jbrown-xentity commented Oct 24, 2022

jbrown-xentity commented Oct 24, 2022

codecov-commenter commented Oct 22, 2021 •

edited

Loading