Add zenodo archiving script #177

e-belfer · 2024-09-17T21:59:30Z

Overview

Closes #176.

What problem does this address?
Queries Zenodo stats for each version of each record in the Catalyst Cooperative community.

What did you change in this PR?

Add save_zenodo_metrics.py
Add this script to the save_daily_metrics.yml github action
Make the archive runs in save_daily_metrics.yml independent of one another's success

Out of scope:

Further improvements to the general archiving approach (e.g., adding retries and improving the robustness of the queries) are catalogued in Further save-daily-metrics.yml improvements #175.

Testing

How did you make sure this worked? How can a reviewer verify this?
See the successful run and corresponding archives in the GCS bucket.

To-do list

Tasks

Give feedback

Add save_zenodo_metrics.py to save_daily_metrics GHA
Options

…tive/pudl-usage-metrics into zenodo-archiver

bendnorman

Looks good! Just a couple of small code clean up requests and some questions.

bendnorman · 2024-09-19T19:25:43Z

src/usage_metrics/scripts/save_zenodo_metrics.py

+        version_df = pd.DataFrame(
+            [
+                dict(
+                    version_records[item].stats.__dict__,
+                    doi=version_records[item].doi,
+                    title=version_records[item].title,
+                )
+                for item in range(len(version_records))
+            ]
+        )


You could simplify this by using the objects in the iterator instead of indexing:

version_df = pd.DataFrame( [ dict( version_record.stats.dict(), # pydantic models have a dict() method you can use instead of accessing __dict__ directly doi=version_record.doi, title=version_record.title, ) for version_record in version_records ] )

bendnorman · 2024-09-19T19:27:54Z

src/usage_metrics/scripts/save_zenodo_metrics.py

+        versions_url = f"https://zenodo.org/api/records/{record.recid}/versions"
+        record_versions = requests.get(versions_url, timeout=100)
+        version_records = record_versions.json()["hits"]["hits"]
+        version_records = [CommunityMetadata(**record) for record in version_records]


I would rename the item to version_record so the name/var doesn't collide with the var in the parent for loop.

bendnorman · 2024-09-19T19:31:14Z

src/usage_metrics/scripts/save_zenodo_metrics.py

+    community_url = "https://zenodo.org/api/communities/14454015-63f1-4f05-80fd-1a9b07593c9e/records"
+    community_records = requests.get(community_url, timeout=100)
+    catalyst_records = community_records.json()["hits"]["hits"]
+    catalyst_records = [CommunityMetadata(**record) for record in catalyst_records]


How do the catalyst_records differ from the version_records? Are they structurally the same we just need to grab the generic records to get all the individual versioned records?

Basically this gives us the record information for each Catalyst dataset, and then for each dataset we iterate through and get all the versions and their corresponding metrics. I can rename this to dataset_records to make it more obvious and add some docstrings.

bendnorman · 2024-09-19T19:34:09Z

.github/workflows/save_daily_metrics.yml

        env:
          KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
          KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
        run: |
          python src/usage_metrics/scripts/save_kaggle_metrics.py

+      - shell: bash -l {0}


Non blocking but we could make use of the github action matrix strategy to run multiple archives in parallel like we do in the pudl-archiver repo.

I'll add this suggestion to #175.

bendnorman

🎉

Add zenodo archiving script

ae08409

e-belfer added the zenodo Relating to Zenodo usage metrics label Sep 17, 2024

e-belfer self-assigned this Sep 17, 2024

e-belfer added 5 commits September 18, 2024 14:23

Add Zenodo metrics to daily archiver, make steps independent

7e9bce3

Actually run zenodo metrics

02e63a4

Remove cruft from save_zenodo_metrics

f015bbd

Minor formatting changes

6d016e9

Merge branch 'zenodo-archiver' of https://github.com/catalyst-coopera…

d7988f4

…tive/pudl-usage-metrics into zenodo-archiver

e-belfer requested a review from jdangerx September 19, 2024 13:22

e-belfer marked this pull request as ready for review September 19, 2024 13:23

e-belfer requested review from a team and removed request for jdangerx September 19, 2024 14:51

e-belfer mentioned this pull request Sep 19, 2024

ETL Kaggle and Github metrics #168

Merged

bendnorman requested changes Sep 19, 2024

View reviewed changes

Update docstrings, avoid variable collision

68ab0d9

e-belfer requested a review from bendnorman September 20, 2024 13:49

e-belfer mentioned this pull request Sep 20, 2024

Write an ETL script to process archived Zenodo records #181

Closed

bendnorman approved these changes Sep 20, 2024

View reviewed changes

e-belfer merged commit 653ae7a into main Sep 20, 2024
5 checks passed

e-belfer deleted the zenodo-archiver branch September 20, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zenodo archiving script #177

Add zenodo archiving script #177

e-belfer commented Sep 17, 2024 •

edited

Loading

Tasks

bendnorman left a comment

bendnorman Sep 19, 2024

bendnorman Sep 19, 2024

bendnorman Sep 19, 2024

e-belfer Sep 20, 2024

bendnorman Sep 19, 2024

e-belfer Sep 20, 2024

bendnorman left a comment

Add zenodo archiving script #177

Add zenodo archiving script #177

Conversation

e-belfer commented Sep 17, 2024 • edited Loading

Overview

Testing

To-do list

Tasks

bendnorman left a comment

Choose a reason for hiding this comment

bendnorman Sep 19, 2024

Choose a reason for hiding this comment

bendnorman Sep 19, 2024

Choose a reason for hiding this comment

bendnorman Sep 19, 2024

Choose a reason for hiding this comment

e-belfer Sep 20, 2024

Choose a reason for hiding this comment

bendnorman Sep 19, 2024

Choose a reason for hiding this comment

e-belfer Sep 20, 2024

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

e-belfer commented Sep 17, 2024 •

edited

Loading