Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zenodo archiving script #177

Merged
merged 7 commits into from
Sep 20, 2024
Merged

Add zenodo archiving script #177

merged 7 commits into from
Sep 20, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Sep 17, 2024

Overview

Closes #176.

What problem does this address?
Queries Zenodo stats for each version of each record in the Catalyst Cooperative community.

What did you change in this PR?

  • Add save_zenodo_metrics.py
  • Add this script to the save_daily_metrics.yml github action
  • Make the archive runs in save_daily_metrics.yml independent of one another's success

Out of scope:

Testing

How did you make sure this worked? How can a reviewer verify this?
See the successful run and corresponding archives in the GCS bucket.

To-do list

Tasks

Preview Give feedback

@e-belfer e-belfer added the zenodo Relating to Zenodo usage metrics label Sep 17, 2024
@e-belfer e-belfer self-assigned this Sep 17, 2024
@e-belfer e-belfer requested a review from jdangerx September 19, 2024 13:22
@e-belfer e-belfer marked this pull request as ready for review September 19, 2024 13:23
@e-belfer e-belfer requested review from a team and removed request for jdangerx September 19, 2024 14:51
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of small code clean up requests and some questions.

Comment on lines 75 to 84
version_df = pd.DataFrame(
[
dict(
version_records[item].stats.__dict__,
doi=version_records[item].doi,
title=version_records[item].title,
)
for item in range(len(version_records))
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify this by using the objects in the iterator instead of indexing:

version_df = pd.DataFrame(
            [
                dict(
                    version_record.stats.dict(), # pydantic models have a dict() method you can use instead of accessing __dict__ directly
                    doi=version_record.doi,
                    title=version_record.title,
                )
                for version_record in version_records
            ]
        )

versions_url = f"https://zenodo.org/api/records/{record.recid}/versions"
record_versions = requests.get(versions_url, timeout=100)
version_records = record_versions.json()["hits"]["hits"]
version_records = [CommunityMetadata(**record) for record in version_records]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename the item to version_record so the name/var doesn't collide with the var in the parent for loop.

community_url = "https://zenodo.org/api/communities/14454015-63f1-4f05-80fd-1a9b07593c9e/records"
community_records = requests.get(community_url, timeout=100)
catalyst_records = community_records.json()["hits"]["hits"]
catalyst_records = [CommunityMetadata(**record) for record in catalyst_records]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the catalyst_records differ from the version_records? Are they structurally the same we just need to grab the generic records to get all the individual versioned records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this gives us the record information for each Catalyst dataset, and then for each dataset we iterate through and get all the versions and their corresponding metrics. I can rename this to dataset_records to make it more obvious and add some docstrings.

env:
KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}
KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
run: |
python src/usage_metrics/scripts/save_kaggle_metrics.py

- shell: bash -l {0}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking but we could make use of the github action matrix strategy to run multiple archives in parallel like we do in the pudl-archiver repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this suggestion to #175.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@e-belfer e-belfer merged commit 653ae7a into main Sep 20, 2024
5 checks passed
@e-belfer e-belfer deleted the zenodo-archiver branch September 20, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zenodo Relating to Zenodo usage metrics
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Write a script to collect stats for all Zenodo archives and save them to a GCS bucket
2 participants