-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zenodo archiving script #177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of small code clean up requests and some questions.
version_df = pd.DataFrame( | ||
[ | ||
dict( | ||
version_records[item].stats.__dict__, | ||
doi=version_records[item].doi, | ||
title=version_records[item].title, | ||
) | ||
for item in range(len(version_records)) | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify this by using the objects in the iterator instead of indexing:
version_df = pd.DataFrame(
[
dict(
version_record.stats.dict(), # pydantic models have a dict() method you can use instead of accessing __dict__ directly
doi=version_record.doi,
title=version_record.title,
)
for version_record in version_records
]
)
versions_url = f"https://zenodo.org/api/records/{record.recid}/versions" | ||
record_versions = requests.get(versions_url, timeout=100) | ||
version_records = record_versions.json()["hits"]["hits"] | ||
version_records = [CommunityMetadata(**record) for record in version_records] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename the item to version_record
so the name/var doesn't collide with the var in the parent for loop.
community_url = "https://zenodo.org/api/communities/14454015-63f1-4f05-80fd-1a9b07593c9e/records" | ||
community_records = requests.get(community_url, timeout=100) | ||
catalyst_records = community_records.json()["hits"]["hits"] | ||
catalyst_records = [CommunityMetadata(**record) for record in catalyst_records] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do the catalyst_records
differ from the version_records
? Are they structurally the same we just need to grab the generic records to get all the individual versioned records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this gives us the record information for each Catalyst dataset, and then for each dataset we iterate through and get all the versions and their corresponding metrics. I can rename this to dataset_records
to make it more obvious and add some docstrings.
env: | ||
KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }} | ||
KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }} | ||
run: | | ||
python src/usage_metrics/scripts/save_kaggle_metrics.py | ||
|
||
- shell: bash -l {0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking but we could make use of the github action matrix strategy to run multiple archives in parallel like we do in the pudl-archiver repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this suggestion to #175.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Overview
Closes #176.
What problem does this address?
Queries Zenodo stats for each version of each record in the Catalyst Cooperative community.
What did you change in this PR?
save_zenodo_metrics.py
save_daily_metrics.yml
github actionsave_daily_metrics.yml
independent of one another's successOut of scope:
save-daily-metrics.yml
improvements #175.Testing
How did you make sure this worked? How can a reviewer verify this?
See the successful run and corresponding archives in the GCS bucket.
To-do list
Tasks