Releases: WordPress/openverse-catalog
v1.3.2
- Update CODEOWNERS (#677) @zackkrida
Improvements
- Refactor Wikimedia Commons to use ProviderDataIngester (#614) @stacimc
- Refactor Brooklyn Museum to use ProviderDataIngester (#701) @AetherUnbound
- Refactor Metropolitan Museum of Art to use ProviderDataIngester (#674) @rwidom
- Always record provider run duration (#694) @AetherUnbound
- Allow DAGs to silence only errors matching predicate (#654) @stacimc
- Bump iNaturalist timeouts to 5 days (#691) @AetherUnbound
- Add iNaturalist.org metadata (#549) @rwidom
Internal Improvements
- 🔄 Synced file(s) with WordPress/openverse (#728) @openverse-bot
- Add
DEPLOYMENT.md
& deployment-related files (#711) @AetherUnbound - Standardize on datetime over pendulum (#678) @AetherUnbound
- Add Openverse email to DAG default args (#683) @AetherUnbound
- Use Python 3.10 everywhere (#656) @sarayourfriend
Bug Fixes
- 🔄 Synced file(s) with WordPress/openverse (#728) @openverse-bot
- Add none check for Cleveland
image_data
(#709) @stacimc - Remove error swallowing during ingestion (#713) @stacimc
- Allow string as exceptions in
on_failure_callback
(#695) @AetherUnbound - Always use Jamendo's "streaming" audio (#706) @AetherUnbound
- Fix dagrun conf for provider scripts (#708) @stacimc
- Initialize iNaturalist with dagrun conf (#707) @stacimc
- hardcodes the test ingestion limit to 1 000 000 (#705) @ramadanomar
- Update audioset_view to use most recently updated f_id/provider pair (#660) @AetherUnbound
Credits
Thanks to @AetherUnbound, @openverse-bot, @ramadanomar, @rwidom, @sarayourfriend, @stacimc and @zackkrida for their contributions!
v1.3.1
This is a release which marks the close of the v1.3.0 milestone! 🎉 🎉 🎉
Thank you to everyone involved, including the many community members who contributed to this release 🥳
- Updates Handbook Link (#662) @ramadanomar
- Fix typo in README (#652) @grumpyp
- Unify header added (#610) @Justinjdaniel
New Features
- Add configuration options to skip ingestion errors (#650) @stacimc
- Omit DAGs that are known to fail from alerts (#643) @stacimc
- Add
filetype
to Phylopic script (#547) @obulat - Add
filetype
test to Metropolitan script (#568) @obulat - Add audio_set_foreign_identifier to the audio materialized view (#565) @obulat
- Add
filetype
andfilesize
to Cleveland Museum of Art API script (#537) @obulat - Add
filetype
andfilesize
to SMK script (#542) @obulat - Add a helper function to extract extension from the media URL (#545) @obulat
- Add DAG to report reported media pending review (#513) @stacimc
Improvements
- Tighten exception handling, always flush buffer (#645) @AetherUnbound
- Automatic DAG documentation generation (#649) @AetherUnbound
- Data refresh record difference reporting (#636) @AetherUnbound
- Use the default provider categories during ingestion (#635) @obulat
- Refactor Museum Victoria to use ProviderDataIngester (#600) @obulat
- Update Finnish Museums to use base class (#579) @stacimc
- Update data refresh DAG to account for manual go-live (#578) @dhruvkb
- Generate TSV filenames in separate step (#620) @AetherUnbound
- Turn on catchup for dated DAGs to allow backfill (#602) @stacimc
- Ignore DS_Store files (#627) @stacimc
- Add date range to ingestion load reports (#613) @AetherUnbound
- Add test to check for import errors for all DAGs in the dags dir (#580) @stacimc
- Refactor StockSnap to use ProviderDataIngester (#601) @rwidom
- Consolidate provider workflows using dynamic DAGs and dataclasses (#540) @stacimc
- Create DAG objects at top level (#551) @stacimc
- Remove thumbnails from images (#526) @obulat
Internal Improvements
- Re-ping if PR is updated and don't ping if 2 approvals exist (#642) @sarayourfriend
- Partition TSVs by date (#632) @AetherUnbound
- Refactor Science museum to use ProviderDataIngester (#576) @obulat
- 🔄 Synced file(s) with WordPress/openverse (#604) @openverse-bot
- 🔄 Synced file(s) with WordPress/openverse (#603) @openverse-bot
- Add missing
MD5
hash to foreign id comparison (#575) @obulat - Add base class for Provider API scripts (#555) @stacimc
- Post comments using JSON instead of form data (#570) @sarayourfriend
- Add PR review reminder DAG (#553) @sarayourfriend
Bug Fixes
- Upgrade Airflow to v2.3.3 (#664) @AetherUnbound
- Only delete dag runs/task instances during testing that match pattern (#651) @AetherUnbound
- Only drop load table if it exists (#634) @AetherUnbound
- Re-raise pytest-socket errors within DelayedRequester (#629) @AetherUnbound
- Adjust load data timeout and retries (#626) @stacimc
- Patch Stocksnap tests that called out to external API (#628) @AetherUnbound
- Update Openverse URL in the user agent string (#612) @PrajwalBorkar
- 🔄 Synced file(s) with WordPress/openverse (#604) @openverse-bot
- 🔄 Synced file(s) with WordPress/openverse (#603) @openverse-bot
- Fix module import for PR review reminder DAG (#566) @AetherUnbound
- Add flag to strip slash in urls while validating (#556) @thedevhaider
- Correct order of None handling in Cleveland provider script (#544) @stacimc
- Unconditionally destroy buckets after testing (#516) @AetherUnbound
- Simplify WP Photo Directory script and get missing authors (#515) @krysal
Credits
Thanks to @AetherUnbound, @Justinjdaniel, @PrajwalBorkar, @dhruvkb, @grumpyp, @krysal, @obulat, @openverse-bot, @ramadanomar, @rwidom, @sarayourfriend, @stacimc and @thedevhaider for their contributions!
v1.3.0
Improvements
- Generate DAGs to recreate popularity calculations using a factory (#507) @stacimc
- Merge popularity calculations and data refresh into a single DAG (#496) @stacimc
Internal Improvements
- airflow dockerfile: set
PYTHONPATH
to DAGs folder (#514) @tal66 - Upgrade Airflow to 2.3, python to 3.10 (#502) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#509) @openverse-bot
- 🔄 Synced file(s) with WordPress/openverse (#505) @openverse-bot
Bug Fixes
- Ensure SMK images don't timeout on validation (#506) @stacimc
- airflow dockerfile: set
PYTHONPATH
to DAGs folder (#514) @tal66 - Retry flaky request when Smithsonian provider script detects no unit codes (#508) @stacimc
- 🔄 Synced file(s) with WordPress/openverse (#509) @openverse-bot
- Don't delete custom pools during test cleanup (#501) @stacimc
- 🔄 Synced file(s) with WordPress/openverse (#505) @openverse-bot
- Add human readable description for durations under 1 second (#500) @rwidom
Credits
Thanks to @AetherUnbound, @openverse-bot, @rwidom, @stacimc and @tal66 for their contributions!
v1.2.2
Internal Improvements
Bug Fixes
- Recreate the audioset matview after full popularity recalculation (#493) @stacimc
- Enable reporting when there is no data to load (#492) @rwidom
- Wikimedia: Catch bit rates that are greater than the int max (#475) @AetherUnbound
- Fix
alt_files
duplicates (#479) @AetherUnbound
Credits
Thanks to @AetherUnbound, @rwidom and @stacimc for their contributions!
v1.2.1
- Rename Thingiverse.py to thingiverse.py (#472) @zackkrida
Improvements
- Update Smithsonian Unit code checker DAG to alert to Slack (#452) @stacimc
- Show duplicate record count in completion slack message (#442) @AetherUnbound
- Use safe_search param to restrict results from Flickr (#460) @stacimc
- Send single slack notification per provider on TSV load complete (#434) @stacimc
Internal Improvements
- Change docker-compose restart policy for local development (#474) @AetherUnbound
- Re-introduce pytest-socket (#467) @AetherUnbound
- Upgrade black to 22.3.0 (#463) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#462) @openverse-bot
- 🔄 Synced file(s) with WordPress/openverse (#459) @openverse-bot
- Remove
apt upgrade
from PG image, upgrade to 13.6 (#455) @AetherUnbound - 🔄 Synced file(s) with WordPress/openverse (#444) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#441) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#440) @dhruvkb
Bug Fixes
- Improved load reporting (#471) @AetherUnbound
- Adjust timeouts for Data Refresh
wait_for_completion
step (#458) @stacimc - Upgrade black to 22.3.0 (#463) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#462) @openverse-bot
- 🔄 Synced file(s) with WordPress/openverse (#459) @openverse-bot
- Remove
apt upgrade
from PG image, upgrade to 13.6 (#455) @AetherUnbound - Handle case where Wikimedia has no audio metadata (#443) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#444) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#441) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#440) @dhruvkb
Credits
Thanks to @AetherUnbound, @dhruvkb, @openverse-bot, @stacimc and @zackkrida for their contributions!
v1.2.0 - Data refresh scheduling, deduplication, and provider fixes
Improvements
- Add data refresh to Airflow (#397) @stacimc
- Change PhyloPic date range & schedule interval (#423) @AetherUnbound
- Round duration for provider ingestion completion message (#422) @AetherUnbound
- Enable XCom pickling in Airflow (#421) @stacimc
- Handle duplicate keys in load_data task (#395) @stacimc
- Make 'sound' category more specific (#402) @AetherUnbound
- Group test runs by module or class (#409) @stacimc
- Report the environment in TSV loader Slack notifications (#382) @stacimc
Internal Improvements
- Add LRU cache to
is_valid_license_info
(#424) @AetherUnbound - Use published Docker image in primary docker-compose.yml (#417) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#404) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#403) @dhruvkb
Bug Fixes
- Fix invalid license urls from Finnish Museum API (#418) @stacimc
- Reduce noise in NYPL ingestion (#415) @AetherUnbound
- Add ConnectionError to acceptable flaky exceptions for Freesound (#413) @AetherUnbound
- Fix schedule intervals on Cleveland Museum & Wikimedia Commons (#416) @AetherUnbound
- Update API requests for Museum Victoria DAG (#414) @stacimc
- Add OFEO-SG subprovider (#412) @stacimc
- 🔄 Synced file(s) with WordPress/openverse (#404) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#403) @dhruvkb
Credits
Thanks to @AetherUnbound, @dhruvkb and @stacimc for their contributions!
v1.1.0
New Features
- Add slack message on TSV load complete (#369) @stacimc
- Add provider media type to DAG tags (#360) @AetherUnbound
Improvements
- Trigger TSV loading immediately after workflow (#357) @AetherUnbound
- Differentiate between slack channels (#359) @AetherUnbound
Internal Improvements
- Use Airflow Variables for storing API keys (#362) @AetherUnbound
- Use pytest-xdist for testing (#337) @AetherUnbound
Bug Fixes
- Updated user agent for Wikimedia Commons #140 (#355) @yavik-kapadia
- Remove buckets after testing (#344) @AetherUnbound
- Ensure Freesound tests are isolated (#340) @AetherUnbound
- Change minio ports from 500X to 501X (#341) @AetherUnbound
Credits
Thanks to @AetherUnbound, @stacimc and @yavik-kapadia for their contributions!
v1.0.0 - Openverse's First Release
Under the purview of Openverse
Now that Openverse is fully up and running, we're releasing version 1.0.0 of the catalog! There is plenty of work and numerous improvements yet to come - stay on the lookout for those exciting changes 🚀
General additions
- Update acknowledgements section (#172) @zackkrida
- Use dag_factory for Provider API DAG creation (#163) @obulat
- Add black and isort and apply to all files (#159) @sarayourfriend
- Add justfile with scripts and update README (#153) @sarayourfriend
- Add pre-commit (#157) @sarayourfriend
- Add manually run healthcheck DAG (#151) @sarayourfriend
- Create S3 bucket to emulate remote logging locally (#156) @krysal
- Switch local postgres to use volumes (#154) @sarayourfriend
- Replace os.path with pathlib in provider API script template (#149) @obulat
- Update Apache Airflow version (#148) @obulat
- Fix resource path string in provider template (#147) @sarayourfriend
- Log cleanup DAG (#139) @zackkrida
- Simplify catalog folder structure (#133) @obulat
- Allow running the catalog and the API at the same time (#145) @sarayourfriend
- [API integration] Add StockSnap (#114) @krysal
- Bump urllib3 from 1.25.11 to 1.26.5 in /src/cc_catalog_airflow (#86) @dependabot
- Bump flask-appbuilder from 3.2.3 to 3.3.0 in /src/cc_catalog_airflow (#75) @dependabot
- Bump lxml from 4.4.2 to 4.6.3 in /src/cc_catalog_airflow (#70) @dependabot
New Features
- Slack alerting for DAG failures (#297) @AetherUnbound
- Add Provider API script for Freesound (#95) @obulat
- Slack alerting utilities (#279) @AetherUnbound
- Add DAG tags, remove health check workflow (#277) @AetherUnbound
- Add production deployment documentation (#271) @AetherUnbound
- OAuth2 DAGs and Machinery (#246) @AetherUnbound
- Add sample WordPress REST API script (#223) @krysal
- [Audio] Add Wikimedia as Audio source (#197) @obulat
- Add new columns to MediaStore and database (#196) @obulat
- Add recreate recipe (#179) @sarayourfriend
- [API integration] Add Jamendo provider API script (#113) @obulat
- Add a PR template to the repository (#131) @dhruvkb
- Add a script to create provider API script template (#128) @obulat
- Create a Provider API script template (#93) @obulat
- Add support for other media types to popularity calculations (#124) @obulat
- Add Audio to the database (#111) @obulat
- Add AudioStorage entity (#85) @obulat
Improvements
- Change request info log to debug to prevent spam (#312) @AetherUnbound
- Upgrade to Airflow 2.2.3 (#308) @AetherUnbound
- Add unique indices to catalog (#306) @AetherUnbound
- Add Image Categories (#302) @krysal
- Remove
get_*_operator
functions, simplify commoncrawl logic (#301) @AetherUnbound - Remove unnecessary logging.basicConfig calls (#299) @AetherUnbound
- Specific error message for auth errors on request, improve tests (#295) @AetherUnbound
- Retire common_api_workflows, clean up config (#296) @AetherUnbound
- Reduce TSV loader complexity (#289) @AetherUnbound
- Retire legacy ingestion column fix (#287) @AetherUnbound
- Retire cleaner_worfklow, pg_cleaner (#288) @AetherUnbound
- Remove tsv_to_postgres_loader_overwrite (#286) @AetherUnbound
- Respository restructure (#276) @AetherUnbound
- Retire update workflows, refactor operators (#266) @AetherUnbound
- Add docker entrypoint to ensure db migration on startup (#270) @AetherUnbound
- Improve
.env
documentation & structure, update values (#251) @AetherUnbound - Remove prefixes from issue template titles (#250) @AetherUnbound
- Remove
get_log_operator
usage (#238) @zackkrida - Use new issue forms feature for source and provider issue templates (#230) @lyu4321
- Docker optimization & repository restructuring (#226) @AetherUnbound
- Implement stocksnap popularity and popularity documentation (#221) @zackkrida
- Move dev-specific services into compose overrides file (#217) @AetherUnbound
- Update README.md with Airflow Credentials (#194) @zackkrida
- Modify audio columns (#130) @krysal
- Add missing
watermarked
column to audio loading table (#125) @obulat - Ingest wikimedia images marked with CC0 and PDM (#119) @obulat
- Set default output dir for commoncrawl (#118) @obulat
- Extract media type from staged tsv file name for loader (#110) @obulat
- Extract MediaStorage entity as parent to ImageStore (#83) @obulat
- Add ingestion column to MediaStore when using provider API (#72) @obulat
- Remove mutable parameters in provider api scripts (#100) @obulat
Internal Improvements
- Set up CI/CD with ghcr.io (#332) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#317) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#314) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#294) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#293) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#274) @dhruvkb
- Add docker entrypoint to ensure db migration on startup (#270) @AetherUnbound
- Replace moto server with Minio (#254) @AetherUnbound
- Add pip upgrade command, docker optimizations (#265) @AetherUnbound
- Add
justfile
deployment recipe (#267) @AetherUnbound - 🔄 Synced file(s) with WordPress/openverse (#269) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#268) @dhruvkb
- Add args option to db-shell recipe (#259) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#258) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#256) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#255) @dhruvkb
- Add pgcli to postgres container, db-shell recipe (#252) @AetherUnbound
- Improve
.env
documentation & structure, update values (#251) @AetherUnbound - Remove prefixes from issue template titles (#250) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#249) @dhruvkb
- Move dev-specific services into compose overrides file (#217) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#206) @dhruvkb
- Move storage module up and deduplicate MediaStore tests (#192) @obulat
- Issue templates (#195) @obulat
- 🔄 Synced file(s) with WordPress/openverse (#190) @dhruvkb
- Streamline dag lists in README.md and add StockSnap (#187) @zackkrida
- 🔄 Synced file(s) with WordPress/openverse (#185) @dhruvkb
- Add recreate recipe (#179) @sarayourfriend
- 🔄 Synced file(s) with WordPress/openverse (#180) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#174) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#173) @dhruvkb
- Add example vars for airflow remote logging (#136) @zackkrida
- Delete docs folder (#135) @krysal
- 🔄 Synced file(s) with WordPress/openverse (#134) @dhruvkb
- Update issue templates (#116) @dhruvkb
- Run release drafter action on push to main branch (#104) @obulat
- Run ci on main push only (#98) @obulat
- Re-add the lint and test workflows (#76) @obulat
- Create a CODEOWNERS file (#80) @dhruvkb
- Update README.md (#68) @obulat
- Move the common package up a level to simplify imports and testing (#78) @obulat
- Add configuration and workflow for Release Drafter (#73) @dhruvkb
- Update Airflow to version 2 (#63) @obulat
- readme updates (#62) @zackkrida
- Update to postgres 13 and apache-airflow 1.10.15 (#54) @obulat
- Add documetation on generating a Flickr API token (#56) @zackkrida
- Initial Migration (#1) @zackkrida
Bug Fixes
- Fix inconsistent alignment in slack message text (#328) @AetherUnbound
- Properly handle "None" values returned from Freesound API (#327) @AetherUnbound
- Add audioset_view to catalog DDL (#320) @AetherUnbound
- Set default timeout to 12 hours (#311) @AetherUnbound
- Make commoncrawl bucket configurable, change default (#318) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#317) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#314) @dhruvkb
- Extend Jamendo's timeout to 24 hours (#310) @AetherUnbound
- Disable TSV loader scheduling (#309) @AetherUnbound
- Bump lxml from 4.6.3 to 4.6.5 (#303) @dependabot
- Refactor delay tests to prevent them from being flaky (#298) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#294) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#293) @dhruvkb
- Add index creation for matviews (#280) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#274) @dhruvkb
- Add pip upgrade command, docker optimizations (#265) @AetherUnbound
- 🔄 Synced file(s) with WordPress/openverse (#269) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#268) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#258) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#256) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#255) @dhruvkb
- Edit wikimedia_audio name in popularity sql (#253) @obulat
- 🔄 Synced file(s) with WordPress/openverse (#249) @dhruvkb
- Make Category a StringColumn (not an ArrayColumn) (#243) @obulat
- Fix type in contributing.md (#245) @MuhammadFaizanHaidar
- Update provider template, refactor DAG parsing tests (#237) @AetherUnbound
- Remove
trackid
query parameter from set thumbnail url (#239) @obulat - Update test to use dag context (#240) @obulat
- 🔄 Synced file(s) with WordPress/openverse (#206) @dhruvkb
- Organize & document
justfile
, fix issue with recreate command (#198) @AetherUnbound - Issue templates (#195) @obulat
- 🔄 Synced file(s) with WordPress/openverse (#190) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#185) @dhruvkb
- Renamed the source suggestion issue template (#184) @MuhammadFaizanHaidar
- 🔄 Synced file(s) with WordPress/openverse (#180) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#174) @dhruvkb
- 🔄 Synced file(s) with WordPress/openverse (#173) @dhruvkb
- Replace
genre
property withgenres
in tests (#137) @obulat - 🔄 Synced file(s) with WordPress/openverse (#134) @dhruvkb
- Make wikimedia script pass license_info, not license_url (#129) @obulat
- Delete duplicated CommonCrawl providers (#126) @krysal
- [Qual...