The OSDG Community Dataset (OSDG-CD) is the direct result of the work of hundreds of volunteers who have contributed to our understanding of Sustainable Development Goals (SDGs) via the OSDG Community Platform (OSDG-CP). It contains thousands of text excerpts which were labelled by the community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches. The OSDG Community Dataset will be updated on a quarterly basis.
Please note that all versions of the dataset are hosted on Zenodo. This repository is only intended to provide examples of how the dataset can be used in practice. You can access different versions of the dataset using DOI handles above. The Most Recent handle always resolves to the latest version.
Version | DOI Handle |
---|---|
Most Recent | |
Version 2023.10 | |
Version 2023.07 | |
Version 2023.04 | |
Version 2023.01 | |
Version 2022.10 | |
Version 2022.07 | |
Version 2022.04 | |
Version 2022.01 | |
Version 2021.09 |
The OSDG Community Platform is an ambitious attempt to bring together volunteers and subject matter experts from all around the world to create a large and accurate source of textual information on SDGs. It uses publicly available texts such as publications, reports and other written data sources. Each text is broken down into smaller pieces of paragraph length. These smaller pieces are then being labelled by the Community volunteers. Since the texts we collect have suggested labels associated with them – these usually come from the data source and do not necessarily reflect the content of a particular paragraph – each volunteer is presented with a single simple question that asks if the suggested label is indeed relevant for the short text at hand. Texts are labelled by multiple volunteers to ensure a high degree of quality.
The OSDG-CD dataset is provided in a .csv
format on Zenodo. It is a flat tabular dataset that contains the following columns:
-
doi
- Digital Object Identifier of the original document; -
text_id
- unique text identifier; -
text
- text excerpt from the document; -
sdg
- the SDG the text is validated against; -
labels_negative
- the number of volunteers who rejected the suggested SDG label; -
labels_positive
- the number of volunteers who accepted the suggested SDG label; -
agreement
- agreement score based on the formula$\text{agreement} = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}}$ ;
Pukelis, L., Bautista-Puig, N., Statulevičiūtė, G., Stančiauskas, V., Dikmener, G., & Akylbekova, D. (2022, November 21). OSDG 2.0: A multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs). arXiv.org. https://doi.org/10.48550/arXiv.2211.11252
Pukelis, L., Puig, N. B., Skrynik, M., & Stanciauskas, V. (2020, May 29). OSDG -- Open-source approach to classify text data by UN Sustainable Development Goals (sdgs). arXiv.org. https://arxiv.org/abs/2005.14569
Examples of text classification using OSDG-CD can be found under the examples
directory:
osdg-cd-example-classifier-sklearn.ipynb
(open in nbviewer)
The OSDG Community Dataset (OSDG-CD) is made available for research purposes. We are making the data open with the hope to enable researchers to discover new insights into and meaningful connections among Sustainable Development Goals.
We would like to know what you discover in the data. So do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. Send us an email at [email protected].
If you are using the dataset in a research paper, please cite the original version as follows:
OSDG, UNDP IICPSD SDG AI Lab, & PPMI. (2021). OSDG Community Dataset (OSDG-CD) (2021.09) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5550238.
To cite a specific version, use the template provided on Zenodo.
This dataset is made possible because of a large community effort. We would be glad to see your contribution to the project too. You can join our Community Platform to help us collect more labelled data. If you have a more technical background, you can also contribute to the OSDG Labelling Tool here. If you want to contribute to the project in some other way, do let us know via this contact form.
To learn more about the OSDG project, visit osdg.ai.