Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
alanramponi committed Sep 1, 2023
1 parent c5a98ba commit 07f803e
Show file tree
Hide file tree
Showing 4 changed files with 15,070 additions and 2 deletions.
33 changes: 31 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,40 @@
# DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy

This repository contains code and resources associated to the paper:
This repository contains materials associated to the paper:

Alan Ramponi and Camilla Casula. 2023. **DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy**. In *Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)*, pages 187–199, Dubrovnik, Croatia. Association for Computational Linguistics. [[cite]](#citation) [[paper]](https://aclanthology.org/2023.vardial-1.19/)


:construction: The dataset is currently part of the [GeoLingIt shared task](https://sites.google.com/view/geolingit) at [EVALITA 2023](https://www.evalita.it/campaigns/evalita-2023/). The material for DiatopIt will thus be made available at the end of the shared task: stay tuned! :construction:
The dataset has recently been used in the context of the [GeoLingIt shared task](https://github.com/dhfbk/geolingit-evalita2023) at [EVALITA 2023](https://www.evalita.it/campaigns/evalita-2023/).


## Data

The corpus is released in the form of tweet IDs to be hydrated, in compliance to the Twitter developer policy. Data is in the `data/` folder and includes three files:

- `train.txt`: the training set, totalling 13,669 posts;
- `dev.txt`: the development set, totalling 552 posts;
- `test.txt`: the test set, totalling 818 posts.

The text of the post can be rehydrated from the [Tweet object](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet) of the Twitter json response. Just iterate over the posts (i.e., `for post in json_object["data"]`) and get the `post["text"]` content.

The location labels can be rehydrated from the [Place object](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/place) of the Twitter json response. Just iterate over the places for the posts (i.e., `for place in json_object["includes"]["places"]`) and get the relevant information:
- **region**: `place["name"].split(", ")[1]`;
- **latitude**: `from_bbox_to_point(place["geo"]["bbox"])[0]`;
- **longitude**: `from_bbox_to_point(place["geo"]["bbox"])[1]`.

The `from_bbox_to_point()` function computes latitude and longitude coordinates by
taking the central point from the 4-point bounding box of city areas given by the Twitter APIs. It is defined as follows:

```python
def from_bbox_to_point(bbox):
w_long, s_lat, e_long, n_lat = bbox
latitude = (s_lat + n_lat) / 2
longitude = (w_long + e_long) / 2
return [latitude, longitude]
```

If you find any issues, please get in touch with us.


## Citation
Expand Down
Loading

0 comments on commit 07f803e

Please sign in to comment.