-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
izzymetzger
committed
Mar 25, 2020
1 parent
c1c6b17
commit 49755b0
Showing
26 changed files
with
66,303 additions
and
235 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -133,3 +133,4 @@ data-orig/*.csv | |
*.gz | ||
*.jsonl | ||
data-fasttext/ | ||
docs/*.zip |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,20 @@ | ||
## SM4H - Team **RxSpace**! | ||
|
||
### DETAILS | ||
|
||
## Table of Contents | ||
* [Competition Details](#competition-details) | ||
* [Team Members](#team) | ||
* [Text Corpora](#text-corpora) | ||
* [Requirements](#requirements) | ||
* [Word Embeddings](#embeddings) | ||
* [Snorkel](#snorkel) | ||
* [Model Training](#model-training) | ||
* [Evaluation](#evaluation) | ||
* [References](#references) | ||
* [Acknowledgments](#acknowledgments) | ||
* [Tags](#tags) | ||
|
||
## Competition Details | ||
*This repository contains code for tackling Task 4 of the SMM2020 | ||
|
||
The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below. | ||
|
@@ -20,13 +34,8 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time) | |
* Workshop: September 13, 2020 <br> | ||
* All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”). <br> | ||
|
||
## Repo Layout | ||
``` | ||
``` | ||
|
||
## META | ||
## Team | ||
### Team members | ||
* Isabel Metzger - [email protected] <br> | ||
* Allison Black - [email protected] <br> | ||
|
@@ -37,11 +46,150 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time) | |
* Natasha Zaliznyak - [email protected] | ||
* Whitley Yi - [email protected] <br> | ||
|
||
## Text Corpora | ||
### Supervised Learning | ||
* Original train/validation split: | ||
* We use the train.csv, validation.csv as provided from our competition | ||
train size = 10537 samples | ||
|
||
<div> | ||
<style scoped> | ||
.dataframe tbody tr th:only-of-type { | ||
vertical-align: middle; | ||
} | ||
|
||
.dataframe tbody tr th { | ||
vertical-align: top; | ||
} | ||
|
||
.dataframe thead th { | ||
text-align: right; | ||
} | ||
</style> | ||
<table border="1" class="dataframe"> | ||
<thead> | ||
<tr style="text-align: right;"> | ||
<th></th> | ||
<th>class counts</th> | ||
<th>class %</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<th>m</th> | ||
<td>5488</td> | ||
<td>52.08</td> | ||
</tr> | ||
<tr> | ||
<th>c</th> | ||
<td>2940</td> | ||
<td>27.90</td> | ||
</tr> | ||
<tr> | ||
<th>a</th> | ||
<td>1685</td> | ||
<td>15.99</td> | ||
</tr> | ||
<tr> | ||
<th>u</th> | ||
<td>424</td> | ||
<td>4.02</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
</div> | ||
validation/dev: 2635 samples | ||
<div> | ||
<style scoped> | ||
.dataframe tbody tr th:only-of-type { | ||
vertical-align: middle; | ||
} | ||
|
||
.dataframe tbody tr th { | ||
vertical-align: top; | ||
} | ||
|
||
.dataframe thead th { | ||
text-align: right; | ||
} | ||
</style> | ||
<table border="1" class="dataframe"> | ||
<thead> | ||
<tr style="text-align: right;"> | ||
<th></th> | ||
<th>class counts</th> | ||
<th>class %</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<th>m</th> | ||
<td>1353</td> | ||
<td>51.35</td> | ||
</tr> | ||
<tr> | ||
<th>c</th> | ||
<td>730</td> | ||
<td>27.70</td> | ||
</tr> | ||
<tr> | ||
<th>a</th> | ||
<td>448</td> | ||
<td>17.00</td> | ||
</tr> | ||
<tr> | ||
<th>u</th> | ||
<td>104</td> | ||
<td>3.95</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
</div> | ||
|
||
* Multiple Splits: | ||
* For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val <br> | ||
|
||
### Unsupervised Learning | ||
We created word embeddings using health social media posts from twitter and other public datasets. We used [ekphrasis]( https://github.com/cbaziotis/ekphrasis) and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook. | ||
|
||
|
||
| Sources | Sentences/Tweets | Tokens | | ||
| :------ | --------: | -----: | | ||
| Twitter (SM4H) | | | | ||
| Drug Reviews| | | | ||
| Wikipedia | | | | ||
| | | | | ||
|
||
## Requirements | ||
* Important packages/frameworks utilized include [spacy](https://github.com/explosion/spaCy), [fastText](https://github.com/facebookresearch/fastText), [ekphrasis](https://github.com/cbaziotis/ekphrasis), [allennlp](https://github.com/allenai/allennlp), [PyTorch](https://github.com/pytorch/pytorch), [snorkel](https://github.com/snorkel-team/snorkel/) | ||
* To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert embeddings, which were downloaded as below | ||
```bash | ||
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar | ||
tar -xvf scibert_scivocab_uncased.tar | ||
``` | ||
* Exact requirements can be found in the requirements.txt file | ||
* For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook | ||
|
||
## Embeddings | ||
|
||
tags: data augmentation, snorkel labeling functions, elmo, cnns, scalability | ||
|
||
### Evaluating predictions | ||
## Repo Layout | ||
``` | ||
* rx_twitterspace | ||
* nlp_configs | ||
* preds | ||
* data-orig | ||
* docs | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
|
||
## Evaluation | ||
|
||
### Text classification | ||
* fasttext model `python evaluation.py` | ||
``` | ||
|
@@ -77,4 +225,6 @@ python -m gensim.scripts.glove2word2vec --input "/Users/isabelmetzger/PycharmPro | |
gzip glove.twitter.27B.100d.w2v.txt | ||
python -m spacy init-model en twitter-glove --vectors-loc glove.twitter.27B.100d.w2v.txt.gz | ||
|
||
``` | ||
``` | ||
## Tags | ||
* data augmentation, weak supervision, noisy labeling, word embeddings, text classification, multi-label, multi-class, scalability |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.