Skip to content

Commit

Permalink
updating readme with newest
Browse files Browse the repository at this point in the history
  • Loading branch information
izzymetzger committed Mar 25, 2020
1 parent c1c6b17 commit 49755b0
Show file tree
Hide file tree
Showing 26 changed files with 66,303 additions and 235 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,4 @@ data-orig/*.csv
*.gz
*.jsonl
data-fasttext/
docs/*.zip
170 changes: 160 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
## SM4H - Team **RxSpace**!

### DETAILS

## Table of Contents
* [Competition Details](#competition-details)
* [Team Members](#team)
* [Text Corpora](#text-corpora)
* [Requirements](#requirements)
* [Word Embeddings](#embeddings)
* [Snorkel](#snorkel)
* [Model Training](#model-training)
* [Evaluation](#evaluation)
* [References](#references)
* [Acknowledgments](#acknowledgments)
* [Tags](#tags)

## Competition Details
*This repository contains code for tackling Task 4 of the SMM2020

The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below.
Expand All @@ -20,13 +34,8 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
* Workshop: September 13, 2020 <br>
* All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”). <br>

## Repo Layout
```
```

## META
## Team
### Team members
* Isabel Metzger - [email protected] <br>
* Allison Black - [email protected] <br>
Expand All @@ -37,11 +46,150 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
* Natasha Zaliznyak - [email protected]
* Whitley Yi - [email protected] <br>

## Text Corpora
### Supervised Learning
* Original train/validation split:
* We use the train.csv, validation.csv as provided from our competition
train size = 10537 samples

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>class counts</th>
<th>class %</th>
</tr>
</thead>
<tbody>
<tr>
<th>m</th>
<td>5488</td>
<td>52.08</td>
</tr>
<tr>
<th>c</th>
<td>2940</td>
<td>27.90</td>
</tr>
<tr>
<th>a</th>
<td>1685</td>
<td>15.99</td>
</tr>
<tr>
<th>u</th>
<td>424</td>
<td>4.02</td>
</tr>
</tbody>
</table>
</div>
validation/dev: 2635 samples
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>class counts</th>
<th>class %</th>
</tr>
</thead>
<tbody>
<tr>
<th>m</th>
<td>1353</td>
<td>51.35</td>
</tr>
<tr>
<th>c</th>
<td>730</td>
<td>27.70</td>
</tr>
<tr>
<th>a</th>
<td>448</td>
<td>17.00</td>
</tr>
<tr>
<th>u</th>
<td>104</td>
<td>3.95</td>
</tr>
</tbody>
</table>
</div>

* Multiple Splits:
* For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val <br>

### Unsupervised Learning
We created word embeddings using health social media posts from twitter and other public datasets. We used [ekphrasis]( https://github.com/cbaziotis/ekphrasis) and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook.


| Sources | Sentences/Tweets | Tokens |
| :------ | --------: | -----: |
| Twitter (SM4H) | | |
| Drug Reviews| | |
| Wikipedia | | |
| | | |

## Requirements
* Important packages/frameworks utilized include [spacy](https://github.com/explosion/spaCy), [fastText](https://github.com/facebookresearch/fastText), [ekphrasis](https://github.com/cbaziotis/ekphrasis), [allennlp](https://github.com/allenai/allennlp), [PyTorch](https://github.com/pytorch/pytorch), [snorkel](https://github.com/snorkel-team/snorkel/)
* To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert embeddings, which were downloaded as below
```bash
wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar
tar -xvf scibert_scivocab_uncased.tar
```
* Exact requirements can be found in the requirements.txt file
* For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook

## Embeddings

tags: data augmentation, snorkel labeling functions, elmo, cnns, scalability

### Evaluating predictions
## Repo Layout
```
* rx_twitterspace
* nlp_configs
* preds
* data-orig
* docs
```






## Evaluation

### Text classification
* fasttext model `python evaluation.py`
```
Expand Down Expand Up @@ -77,4 +225,6 @@ python -m gensim.scripts.glove2word2vec --input "/Users/isabelmetzger/PycharmPro
gzip glove.twitter.27B.100d.w2v.txt
python -m spacy init-model en twitter-glove --vectors-loc glove.twitter.27B.100d.w2v.txt.gz

```
```
## Tags
* data augmentation, weak supervision, noisy labeling, word embeddings, text classification, multi-label, multi-class, scalability
Binary file added docs/output_14_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/output_15_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 49755b0

Please sign in to comment.