updating readme with newest

izzykayu · Mar 25, 2020 · 49755b0 · 49755b0
1 parent c1c6b17
commit 49755b0
Show file tree

Hide file tree

Showing 26 changed files with 66,303 additions and 235 deletions.
diff --git a/.gitignore b/.gitignore
@@ -133,3 +133,4 @@ data-orig/*.csv
 *.gz
 *.jsonl
 data-fasttext/
+docs/*.zip
diff --git a/README.md b/README.md
@@ -1,6 +1,20 @@
 ## SM4H - Team **RxSpace**!
 
-### DETAILS
+
+## Table of Contents
+* [Competition Details](#competition-details)
+* [Team Members](#team)
+* [Text Corpora](#text-corpora)
+* [Requirements](#requirements)
+* [Word Embeddings](#embeddings)
+* [Snorkel](#snorkel)
+* [Model Training](#model-training)
+* [Evaluation](#evaluation)
+* [References](#references)
+* [Acknowledgments](#acknowledgments)
+* [Tags](#tags)
+
+## Competition Details
 *This repository contains code for tackling Task 4 of the SMM2020 
 
 The Social Media Mining for Health Applications (#SMM4H) Shared Task involves natural language processing (NLP) challenges of using social media data for health research, including informal, colloquial expressions and misspellings of clinical concepts, noise, data sparsity, ambiguity, and multilingual posts. For each of the five tasks below, participating teams will be provided with a set of annotated tweets for developing systems, followed by a three-day window during which they will run their systems on unlabeled test data and upload the predictions of their systems to CodaLab. Informlsation about registration, data access, paper submissions, and presentations can be found below.
@@ -20,13 +34,8 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
 * Workshop: September 13, 2020 <br>
 * All deadlines, except for system predictions (see above), are 23:59 UTC (“anywhere on Earth”). <br>
 
-## Repo Layout
-```
-
-
-```
 
-## META
+## Team
 ### Team members
 * Isabel Metzger - [email protected] <br>
 * Allison Black - [email protected] <br>
@@ -37,11 +46,150 @@ System predictions for test data due: April 5, 2020 (23:59 CodaLab server time)
 * Natasha Zaliznyak - [email protected]
 * Whitley Yi - [email protected] <br>
 
+## Text Corpora
+### Supervised Learning
+* Original train/validation split:
+   * We use the train.csv, validation.csv as provided from our competition
+    train size = 10537 samples
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>class counts</th>
+      <th>class %</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>m</th>
+      <td>5488</td>
+      <td>52.08</td>
+    </tr>
+    <tr>
+      <th>c</th>
+      <td>2940</td>
+      <td>27.90</td>
+    </tr>
+    <tr>
+      <th>a</th>
+      <td>1685</td>
+      <td>15.99</td>
+    </tr>
+    <tr>
+      <th>u</th>
+      <td>424</td>
+      <td>4.02</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+    validation/dev: 2635 samples
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>class counts</th>
+      <th>class %</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>m</th>
+      <td>1353</td>
+      <td>51.35</td>
+    </tr>
+    <tr>
+      <th>c</th>
+      <td>730</td>
+      <td>27.70</td>
+    </tr>
+    <tr>
+      <th>a</th>
+      <td>448</td>
+      <td>17.00</td>
+    </tr>
+    <tr>
+      <th>u</th>
+      <td>104</td>
+      <td>3.95</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+* Multiple Splits:
+   * For our ensemble method of multiple text classification models, we train models on different splits (70:30) of shuffled and stratified by class combined train + val <br>
+
+### Unsupervised Learning
+We created word embeddings using health social media posts from twitter and other public datasets. We used [ekphrasis]( https://github.com/cbaziotis/ekphrasis) and nltk tweet tokenizer for tokenization and sentencizing. Preprocessing can be found in the preprocessing notebook.
+
+
+| Sources  | Sentences/Tweets | Tokens |
+| :------  | --------: | -----: |
+| Twitter (SM4H) |  |  | 
+| Drug Reviews| | |
+|  Wikipedia |  |  | 
+| | | |
+
+## Requirements
+* Important packages/frameworks utilized include [spacy](https://github.com/explosion/spaCy), [fastText](https://github.com/facebookresearch/fastText), [ekphrasis](https://github.com/cbaziotis/ekphrasis), [allennlp](https://github.com/allenai/allennlp), [PyTorch](https://github.com/pytorch/pytorch), [snorkel](https://github.com/snorkel-team/snorkel/)
+* To use the allennlp configs (nlp_cofigs/text_classification.json) with pre-trained scibert embeddings, which were downloaded as below 
+```bash
+wget https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar
+tar -xvf scibert_scivocab_uncased.tar
+```
+* Exact requirements can be found in the requirements.txt file
+* For specific processed done in jupyter notebooks, please find the packages listed in the beginning cells of each notebook
 
+## Embeddings
 
-tags: data augmentation, snorkel labeling functions, elmo, cnns, scalability
 
-### Evaluating predictions 
+## Repo Layout
+```
+* rx_twitterspace
+* nlp_configs
+* preds
+* data-orig
+* docs
+```
+
+
+
+
+
+
+## Evaluation
+
+### Text classification
 * fasttext model `python evaluation.py`
 ```
 
@@ -77,4 +225,6 @@ python -m gensim.scripts.glove2word2vec --input "/Users/isabelmetzger/PycharmPro
 gzip glove.twitter.27B.100d.w2v.txt
 python -m spacy init-model en twitter-glove --vectors-loc glove.twitter.27B.100d.w2v.txt.gz
 
-```
+```
+## Tags
+* data augmentation, weak supervision, noisy labeling, word embeddings, text classification, multi-label, multi-class, scalability
diff --git a/docs/output_14_0.png b/docs/output_14_0.png
diff --git a/docs/output_15_0.png b/docs/output_15_0.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -133,3 +133,4 @@ data-orig/*.csv @@
     *.gz
     *.jsonl
     data-fasttext/
+    docs/*.zip