diff --git a/data/data_statement.md b/data/data_statement.md
new file mode 100644
index 0000000..116ac63
--- /dev/null
+++ b/data/data_statement.md
@@ -0,0 +1,30 @@
+We are currently relying on 3 datasets for our research and modeling efforts: 
+
+1. Waseem, Zeerak, and Dirk Hovy. "Hateful symbols or hateful people? predictive features for hate speech detection on 
+twitter." Proceedings of the NAACL student research workshop. 2016. (check 
+[Hate speech](#Hate-speech) section)
+
+2. Anzovino, Maria, Elisabetta Fersini, and Paolo Rosso. "Automatic identification and classification of misogynistic 
+language on twitter." International Conference on Applications of Natural Language to Information Systems. 
+Springer, Cham, 2018. (check [Automatic Misogyny Identification](#Automatic-Misogyny-Identification ) section)
+
+3. A dataset that we collected and labeled. Check [Our Annotations](#Our-Annotations) section for a full description
+of our process.
+
+
+These 3 datasets are combined into what we call the **gold dataset**. 
+
+The next 3 sections provide an overview of how the data was collected and labeled in the form of data statements 
+([Bender, Emily M., and Batya Friedman](https://www.aclweb.org/anthology/Q18-1041/)) 
+
+# Hate speech  
+to-do
+
+# Automatic Misogyny Identification 
+to-do
+
+# Our Annotations
+to-do
+
+
+