A big data system to handle weather and crime data | Final Project MPCS 53013 Big Data
The code in this repo implements a lambda architecture to feed a large-scale data application that takes weather and crime data, ingests it to HDFS, then automatically runs batch views of the data for user availability while also allowing for real-time updates on the fly.
You can see the Speed layer interface here. Fair warning, it's not very pretty because of the time constraints placed on this project -- the effort was, necessarily, on the back end functionality.
- Assumes a Hadoop HDFS file system hosted on Google Cloud
- Apache Kakfa for Serving Layer data collection
- Apache Storm topology for Serving Layer ingestion
- Apache Thrift data structure for fact-based, schema-on read data storage
- Apache Pig for Batch Layer pre-computed view construction
- Apache HBase for pre-computed view storage, Serving Layer data storage, and data access in Speed Layer
- Basic HTML with Python back-end for Speed Layer data access
set-up
contains the necessary shell code for running various aspects of the systemfrontEnd
contains the Speed Layer for data accessingestFiles
contains the ingestion code for HDFS serializationthriftFiles
contains the Thrift schema for serializationpigFiles
contain all the Pig code for batch layer runsstormFiles
contains the code for the Serving layer Storm topologyjars
contain the necessary jars and uberjars for java applicationsmvn
andpig
contain necessary open-source application jars for implementation
Data are from the NOAA (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/) and the City of Chicago data portal.