Data exploration using some Kaggle data sets developed in Spark and Scala.
Initial program to filter distinct values from the London crime data set. This program currently extracts and saves the following information:
- Distinct boroughs found in the data set
- Distinct boroughs and LSOA codes
- Distinct major categories crimes
- Distinct minor categories crimes
- Total number of crimes by borough in descending order
- Total number of crimes by major category in descending order
- Total number of crimes by borough and year
- Total number of crimes by major category and year
- Total number of crimes by minor category and year
- Total number of crimes by year
- Total number of crimes by year and month
- Crime category percentage by year (2016-2008)
- Total crimes by year and LSOA codes
In order to run it, the CSV file download from the Kaggle website must be located in the root folder of the project and executed as follows:
$ sbt
...
$ run --aws-s3 bucket/path/london_crime_by_lsoa.csv --destination D:\temp --master local[*]
At the end, the following folders are created inside the project:
- borough_csv
- lsoa_csv
- categories_csv
- major_category_csv
- minor_category_csv
- total_crimes_by_borough_csv
- total_crimes_by_major_category_csv
- total_crimes_by_minor_category_csv
- total_crimes_by_borough_year_csv
- total_crimes_by_major_category_year_csv
- total_crimes_by_minor_category_year_csv
- total_crimes_by_year_csv
- total_crimes_by_year_month_csv
- crime_percentage_2016_csv
- crime_percentage_2015_csv
- crime_percentage_2014_csv
- crime_percentage_2013_csv
- crime_percentage_2012_csv
- crime_percentage_2011_csv
- crime_percentage_2010_csv
- crime_percentage_2009_csv
- crime_percentage_2008_csv
- total_crimes_by_year_lsoa_code_csv
The following folders are also created containing the parquet files:
- borough_parquet
- lsoa_parquet
- categories_parquet
- major_category_parquet
- minor_category_parquet
- total_crimes_by_borough_parquet
- total_crimes_by_major_category_parquet
- total_crimes_by_minor_category_parquet
- total_crimes_by_borough_year_parquet
- total_crimes_by_major_category_year_parquet
- total_crimes_by_minor_category_year_parquet
- total_crimes_by_year_parquet
- total_crimes_by_year_month_parquet
- crime_percentage_2016_parquet
- crime_percentage_2015_parquet
- crime_percentage_2014_parquet
- crime_percentage_2013_parquet
- crime_percentage_2012_parquet
- crime_percentage_2011_parquet
- crime_percentage_2010_parquet
- crime_percentage_2009_parquet
- crime_percentage_2008_parquet
- total_crimes_by_year_lsoa_code_parquet
each containing the results of each filter applied to the original data set.
In order to run unit tests inside the project, the following command can be executed:
$ sbt
...
$ test
Wait for the unit tests finish the execution to see the results.
In order to run the program as standalone mode and using the CSV from S3, the following command can be used (assuming Windows machine):
$ sbt
$ sbt:kaggle-london-crime-data-saprk> run --aws-s3 org.sharpsw.spark/kaggle-london-crime-data/input/london_crime_by_lsoa.csv --destination D:\temp --s3-bucket bucket --s3-prefix s3-prefix --master local[*]
In order to run the program as standalone mode and using the CSV locally, the following command can be used (assuming Windows machine):
$ sbt
$ sbt:kaggle-london-crime-data-saprk> run --local london_crime_by_lsoa.csv --destination D:\temp --s3-bucket bucket --s3-prefix s3-prefix --master local[*]
In order to run the program inside Spark, the following command can be used (assuming Windows machine):
$ spark-submit --master local[*] --jars aws-java-sdk-kms-1.11.354.jar,aws-java-sdk-s3-1.11.354.jar,aws-java-sdk-core-1.11.354.jar --class org.sharpsw.spark.ExtractLondonCrimeData target\scala-2.11\kaggle-london-crime-data-spark_2.11-<appVersion>.jar --aws-s3 org.sharpsw.spark/kaggle-london-crime-data/input/london_crime_by_lsoa.csv --destination D:\temp --s3-bucket bucket --s3-prefix s3-prefix
In order to run the program inside Spark using a local CSV, the following command can be used (assuming Windows machine):
$ spark-submit --master local[*] --jars aws-java-sdk-kms-1.11.354.jar,aws-java-sdk-s3-1.11.354.jar,aws-java-sdk-core-1.11.354.jar --class org.sharpsw.spark.ExtractLondonCrimeData target\scala-2.11\kaggle-london-crime-data-spark_2.11-<appVersion>.jar --local london_crime_by_lsoa.csv --destination D:\temp --s3-bucket bucket --s3-prefix s3-prefix
All changes are listed in CHANGELOG.md
London crime data, 2008-2016 (Kaggle) - last used in May 2018