NYC open data traffic violations program

Spark program for analyzing parking violations in NYC using datasets from the NYC open data web site.

Description

This Spark program loads the CSVs from the NYC open data web site and performs filtering and groupings of traffic violations occurred in New York City.

Some of the data transformations realized are:

Filter violations by year
Aggregate (count) violations by year
Aggregate (count) violations by plate type
Aggregate (count) violations by registration state

In this program one structure used is the data set which implies the development of case classes to be stored and manipulated. Therefore, all major data structures loaded in this program make use of case classes and data sets.

Another mechanism I tried was to bulk load all violation CSVs at once instead of loading them individually. Since Spark provides such functionally it makes the code smaller and once everything was in memory I just filtered the data by year.

Build and run locally

In order to build and run locally the following commands can be used:

$ sbt compile
$ sbt "run --app-token token_app --data-folder /home/user/kaggle/nyc-parking-violations/data --csv-folder /home/andersonkmi/kaggle/nyc-parking-violations --destination-folder /home/user/temp"

The token_app refers to a application token that must be obtained from the NYC open data web site. The reason for that is due to the use of an API for retrieving data directly from them related to violation codes. See the DOF Parking Violation codes reference below.

The --data-folder option indicates the source folder of the CSV files containing the violations.

The --csv-folder option indicates the folder for plate type and state information CSV files.

The --destination-folder option indicates where the data transformation results will be written to. If not informed, the default destination is the project root folder.

Changelog

All modifications developed for this project are listed on CHANGELOG.md

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
project		project
src/main/scala/org/codecraftlabs/nyc		src/main/scala/org/codecraftlabs/nyc
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC open data traffic violations program

Description

Build and run locally

Changelog

References

About

Releases

Packages

Languages

andersonkmi/kaggle-nyc-parking-violations-issued

Folders and files

Latest commit

History

Repository files navigation

NYC open data traffic violations program

Description

Build and run locally

Changelog

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages