seismos for NFP Stability

####Paul Mack, Sam Sun, Alden Golab A prediction system for anticipating not-for-profit service provider fiscal stability year-over-year. Final project for Machine Learning in Public Policy, Spring 2016.

Requirements

We have included a requirements.txt file in the setup folder, which contains pip statements that ought to be run prior to doing any work with the files contained in this repo. All code has been written in Python 3.

Getting the Data

Data are provided by:

US Internal Revenue Service (IRS):

US Department of Commerce:

City GDP Per Capita

By running merge_years.py with the data from the setup folder, you should be able to replicate the exact raw data we utilized for this project. The following files are explicitly called by merge_years.py:

py12_990.dat
py13_990.dat
py14_990.dat
15eofinextract990.dat
eo1.csv
eo2.csv
eo3.csv
eo4.csv

Not included in this are the GDP per Capita: these were separately merged into the zipmsa.csv file and renamed. We will add support for that eventually; however, it's a relatively easy process to carry out on your own if you need to.

Strangely, and perhaps worrisomely, the IRS data contains repeated EINs; that means that organizations appear twice in the data, oftentimes with competing values. We de-duplicate on EINs as these are the de-facto unique IDs for the data and keep the first entry for the EIN we encounter.

Building Features

Also contained with the setup folder is the feature_generation.py file. This is a fully automated feature generation for the project, using the product of merge_years.py to create a new .csv with solely the features to be used.

Modeling

The model folder contains the code used to run our models. acg_process.py contains a model loop, requiring a features set input and will print out to screen results from selected models. As it is set, it will run all models for all features and carry out the necessary data transformations and cleaning.

From this list, we then run model_pickle.py which takes specifications for the model that performed best in acg_process.py and re-runs against splits of the dataset in a Monte-Carlo style simulation. We modified this file on the fly for running Monte-Carlo cross-validation with a pickled file -- we will work to add support for this functionality, but it is not here yet.

run_validation.py runs the pickeled model against a validation year. This is imperative: given the structure of IRS data reporting, a model must be able to predict on the following year using a model trained on previous years: the data for the following year won't be available until the year after that.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
data		data
model		model
setup		setup
.gitignore		.gitignore
Proposal_seismosNFPStability.pdf		Proposal_seismosNFPStability.pdf
README.md		README.md
description.txt		description.txt
form-990-highlights.pdf		form-990-highlights.pdf
seismos_finalpresentation.pdf		seismos_finalpresentation.pdf
seismos_finalreport.pdf		seismos_finalreport.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seismos for NFP Stability

Requirements

Getting the Data

Building Features

Modeling

About

Releases

Packages

Contributors 2

Languages

aldengolab/seismos-NFP-stability-prediction

Folders and files

Latest commit

History

Repository files navigation

seismos for NFP Stability

Requirements

Getting the Data

Building Features

Modeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages