This project contains code to estimate vote shares and seats
distributions using CEO Barometre data. The code is designed to be run
from the Snakefile
(see section on execution) but it
can also be run interactively.
src
contains all the R scripts. The purpose of each script is listed below.dta
will host all the relevant results created by the scripts as well as the input data. Note that the scripts expect that the raw data is expected to live indta/raw-dta
. All the estimated models will be written todta/models
.img
will host the images generated by the scripts.config
includes aconfig.yaml
file that defines configuration variables that are used throughout the code. Most of these variables are just the locations/names of the relevant folders.
The project is structured in the following way. Each script performs a
single task. All the scripts read in some data and write out the
results of the corresponding operations. The input and output data for
each script is documented in the Snakefile
file.
Note that all the scripts read from config/config.yaml
. This is a
YAML configuration that sets the main paths that are used in the
project, like the location and name of the folders, and the final
colors and names used for each party. The variables defined in the
configuration file are attached to the R global environment for ease
of use.
-
01-data-cleaning.R
reads in the raw data in SPSS format and selects and transforms the variables that are used during the rest of the pipeline. It is worth noting that the file also transforms the names of the different parties to a format that can be used as factor names by R. -
02-past-behavior.R
estimates weights that match the reported electoral behavior the distribution of their primary language of the respondents to known frame values. A proportion of respondents say that they don't remember who they voted for or even whether they voted in the last election. For these individuals, model predictions are used. -
03-expected-behavior.R
estimates the electoral behavior of all respondents in the survey at the individual level. There are two behaviors of interest: whether the respondent will vote and the party they will vote for. A proportion of respondents do not report one or both behaviors and for them the expected behavior is, as before, assigned using two predictive models. The model for party choice assigns a party to each respondent but the model for turnout assigns a probability of voting. A cutoff probability is then estimated from the ROC curve of the model. -
04-vote-shares.R
uses the individual predictions about past and expected behavior and estimates vote shares at the Catalonia level. Individuals who reported that they don't know who they will vote for are assigned the predictions from the party choice model. Individuals with a probability of voting below the cutoff, are expected to not vote. -
05-district-shares.R
estimates district-level vote shares using a combination of survey data at the district level and some priors to compensate for the small sample size. The priors are set to the expected deviation between the electoral results from each district and that from from Catalonia in the previous election. This script uses the packagedshare
which needs to be installed separately. -
06-seat-estimates.R
uses the district-level vote shares to simulate the distribution of seats for each party. This script uses the packageescons
which needs to be installed separately. -
07-report-figures.R
prepares the final figures included in the report. Note: This file is not executed via theSnakefile
and will likely contain dependencies different from those listed in therenv.lock
.
The Snakefile
will ensure that the scripts performing data analysis
are executed in the correct order.
The project can be executed via the Snakefile
which will run all the
scripts in the correct order. More information about this file can be
found in the Snakemake
documentation.
Make sure that
Snakemake` is installed in your
machine,
for instance, using
pip3 install snakemake
and then run:
snakemake --cores all
Alternatively, the scripts can be run separately from the shell or
from an interactive session. In this case, it is important to remember
that all paths are currently set relative to the top folder. In other
words, make sure that getwd()
points to the folder in which the
Snakefile
lives -- that is, the folder above where all the R
scripts live.
The order of execution is the order in which the files are listed above. The full project can be run manually using:
Rscript src/data-cleaning.R
Rscript src/past-behavior.R
Rscript src/expected-behavior.R
Rscript src/vote-shares.R
Rscript src/district-shares.R
Rscript src/seat-estimates.R
The project dependencies are listed in renv.lock
. Check the renv
package for more
information about how to install them in a separate environment for
reproducibility.
The project uses three machine learning models. One to estimate past behavior, another to estimate expected party choice, and a third one to estimate whether the respondent will vote. All these three models use the same structure (including similar RHS variables) and very similar code. It is important to keep in mind that these models may take several hours to run.
One alternative to reduce runtime is to limit the size of the grid used for parameter search. For instance, the following snippet defines a search grid with 180 search points
grid_partychoice <- expand.grid(eta=c(.01, .005, .001),
max_depth=c(1, 2, 3),
min_child_weight=1,
subsample=0.8,
colsample_bytree=0.8,
nrounds=seq(1, 15, length.out=20)*100,
gamma=0)
The grid is then run 5 times over each of the 5 folds (see the
variables FOLDS
and REPEATS
in config/config.yaml
). That means
4,500 runs of a given model. It is possible to make the size of the
grid smaller by more carefully selecting or searching some of the
parameters above -- perhaps setting eta
to a single, small value and
focusing on identifying good values of nrounds
.