Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add margin outlier filter #114

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open

Conversation

lennybronner
Copy link
Collaborator

@lennybronner lennybronner commented Oct 12, 2024

Description

Adding an outlier detection for margin outliers also (in addition to the turnout outliers, which we already have). Also added the ability to blacklist units (or even entire states).

Jira Ticket

Test Steps

Running the testbed like this will show 3352100 and 21083 be printed as unexpected units. And all Maryland units also:

python run.py 2020-11-03_USA_G redo --office_id P --estimands="['margin']" --features "['baseline_normalized_margin', 'ethnicity_likely_african_american', 'percent_bachelor_or_higher']" --current_run_name "rerun_2020_remove_outlier" --agg_model_preds --start_timestamp 2020-11-03T19:54:29-05:00 --end_timestamp 2020-11-04T06:07:13-05:00 --redo_run_model_every_n 100 --model_parameters "{'unit_blacklist': ['3352100', '21083'], 'postal_code_blacklist': ['MD']}"

@lennybronner lennybronner requested a review from a team as a code owner October 12, 2024 20:17
@dmnapolitano dmnapolitano self-assigned this Oct 16, 2024
@@ -163,24 +196,77 @@ def _get_unexpected_units(self, aggregates):

return unexpected_units

def _get_non_modeled_units(self, percent_reporting_threshold, turnout_factor_lower, turnout_factor_upper):
expected_geographic_units = self._get_expected_geographic_unit_fips().tolist()
def _fit_outlier_detection_model(self, reporting_units, response_variable, outlier_z_threshold):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a sense of:

  1. how much better this works compared to say, mean ± (z * standard deviation) or the IQR method?
  2. how much additional runtime this adds on to the model?

🤔



if "margin" in self.estimands:
# reporting_units["normalized_margin_change"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to keep this commented-out code? 🤔

Also can you remove the extra blank line above if "margin" in self.estimands? I am genuinely surprised pre-commit didn't remove that lol 🤔

@@ -1024,6 +1024,7 @@ def compute_bootstrap_errors(
self.weighted_z_test_pred = z_test_pred * weights_test
self.ran_bootstrap = True
self.n_contests = aggregate_indicator.shape[1]
# import IPython; IPython.embed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this? (Also I had never heard of IPython.embed() before and it's cool 😄 )

assert reporting_data.shape[0] == 133 - over


def test_unit_blacklist(va_governor_county_data, rng):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either "blocklist" or "blacklist", so can you either change this to "blocklist" or change CombinedDataHandler to "blacklist"? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants