Function to identify variable that can be best predicted from a set of base variables #38

MaxGhenis · 2020-07-29T22:08:21Z

This would help for defining the sequence of variables to impute or synthesize. Something like this would fit well in other functions:

def most_predictable(df, base_cols, candidate_cols, algorithm):
    """ Identifies the most predictable column from a set of base columns.
    
    Args:
        df: DataFrame with base and candidate columns.
        base_cols: List of column names to predict from.
        candidate_cols: List of column names to compare on predictability given base_cols.
        algorithm: Algorithm for determining predictability.

    Returns:
        Column name from candidate_cols which is most predictable from base_cols.
    """

This could be done with something like correlations, or algorithms like random forests (after standardizing data, and the standardization technique might be another arg).

cc @rickecon, per our chat if you can take a stab at this that'd be awesome.

The text was updated successfully, but these errors were encountered:

MaxGhenis · 2020-07-30T18:18:41Z

Some relevant sections from OSPC's synthetic PUF working paper:

Visit sequence definition. Some analysts have shared that they define the sequence according to a presumed causal chain, which in our case could mean income preceding the charitable deduction.

We based the visit sequence largely on descending size of the variables (their weighted values in the PUF). We did not experiment extensively with the order of synthesis, and this may or may not be the best way to approach the issues. Drechshler and Reiter (2011) describe several approaches to this.

That links to Drechsler and Reiter (2011), “An Empirical Evaluation of Easily Implemented, Nonparametric Methods for Generating Synthetic Datasets."

cc @donboyd5 who looked at the visit sequence in that project (Don, we're looking at defining a sequence for imputation across datasets).

donboyd5 · 2020-07-31T13:25:31Z

Thanks, @MaxGhenis. I am really glad to see you working on improving the synthesis process - I think there are a lot of ways that it can be improved relative to our initial effort and this issue in particular seems ripe for exploration. Is the idea that you would define the full sequence by calling most_predictable iteratively, removing the previously found most predictable variable from candidate_cols and moving it into base_cols at the start of each new iteration? That certainly seems intuitively appealing and testable.

MaxGhenis · 2020-07-31T22:43:55Z

Is the idea that you would define the full sequence by calling most_predictable iteratively, removing the previously found most predictable variable from candidate_cols and moving it into base_cols at the start of each new iteration?

@donboyd5 that's exactly right.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to identify variable that can be best predicted from a set of base variables #38

Function to identify variable that can be best predicted from a set of base variables #38

MaxGhenis commented Jul 29, 2020

MaxGhenis commented Jul 30, 2020

donboyd5 commented Jul 31, 2020

MaxGhenis commented Jul 31, 2020

Function to identify variable that can be best predicted from a set of base variables #38

Function to identify variable that can be best predicted from a set of base variables #38

Comments

MaxGhenis commented Jul 29, 2020

MaxGhenis commented Jul 30, 2020

donboyd5 commented Jul 31, 2020

MaxGhenis commented Jul 31, 2020