KL divergence test for missing data #2

vcabeli · 2020-07-23T17:13:20Z

I’ll probably have to fix this one, just taking notes here

We use the KL divergence for testing whether the joint distribution of (X,Y) on the samples for which the contributor Z is not NA P(XY)|Z_notNA is not too different from the original P(XY).

If it is very different, then the result of I(X;Y|Z) does not really give information about Z as a contributor, see this extreme example :

For now the value KL(P(XY)|Z_notNA, P(XY)) is compared to log(N_nonNA) which probably captures the worst cases of selection bias but may not be what we want.

One obvious flaw is that log(N_nonNA) is increasing, whereas we expect it to be harder to create a strong selection bias when adding more samples to the subsample.

In this image the blue distribution are empirical distributions of 10K KL divs for random subsampling (null hypothesis) and the red line is log(N_nonNA) (grey number to the right) along with its empirical pvalue.

The threshold should be defined on a pvalue agasint the null (what can we expect from the null distribution, i.e. if data were really missing at random?), probably relative to either I(X;Y) (or H(X,Y)?) : if I(X;Y) is already very low it may be a good idea to be very strict about the value of KL.

It may be considered as a special case of the two-sample test (many tests require the two samples to be independent)

louise-rb-dupuis added the enhancement New feature or request label Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KL divergence test for missing data #2

KL divergence test for missing data #2

vcabeli commented Jul 23, 2020

KL divergence test for missing data #2

KL divergence test for missing data #2

Comments

vcabeli commented Jul 23, 2020