You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ll probably have to fix this one, just taking notes here
We use the KL divergence for testing whether the joint distribution of (X,Y) on the samples for which the contributor Z is not NA P(XY)|Z_notNA is not too different from the original P(XY).
If it is very different, then the result of I(X;Y|Z) does not really give information about Z as a contributor, see this extreme example :
For now the value KL(P(XY)|Z_notNA, P(XY)) is compared to log(N_nonNA) which probably captures the worst cases of selection bias but may not be what we want.
One obvious flaw is that log(N_nonNA) is increasing, whereas we expect it to be harder to create a strong selection bias when adding more samples to the subsample.
In this image the blue distribution are empirical distributions of 10K KL divs for random subsampling (null hypothesis) and the red line is log(N_nonNA) (grey number to the right) along with its empirical pvalue.
The threshold should be defined on a pvalue agasint the null (what can we expect from the null distribution, i.e. if data were really missing at random?), probably relative to either I(X;Y) (or H(X,Y)?) : if I(X;Y) is already very low it may be a good idea to be very strict about the value of KL.
It may be considered as a special case of the two-sample test (many tests require the two samples to be independent)
The text was updated successfully, but these errors were encountered:
I’ll probably have to fix this one, just taking notes here
We use the KL divergence for testing whether the joint distribution of (X,Y) on the samples for which the contributor Z is not NA P(XY)|Z_notNA is not too different from the original P(XY).
If it is very different, then the result of I(X;Y|Z) does not really give information about Z as a contributor, see this extreme example :
For now the value KL(P(XY)|Z_notNA, P(XY)) is compared to log(N_nonNA) which probably captures the worst cases of selection bias but may not be what we want.
One obvious flaw is that log(N_nonNA) is increasing, whereas we expect it to be harder to create a strong selection bias when adding more samples to the subsample.
In this image the blue distribution are empirical distributions of 10K KL divs for random subsampling (null hypothesis) and the red line is log(N_nonNA) (grey number to the right) along with its empirical pvalue.
The threshold should be defined on a pvalue agasint the null (what can we expect from the null distribution, i.e. if data were really missing at random?), probably relative to either I(X;Y) (or H(X,Y)?) : if I(X;Y) is already very low it may be a good idea to be very strict about the value of KL.
It may be considered as a special case of the two-sample test (many tests require the two samples to be independent)
The text was updated successfully, but these errors were encountered: