Confounding Mitigation in Machine Learning

Implementation details for CPT

The experiments use the library mlconfound. However, the original library contains a bunch of functions for utility which makes it harder to read and understand. There are essentially 3 steps for the CPT algorithm to compute the p-value:

estimate the probably density $q(c|y)$

Note that we condition the probability density estimation on $\textbf{y}$ which can be either continuous or categorical.

def conditional_log_likelihood(X, C, xdtype='categorical'):
  default_kwargs = {'n_splines': 8, 'dtype': [xdtype]}
  fit = LinearGAM(**default_kwargs).gridsearch(y=X, X=C.reshape(-1, 1), progress=False)
  mu = np.array(fit.predict(C))
  sigma = np.repeat(np.std(X - mu), len(C))

  return np.array([norm.logpdf(X, loc=m, scale=sigma) for m in mu]).T

GAM model is used to estimate $$c = \alpha + \beta f(y) + e$$ where the feature function $f$ is built using penalized B-splines. If we write $\mu=\alpha+\beta f(y)$ and $\sigma$ denotes the standard deviation of the residual $e$

sample permutations of $c^{(i)}$ using Markov Chain Monte Carlo sampler (the slowest part of the algorithm)

generate_X_CPT_MC

def generate_X_CPT_MC(nstep, log_likelihood_mat, Pi, random_state=None):
  n = len(Pi)
  npair = np.floor(n / 2).astype(int)
  rng = np.random.default_rng(random_state)
  for istep in range(nstep):
    perm = rng.choice(n, n, replace=False)
    inds_i = perm[0:npair]
    inds_j = perm[npair:(2 * npair)]
    # for each k=1,...,npair, decide whether to swap Pi[inds_i[k]] with Pi[inds_j[k]]
    log_odds = log_likelihood_mat[Pi[inds_i], inds_j] + log_likelihood_mat[Pi[inds_j], inds_i] - \
              log_likelihood_mat[Pi[inds_i], inds_i] - log_likelihood_mat[Pi[inds_j], inds_j]
    swaps = rng.binomial(1, 1 / (1 + np.exp(-np.maximum(-500, log_odds))))
    Pi[inds_i], Pi[inds_j] = Pi[inds_i] + swaps * (Pi[inds_j] - Pi[inds_i]), Pi[inds_j] - swaps * (Pi[inds_j] - Pi[inds_i])
  # Pi is a permutations of array indices
  return Pi

It draws disjoint pairs in parallel and decides whether or not to swap them randomly, according to the odds ratio calculated from the conditional densities belonging to the original and swapped data. $$ln\frac{q(c_j|y_i)q(c_i|y_j)}{q(c_i|y_i)q(c_j|y_j)} = \ell(c_j|y_i)+\ell(c_i|y_j)-\ell(c_i|y_i)-\ell(c_j|y_j)$$ where $\ell$ denotes the log-likelihood from 1.

calculate p-value

t_x_y   = np.corrcoef(x, y)[0,1]
t_xpi_y = np.zeros(num_perm)
y_tile  = np.tile(y, (num_perm,1))
for i in range(num_perm):
  t_xpi_y[i] = np.corrcoef(x_perm[i,:], y_tile[i,:])[0,1]
p = np.sum(t_xpi_y >= t_x_y) / len(t_xpi_y)

$$p = \frac{\sum^m_{i=1}{1{T(c^{(i)}, \hat{y}, y) \geq T(c, \hat{y}, y)}}}{m}$$

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
data		data
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Confounding Mitigation in Machine Learning

Implementation details for CPT

About

Releases

Packages

Languages

g1y5x3/confounding_mitigation_multiobj-ft

Folders and files

Latest commit

History

Repository files navigation

Confounding Mitigation in Machine Learning

Implementation details for CPT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages