Principal components not separating the data well in PCA-PRIM? #333

ostenst · 2023-12-13T14:56:55Z

ostenst
Dec 13, 2023

Hi! I have a question related to the PCA-PRIM implementation of the Workbench. It seems like the pca_preprocess() function produces unexpected values, or maybe they are ordered in an unexpected way? I will illustrate the issue based on the sd_prim_PCA_flu.py example.

My main question is this: why are the two first principal components not separating the data very well? Perhaps I am plotting this wrong or understanding the dataframes rotated_experiments and rotation_matrix incorrectly. But when I plot the transformed x-values (i.e. the rotated_experiments) against r_0 and r_1, they are not clearly distinguishable based on the classification applied:
y = outcomes["deceased population region 1"][:, -1] > 1000000

Which I guess they should be. I tried to check the eigenvectors of the covariance matrix of the original data, and it seems like the first and second eigenvectors are relatively large. I.e., their principal components should actually separate the data well. Anyone has an explanation? You can run the code below to see what I mean.

P.S. amazing work putting this Workbench together. Admirable!
Best regards,
Oscar Stenström

import matplotlib.pyplot as plt
import numpy as np
import ema_workbench.analysis.prim as prim
from ema_workbench import ema_logging, load_results

#
# .. (original) codeauthor:: jhkwakkel <j.h.kwakkel (at) tudelft (dot) nl>

ema_logging.log_to_stderr(level=ema_logging.INFO)

# load data
fn = r"1000 flu cases no policy.tar.gz"
x, outcomes = load_results(fn)

# specify y
y = outcomes["deceased population region 1"][:, -1] > 1000000

rotated_experiments, rotation_matrix = prim.pca_preprocess(x, y, exclude=["model", "policy"])

### I am interested in how well this rotation separates the data based on the outcome y (I think? Since y contains True/False based on policy failures/successes).
# To my understanding, the 1st and 2nd principal components should explain most variance. So I plot these:
plt.figure(figsize=(8, 6))
plt.scatter(rotated_experiments['r_0'], rotated_experiments['r_1'],c=y, alpha=0.75, edgecolors='k', s=100)
plt.title('r_0 and r_1 do not separate the data well')
plt.xlabel('r_0')
plt.ylabel('r_1')
plt.show()
print("The first two principal components are not separating the data very well, based on y.")

# However, these components are not separating the data very well based on y! Why is this?
# It could maybe be that the components r_i are not ordered by (decending) influence, so I check the original eigenvalues:
x_original = x.drop("model",axis=1).drop("policy",axis=1)
x_cov = x_original.cov()
eigenvalues, eigenvectors = np.linalg.eig(x_cov.values)
print("The eigenvalues are: \n ",eigenvalues)
print("But the first two eigenvalues are very large, and their principal components should cause most of the variance. Strange!")


# # perform prim on modified results tuple
# prim_obj = prim.Prim(rotated_experiments, y, threshold=0.8)
# box = prim_obj.find_box()

# box.show_tradeoff()
# box.inspect(22)
# plt.show()

quaquel · 2023-12-13T15:12:35Z

quaquel
Dec 13, 2023
Maintainer

The PCA is done not on the entire experiments matrix but only on the subset of experiments that is of interest (i.e., TRUE). So your eigenvalues are based on a subset of experiments yet your visual is for all experiments. Have you checked the paper by Dalal et al?

To be clear: it is still conceivable, that there is a mistake in the current code. It is not the most used or most extensively tested part of the code base.

1 reply

ostenst Dec 15, 2023
Author

Thanks for pointing that out, my mistake. Yes, I did look into the Dalal study, and try to follow their procedure.

However, even when only selecting the cases of interest (TRUE), the problem remains: the first two eigenvalues are very large, but their principal components are not separating the data well:

fn = r"1000 flu cases no policy.tar.gz"
x, outcomes = load_results(fn)
y = outcomes["deceased population region 1"][:, -1] > 1000000

x = x.drop("model",axis=1).drop("policy",axis=1)
x_interesting = x[y]
x_cov = x_interesting.cov()
eigenvalues, eigenvectors = np.linalg.eig(x_cov.values)
print("The eigenvalues are: \n ",eigenvalues)

rotated_experiments, rotation_matrix = prim.pca_preprocess(x, y)
figure, ax = plt.subplots(figsize=(8, 6))
ax.scatter(rotated_experiments['r_0'], rotated_experiments['r_1'],c=y, alpha=0.75, edgecolors='k', s=100)
ax.set_xlabel('r_0')
ax.set_ylabel('r_1')
plt.show()

That said, you did illustrate below that the pca_preprocess() function works as intended in your test case. So, maybe the issue I highlight is just due to a peculiarity of the flu data set.

quaquel · 2023-12-13T15:43:35Z

quaquel
Dec 13, 2023
Maintainer

I did a quick follow-up test.

x = pd.DataFrame(np.random.rand(250, 2), columns=['a', 'b'])
y = np.sum(x, axis=1)>1.2

rotated_experiments, rotation_matrix = prim.pca_preprocess(x, y)

figure, ax = plt.subplots(figsize=(8, 6))
ax.scatter(rotated_experiments['r_0'], rotated_experiments['r_1'],c=y, alpha=0.75, edgecolors='k', s=100)
ax.set_xlabel('r_0')
ax.set_ylabel('r_1')


plt.show()

This does produce the expected behavior:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Principal components not separating the data well in PCA-PRIM? #333

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Principal components not separating the data well in PCA-PRIM? #333

ostenst Dec 13, 2023

Replies: 2 comments · 1 reply

quaquel Dec 13, 2023 Maintainer

ostenst Dec 15, 2023 Author

quaquel Dec 13, 2023 Maintainer

ostenst
Dec 13, 2023

Replies: 2 comments 1 reply

quaquel
Dec 13, 2023
Maintainer

ostenst Dec 15, 2023
Author

quaquel
Dec 13, 2023
Maintainer