Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining Necessary Number of Dimensions #32

Open
BradKML opened this issue Oct 25, 2022 · 1 comment
Open

Defining Necessary Number of Dimensions #32

BradKML opened this issue Oct 25, 2022 · 1 comment
Labels
question A question regarding usage, implementation etc

Comments

@BradKML
Copy link

BradKML commented Oct 25, 2022

For fun I also borrowed some other data from This Link and see how personality and test performance can be condensed to a dimensionally reduced model.
personality_score.csv

Question1 : what is the proper way of selecting sufficient amont of dimensions to preserve data and avoiding noise? Kaiser–Meyer–Olkin, Levene, and others all seem to be better descriptors compared to "Eigenvalue > 1" rule.
Question 2: Can PCA be integrated with something else such that it can behave like PCR and Lasso Regression? (as in reducing the amounf of unnecessary columns before attempting to be accurate)
Questions 3: Can ICA be used to discover significant columns? It is seen as A way to isolate components after using PCA to assess proper dimension count

!pip install pca
from pandas import read_csv
from pca import pca

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
y = df[['AFQT']]
X = df.drop(columns=['AFQT'])
model = pca(normalize=True)
results = model.fit_transform(X)
print(model.results['explained_var'])
fig, ax = model.plot()
fig.savefig('personality_performance.png')

personality_performance.png

@BradKML
Copy link
Author

BradKML commented Oct 25, 2022

Current find: Six Components are enough to describe all data with Eigenvalue > 1 when used with RobustScaler. [9.136, 2.683, 2.078, 1.420, 1.328, 1.090] similar to the amount without scaling
MaxAbsScaler yield only one weak element whilst StandardScaler yielded 159 components with first six being [52.051, 12.815, 9.561, 8.205, 6.741, 5.902]. It seems that normalization does not help with clearing noise in some cases.

Question 4: How can one check the significance of an ICA component
Question 5: If one were to use the 159 components, what are the strategy of determining the designation of the most useful columns in each column?

from pandas import read_csv
from sklearn.preprocessing import MaxAbsScaler, RobustScaler, StandardScaler

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
X, y = df.drop(columns=['AFQT']), df[['AFQT']]
# X_transformed = MaxAbsScaler().fit_transform(X) # in case of Yes/No Question
X_transformed = RobustScaler().fit_transform(X) # in case of Likert Scale
# X_transformed = StandardScaler().fit_transform(X) # in case of aggregates

from numpy import mean, dot
from sklearn.decomposition import PCA, FastICA
from pandas import DataFrame

pca = PCA()
X_transformed_pca = pca.fit_transform(X_transformed)
suff_len = len([i for i in pca.explained_variance_ if i > 1])
print(pca.explained_variance_[:suff_len])

ica = FastICA(n_components=suff_len)
X_transformed_ica = ica.fit_transform(X_transformed)
df_comp = DataFrame(ica.components_, columns = X.columns)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A question regarding usage, implementation etc
Projects
None yet
Development

No branches or pull requests

2 participants