Defining Necessary Number of Dimensions #32

BradKML · 2022-10-25T08:24:35Z

For fun I also borrowed some other data from This Link and see how personality and test performance can be condensed to a dimensionally reduced model.
personality_score.csv

Question1 : what is the proper way of selecting sufficient amont of dimensions to preserve data and avoiding noise? Kaiser–Meyer–Olkin, Levene, and others all seem to be better descriptors compared to "Eigenvalue > 1" rule.
Question 2: Can PCA be integrated with something else such that it can behave like PCR and Lasso Regression? (as in reducing the amounf of unnecessary columns before attempting to be accurate)
Questions 3: Can ICA be used to discover significant columns? It is seen as A way to isolate components after using PCA to assess proper dimension count

!pip install pca
from pandas import read_csv
from pca import pca

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
y = df[['AFQT']]
X = df.drop(columns=['AFQT'])
model = pca(normalize=True)
results = model.fit_transform(X)
print(model.results['explained_var'])
fig, ax = model.plot()
fig.savefig('personality_performance.png')

BradKML · 2022-10-25T09:25:16Z

Current find: Six Components are enough to describe all data with Eigenvalue > 1 when used with RobustScaler. [9.136, 2.683, 2.078, 1.420, 1.328, 1.090] similar to the amount without scaling
MaxAbsScaler yield only one weak element whilst StandardScaler yielded 159 components with first six being [52.051, 12.815, 9.561, 8.205, 6.741, 5.902]. It seems that normalization does not help with clearing noise in some cases.

Question 4: How can one check the significance of an ICA component
Question 5: If one were to use the 159 components, what are the strategy of determining the designation of the most useful columns in each column?

from pandas import read_csv
from sklearn.preprocessing import MaxAbsScaler, RobustScaler, StandardScaler

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
X, y = df.drop(columns=['AFQT']), df[['AFQT']]
# X_transformed = MaxAbsScaler().fit_transform(X) # in case of Yes/No Question
X_transformed = RobustScaler().fit_transform(X) # in case of Likert Scale
# X_transformed = StandardScaler().fit_transform(X) # in case of aggregates

from numpy import mean, dot
from sklearn.decomposition import PCA, FastICA
from pandas import DataFrame

pca = PCA()
X_transformed_pca = pca.fit_transform(X_transformed)
suff_len = len([i for i in pca.explained_variance_ if i > 1])
print(pca.explained_variance_[:suff_len])

ica = FastICA(n_components=suff_len)
X_transformed_ica = ica.fit_transform(X_transformed)
df_comp = DataFrame(ica.components_, columns = X.columns)

BradKML mentioned this issue Oct 27, 2022

How to apply a custom preprocessor to only specified features automl/auto-sklearn#1110

Open

erdogant added the question A question regarding usage, implementation etc label Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining Necessary Number of Dimensions #32

Defining Necessary Number of Dimensions #32

BradKML commented Oct 25, 2022 •

edited

Loading

BradKML commented Oct 25, 2022 •

edited

Loading

Defining Necessary Number of Dimensions #32

Defining Necessary Number of Dimensions #32

Comments

BradKML commented Oct 25, 2022 • edited Loading

BradKML commented Oct 25, 2022 • edited Loading

BradKML commented Oct 25, 2022 •

edited

Loading

BradKML commented Oct 25, 2022 •

edited

Loading