Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to make the balanced dataset? #6

Open
eggpom opened this issue Feb 13, 2023 · 2 comments
Open

how to make the balanced dataset? #6

eggpom opened this issue Feb 13, 2023 · 2 comments

Comments

@eggpom
Copy link

eggpom commented Feb 13, 2023

First of all, thank you so much for sharing your work, it has been very helpful. But I still have a small problem, I hope to get your help. I find that the balanced data file was not generated after running the code (cicids2017.py) . How can i get the balanced data? Looking forward to your reply, thank you again!

@baixiaobaicai
Copy link

Thank you for the author's work. I am reproducing this paper,I also encountered the same problem. May I ask how to solve it? Thank you!
image

@foresthao
Copy link

foresthao commented Mar 25, 2024

well,I hace encounter the same problem. But I think it is easy to solve. just need to resample. Please check out what i do, I just rewrite ./preprocessing/cicids2017.py: def scale() function:

def scale(self, training_set, validation_set, testing_set):
        """"""
        (X_train, y_train), (X_val, y_val), (X_test, y_test) = training_set, validation_set, testing_set
        
        categorical_features = self.features.select_dtypes(exclude=["number"]).columns
        numeric_features = self.features.select_dtypes(exclude=[object]).columns

        preprocessor = ColumnTransformer(transformers=[
            ('categoricals', OneHotEncoder(drop='first', sparse=False, handle_unknown='error'), categorical_features),
            ('numericals', QuantileTransformer(), numeric_features)
        ])

        # Preprocess the features
        columns = numeric_features.tolist()

        X_train = pd.DataFrame(preprocessor.fit_transform(X_train), columns=columns)
        X_val = pd.DataFrame(preprocessor.transform(X_val), columns=columns)
        X_test = pd.DataFrame(preprocessor.transform(X_test), columns=columns)

        # Preprocess the labels
        le = LabelEncoder()

        y_train = pd.DataFrame(le.fit_transform(y_train), columns=["label"])
        y_val = pd.DataFrame(le.transform(y_val), columns=["label"])
        y_test = pd.DataFrame(le.transform(y_test), columns=["label"])

        # Resample the training data to address class imbalance
        train_data = pd.concat([X_train, y_train], axis=1)  # Combine features and labels
        resampled_data = []  # List to store resampled data
        min_samples = 20000
        # Iterate over each class label
        for label_value in y_train["label"].unique():
            # Resample data for the current class
            class_data = train_data[train_data["label"] == label_value]
            # resampled_class_data = resample(class_data, n_samples=20000, random_state=123, replace=True)
            
            if len(class_data) < min_samples:
            # If the number of samples is less than the required minimum, perform resampling with replacement
                resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=True)
            else:
                # Otherwise, perform resampling without replacement
                resampled_class_data = resample(class_data, n_samples=min_samples, random_state=123, replace=False)
            resampled_data.append(resampled_class_data)

        # Combine the resampled data for all classes
        resampled_data_cat = pd.concat(resampled_data)
        X_train_resampled = resampled_data_cat.drop("label", axis=1)
        y_train_resampled = resampled_data_cat["label"]

        return (X_train_resampled, y_train_resampled), (X_val, y_val), (X_test, y_test)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants