Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

Open
kimusaku opened this issue Jan 24, 2025 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@kimusaku
Copy link
Contributor

Describe the bug
I tried to apply the latest version to seaborn diamonds dataset (sns.load_dataset("diamonds")) and got an error TypeError: Object with dtype category cannot perform the numpy op log1p in executing generated code.
It looks np.log1p is applied to categorical columns before encoding their values.

To Reproduce

script
import pandas as pd
from sapientml import SapientML
from sapientml.util.logging import setup_logger
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import seaborn as sns

train_data = sns.load_dataset('diamonds')

train_data, test_data = train_test_split(train_data)

y_true = test_data["cut"].reset_index(drop=True)
test_data.drop(["cut"], axis=1, inplace=True)

cls = SapientML(["cut"])
cls.fit(train_data)
y_pred = cls.predict(test_data)

print("Accuracy:", accuracy_score(y_true, y_pred))
generated code
# *** GENERATED PIPELINE ***

# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"/home/kimura/lectures/begin-python-2024/outputs/training.pkl")

# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
    train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state, stratify=dataset["cut"])
    return train_dataset, test_dataset
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)

# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
    dataframe=train_dataset,
    sample_size=100000,
    target_columns=['cut'],
    task_type='classification'
)

test_dataset = validation_dataset


# PREPROCESSING-1
# Component: Preprocess:Log
# Efficient Cause: Preprocess:Log is required in this pipeline since the dataset has ['feature:target_imbalance_score', 'feature:str_category_presence', 'feature:max_normalized_stddev']. The relevant features are: ['carat', 'clarity', 'color', 'depth', 'price', 'table'].
# Purpose: Return the natural logarithm of one plus the input array, element-wise.
# Form:
#   Input: array_like
#   Key hyperparameters used: None
# Alternatives: Although [Preprocess:StandardScaler] can also be used for this dataset, Preprocess:Log is used because it has more feature:target_imbalance_score than feature:max_normalized_stddev.
# Order: Preprocess:Log should be applied  
import numpy as np
NUMERIC_COLS_TO_SCALE = ['carat', 'clarity', 'color', 'depth', 'price', 'table']
train_dataset[NUMERIC_COLS_TO_SCALE] = np.log1p(train_dataset[NUMERIC_COLS_TO_SCALE]).replace([np.inf, -np.inf], np.nan).fillna(train_dataset[NUMERIC_COLS_TO_SCALE].mean())
NUMERIC_COLS_TO_SCALE_FOR_TEST = list(set(test_dataset.columns) & set(NUMERIC_COLS_TO_SCALE))
test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST] = np.log1p(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST]).replace([np.inf, -np.inf], np.nan).fillna(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST].mean())

# DETACH TARGET
TARGET_COLUMNS = ['cut']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()

# PREPROCESSING-2
# Component: Preprocess:OneHotEncoder
# Efficient Cause: Preprocess:OneHotEncoder is required in this pipeline since the dataset has ['feature:str_category_presence', 'feature:str_category_binary_presence', 'feature:str_category_small_presence']. The relevant features are: ['clarity', 'color'].
# Purpose: Encode categorical features as a one-hot numeric array.
# Form:
#   Input: list of arrays
#   Key hyperparameters used: 
#		 "handle_unknown: {‘error’, ‘ignore’}, default=’error’" :: Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
#		 "sparse: bool, default=True" :: Will return sparse matrix if set True else will return an array.
# Alternatives: Although [Preprocess:OrdinalEncoder] can also be used for this dataset, Preprocess:OneHotEncoder is used because it has more feature:str_category_binary_presence than feature:str_category_small_presence.
# Order: Preprocess:OneHotEncoder should be applied  
from sklearn.preprocessing import OneHotEncoder
CATEGORICAL_COLS = ['clarity', 'color']
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
train_encoded = pd.DataFrame(onehot_encoder.fit_transform(feature_train[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_train.index)
feature_train = pd.concat([feature_train, train_encoded ], axis=1)
feature_train.drop(CATEGORICAL_COLS, axis=1, inplace=True)
test_encoded = pd.DataFrame(onehot_encoder.transform(feature_test[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_test.index)
feature_test = pd.concat([feature_test, test_encoded ], axis=1)
feature_test.drop(CATEGORICAL_COLS, axis=1, inplace=True)

# MODEL
import numpy as np
from sklearn.ensemble import RandomForestClassifier
random_state_model = 42
model = RandomForestClassifier(random_state=random_state_model, )
model.fit(feature_train, target_train.values.ravel())
y_pred = model.predict(feature_test)

#EVALUATION
from sklearn import metrics
f1 = metrics.f1_score(target_test, y_pred, average='macro')
print('RESULT: F1 Score: ' + str(f1))

Expected behavior
sapientml-core==0.6.2 succeeded to generate code getting no error.

Environment (please complete the following information):

  • OS: Ubuntu 22.04.5
  • Python Version: 3.11.10
  • SapientML Version: sapientml==0.4.15 sapientml-core==0.7.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants