Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

kimusaku · 2025-01-24T01:15:00Z

Describe the bug
I tried to apply the latest version to seaborn diamonds dataset (sns.load_dataset("diamonds")) and got an error TypeError: Object with dtype category cannot perform the numpy op log1p in executing generated code.
It looks np.log1p is applied to categorical columns before encoding their values.

To Reproduce

script

import pandas as pd
from sapientml import SapientML
from sapientml.util.logging import setup_logger
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import seaborn as sns

train_data = sns.load_dataset('diamonds')

train_data, test_data = train_test_split(train_data)

y_true = test_data["cut"].reset_index(drop=True)
test_data.drop(["cut"], axis=1, inplace=True)

cls = SapientML(["cut"])
cls.fit(train_data)
y_pred = cls.predict(test_data)

print("Accuracy:", accuracy_score(y_true, y_pred))

generated code

# *** GENERATED PIPELINE ***

# LOAD DATA
import pandas as pd
train_dataset = pd.read_pickle(r"/home/kimura/lectures/begin-python-2024/outputs/training.pkl")

# TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
def split_dataset(dataset, train_size=0.75, random_state=17):
    train_dataset, test_dataset = train_test_split(dataset, train_size=train_size, random_state=random_state, stratify=dataset["cut"])
    return train_dataset, test_dataset
train_dataset, test_dataset = split_dataset(train_dataset)
train_dataset, validation_dataset = split_dataset(train_dataset)

# SUBSAMPLE
# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.
from lib.sample_dataset import sample_dataset
train_dataset = sample_dataset(
    dataframe=train_dataset,
    sample_size=100000,
    target_columns=['cut'],
    task_type='classification'
)

test_dataset = validation_dataset


# PREPROCESSING-1
# Component: Preprocess:Log
# Efficient Cause: Preprocess:Log is required in this pipeline since the dataset has ['feature:target_imbalance_score', 'feature:str_category_presence', 'feature:max_normalized_stddev']. The relevant features are: ['carat', 'clarity', 'color', 'depth', 'price', 'table'].
# Purpose: Return the natural logarithm of one plus the input array, element-wise.
# Form:
#   Input: array_like
#   Key hyperparameters used: None
# Alternatives: Although [Preprocess:StandardScaler] can also be used for this dataset, Preprocess:Log is used because it has more feature:target_imbalance_score than feature:max_normalized_stddev.
# Order: Preprocess:Log should be applied  
import numpy as np
NUMERIC_COLS_TO_SCALE = ['carat', 'clarity', 'color', 'depth', 'price', 'table']
train_dataset[NUMERIC_COLS_TO_SCALE] = np.log1p(train_dataset[NUMERIC_COLS_TO_SCALE]).replace([np.inf, -np.inf], np.nan).fillna(train_dataset[NUMERIC_COLS_TO_SCALE].mean())
NUMERIC_COLS_TO_SCALE_FOR_TEST = list(set(test_dataset.columns) & set(NUMERIC_COLS_TO_SCALE))
test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST] = np.log1p(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST]).replace([np.inf, -np.inf], np.nan).fillna(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST].mean())

# DETACH TARGET
TARGET_COLUMNS = ['cut']
feature_train = train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train = train_dataset[TARGET_COLUMNS].copy()
feature_test = test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test = test_dataset[TARGET_COLUMNS].copy()

# PREPROCESSING-2
# Component: Preprocess:OneHotEncoder
# Efficient Cause: Preprocess:OneHotEncoder is required in this pipeline since the dataset has ['feature:str_category_presence', 'feature:str_category_binary_presence', 'feature:str_category_small_presence']. The relevant features are: ['clarity', 'color'].
# Purpose: Encode categorical features as a one-hot numeric array.
# Form:
#   Input: list of arrays
#   Key hyperparameters used: 
#		 "handle_unknown: {‘error’, ‘ignore’}, default=’error’" :: Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
#		 "sparse: bool, default=True" :: Will return sparse matrix if set True else will return an array.
# Alternatives: Although [Preprocess:OrdinalEncoder] can also be used for this dataset, Preprocess:OneHotEncoder is used because it has more feature:str_category_binary_presence than feature:str_category_small_presence.
# Order: Preprocess:OneHotEncoder should be applied  
from sklearn.preprocessing import OneHotEncoder
CATEGORICAL_COLS = ['clarity', 'color']
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
train_encoded = pd.DataFrame(onehot_encoder.fit_transform(feature_train[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_train.index)
feature_train = pd.concat([feature_train, train_encoded ], axis=1)
feature_train.drop(CATEGORICAL_COLS, axis=1, inplace=True)
test_encoded = pd.DataFrame(onehot_encoder.transform(feature_test[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_test.index)
feature_test = pd.concat([feature_test, test_encoded ], axis=1)
feature_test.drop(CATEGORICAL_COLS, axis=1, inplace=True)

# MODEL
import numpy as np
from sklearn.ensemble import RandomForestClassifier
random_state_model = 42
model = RandomForestClassifier(random_state=random_state_model, )
model.fit(feature_train, target_train.values.ravel())
y_pred = model.predict(feature_test)

#EVALUATION
from sklearn import metrics
f1 = metrics.f1_score(target_test, y_pred, average='macro')
print('RESULT: F1 Score: ' + str(f1))

Expected behavior
sapientml-core==0.6.2 succeeded to generate code getting no error.

Environment (please complete the following information):

OS: Ubuntu 22.04.5
Python Version: 3.11.10
SapientML Version: sapientml==0.4.15 sapientml-core==0.7.1

The text was updated successfully, but these errors were encountered:

kimusaku added the bug Something isn't working label Jan 24, 2025

kimusaku assigned AkiraUra and HimanshuRRai Jan 24, 2025

HimanshuRRai mentioned this issue Feb 5, 2025

fix: TypeError: Object with dtype category cannot perform the NumPy op log1p #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

kimusaku commented Jan 24, 2025

Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

Could not generate code for seaborn diamonds dataset due to applying np.log1p to categorical columns #105

Comments

kimusaku commented Jan 24, 2025