You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I tried to apply the latest version to seaborn diamonds dataset (sns.load_dataset("diamonds")) and got an error TypeError: Object with dtype category cannot perform the numpy op log1p in executing generated code.
It looks np.log1p is applied to categorical columns before encoding their values.
# *** GENERATED PIPELINE ***# LOAD DATAimportpandasaspdtrain_dataset=pd.read_pickle(r"/home/kimura/lectures/begin-python-2024/outputs/training.pkl")
# TRAIN-TEST SPLITfromsklearn.model_selectionimporttrain_test_splitdefsplit_dataset(dataset, train_size=0.75, random_state=17):
train_dataset, test_dataset=train_test_split(dataset, train_size=train_size, random_state=random_state, stratify=dataset["cut"])
returntrain_dataset, test_datasettrain_dataset, test_dataset=split_dataset(train_dataset)
train_dataset, validation_dataset=split_dataset(train_dataset)
# SUBSAMPLE# If the number of rows of train_dataset is larger than sample_size, sample rows to sample_size for speedup.fromlib.sample_datasetimportsample_datasettrain_dataset=sample_dataset(
dataframe=train_dataset,
sample_size=100000,
target_columns=['cut'],
task_type='classification'
)
test_dataset=validation_dataset# PREPROCESSING-1# Component: Preprocess:Log# Efficient Cause: Preprocess:Log is required in this pipeline since the dataset has ['feature:target_imbalance_score', 'feature:str_category_presence', 'feature:max_normalized_stddev']. The relevant features are: ['carat', 'clarity', 'color', 'depth', 'price', 'table'].# Purpose: Return the natural logarithm of one plus the input array, element-wise.# Form:# Input: array_like# Key hyperparameters used: None# Alternatives: Although [Preprocess:StandardScaler] can also be used for this dataset, Preprocess:Log is used because it has more feature:target_imbalance_score than feature:max_normalized_stddev.# Order: Preprocess:Log should be applied importnumpyasnpNUMERIC_COLS_TO_SCALE= ['carat', 'clarity', 'color', 'depth', 'price', 'table']
train_dataset[NUMERIC_COLS_TO_SCALE] =np.log1p(train_dataset[NUMERIC_COLS_TO_SCALE]).replace([np.inf, -np.inf], np.nan).fillna(train_dataset[NUMERIC_COLS_TO_SCALE].mean())
NUMERIC_COLS_TO_SCALE_FOR_TEST=list(set(test_dataset.columns) &set(NUMERIC_COLS_TO_SCALE))
test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST] =np.log1p(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST]).replace([np.inf, -np.inf], np.nan).fillna(test_dataset[NUMERIC_COLS_TO_SCALE_FOR_TEST].mean())
# DETACH TARGETTARGET_COLUMNS= ['cut']
feature_train=train_dataset.drop(TARGET_COLUMNS, axis=1)
target_train=train_dataset[TARGET_COLUMNS].copy()
feature_test=test_dataset.drop(TARGET_COLUMNS, axis=1)
target_test=test_dataset[TARGET_COLUMNS].copy()
# PREPROCESSING-2# Component: Preprocess:OneHotEncoder# Efficient Cause: Preprocess:OneHotEncoder is required in this pipeline since the dataset has ['feature:str_category_presence', 'feature:str_category_binary_presence', 'feature:str_category_small_presence']. The relevant features are: ['clarity', 'color'].# Purpose: Encode categorical features as a one-hot numeric array.# Form:# Input: list of arrays# Key hyperparameters used: # "handle_unknown: {‘error’, ‘ignore’}, default=’error’" :: Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.# "sparse: bool, default=True" :: Will return sparse matrix if set True else will return an array.# Alternatives: Although [Preprocess:OrdinalEncoder] can also be used for this dataset, Preprocess:OneHotEncoder is used because it has more feature:str_category_binary_presence than feature:str_category_small_presence.# Order: Preprocess:OneHotEncoder should be applied fromsklearn.preprocessingimportOneHotEncoderCATEGORICAL_COLS= ['clarity', 'color']
onehot_encoder=OneHotEncoder(handle_unknown='ignore', sparse_output=False)
train_encoded=pd.DataFrame(onehot_encoder.fit_transform(feature_train[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_train.index)
feature_train=pd.concat([feature_train, train_encoded ], axis=1)
feature_train.drop(CATEGORICAL_COLS, axis=1, inplace=True)
test_encoded=pd.DataFrame(onehot_encoder.transform(feature_test[CATEGORICAL_COLS]), columns=onehot_encoder.get_feature_names_out(), index=feature_test.index)
feature_test=pd.concat([feature_test, test_encoded ], axis=1)
feature_test.drop(CATEGORICAL_COLS, axis=1, inplace=True)
# MODELimportnumpyasnpfromsklearn.ensembleimportRandomForestClassifierrandom_state_model=42model=RandomForestClassifier(random_state=random_state_model, )
model.fit(feature_train, target_train.values.ravel())
y_pred=model.predict(feature_test)
#EVALUATIONfromsklearnimportmetricsf1=metrics.f1_score(target_test, y_pred, average='macro')
print('RESULT: F1 Score: '+str(f1))
Expected behavior
sapientml-core==0.6.2 succeeded to generate code getting no error.
Environment (please complete the following information):
Describe the bug
I tried to apply the latest version to seaborn diamonds dataset (
sns.load_dataset("diamonds")
) and got an errorTypeError: Object with dtype category cannot perform the numpy op log1p
in executing generated code.It looks np.log1p is applied to categorical columns before encoding their values.
To Reproduce
script
generated code
Expected behavior
sapientml-core==0.6.2 succeeded to generate code getting no error.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: