You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names
#9352
Closed
KWiecko opened this issue
Jun 30, 2023
· 2 comments
· Fixed by #9474
At the end of the model training workflow we need to convert xgboost booster to coremltools MLModel. For that we use coremltools converter, which under the hood calls get_dump(dump_format='json') to perform conversion. It looks like get_dump(dump_format='json') returns invalid JSON strings if special characters are present in feature names.
Detailed bug description:
Feature names which contain 'non standard' characters (e.g. double quotes ' " ', tab '\t', newline '\n' symbols) are not properly dumped to JSON by get_dump(dump_format='json') and dump_model(..., dump_format='json') methods. All feature names (no matter which characters they contain) seem to just be wrapped with double quotes which leads to invalid JSON strings if a feature name contains a 'special' character.
I'm not 100% sure but my guess is that JsonGenerator class (localed in xgboost/src/tree/tree_model.cc) is responsible for dumping trees to JSONs so maybe replacing "" wrapping (..., "split": "{fname}", ...) with a json.dumps() statement would solve this issue?
save_model('*.json') always dumps model to a valid JSON representation.
Environment
xgboost: 1.4.2 and 1.7.6
ubuntu 22.04, Amazon Linux 2
installed via pip
Minimum reproducible example:
importnumpyasnpimportjsonimportxgboostasxgbif__name__=='__main__':
x=np.random.rand(100, 3)
y=np.random.rand(100, 1)
# set dummy paramsparams= {'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'max_depth': 4}
# following feature names will not cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')correctly_dumped_feature_names= ['feature 0', 'feature 1', 'feature 2']
dm_correctly_dumped_feature_names=xgb.DMatrix(x, label=y, feature_names=correctly_dumped_feature_names)
correctly_dumped_booster=xgb.train(params=params, dtrain=dm_correctly_dumped_feature_names, num_boost_round=3)
# get dump and attempt to iterate over trees JSONscorrect_model_dump=correctly_dumped_booster.get_dump(dump_format='json')
fortree_json_index, tree_jsoninenumerate(correct_model_dump):
# all trees correctly loadedjson.loads(tree_json)
# dump_model dumps model to valid JSON format only if feature names do not contain any special characters# for current feature names correctly_dumped_booster will be dumped to a valid JSONcorrectly_dumped_booster.dump_model('correctly_dumped_booster_dump_model.json', dump_format='json')
# save_model always saves model to valid JSON formatcorrectly_dumped_booster.save_model('correctly_dumped_booster_save_model.json')
# following feature names will cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')incorrectly_dumped_feature_names= ['"feature 0"', '\tfeature\n1', 'feature 2']
dm_incorrectly_dumped_feature_names=xgb.DMatrix(x, label=y, feature_names=incorrectly_dumped_feature_names)
incorrectly_dumped_booster=xgb.train(params=params, dtrain=dm_incorrectly_dumped_feature_names, num_boost_round=3)
# get dump and attempt to iterate over trees JSONsincorrect_model_dump=incorrectly_dumped_booster.get_dump(dump_format='json')
fortree_json_index, tree_jsoninenumerate(incorrect_model_dump):
try:
json.loads(tree_json)
exceptjson.JSONDecodeErrorasexc:
print(f'Error when parsing tree with id: {tree_json_index}')
print(exc)
# dump_model dumps model to valid JSON format only if feature names do not contain any special characters# for current feature names incorrectly_dumped_booster will be dumped to a invalid JSONincorrectly_dumped_booster.dump_model('incorrectly_dumped_booster_dump_model.json', dump_format='json')
# save_model always saves model to valid JSON formatincorrectly_dumped_booster.save_model('incorrectly_dumped_booster_save_model.json')
The text was updated successfully, but these errors were encountered:
Bug description
Rationale:
At the end of the model training workflow we need to convert
xgboost
booster tocoremltools
MLModel. For that we usecoremltools
converter, which under the hood callsget_dump(dump_format='json')
to perform conversion. It looks likeget_dump(dump_format='json')
returns invalid JSON strings if special characters are present in feature names.Detailed bug description:
Feature names which contain 'non standard' characters (e.g. double quotes ' " ', tab '\t', newline '\n' symbols) are not properly dumped to JSON by
get_dump(dump_format='json')
anddump_model(..., dump_format='json')
methods. All feature names (no matter which characters they contain) seem to just be wrapped with double quotes which leads to invalid JSON strings if a feature name contains a 'special' character.I'm not 100% sure but my guess is that
JsonGenerator
class (localed inxgboost/src/tree/tree_model.cc
) is responsible for dumping trees to JSONs so maybe replacing "" wrapping (..., "split": "{fname}", ...
) with a json.dumps() statement would solve this issue?save_model('*.json')
always dumps model to a valid JSON representation.Environment
Minimum reproducible example:
The text was updated successfully, but these errors were encountered: