get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

KWiecko · 2023-06-30T21:20:21Z

Bug description

Rationale:

At the end of the model training workflow we need to convert xgboost booster to coremltools MLModel. For that we use coremltools converter, which under the hood calls get_dump(dump_format='json') to perform conversion. It looks like get_dump(dump_format='json') returns invalid JSON strings if special characters are present in feature names.

Detailed bug description:

Feature names which contain 'non standard' characters (e.g. double quotes ' " ', tab '\t', newline '\n' symbols) are not properly dumped to JSON by get_dump(dump_format='json') and dump_model(..., dump_format='json') methods. All feature names (no matter which characters they contain) seem to just be wrapped with double quotes which leads to invalid JSON strings if a feature name contains a 'special' character.

I'm not 100% sure but my guess is that JsonGenerator class (localed in xgboost/src/tree/tree_model.cc) is responsible for dumping trees to JSONs so maybe replacing "" wrapping (..., "split": "{fname}", ...) with a json.dumps() statement would solve this issue?

save_model('*.json') always dumps model to a valid JSON representation.

Environment

xgboost: 1.4.2 and 1.7.6
ubuntu 22.04, Amazon Linux 2
installed via pip

Minimum reproducible example:

import numpy as np
import json
import xgboost as xgb


if __name__ == '__main__':

    x = np.random.rand(100, 3)
    y = np.random.rand(100, 1)

    # set dummy params
    params = {'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'max_depth': 4}

    # following feature names will not cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')
    correctly_dumped_feature_names = ['feature 0', 'feature 1', 'feature 2']
    dm_correctly_dumped_feature_names = xgb.DMatrix(x, label=y, feature_names=correctly_dumped_feature_names)
    correctly_dumped_booster = xgb.train(params=params, dtrain=dm_correctly_dumped_feature_names, num_boost_round=3)

    # get dump and attempt to iterate over trees JSONs
    correct_model_dump = correctly_dumped_booster.get_dump(dump_format='json')

    for tree_json_index, tree_json in enumerate(correct_model_dump):
        # all trees correctly loaded
        json.loads(tree_json)

    # dump_model dumps model to valid JSON format only if feature names  do not contain any special characters
    # for current feature names correctly_dumped_booster will be dumped to a valid JSON
    correctly_dumped_booster.dump_model('correctly_dumped_booster_dump_model.json', dump_format='json')
    # save_model always saves model to valid JSON format
    correctly_dumped_booster.save_model('correctly_dumped_booster_save_model.json')

    # following feature names will cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')
    incorrectly_dumped_feature_names = ['"feature 0"', '\tfeature\n1', 'feature 2']
    dm_incorrectly_dumped_feature_names = xgb.DMatrix(x, label=y, feature_names=incorrectly_dumped_feature_names)
    incorrectly_dumped_booster = xgb.train(params=params, dtrain=dm_incorrectly_dumped_feature_names, num_boost_round=3)

    # get dump and attempt to iterate over trees JSONs
    incorrect_model_dump = incorrectly_dumped_booster.get_dump(dump_format='json')

    for tree_json_index, tree_json in enumerate(incorrect_model_dump):
        try:
            json.loads(tree_json)
        except json.JSONDecodeError as exc:
            print(f'Error when parsing tree with id: {tree_json_index}')
            print(exc)

    # dump_model dumps model to valid JSON format only if feature names  do not contain any special characters
    # for current feature names incorrectly_dumped_booster will be dumped to a invalid JSON
    incorrectly_dumped_booster.dump_model('incorrectly_dumped_booster_dump_model.json', dump_format='json')
    # save_model always saves model to valid JSON format
    incorrectly_dumped_booster.save_model('incorrectly_dumped_booster_save_model.json')

The text was updated successfully, but these errors were encountered:

Tanupriya-Singh · 2023-07-29T05:12:09Z

Hello @KWiecko , thanks for the detailed explanation. I can work on this. Let's see if the repository owners will approve the PR.

trivialfis · 2023-08-11T12:03:13Z

Apologies for the slow reply, here's a quick fix #9474 .

trivialfis added the type: bug label Aug 11, 2023

trivialfis mentioned this issue Aug 11, 2023

Handle special characters in JSON model dump. #9474

Merged

trivialfis closed this as completed in #9474 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

KWiecko commented Jun 30, 2023

Tanupriya-Singh commented Jul 29, 2023

trivialfis commented Aug 11, 2023

get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

Comments

KWiecko commented Jun 30, 2023

Bug description

Rationale:

Detailed bug description:

Environment

Minimum reproducible example:

Tanupriya-Singh commented Jul 29, 2023

trivialfis commented Aug 11, 2023