Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_dump(dump_format='json') and dump_model(..., dump_format='json') return invalid JSON strings when special characters are present in feature names #9352

Closed
KWiecko opened this issue Jun 30, 2023 · 2 comments · Fixed by #9474

Comments

@KWiecko
Copy link

KWiecko commented Jun 30, 2023

Bug description

Rationale:

At the end of the model training workflow we need to convert xgboost booster to coremltools MLModel. For that we use coremltools converter, which under the hood calls get_dump(dump_format='json') to perform conversion. It looks like get_dump(dump_format='json') returns invalid JSON strings if special characters are present in feature names.

Detailed bug description:

Feature names which contain 'non standard' characters (e.g. double quotes ' " ', tab '\t', newline '\n' symbols) are not properly dumped to JSON by get_dump(dump_format='json') and dump_model(..., dump_format='json') methods. All feature names (no matter which characters they contain) seem to just be wrapped with double quotes which leads to invalid JSON strings if a feature name contains a 'special' character.

I'm not 100% sure but my guess is that JsonGenerator class (localed in xgboost/src/tree/tree_model.cc) is responsible for dumping trees to JSONs so maybe replacing "" wrapping (..., "split": "{fname}", ...) with a json.dumps() statement would solve this issue?

save_model('*.json') always dumps model to a valid JSON representation.

Environment

  • xgboost: 1.4.2 and 1.7.6
  • ubuntu 22.04, Amazon Linux 2
  • installed via pip

Minimum reproducible example:

import numpy as np
import json
import xgboost as xgb


if __name__ == '__main__':

    x = np.random.rand(100, 3)
    y = np.random.rand(100, 1)

    # set dummy params
    params = {'objective': 'reg:squarederror', 'eval_metric': 'rmse', 'max_depth': 4}

    # following feature names will not cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')
    correctly_dumped_feature_names = ['feature 0', 'feature 1', 'feature 2']
    dm_correctly_dumped_feature_names = xgb.DMatrix(x, label=y, feature_names=correctly_dumped_feature_names)
    correctly_dumped_booster = xgb.train(params=params, dtrain=dm_correctly_dumped_feature_names, num_boost_round=3)

    # get dump and attempt to iterate over trees JSONs
    correct_model_dump = correctly_dumped_booster.get_dump(dump_format='json')

    for tree_json_index, tree_json in enumerate(correct_model_dump):
        # all trees correctly loaded
        json.loads(tree_json)

    # dump_model dumps model to valid JSON format only if feature names  do not contain any special characters
    # for current feature names correctly_dumped_booster will be dumped to a valid JSON
    correctly_dumped_booster.dump_model('correctly_dumped_booster_dump_model.json', dump_format='json')
    # save_model always saves model to valid JSON format
    correctly_dumped_booster.save_model('correctly_dumped_booster_save_model.json')

    # following feature names will cause problems when calling .get_dump(dump_format='json') or dump_model(..., dump_format='json')
    incorrectly_dumped_feature_names = ['"feature 0"', '\tfeature\n1', 'feature 2']
    dm_incorrectly_dumped_feature_names = xgb.DMatrix(x, label=y, feature_names=incorrectly_dumped_feature_names)
    incorrectly_dumped_booster = xgb.train(params=params, dtrain=dm_incorrectly_dumped_feature_names, num_boost_round=3)

    # get dump and attempt to iterate over trees JSONs
    incorrect_model_dump = incorrectly_dumped_booster.get_dump(dump_format='json')

    for tree_json_index, tree_json in enumerate(incorrect_model_dump):
        try:
            json.loads(tree_json)
        except json.JSONDecodeError as exc:
            print(f'Error when parsing tree with id: {tree_json_index}')
            print(exc)

    # dump_model dumps model to valid JSON format only if feature names  do not contain any special characters
    # for current feature names incorrectly_dumped_booster will be dumped to a invalid JSON
    incorrectly_dumped_booster.dump_model('incorrectly_dumped_booster_dump_model.json', dump_format='json')
    # save_model always saves model to valid JSON format
    incorrectly_dumped_booster.save_model('incorrectly_dumped_booster_save_model.json')
@Tanupriya-Singh
Copy link

Hello @KWiecko , thanks for the detailed explanation. I can work on this. Let's see if the repository owners will approve the PR.

@trivialfis
Copy link
Member

Apologies for the slow reply, here's a quick fix #9474 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants