Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMatrix handling of one-hot labels (Python) #10095

Closed
loewenm opened this issue Mar 7, 2024 · 4 comments
Closed

DMatrix handling of one-hot labels (Python) #10095

loewenm opened this issue Mar 7, 2024 · 4 comments

Comments

@loewenm
Copy link

loewenm commented Mar 7, 2024

It appears that DMatrix does not handle one-hot encoded labels appropriately.

Symptom:

DMatrix flattens one-hot encoded labels into a 1d array for shape [samples * classes,] instead of preserving original shape.

Example:

from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(
    n_samples=100, n_classes=5, n_labels=3, random_state=0
)
print("X: {} | y: {}".format(X.shape, y.shape))

>>> X: (100, 20) | y: (100, 5)
dmatrix_dataset = xgb.DMatrix(
    X,
    label=y,
)
print(
    "data: {} | labels: {}".format(
        dmatrix_dataset.get_data().shape, dmatrix_dataset.get_label().shape
    )
)

>>> data: (100, 20) | labels: (500,)

Explanation of output:

The original y variable in the above example has shape [100, 5] (e.g. five one-hot encoded labels)

However, if I try to extract labels from the DMatrix, it has been reshaped to [100 * 5, ].

Question:

Is this working as intended? Does DMatrix not support one-hot encoded labels?

Additional Notes:

There does not appear to be any parameter to fix this issue in the official docs.

@trivialfis
Copy link
Member

At the moment, the DMatrix supports consuming 2-D labels not doesn't support returning them. We implemented the receiving part for DMatrix so that we can get basic multi-target/label training to work, the returning part is still working in progress as we are trying to improve the support for custom-objective with multi-target/label.

@loewenm
Copy link
Author

loewenm commented Mar 7, 2024

I see. In the meantime, suppose I needed to calculate a custom metric or objective on predicted values vs. DMatrix.get_label(). Should I simply reference to original dataset instead of pulling it out of DMatrix as a work-around?

Any idea on when the returning functionality will become available?

@trivialfis
Copy link
Member

Any idea on when the returning functionality will become available?

Unfortunately, It's not a high priority at the moment.

Should I simply reference to original dataset instead of pulling it out of DMatrix as a work-around?

If you can use the original dataset, use it, with or without the feature support in DMatrix. One less conversion is more efficient. The get_label is for when XGBoost is deeply embedded in auto ML pipelines with data being split for training/validation and there's no way to obtain the original data.

That said, the returned label is a row-major matrix (as internal knowledge instead of something we want to document), you can use this to reshape the numpy array accordingly.

@trivialfis
Copy link
Member

Closing in favor of #9043

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants