DMatrix handling of one-hot labels (Python) #10095

loewenm · 2024-03-07T14:24:27Z

It appears that DMatrix does not handle one-hot encoded labels appropriately.

Symptom:

DMatrix flattens one-hot encoded labels into a 1d array for shape [samples * classes,] instead of preserving original shape.

Example:

from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(
    n_samples=100, n_classes=5, n_labels=3, random_state=0
)
print("X: {} | y: {}".format(X.shape, y.shape))

>>> X: (100, 20) | y: (100, 5)

dmatrix_dataset = xgb.DMatrix(
    X,
    label=y,
)
print(
    "data: {} | labels: {}".format(
        dmatrix_dataset.get_data().shape, dmatrix_dataset.get_label().shape
    )
)

>>> data: (100, 20) | labels: (500,)

Explanation of output:

The original y variable in the above example has shape [100, 5] (e.g. five one-hot encoded labels)

However, if I try to extract labels from the DMatrix, it has been reshaped to [100 * 5, ].

Question:

Is this working as intended? Does DMatrix not support one-hot encoded labels?

Additional Notes:

There does not appear to be any parameter to fix this issue in the official docs.

The text was updated successfully, but these errors were encountered:

trivialfis · 2024-03-07T17:03:27Z

At the moment, the DMatrix supports consuming 2-D labels not doesn't support returning them. We implemented the receiving part for DMatrix so that we can get basic multi-target/label training to work, the returning part is still working in progress as we are trying to improve the support for custom-objective with multi-target/label.

loewenm · 2024-03-07T18:13:46Z

I see. In the meantime, suppose I needed to calculate a custom metric or objective on predicted values vs. DMatrix.get_label(). Should I simply reference to original dataset instead of pulling it out of DMatrix as a work-around?

Any idea on when the returning functionality will become available?

trivialfis · 2024-03-07T19:23:22Z

Any idea on when the returning functionality will become available?

Unfortunately, It's not a high priority at the moment.

Should I simply reference to original dataset instead of pulling it out of DMatrix as a work-around?

If you can use the original dataset, use it, with or without the feature support in DMatrix. One less conversion is more efficient. The get_label is for when XGBoost is deeply embedded in auto ML pipelines with data being split for training/validation and there's no way to obtain the original data.

That said, the returned label is a row-major matrix (as internal knowledge instead of something we want to document), you can use this to reshape the numpy array accordingly.

trivialfis · 2024-03-09T11:08:29Z

Closing in favor of #9043

trivialfis closed this as completed Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMatrix handling of one-hot labels (Python) #10095

DMatrix handling of one-hot labels (Python) #10095

loewenm commented Mar 7, 2024

trivialfis commented Mar 7, 2024

loewenm commented Mar 7, 2024

trivialfis commented Mar 7, 2024

trivialfis commented Mar 9, 2024

DMatrix handling of one-hot labels (Python) #10095

DMatrix handling of one-hot labels (Python) #10095

Comments

loewenm commented Mar 7, 2024

trivialfis commented Mar 7, 2024

loewenm commented Mar 7, 2024

trivialfis commented Mar 7, 2024

trivialfis commented Mar 9, 2024