Formula documentation for `predict_partial_hazard` function with categorical features #1645

tle4336 · 2024-11-22T06:10:20Z

Does anyone happen to know the formula that is used in predict_partial_hazard function of the class CoxPHFitter when the features have some categorical variables, each of which might have at least 3 values (e.g. IDs, day of week)?

The text was updated successfully, but these errors were encountered:

tle4336 · 2024-11-22T23:43:33Z

Could anyone please help with the above question?

CamDavidsonPilon · 2024-11-24T16:03:16Z

reading the code, categorical inputs are transformed into one-hot columns, and the mean of that column from the training set is subtracted, then betas are applied.

tle4336 · 2024-11-24T16:18:49Z

@CamDavidsonPilon Thank you very much for your help with my question, really appreciate your help. From your answer, I have two quick clarification questions:

Is the mean of categorical-input column the same as the mean obtained from the method norm_mean of a trained CphFitter model? For numerical-input columns, these two are the same, but I just want to ensure it remains that way for categorical.
When subtracting from the mean of that column from the training set, I understand the code just did (1 - mean) and (0 - mean), rather than take the raw value of the original categorical-input column and subtract from the mean of the corresponding transformed one-hot column (e.g. xi_{categorical} - mean). Can you please confirm if this is the case?

CamDavidsonPilon · 2024-11-24T16:34:42Z

Yes,
I don't understand your question

tle4336 · 2024-11-24T22:53:24Z

@CamDavidsonPilon Thank you very much for your quick reply.

Let me rephrase question 2 by a concrete example: let's say we have student ID as one of the categorical-input column, where its value is integer and ranges from 20 to 40 (inclusive). From what you have described, CoxPHFitter would have 20 one-hot encoding columns x_21 to x_40, where each of these columns would have their means computed based on training data --- All good at this point.
Now, let' say the input contains student ID = 21 for inference. Then in the calculation of the exponent of the partial hazard term, do we actually have this sum: (1 - mean of column x_21) * beta_{x21} + (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40} + [other terms associated with other predictors] ?

(https://web.archive.org/web/20070630025831/https://www.stat.nus.edu.sg/%7Estachenz/ST3242Notes3.pdf --- From page 2 of this slide, without de-meaning, we won't have this sum: (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40}. But due to de-mean, we would have them? )

CamDavidsonPilon · 2024-11-26T02:05:42Z

do we actually have this sum: (1 - mean of column x_21) * beta_{x21} + (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40} + [other terms associated with other predictors] ?

We do, yes.

Authors may choose the include demeaning in their formulas or not, but implementations must choose. Demeaning typically leads to better numerical stability, so lifelines demeans. Demeaning isn't that important, either: the output of the predict_partial_hazard is a meaningless number, only good for ranking / ratios, and a mean / demeaned prediction doesn't effect this.

tle4336 changed the title ~~Math formula documentation for predict_partial_hazard with categorical features~~ Formula documentation for predict_partial_hazard function with categorical features Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formula documentation for `predict_partial_hazard` function with categorical features #1645

Formula documentation for `predict_partial_hazard` function with categorical features #1645

tle4336 commented Nov 22, 2024 •

edited

Loading

tle4336 commented Nov 22, 2024

CamDavidsonPilon commented Nov 24, 2024

tle4336 commented Nov 24, 2024 •

edited

Loading

CamDavidsonPilon commented Nov 24, 2024

tle4336 commented Nov 24, 2024 •

edited

Loading

CamDavidsonPilon commented Nov 26, 2024

Formula documentation for predict_partial_hazard function with categorical features #1645

Formula documentation for predict_partial_hazard function with categorical features #1645

Comments

tle4336 commented Nov 22, 2024 • edited Loading

tle4336 commented Nov 22, 2024

CamDavidsonPilon commented Nov 24, 2024

tle4336 commented Nov 24, 2024 • edited Loading

CamDavidsonPilon commented Nov 24, 2024

tle4336 commented Nov 24, 2024 • edited Loading

CamDavidsonPilon commented Nov 26, 2024

Formula documentation for `predict_partial_hazard` function with categorical features #1645

Formula documentation for `predict_partial_hazard` function with categorical features #1645

tle4336 commented Nov 22, 2024 •

edited

Loading

tle4336 commented Nov 24, 2024 •

edited

Loading

tle4336 commented Nov 24, 2024 •

edited

Loading