Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formula documentation for predict_partial_hazard function with categorical features #1645

Open
tle4336 opened this issue Nov 22, 2024 · 6 comments

Comments

@tle4336
Copy link

tle4336 commented Nov 22, 2024

Does anyone happen to know the formula that is used in predict_partial_hazard function of the class CoxPHFitter when the features have some categorical variables, each of which might have at least 3 values (e.g. IDs, day of week)?

@tle4336 tle4336 changed the title Math formula documentation for predict_partial_hazard with categorical features Formula documentation for predict_partial_hazard function with categorical features Nov 22, 2024
@tle4336
Copy link
Author

tle4336 commented Nov 22, 2024

Could anyone please help with the above question?

@CamDavidsonPilon
Copy link
Owner

reading the code, categorical inputs are transformed into one-hot columns, and the mean of that column from the training set is subtracted, then betas are applied.

@tle4336
Copy link
Author

tle4336 commented Nov 24, 2024

@CamDavidsonPilon Thank you very much for your help with my question, really appreciate your help. From your answer, I have two quick clarification questions:

  1. Is the mean of categorical-input column the same as the mean obtained from the method norm_mean of a trained CphFitter model? For numerical-input columns, these two are the same, but I just want to ensure it remains that way for categorical.

  2. When subtracting from the mean of that column from the training set, I understand the code just did (1 - mean) and (0 - mean), rather than take the raw value of the original categorical-input column and subtract from the mean of the corresponding transformed one-hot column (e.g. xi_{categorical} - mean). Can you please confirm if this is the case?

@CamDavidsonPilon
Copy link
Owner

  1. Yes,
  2. I don't understand your question

@tle4336
Copy link
Author

tle4336 commented Nov 24, 2024

@CamDavidsonPilon Thank you very much for your quick reply.

Let me rephrase question 2 by a concrete example: let's say we have student ID as one of the categorical-input column, where its value is integer and ranges from 20 to 40 (inclusive). From what you have described, CoxPHFitter would have 20 one-hot encoding columns x_21 to x_40, where each of these columns would have their means computed based on training data --- All good at this point.
Now, let' say the input contains student ID = 21 for inference. Then in the calculation of the exponent of the partial hazard term, do we actually have this sum: (1 - mean of column x_21) * beta_{x21} + (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40} + [other terms associated with other predictors] ?

(https://web.archive.org/web/20070630025831/https://www.stat.nus.edu.sg/%7Estachenz/ST3242Notes3.pdf --- From page 2 of this slide, without de-meaning, we won't have this sum: (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40}. But due to de-mean, we would have them? )

@CamDavidsonPilon
Copy link
Owner

do we actually have this sum: (1 - mean of column x_21) * beta_{x21} + (0 - mean of column of x_22) * beta_{x22} +.... + (0 - mean of column of x_40) * beta_{x40} + [other terms associated with other predictors] ?

We do, yes.

Authors may choose the include demeaning in their formulas or not, but implementations must choose. Demeaning typically leads to better numerical stability, so lifelines demeans. Demeaning isn't that important, either: the output of the predict_partial_hazard is a meaningless number, only good for ranking / ratios, and a mean / demeaned prediction doesn't effect this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants