I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure corr
I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:
setinfo(xgtrain, "base_margin", log(d$exposure))
At least with the glm
function in R, modeling count ~ x1 + x2 + offset(log(exposure))
with family=poisson(link='log')
is equivalent to modeling I(count/exposure) ~ x1 + x2
with family=poisson(link='log')
and weight=exposure
. That is, normalize your count by exposure to get frequency, and model frequency with exposure as the weight. Your estimated coefficients should be the same in both cases when using glm
for Poisson regression. Try it for yourself using a sample data set
I'm not exactly sure what objective='count:poisson'
corresponds to, but I would expect setting your target variable as frequency (count/exposure) and using exposure as the weight in xgboost
would be the way to go when exposures are varying.