Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Count-Based Target Encoder (Dracula)? #420

Open
bking124 opened this issue Sep 11, 2023 · 1 comment
Open

Feature Request: Count-Based Target Encoder (Dracula)? #420

bking124 opened this issue Sep 11, 2023 · 1 comment

Comments

@bking124
Copy link

bking124 commented Sep 11, 2023

I recently stumbled upon a categorical encoding idea dubbed "Distributed Robust Algorithm for Count-based Learning" (aka Dracula) described in this Microsoft blog as well as this talk. It seems like it mixes ideas of CountEncoder and TargetEncoder. Has anybody heard of this approach before and has there been thought of introducing such an encoder into the package? I'm interested to compare this approach with the typical TargetEncoder.

Thanks for the wonderful package!

@PaulWestenthanner
Copy link
Collaborator

Hi @bking124

I haven't heard of the approach before. Searching "Dracula Encoder" or "CTR encoder" (as mentioned in the talk) also doesn't yield much. Since the talk and blog post are already 8 years old and it didn't get much traction since I'd be surprised if yields great results.
On the other hand we could include it into the package. I think it should be rather straight forward to implement.
From what I understood the encoded value is calculated as:

  1. calculate the counts for each label df.groupBy(col, label).count(). This can be only done for the top N and the rest will go to a rest category
  2. use as encoded value for a label x: counts[x, target=0], counts[x, target=1], ..., log-odds, flag_is_rest

I'm not quite sure how to handle the regression case. Probably we'd need some binning of the target variable there?
Also small categories might result in overfitting if the classifier basically ignores the counts and just uses the log odds (which it will). This might be a potential issue (just like in target encoding with too little regularization).
In fact this is pretty much what you'd get when you encode a variable with both count encoder and target encoder (with no regularisation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants