Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory increase of WOEEncoder for newer category_encoders version #364

Open
Piecer-plc opened this issue Jul 19, 2022 · 2 comments
Open

Comments

@Piecer-plc
Copy link

Memory increase of WOEEncoder for category_encoders version >=2.0.0

Hi, I noticed another memory issue with WOEEncoder. I have submitted the same bug before in #335, the difference between two bugs is the different encoder methods used and different datasets. In order to distinguish between the two encoder APIs, I resubmitted a new bug report.

Expected Behavior

Similar memory usage

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, weight_enc.fit(train[weight_encode], train['target']) memory usage increase from 58MB to 206MB.

Memory(MB) Version
209 2.3.0
209 2.2.2
209 2.1.0
209 2.0.0
58 1.3.0

Steps to Reproduce the Problem

Step 1: Download the dataset

train.zip

Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd 
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
columns = [x for x in train.columns if x != 'target']
object_col_label = ['bin_0','bin_1','bin_2','bin_3','bin_4']
one_hot_encode = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4']
target_encode = ['nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']
weight_encode = target_encode + ['ord_4', 'ord_5' ,'ord_3'] + one_hot_encode + object_col_label
import category_encoders as ce
weight_enc = ce.woe.WOEEncoder(cols=weight_encode)
import tracemalloc
tracemalloc.start()
weight_enc.fit(train[weight_encode], train['target'])
current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0
Platform: ubuntu 16.4
OS : Ubuntu
CPU : Intel(R) Core(TM) i9-9900K CPU
GPU : TITAN V

@glevv
Copy link
Contributor

glevv commented Aug 6, 2022

Happens because WOE relies on Ordinal encoding and OE copies input data

X = X_in.copy(deep=True)

@bmreiniger
Copy link
Contributor

(When) do we actually need to copy inputs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants