Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to multilabel #340

Closed
glemaitre opened this issue Sep 3, 2017 · 25 comments
Closed

Add support to multilabel #340

glemaitre opened this issue Sep 3, 2017 · 25 comments

Comments

@glemaitre
Copy link
Member

We should add support to multilabel when y can be converted back to multiclass.
It means that the sum of each row should be one.

@chkoar
Copy link
Member

chkoar commented Sep 4, 2017

Are we talking about multilabel or multioutput/multiclass?

@glemaitre
Copy link
Member Author

those are always confusing. an example will speak for itself (but it should a multilabel case encoding a multiclass)

[[0 0 1]
 [1 0 0]
 [0 1 0]]

is a multilabel-indicator type encoding the following:

[[2]
 [0]
 [1]]

@chkoar
Copy link
Member

chkoar commented Sep 4, 2017

I wouldn't call it multilabel. It is a binarized version of the target, right?
I am -1 for adding that logic inside the algorithms. We could use the LabelBinirizer for that, no?

@massich
Copy link
Contributor

massich commented Sep 4, 2017

@chkoar I think that @glemaitre refers to provide the same support for y as scikit-learn does ( see here )

@MarcoNiemann
Copy link

MarcoNiemann commented Nov 9, 2017

Well, shouldn't multi-label be:

[[0,1,1],
 [1,0,0],
 [0,1,0],
 [1,0,1],
 [1,0,1],
 ...]

Because the version mentioned by @glemaitre appears - as stated by @chkoar - to be a binarized version of a multi-class problem. But the difference between multi-class and multi-label is that multi-class only allows the assignment of a single class to the target instance, whereas in a multi-label case it can be an arbitrary amount of class assignments.

For an implementation one might consider the label powerset transformation of multi-label data into a multiclass data set. So e.g. for the data set above one might apply the following transformation:

[[1],
 [2],
 [3],
 [4],
 [4],
 ...]

For all people searching for a quick and dirty solution I appear to have some success with the following solution:

from skmultilearn.problem_transformation import LabelPowerset
from imblearn.over_sampling import RandomOverSampler

# Import a dataset with X and multi-label y

lp = LabelPowerset()
ros = RandomOverSampler(random_state=42)

# Applies the above stated multi-label (ML) to multi-class (MC) transformation.
yt = lp.transform(y)

X_resampled, y_resampled = ros.fit_sample(X, yt)

# Inverts the ML-MC transformation to recreate the ML set
y_resampled = lp.inverse_transform(y_resampled)

(Use of the skmultilearn package for convenience sake to avoid custom transformation!)

@glemaitre
Copy link
Member Author

imblearn accept by default one-vs-all enconding from now on

@j-greer
Copy link

j-greer commented Jul 19, 2018

@MarcoNiemann your solution works well when the imbalance occurs across the ith dimension of y rather than the jth.

Expanding upon your example:

[
[0,1,1],
[0,1,1],
[1,1,1],
[1,1,1],
[1,1,1],
[1,1,1],

 ...
]

Can be considered imbalanced along rows but take the following example:

[
[0,0,1],
[1,0,0],
[1,0,0],
[1,1,0],
[1,1,0],
 ...
]

This is imbalanced in the sense that yi3 is mostly zero. Do you know of a way of addressing this type of imbalance problem using imbalanced-learn? @glemaitre

@rjurney
Copy link

rjurney commented Jul 31, 2019

@glenmaitre This seems an unsolved problem in the Python space. Support for this would be amazing.

@glemaitre
Copy link
Member Author

glemaitre commented Aug 6, 2019

@rjurney The issue is that the literature does not address this problem. So I am not really sure how we could go forward. It would be cool to have an overview of the full literature. It is a while I did not look at it.

@HabeebullahEbrahemi
Copy link

#just correcting the import part for my case python 3.7
from skmultilearn.problem_transform import LabelPowerset

@daanvdn
Copy link

daanvdn commented Oct 17, 2019

@glemaitre, I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:

Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.

There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java

@aamin21
Copy link

aamin21 commented Oct 17, 2019

Any update on this? Stuck on this one.

@woolr
Copy link

woolr commented Jan 9, 2020

@daanvdn do you know if anyone has implemented this in Python?

@daanvdn
Copy link

daanvdn commented Jan 10, 2020 via email

@alfredsasko
Copy link

@daanvdn, @glemaitre I read the referenced article by @daanvdn. Researches claim to be MLSMOTE superior in highly imbalanced multi-label datasets compared to other popular algorithms like BR, RAkEL, and CLR. They also provide algorithm pseudocode. I am trying to implement it in my project. ones I succeed will share the code with you.

@t-lini
Copy link

t-lini commented Mar 25, 2020

It might be worth also considering ML-ROS and ML-RUS as multilabel random over- and undersampling methods respectively, which were introduced by the authors of the article referenced by @daanvdn in an article prior to MLSMOTE, see:
F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: measures and random resampling algorithms, Neurocomputing 163(9) (2015) 3–16, http://dx.doi.org/10.1016/j.neucom.2014.08.091.
These algorithms might be a good choice if you do not want to or can not use synthetic resampling methods. Implementations in Java are also available in the MULAN package:
https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MultiLabelRandomOverSampling.java
https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MutilLabelRandomUnderSampling.java
I will try to implement these methods in Python.

@chkoar
Copy link
Member

chkoar commented Mar 25, 2020

I will try to implement these methods in Python.

That would be a great addition

@SimonErm
Copy link

SimonErm commented May 5, 2020

I have tried to implement MLSMOTE in Python, but since I am not an experienced Python programmer, it consists of a lot of stackoverflow solutions and ugly code. As far as logic is concerned, it should be correct.
https://gist.github.com/SimonErm/b06c236cafdeb79fdf7adb90aef04fec

@chkoar
Copy link
Member

chkoar commented May 5, 2020

@SimonErm I encourage you to add docstrings, write comments with your intention wherever you think it is appropriate, write some tests and open a PR in draft mode, so we could discuss your code in the PR.

@Vishnux0pa
Copy link

@SimonErm I tried you code and it works but it generates a random number of samples i,e I cant specify how many samples I would need. Is there a way to do that? Also, it would be good it you can share the paper

@SimonErm
Copy link

@Vishnux0pa That's because the number of generated labels is driven by the imbalance ratio of each label which is also discribed in the paper. You can find a reference in the description of the PR . It's the same mentioned by daanvdn:

I found the article below that proposes MLSMOTE, an adaptation of SMOTE to multi-label problems:

Charte, Francisco, et al. "MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation." Knowledge-Based Systems 89 (2015): 385-397.

There is also an (open-source) java implementation on github: https://github.com/tsoumakas/mulan/blob/master/mulan/src/main/java/mulan/sampling/MLSMOTE.java

@xelandar
Copy link

xelandar commented Aug 30, 2020

As far as I can see another implementation of MLSMOTE can be found here (via this medium article). A haven't tested it yet, but thought it would be good to share it here in relevant thread.

@chkoar
Copy link
Member

chkoar commented Aug 30, 2020

@xelandar there is already a PR here but it hasn't got a review yet, probably due to lack of time.

@balvisio
Copy link

I have created a new PR that implements MLSMOTE: #927.

@imaspol
Copy link

imaspol commented Aug 30, 2024

Hi, it would be great to have a version of classification_report_imbalanced for multilabel imbalanced data. Do you plan to implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.