Add CLARA Clustering algorithm #83

TimotheeMathieu · 2020-11-30T10:40:58Z

Linked to issue #23 and PR #73

This PR implement the CLARA algorithm.
CLARA (Clustering for LARge Applications) extends k-medoids approach for a large number of objects. CLARA applies PAM iteratively on multiple subsamples (each new subsample is composed with the medoids of the previous iterations plus random points), and then keep the best result (with respect to the inertia in the whole dataset).

This can be seen as a clever sub-sampling scheme.

The algorithm is linear in sample_size for both time and space complexity.

Example : even on the (relatively) small digits dataset, I gain a factor 2 in computational time. However you will see that the clustering is similar but definitely not the same, there is a tradeoff complexity/efficiency going on here.

rth · 2020-12-18T16:17:00Z

Thanks @TimotheeMathieu ! Nice work.

We need to add this estimator to https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/master/sklearn_extra/tests/test_common.py#L14 and also investigate why the CI fails and fix it before merging.

Do you know someone who might be interested in reviewing this PR? Sorry I don't have much availability for reviews at the moment.

For more minor changes (that don't change the API, or add new algorithms or solvers, i.e. not this one), if you see there is no review after some number of days and you are confident it will not break any existing use cases, you should go ahead an merge. We have very few active maintainers in this project, and otherwise PRs will just stay there.

TimotheeMathieu · 2020-12-18T16:29:16Z

For the review of the PR, maybe @kno10 would be interested ?

I will look into the CI, this algorithm is not really meant to be efficient on small datasets so some of the tests are not really adapted.

Concerning the minor change, this seems to contradict the comment in #73 (comment) . Is it some exception for minor changes ?

…it-learn-extra into clara

rth · 2020-12-18T17:49:44Z

Concerning the minor change, this seems to contradict the comment in #73 (comment) . Is it some exception for minor changes ?

About @adrinjalali's comment , yes ideally we should review each PR before merging. However all maintainers are busy or unavailable. Personally I don't plan to review each new PR in this repo, but since you are motivated to improve this project, for which I'm very grateful, it would be a shame to prevent you from doing that. Otherwise this project will just die. So I feel a review by contributors and/or external experts would also be very valuable, and that for very simple fixes you should be able to merge your PRs directly if there are no reviews.

It's a bit of a compromise between scikit-learn where each PR needs to 2 reviews, but there are 8+ active developers that do reviews, and other scikit-learn-contrib projects where after initial project acceptance maintainers can do whatever they want basically.

adrinjalali · 2020-12-20T19:26:52Z

I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project.

adrinjalali · 2020-12-20T19:26:52Z

I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project.

rth · 2020-12-20T21:30:33Z

Agreed, I meant more minor changes, new algorithms should certainly be reviewed #83 (comment)

TimotheeMathieu · 2020-12-21T09:50:42Z

Just to fix ideas : this PR is indeed not minor and I think for instance that PR#85 is minor.
But is PR#78 considered minor ?

…it-learn-extra into clara

TimotheeMathieu · 2021-04-18T11:29:17Z

@rth I noticed that the codecov CI disappeared (which is why the CI became all green here), is it normal ?

chkoar · 2021-04-18T12:23:11Z

@TimotheeMathieu probably is related to this https://about.codecov.io/security-update/

jfigui · 2021-04-23T13:52:45Z

I see that you want to implement CLARA as a separate class from kmedoids right?

Can I suggest to implement CLARA as one of the possible methods of the kmedoids algorithm? In this way you could offer an 'auto' method where kmedoids itself decides which is the best method to attack the problem as a function of the sample size.

Matlab uses a similar approach: https://www.mathworks.com/help/stats/kmedoids.html?s_tid=srchtitle

In any case I am looking forward using CLARA in the future. I have 200000+ samples to cluster and it is a bit problematic at the moment.

Nice work you are doing!

…-learn-extra into clara

jfigui · 2021-06-10T13:43:49Z

@TimotheeMathieu : Thanks for your reply. Yes, I can always use the corresponding git branch for testing but I intend to integrate this algorithm as part of a radar software package: https://github.com/MeteoSwiss/pyrad. It is not convenient for end-users to install dependencies from source code. Hence my interest in having that algorithm included in a release. I will be happy to help if this can speed up things. I don't know of other python packages that include the k-medoids algorithm and are so compatible with scikit-learn.

TimotheeMathieu · 2021-06-10T14:56:57Z

A review & test of the CLARA code would be much appreciated and would help me in the process of merging CLARA with main. In particular, because CLARA is meant to be a faster version of kmedoids we have to be careful to be as fast as possible with its code.
Concerning the release I am not familiar with the exact conditions for a new release, @rth can you comment on this ?

rth · 2021-06-10T15:43:02Z

A review & test of the CLARA code would be much appreciated and would help me in the process of merging CLARA with main.

+1. Even just reporting that you tried on some practical example and provide your feedback on the results would already be helpful.

About the release, we I don't think we have any specific rules. Usually we would release when there are some significant features / bug fixes. It takes some effort to build wheels for PyPi but it's manageable. Say releasing withing a month or two sounds reasonable assuming we merge this and few of the other PRs. I'll write instructions on how to make a release and @TimotheeMathieu it would be great if you could make it when we decide to go forward with it, so you would be familiar with the process :)

rth · 2021-06-10T15:46:05Z

Also +1 to keep this a separate class.

jfigui · 2021-06-11T05:07:11Z

What I can do is install the source code and test the CLARA algorithm within my processing framework. My main concern is memory consumption not speed but I can report on that as well.

TimotheeMathieu · 2021-06-11T12:22:21Z

Also, I don't know why kmedoids initial coders decided to implement k-medoids in scikit-learn-extra, but eventually, it may be interesting to include k-medoids in scikit-learn because k-medoids verifies the conditions to be included.
Initially, scikit-learn-extra is meant for algorithms that can't be included in scikit-learn for whatever reason.

jfigui · 2021-06-16T10:37:17Z

@TimotheeMathieu , I just wanted to let you know that I cloned your CLARA branch (as you may have noticed) and I am currently testing it within my code. At the moment I can already say that I managed to install it properly :)

jfigui · 2021-06-17T11:52:04Z

@TimotheeMathieu , @rth ,
A feedback on my experience : The CLARA algorithm manages to cluster 211593 samples in 1288.87 seconds.

I used these specifications:
kmedoids = CLARA(
n_clusters=9, metric='seuclidean', init='k-medoids++', max_iter=100, random_state=None, sampling_size=10000, samples=5).fit(fm_sample)

I ran my algorithm in a server at work. I don't know exactly the specifications but I can ask if needed.

If it counts on something I fully support the inclusion of the kmedoids and CLARA algorithms in scikit-learn. I think for data with outliers it is a better solution than the kmeans algorithm implemented there and it can be very useful in many applications.

If you need further tests or further information let me know. Any suggestions on how to run the algorithm are most welcomed.

If you can speed up the new release somehow we would truly appreciate it.

TimotheeMathieu · 2021-06-17T12:19:33Z

Thanks @jfigui
Some questions (feel free to answer only some of the questions if you don't have the time to answer all of them) :

What is the dimension of your problem ? (i.e. number of columns in X)
Can you also run KMedoids on this or is it too computationally intensive ?
Do you have any way to check the results of CLARA ? For instance, if the answer to question 2 is yes, then yiu can compare the results of CLARA and KMedoids but if it is not possible would you be able to say with some expert knowledge that the clustering make sense ?
Did the optimization finish ? i.e. did you get the warning asking you to increase max_iter ?
Is the result very different if you change the initialization, for instance with heuristic
I see you changed samples and sampling_size, was it for more efficient result ?

PS : A small suggestion : for more in depth Benchmarking you may want to use a proper benchmarking tool as you may want to test dataset increasing in size and check what is the memory/time used to compute on an dataset of different sizes. For instance with neurtu:

import numpy as np
import neurtu
from sklearn_extra.cluster import CLARA

X = np.random.normal(size=(100_000, 100)) # definition of dataset

def cases():
    for sampling_size in [50, 100, 200, 500, 1000]: # different parameters to test
        tags = {'sampling_size' : sampling_size}
        yield neurtu.delayed(CLARA(n_clusters=9, sampling_size=sampling_size), tags=tags).fit(X)
        
        
bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True) # define metrics to use
bench(cases())  # do benchmarking

From which you get

               wall_time  cpu_time  peak_memory
sampling_size                                  
50              0.471997  2.243844     0.464844
100             0.394750  2.427610     0.000000
200             0.467486  2.759086     0.070312
500             0.714508  3.129938     0.140625
1000            1.234070  4.370417     0.269531

jfigui · 2021-06-17T13:06:10Z

@TimotheeMathieu ,

The dimensionality was 5
I could not run KMedoids on more than 40000-60000 samples. I was running out of memory due to the samplesxsamples matrix needed to compute the distances. By the way, it would be really nice that the distances keep the original data type of the data (e.g. float32, float16) when computing KMedoids.
A proper check on CLARA would be difficult for me since the clustering is a small part of what I am doing but I can tell you that I get plausible results out of the clustering.
The optimization finished without any issues
I have not check the results with other initializations
I did not really change samples (the default is 5 anyway). For that my reference was the default of the Matlab implementation. I did change sampling_size. I thought the default number of samples used was very small for my problem. The number of samples corresponding to each class is not homogeneous (I have more samples for some classes than others, not ideal I know) and I thought that there was a real risk of simply not having enough samples for a particular class.
I appreciate your suggestion for a proper bench-marking but I cannot dedicate too much time to this particular issue. Sorry about that

If you want to know more about what I am doing with the CLARA algorithm have a look at this paper: https://amt.copernicus.org/articles/9/4425/2016/
I am basically coding the open source version of what it is described there. You can find the specific code here:
https://github.com/MeteoSwiss/pyart/blob/master/pyart/retrieve/echo_class.py start with function compute_centroids

TimotheeMathieu · 2021-06-17T14:12:10Z

The dimensionality was 5
I could not run KMedoids on more than 40000-60000 samples. I was running out of memory due to the samplesxsamples matrix needed to compute the distances. By the way, it would be really nice that the distances keep the original data type of the data (e.g. float32, float16) when computing KMedoids.

Ok I will try, indeed this would be a good idea to speed things up if the the data can stay float16 for instance.

A proper check on CLARA would be difficult for me since the clustering is a small part of what I am doing but I can tell you that I get plausible results out of the clustering.

Good, plausible results is already a validation for us. Thanks.

The optimization finished without any issues
I have not check the results with other initializations
I did not really change samples (the default is 5 anyway). For that my reference was the default of the Matlab implementation. I did change sampling_size. I thought the default number of samples used was very small for my problem. The number of samples corresponding to each class is not homogeneous (I have more samples for some classes than others, not ideal I know) and I thought that there was a real risk of simply not having enough samples for a particular class.
I appreciate your suggestion for a proper bench-marking but I cannot dedicate too much time to this particular issue. Sorry about that

Ok no worries.

TimotheeMathieu · 2021-06-18T08:51:13Z

Ok, now we can use float32 and from experiments we gain a factor 2 in the computation time when we use float32 !
To use float16 is a bit difficult because cython does not handle float16, so in KMedoids float16 is ok only for method other than "pam" and init other than "build" (and CLARA use "pam" because else it would not be CLARA so CLARA can only handle float32 and float64 but not float16).

jfigui · 2021-06-22T13:39:31Z

I have not timed it exactly but I can confirm that it is much faster and it also decreases a lot the memory consumption. Do you think anything else should be done before this pull request gets accepted?

rth · 2021-06-22T13:45:28Z

Thanks for testing it and providing feedback @jfigui !

Would you be by any chance interested in reviewing the code in this PR as well ? :)

Edit: see https://www.youtube.com/watch?v=dyxS9KKCNzA to give some ideas on how to review PRs (though the video is quite long).

rth

Thanks @TimotheeMathieu !

Would it be possible to move changes related to adding support for 32bit in KMedoids to a separate PR, merged first?

A few comments below, otherwise LGTM. Please add a changelog entry.

doc/modules/cluster.rst

sklearn_extra/cluster/_k_medoids.py

Co-authored-by: Roman Yurchak <[email protected]>

rth · 2021-06-24T20:42:49Z

Could you please merge master into this PR? To sync with your merged PR.

…-learn-extra into clara

rth

Thanks, please add a changelog entry.

sklearn_extra/cluster/_k_medoids.py

TimotheeMathieu · 2021-06-25T07:29:52Z

Ok thank you for the review @rth, I merge.

jfigui · 2021-06-28T08:43:27Z

Hi @TimotheeMathieu and @rth ,

I was on holiday so I did not have the occasion to review the pull request but I see you managed perfectly well without me :).

I will install the main of scikit-learn-extra and make sure that it can be used within the context of my project. I will report to you if I see anything odd. I am looking forward for a new release of scikit-learn-extra that can be easily installed by our users.

ajuric · 2022-08-30T14:12:07Z

This is not released or I just don't see it?

I see that this is merged in June 2021, but latest release is from April 2021.

I would like to try CLARA algorithm, since I have a lot of samples (>100000) to cluster.

TimotheeMathieu · 2022-08-30T14:24:21Z

This is just not released. You can install the dev version which contains clara by installing from git's source (with pip install git+https://github.com/scikit-learn-contrib/scikit-learn-extra for instance).

TimotheeMathieu added 3 commits November 29, 2020 17:44

add CLARA

94e14c4

add example

51d4034

fix typo

6204cba

TimotheeMathieu changed the title ~~Clara~~ Add CLARA Clustering algorithm Nov 30, 2020

TimotheeMathieu added the new feature label Nov 30, 2020

Merge branch 'master' of https://github.com/scikit-learn-contrib/scik…

3e699c8

…it-learn-extra into clara

TimotheeMathieu added 2 commits December 18, 2020 19:06

add doc

61a580c

fix docstring

ebabc8a

add CLARA to test_common

007d0e8

TimotheeMathieu added 4 commits December 21, 2020 11:03

add size check to pass tests

6a461d6

fix tests

6e1cded

Merge remote-tracking branch 'upstream/master' into clara

3a9b885

Merge branch 'master' of https://github.com/scikit-learn-contrib/scik…

41cb170

…it-learn-extra into clara

update doc

a95c1c3

TimotheeMathieu and others added 6 commits May 29, 2021 11:05

add test consistency clara kmedoids

107db44

Merge branch 'main' into clara

d54eefe

merge

99f2e13

merge2

5610f36

black

e7b88cd

Merge branch 'main' of https://github.com/scikit-learn-contrib/scikit…

6922163

…-learn-extra into clara

handle types KMedoids

fd3fa72

rth reviewed Jun 24, 2021

View reviewed changes

TimotheeMathieu and others added 2 commits June 24, 2021 19:11

Apply suggestions from code review

06d7650

Co-authored-by: Roman Yurchak <[email protected]>

correct 32 bit

ed91aeb

Merge branch 'main' of https://github.com/scikit-learn-contrib/scikit…

7fdcf7c

…-learn-extra into clara

rth approved these changes Jun 25, 2021

View reviewed changes

sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved

sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved

TimotheeMathieu added 2 commits June 25, 2021 09:13

change name variables

b5fcfdb

create private function inertia and changelog

30652cf

TimotheeMathieu merged commit 5c47ba2 into scikit-learn-contrib:main Jun 25, 2021

TimotheeMathieu deleted the clara branch June 25, 2021 07:30

TimotheeMathieu mentioned this pull request Jun 25, 2021

Bug with CI codecov and doc #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLARA Clustering algorithm #83

Add CLARA Clustering algorithm #83

TimotheeMathieu commented Nov 30, 2020

rth commented Dec 18, 2020

TimotheeMathieu commented Dec 18, 2020

rth commented Dec 18, 2020

adrinjalali commented Dec 20, 2020

adrinjalali commented Dec 20, 2020

rth commented Dec 20, 2020

TimotheeMathieu commented Dec 21, 2020

TimotheeMathieu commented Apr 18, 2021

chkoar commented Apr 18, 2021

jfigui commented Apr 23, 2021

jfigui commented Jun 10, 2021

TimotheeMathieu commented Jun 10, 2021

rth commented Jun 10, 2021

rth commented Jun 10, 2021

jfigui commented Jun 11, 2021

TimotheeMathieu commented Jun 11, 2021 •

edited

Loading

jfigui commented Jun 16, 2021

jfigui commented Jun 17, 2021

TimotheeMathieu commented Jun 17, 2021 •

edited

Loading

jfigui commented Jun 17, 2021 •

edited

Loading

TimotheeMathieu commented Jun 17, 2021

TimotheeMathieu commented Jun 18, 2021 •

edited

Loading

jfigui commented Jun 22, 2021

rth commented Jun 22, 2021 •

edited

Loading

rth left a comment

rth commented Jun 24, 2021

rth left a comment

TimotheeMathieu commented Jun 25, 2021

jfigui commented Jun 28, 2021

ajuric commented Aug 30, 2022 •

edited

Loading

TimotheeMathieu commented Aug 30, 2022

Add CLARA Clustering algorithm #83

Add CLARA Clustering algorithm #83

Conversation

TimotheeMathieu commented Nov 30, 2020

rth commented Dec 18, 2020

TimotheeMathieu commented Dec 18, 2020

rth commented Dec 18, 2020

adrinjalali commented Dec 20, 2020

adrinjalali commented Dec 20, 2020

rth commented Dec 20, 2020

TimotheeMathieu commented Dec 21, 2020

TimotheeMathieu commented Apr 18, 2021

chkoar commented Apr 18, 2021

jfigui commented Apr 23, 2021

jfigui commented Jun 10, 2021

TimotheeMathieu commented Jun 10, 2021

rth commented Jun 10, 2021

rth commented Jun 10, 2021

jfigui commented Jun 11, 2021

TimotheeMathieu commented Jun 11, 2021 • edited Loading

jfigui commented Jun 16, 2021

jfigui commented Jun 17, 2021

TimotheeMathieu commented Jun 17, 2021 • edited Loading

jfigui commented Jun 17, 2021 • edited Loading

TimotheeMathieu commented Jun 17, 2021

TimotheeMathieu commented Jun 18, 2021 • edited Loading

jfigui commented Jun 22, 2021

rth commented Jun 22, 2021 • edited Loading

rth left a comment

Choose a reason for hiding this comment

rth commented Jun 24, 2021

rth left a comment

Choose a reason for hiding this comment

TimotheeMathieu commented Jun 25, 2021

jfigui commented Jun 28, 2021

ajuric commented Aug 30, 2022 • edited Loading

TimotheeMathieu commented Aug 30, 2022

TimotheeMathieu commented Jun 11, 2021 •

edited

Loading

TimotheeMathieu commented Jun 17, 2021 •

edited

Loading

jfigui commented Jun 17, 2021 •

edited

Loading

TimotheeMathieu commented Jun 18, 2021 •

edited

Loading

rth commented Jun 22, 2021 •

edited

Loading

ajuric commented Aug 30, 2022 •

edited

Loading