Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLARA Clustering algorithm #83

Merged
merged 24 commits into from
Jun 25, 2021

Conversation

TimotheeMathieu
Copy link
Contributor

Linked to issue #23 and PR #73

This PR implement the CLARA algorithm.
CLARA (Clustering for LARge Applications) extends k-medoids approach for a large number of objects. CLARA applies PAM iteratively on multiple subsamples (each new subsample is composed with the medoids of the previous iterations plus random points), and then keep the best result (with respect to the inertia in the whole dataset).

This can be seen as a clever sub-sampling scheme.

The algorithm is linear in sample_size for both time and space complexity.

Example : even on the (relatively) small digits dataset, I gain a factor 2 in computational time. However you will see that the clustering is similar but definitely not the same, there is a tradeoff complexity/efficiency going on here.
image

@TimotheeMathieu TimotheeMathieu changed the title Clara Add CLARA Clustering algorithm Nov 30, 2020
@rth
Copy link
Contributor

rth commented Dec 18, 2020

Thanks @TimotheeMathieu ! Nice work.

We need to add this estimator to https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/master/sklearn_extra/tests/test_common.py#L14 and also investigate why the CI fails and fix it before merging.

Do you know someone who might be interested in reviewing this PR? Sorry I don't have much availability for reviews at the moment.

For more minor changes (that don't change the API, or add new algorithms or solvers, i.e. not this one), if you see there is no review after some number of days and you are confident it will not break any existing use cases, you should go ahead an merge. We have very few active maintainers in this project, and otherwise PRs will just stay there.

@TimotheeMathieu
Copy link
Contributor Author

For the review of the PR, maybe @kno10 would be interested ?

I will look into the CI, this algorithm is not really meant to be efficient on small datasets so some of the tests are not really adapted.

Concerning the minor change, this seems to contradict the comment in #73 (comment) . Is it some exception for minor changes ?

@rth
Copy link
Contributor

rth commented Dec 18, 2020

Concerning the minor change, this seems to contradict the comment in #73 (comment) . Is it some exception for minor changes ?

About @adrinjalali's comment , yes ideally we should review each PR before merging. However all maintainers are busy or unavailable. Personally I don't plan to review each new PR in this repo, but since you are motivated to improve this project, for which I'm very grateful, it would be a shame to prevent you from doing that. Otherwise this project will just die. So I feel a review by contributors and/or external experts would also be very valuable, and that for very simple fixes you should be able to merge your PRs directly if there are no reviews.

It's a bit of a compromise between scikit-learn where each PR needs to 2 reviews, but there are 8+ active developers that do reviews, and other scikit-learn-contrib projects where after initial project acceptance maintainers can do whatever they want basically.

@adrinjalali
Copy link
Member

I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project.

1 similar comment
@adrinjalali
Copy link
Member

I'd be more than happy to engage more contributors from the community on the review process here, but I'd be wary of merging w/o any review (except the trivial minor changes). Especially when it comes to new algorithms, I'd be happy if at least one expert other than the PR creator reviews the PR, that process always makes the code much more maintainable and in the long run it's going to benefit the project.

@rth
Copy link
Contributor

rth commented Dec 20, 2020

Agreed, I meant more minor changes, new algorithms should certainly be reviewed #83 (comment)

@TimotheeMathieu
Copy link
Contributor Author

Just to fix ideas : this PR is indeed not minor and I think for instance that PR#85 is minor.
But is PR#78 considered minor ?

@TimotheeMathieu
Copy link
Contributor Author

@rth I noticed that the codecov CI disappeared (which is why the CI became all green here), is it normal ?

@chkoar
Copy link
Member

chkoar commented Apr 18, 2021

@TimotheeMathieu probably is related to this https://about.codecov.io/security-update/

@jfigui
Copy link

jfigui commented Apr 23, 2021

I see that you want to implement CLARA as a separate class from kmedoids right?

Can I suggest to implement CLARA as one of the possible methods of the kmedoids algorithm? In this way you could offer an 'auto' method where kmedoids itself decides which is the best method to attack the problem as a function of the sample size.

Matlab uses a similar approach: https://www.mathworks.com/help/stats/kmedoids.html?s_tid=srchtitle

In any case I am looking forward using CLARA in the future. I have 200000+ samples to cluster and it is a bit problematic at the moment.

Nice work you are doing!

@jfigui
Copy link

jfigui commented Jun 10, 2021

@TimotheeMathieu : Thanks for your reply. Yes, I can always use the corresponding git branch for testing but I intend to integrate this algorithm as part of a radar software package: https://github.com/MeteoSwiss/pyrad. It is not convenient for end-users to install dependencies from source code. Hence my interest in having that algorithm included in a release. I will be happy to help if this can speed up things. I don't know of other python packages that include the k-medoids algorithm and are so compatible with scikit-learn.

@TimotheeMathieu
Copy link
Contributor Author

A review & test of the CLARA code would be much appreciated and would help me in the process of merging CLARA with main. In particular, because CLARA is meant to be a faster version of kmedoids we have to be careful to be as fast as possible with its code.
Concerning the release I am not familiar with the exact conditions for a new release, @rth can you comment on this ?

@rth
Copy link
Contributor

rth commented Jun 10, 2021

A review & test of the CLARA code would be much appreciated and would help me in the process of merging CLARA with main.

+1. Even just reporting that you tried on some practical example and provide your feedback on the results would already be helpful.

About the release, we I don't think we have any specific rules. Usually we would release when there are some significant features / bug fixes. It takes some effort to build wheels for PyPi but it's manageable. Say releasing withing a month or two sounds reasonable assuming we merge this and few of the other PRs. I'll write instructions on how to make a release and @TimotheeMathieu it would be great if you could make it when we decide to go forward with it, so you would be familiar with the process :)

@rth
Copy link
Contributor

rth commented Jun 10, 2021

Also +1 to keep this a separate class.

@jfigui
Copy link

jfigui commented Jun 11, 2021

What I can do is install the source code and test the CLARA algorithm within my processing framework. My main concern is memory consumption not speed but I can report on that as well.

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Jun 11, 2021

Also, I don't know why kmedoids initial coders decided to implement k-medoids in scikit-learn-extra, but eventually, it may be interesting to include k-medoids in scikit-learn because k-medoids verifies the conditions to be included.
Initially, scikit-learn-extra is meant for algorithms that can't be included in scikit-learn for whatever reason.

@jfigui
Copy link

jfigui commented Jun 16, 2021

@TimotheeMathieu , I just wanted to let you know that I cloned your CLARA branch (as you may have noticed) and I am currently testing it within my code. At the moment I can already say that I managed to install it properly :)

@jfigui
Copy link

jfigui commented Jun 17, 2021

@TimotheeMathieu , @rth ,
A feedback on my experience : The CLARA algorithm manages to cluster 211593 samples in 1288.87 seconds.

I used these specifications:
kmedoids = CLARA(
n_clusters=9, metric='seuclidean', init='k-medoids++', max_iter=100, random_state=None, sampling_size=10000, samples=5).fit(fm_sample)

I ran my algorithm in a server at work. I don't know exactly the specifications but I can ask if needed.

If it counts on something I fully support the inclusion of the kmedoids and CLARA algorithms in scikit-learn. I think for data with outliers it is a better solution than the kmeans algorithm implemented there and it can be very useful in many applications.

If you need further tests or further information let me know. Any suggestions on how to run the algorithm are most welcomed.

If you can speed up the new release somehow we would truly appreciate it.

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Jun 17, 2021

Thanks @jfigui
Some questions (feel free to answer only some of the questions if you don't have the time to answer all of them) :

  • What is the dimension of your problem ? (i.e. number of columns in X)
  • Can you also run KMedoids on this or is it too computationally intensive ?
  • Do you have any way to check the results of CLARA ? For instance, if the answer to question 2 is yes, then yiu can compare the results of CLARA and KMedoids but if it is not possible would you be able to say with some expert knowledge that the clustering make sense ?
  • Did the optimization finish ? i.e. did you get the warning asking you to increase max_iter ?
  • Is the result very different if you change the initialization, for instance with heuristic
  • I see you changed samples and sampling_size, was it for more efficient result ?

PS : A small suggestion : for more in depth Benchmarking you may want to use a proper benchmarking tool as you may want to test dataset increasing in size and check what is the memory/time used to compute on an dataset of different sizes. For instance with neurtu:

import numpy as np
import neurtu
from sklearn_extra.cluster import CLARA

X = np.random.normal(size=(100_000, 100)) # definition of dataset

def cases():
    for sampling_size in [50, 100, 200, 500, 1000]: # different parameters to test
        tags = {'sampling_size' : sampling_size}
        yield neurtu.delayed(CLARA(n_clusters=9, sampling_size=sampling_size), tags=tags).fit(X)
        
        
bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True) # define metrics to use
bench(cases())  # do benchmarking

From which you get

               wall_time  cpu_time  peak_memory
sampling_size                                  
50              0.471997  2.243844     0.464844
100             0.394750  2.427610     0.000000
200             0.467486  2.759086     0.070312
500             0.714508  3.129938     0.140625
1000            1.234070  4.370417     0.269531

@jfigui
Copy link

jfigui commented Jun 17, 2021

@TimotheeMathieu ,

  • The dimensionality was 5
  • I could not run KMedoids on more than 40000-60000 samples. I was running out of memory due to the samplesxsamples matrix needed to compute the distances. By the way, it would be really nice that the distances keep the original data type of the data (e.g. float32, float16) when computing KMedoids.
  • A proper check on CLARA would be difficult for me since the clustering is a small part of what I am doing but I can tell you that I get plausible results out of the clustering.
  • The optimization finished without any issues
  • I have not check the results with other initializations
  • I did not really change samples (the default is 5 anyway). For that my reference was the default of the Matlab implementation. I did change sampling_size. I thought the default number of samples used was very small for my problem. The number of samples corresponding to each class is not homogeneous (I have more samples for some classes than others, not ideal I know) and I thought that there was a real risk of simply not having enough samples for a particular class.
  • I appreciate your suggestion for a proper bench-marking but I cannot dedicate too much time to this particular issue. Sorry about that

If you want to know more about what I am doing with the CLARA algorithm have a look at this paper: https://amt.copernicus.org/articles/9/4425/2016/
I am basically coding the open source version of what it is described there. You can find the specific code here:
https://github.com/MeteoSwiss/pyart/blob/master/pyart/retrieve/echo_class.py start with function compute_centroids

@TimotheeMathieu
Copy link
Contributor Author

The dimensionality was 5
I could not run KMedoids on more than 40000-60000 samples. I was running out of memory due to the samplesxsamples matrix needed to compute the distances. By the way, it would be really nice that the distances keep the original data type of the data (e.g. float32, float16) when computing KMedoids.

Ok I will try, indeed this would be a good idea to speed things up if the the data can stay float16 for instance.

A proper check on CLARA would be difficult for me since the clustering is a small part of what I am doing but I can tell you that I get plausible results out of the clustering.

Good, plausible results is already a validation for us. Thanks.

The optimization finished without any issues
I have not check the results with other initializations
I did not really change samples (the default is 5 anyway). For that my reference was the default of the Matlab implementation. I did change sampling_size. I thought the default number of samples used was very small for my problem. The number of samples corresponding to each class is not homogeneous (I have more samples for some classes than others, not ideal I know) and I thought that there was a real risk of simply not having enough samples for a particular class.
I appreciate your suggestion for a proper bench-marking but I cannot dedicate too much time to this particular issue. Sorry about that

Ok no worries.

@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Jun 18, 2021

Ok, now we can use float32 and from experiments we gain a factor 2 in the computation time when we use float32 !
To use float16 is a bit difficult because cython does not handle float16, so in KMedoids float16 is ok only for method other than "pam" and init other than "build" (and CLARA use "pam" because else it would not be CLARA so CLARA can only handle float32 and float64 but not float16).

@jfigui
Copy link

jfigui commented Jun 22, 2021

I have not timed it exactly but I can confirm that it is much faster and it also decreases a lot the memory consumption. Do you think anything else should be done before this pull request gets accepted?

@rth
Copy link
Contributor

rth commented Jun 22, 2021

Thanks for testing it and providing feedback @jfigui !

Would you be by any chance interested in reviewing the code in this PR as well ? :)

Edit: see https://www.youtube.com/watch?v=dyxS9KKCNzA to give some ideas on how to review PRs (though the video is quite long).

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TimotheeMathieu !

Would it be possible to move changes related to adding support for 32bit in KMedoids to a separate PR, merged first?

A few comments below, otherwise LGTM. Please add a changelog entry.

doc/modules/cluster.rst Outdated Show resolved Hide resolved
doc/modules/cluster.rst Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
@rth
Copy link
Contributor

rth commented Jun 24, 2021

Could you please merge master into this PR? To sync with your merged PR.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please add a changelog entry.

sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
sklearn_extra/cluster/_k_medoids.py Outdated Show resolved Hide resolved
@TimotheeMathieu
Copy link
Contributor Author

Ok thank you for the review @rth, I merge.

@TimotheeMathieu TimotheeMathieu merged commit 5c47ba2 into scikit-learn-contrib:main Jun 25, 2021
@TimotheeMathieu TimotheeMathieu deleted the clara branch June 25, 2021 07:30
@jfigui
Copy link

jfigui commented Jun 28, 2021

Hi @TimotheeMathieu and @rth ,

I was on holiday so I did not have the occasion to review the pull request but I see you managed perfectly well without me :).

I will install the main of scikit-learn-extra and make sure that it can be used within the context of my project. I will report to you if I see anything odd. I am looking forward for a new release of scikit-learn-extra that can be easily installed by our users.

@ajuric
Copy link

ajuric commented Aug 30, 2022

This is not released or I just don't see it?

I see that this is merged in June 2021, but latest release is from April 2021.

I would like to try CLARA algorithm, since I have a lot of samples (>100000) to cluster.

@TimotheeMathieu
Copy link
Contributor Author

This is just not released. You can install the dev version which contains clara by installing from git's source (with pip install git+https://github.com/scikit-learn-contrib/scikit-learn-extra for instance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants