Reduce memory requirements in KMedoids #23

rth · 2019-07-29T12:50:40Z

KMedoids currently pre-computes a full distance matrix with pairwise_distances resulting in large memory usage making it unsuitable for datasets with more than 20-50k samples.

To improve the situation somewhat, following approaches could be possible,

use pairwise_distances_chunked
makes sure that for float32 input the distance matrix is also 32 bit.
investigate re-computing distance in each iterations (Implementing KMedoids in scikit-learn-extra #12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.

The text was updated successfully, but these errors were encountered:

tooHotSpot · 2020-05-10T13:17:59Z

Too good idea to be not implemented. Especially it worth for binary data clustering with manhattan distance, e.g..

lefnire · 2020-08-04T04:05:36Z

+1, I've hit this snag and having troubling thinking of a workaround (sampling from data a bit tough in my situation). Was hoping there was continue training via fit or such.

investigate re-computing distance in each iterations (#12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.

I vote worth it. Many cluster algos are super heavy on the training anyway for many rows. It's inference we care about usually. At the least, if it works for now as a holdover, more could use the package

lefnire · 2020-08-04T18:59:44Z

I think pairwise_distances_chunked might still hit the memory wall. You'd need to collect the intermediate lists over time, or get fancy with np.empty & setting slices. I tried both methods on my 75k rows and still get an error.

from sklearn.metrics import pairwise_distances_chunked
pdc = pairwise_distances_chunked(x, metric='cosine', working_memory=64)

# attempt 1
dists = np.vstack(pdc)  # vstack can work with generators

# attempt 2
dists = np.empty((x.shape[0], x.shape[0]))
for i, chunk in enumerate(pdc):
    sz = chunk.shape[0]
    start, stop = i*sz, (i+1)*sz
    dists[start:stop] = chunk

dists = np.empty((x.shape[0], x.shape[0]))
MemoryError: Unable to allocate 43.0 GiB for an array with shape (75970, 75970) and data type float64

I haven't sat with kmedoids code yet (hopefully will find time), maybe it doesn't need the n*n after pdist and can be reduced in each chunk iteration.

jnothman · 2020-08-04T20:53:39Z

pairwise_distances_chunked is useful where there is a "reduce operation" following the distance computation, such that the final memory required is smaller than n^2.

TimotheeMathieu · 2020-11-23T18:18:42Z

We could also implement CLARA which does an aggregation of PAM on sub-samples (see wikipedia) this is an approximate algorithm and there is a risk that the result would be not as good but at least it will be tractable, the algorithm is a trade-off between efficiency and complexity (there is a parameter to tune this trade-off). This could easily be implemented as yet another method for KMedoids. Ref : Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2. This algorithm is already implemented in R, Matlab, ELKI...

Another possibility is this article which should not decrease the efficiency of KMedoids from what I understood but may save time/space, this would be more experimental as the method is much more recent and may be for later.

kno10 · 2021-02-02T12:22:34Z

K-Medoids will need all distances several times. So unless you have a very cheap distance that you can afford to compute several times you will likely be trading memory vs. even heavier CPU usage.
There are some variants (e.g., CLARA, BanditPAM) that aim at this use case where you do not want to compute all pairwise distances.

rth changed the title ~~Reduce memory use in KMedoids~~ Reduce memory requirements in KMedoids Jul 29, 2019

rth added the enhancement New feature or request label Jul 29, 2019

rth added the help wanted Extra attention is needed label Feb 25, 2020

TimotheeMathieu mentioned this issue Nov 30, 2020

Add CLARA Clustering algorithm #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory requirements in KMedoids #23

Reduce memory requirements in KMedoids #23

rth commented Jul 29, 2019

tooHotSpot commented May 10, 2020

lefnire commented Aug 4, 2020 •

edited

Loading

lefnire commented Aug 4, 2020

jnothman commented Aug 4, 2020 via email

TimotheeMathieu commented Nov 23, 2020

kno10 commented Feb 2, 2021

Reduce memory requirements in KMedoids #23

Reduce memory requirements in KMedoids #23

Comments

rth commented Jul 29, 2019

tooHotSpot commented May 10, 2020

lefnire commented Aug 4, 2020 • edited Loading

lefnire commented Aug 4, 2020

jnothman commented Aug 4, 2020 via email

TimotheeMathieu commented Nov 23, 2020

kno10 commented Feb 2, 2021

lefnire commented Aug 4, 2020 •

edited

Loading