Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory requirements in KMedoids #23

Open
3 tasks
rth opened this issue Jul 29, 2019 · 6 comments
Open
3 tasks

Reduce memory requirements in KMedoids #23

rth opened this issue Jul 29, 2019 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@rth
Copy link
Contributor

rth commented Jul 29, 2019

KMedoids currently pre-computes a full distance matrix with pairwise_distances resulting in large memory usage making it unsuitable for datasets with more than 20-50k samples.

To improve the situation somewhat, following approaches could be possible,

  • use pairwise_distances_chunked
  • makes sure that for float32 input the distance matrix is also 32 bit.
  • investigate re-computing distance in each iterations (Implementing KMedoids in scikit-learn-extra #12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.
@rth rth changed the title Reduce memory use in KMedoids Reduce memory requirements in KMedoids Jul 29, 2019
@rth rth added the enhancement New feature or request label Jul 29, 2019
@rth rth added the help wanted Extra attention is needed label Feb 25, 2020
@tooHotSpot
Copy link

Too good idea to be not implemented. Especially it worth for binary data clustering with manhattan distance, e.g..

@lefnire
Copy link

lefnire commented Aug 4, 2020

+1, I've hit this snag and having troubling thinking of a workaround (sampling from data a bit tough in my situation). Was hoping there was continue training via fit or such.

investigate re-computing distance in each iterations (#12 (comment)). This will reduce the memory requirements at the cost of additional compute time. I'm not sure it could be worth it.

I vote worth it. Many cluster algos are super heavy on the training anyway for many rows. It's inference we care about usually. At the least, if it works for now as a holdover, more could use the package

@lefnire
Copy link

lefnire commented Aug 4, 2020

I think pairwise_distances_chunked might still hit the memory wall. You'd need to collect the intermediate lists over time, or get fancy with np.empty & setting slices. I tried both methods on my 75k rows and still get an error.

from sklearn.metrics import pairwise_distances_chunked
pdc = pairwise_distances_chunked(x, metric='cosine', working_memory=64)

# attempt 1
dists = np.vstack(pdc)  # vstack can work with generators

# attempt 2
dists = np.empty((x.shape[0], x.shape[0]))
for i, chunk in enumerate(pdc):
    sz = chunk.shape[0]
    start, stop = i*sz, (i+1)*sz
    dists[start:stop] = chunk
dists = np.empty((x.shape[0], x.shape[0]))
MemoryError: Unable to allocate 43.0 GiB for an array with shape (75970, 75970) and data type float64

I haven't sat with kmedoids code yet (hopefully will find time), maybe it doesn't need the n*n after pdist and can be reduced in each chunk iteration.

@jnothman
Copy link
Member

jnothman commented Aug 4, 2020 via email

@TimotheeMathieu
Copy link
Contributor

We could also implement CLARA which does an aggregation of PAM on sub-samples (see wikipedia) this is an approximate algorithm and there is a risk that the result would be not as good but at least it will be tractable, the algorithm is a trade-off between efficiency and complexity (there is a parameter to tune this trade-off). This could easily be implemented as yet another method for KMedoids. Ref : Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). doi:10.1002/9780470316801.ch2. This algorithm is already implemented in R, Matlab, ELKI...

Another possibility is this article which should not decrease the efficiency of KMedoids from what I understood but may save time/space, this would be more experimental as the method is much more recent and may be for later.

@kno10
Copy link
Contributor

kno10 commented Feb 2, 2021

K-Medoids will need all distances several times. So unless you have a very cheap distance that you can afford to compute several times you will likely be trading memory vs. even heavier CPU usage.
There are some variants (e.g., CLARA, BanditPAM) that aim at this use case where you do not want to compute all pairwise distances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants