Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBScan performance issue with large dataset #645

Open
divya-agrawal3103 opened this issue Jul 12, 2024 · 3 comments
Open

HDBScan performance issue with large dataset #645

divya-agrawal3103 opened this issue Jul 12, 2024 · 3 comments

Comments

@divya-agrawal3103
Copy link

Hi Team,

We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:

from datetime import datetime
import pandas as pd
import modelerpy
modelerpy.installPackage('scikit-learn')
import sklearn
modelerpy.installPackage('cython')
modelerpy.installPackage('hdbscan')
import hdbscan
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, OneHotEncoder
import pkg_resources
from sklearn.decomposition import PCA
 
data = pd.read_csv("sample.csv")
cluster_data = data.drop(['Customer ID', 'Campaign ID', 'Response'], axis=1)
categorical_features = ['Gender', 'Marital Status']
numeric_features = list(set(cluster_data.columns) - set(categorical_features))
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), categorical_features),
    ('numeric', RobustScaler(), numeric_features)
], remainder='passthrough')
normalized = preprocessor.fit_transform(cluster_data)
normalized_df = pd.DataFrame(normalized, columns=preprocessor.get_feature_names_out())
pca = PCA(n_components=2)
pca_result = pca.fit_transform(normalized_df)
print('build model start')
print(datetime.now().time())
try:
 model = hdbscan.HDBSCAN(
    min_cluster_size=1000,
    min_samples=5,
    metric="euclidean",
    alpha=1.0,
    p=1.5,
    algorithm="prims_kdtree",
    leaf_size=30,
    approx_min_span_tree=True,
    cluster_selection_method="eom",
    allow_single_cluster=False,
    gen_min_span_tree=True,
    prediction_data=True
    ).fit(pca_result)
 print('build model end') 
 print(datetime.now().time())
 #print(model)
 print("Cluster labels:")
 print(model.labels_)
 print("\nNumber of clusters:")
 print(len(set(model.labels_)) - (1 if -1 in model.labels_ else 0))
 print("\nCluster membership probabilities:")
 print(model.probabilities_)
 print("\nOutlier scores:")
 print(model.outlier_scores_)
except Exception as e:
  # Code to handle any exception
  print(f"An error occurred: {e}")

Sample file-
sample.csv

We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction.
The script executes in approximately 8 minutes.
However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:

"An error occurred: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by excessive memory usage causing the Operating System to kill the worker."

Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.

Could you please help us with the following questions?

  1. Why do "best", "boruvka_kdtree", and "boruvka_balltree" algorithms fail while "prims_balltree" and "prims_kdtree" do not?
  2. What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
  3. Does HDBSCAN support spilling to disk?

Your insights and guidance would be greatly appreciated.

@Bokang-ctrl
Copy link

Since you mentioned that the execution is successful on Jupyter Notebook, the problem could be with the memory usage. It seem there is no stability when executing in your script.

For optimizing, I would suggest you ensure that you have enough memory and CPU resources to handle the process. You could leverage GPU acceleration.

@divya-agrawal3103
Copy link
Author

Hi @Bokang-ctrl
Thanks for your response.
Could you please also clarify below 2 doubts?

1. What are the recommended best practices for optimizing HDBSCAN algorithm performance with large and varied datasets?
2. Does HDBSCAN support spilling to disk?

Thanks

@Bokang-ctrl
Copy link

Bokang-ctrl commented Jul 17, 2024

Hi @divya-agrawal3103 . Apologies for getting back to you now. To answer your questions;

I would recommend using PCA for dimensionality reduction which will reduce the number of features and make the model effective. Try different scaling techniques (Robust scaler, Standard scaler & Min Max Scaler) and check which one gives the best results.

Try tuning your parameters, check the attached picture for the way I tuned my params. I'm pretty sure there are other ways but these are what I can think of.
hyper params HDBSCAN

For spilling to disk, I ask chatGPT and this is what the response was:
chatGPT: HDBSCAN itself does not natively support spilling to disk. The algorithm is designed to work in-memory, which means it requires sufficient RAM to handle the dataset being processed. However, you can manage large datasets using the following strategies:

  • Dask Integration:
    
      Dask: Use Dask to handle large datasets and parallelize computations. Dask can spill intermediate results to disk, allowing you to work with datasets larger than your available memory.
    
  • Memory-Mapped Arrays:
    
      NumPy.memmap: Use memory-mapped arrays to handle large datasets. This technique allows you to store data on disk while treating it as if it were in memory.
    
  • External Libraries:
    
      Faiss: For large-scale clustering, consider using external libraries like Faiss, which can handle large datasets efficiently and integrate with HDBSCAN for nearest neighbor search.
    
  • Data Subsetting:
    
      Sample Subsets: Process subsets of your data sequentially and then combine the results, if possible, to manage memory constraints.
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants