Clustering with multiindex DataFrame

Question

Clustering with multiindex DataFrame

Berthold.LR

2022年5月17日 15:23

I have a huge amount of multiindexed data that, very simplified, looks like this:

Code:

channel_list = ['A','B','C']

df1=pd.DataFrame([[1.1,4,9],[1.2,5,9],[1.3,6,9],[1.4,7,9]],columns=pd.MultiIndex.from_product([['Test1'],channel_list],names = ['Test ID','Channel']))
df2=pd.DataFrame([[1.1,9,6],[1.2,8,6],[1.3,7,6],[1.4,6,6]],columns=pd.MultiIndex.from_product([['Test2'],channel_list],names = ['Test ID','Channel']))
df3=pd.DataFrame([[1.1,1,4],[1.2,2,4],[1.3,3,4],[1.4,4,4]],columns=pd.MultiIndex.from_product([['Test3'],channel_list],names = ['Test ID','Channel']))
df4=pd.DataFrame([[1.1,7,9],[1.2,6,9],[1.3,5,9],[1.4,4,9]],columns=pd.MultiIndex.from_product([['Test4'],channel_list],names = ['Test ID','Channel']))
df5=pd.DataFrame([[1.1,9,9],[1.2,8,9],[1.3,7,9],[1.4,6,9]],columns=pd.MultiIndex.from_product([['Test5'],channel_list],names = ['Test ID','Channel']))
df6=pd.DataFrame([[1.1,1,5],[1.2,2,5],[1.3,3,5],[1.4,4,5]],columns=pd.MultiIndex.from_product([['Test6'],channel_list],names = ['Test ID','Channel']))
df7=pd.DataFrame([[1.1,2,1],[1.2,3,1],[1.3,4,1],[1.4,5,1]],columns=pd.MultiIndex.from_product([['Test7'],channel_list],names = ['Test ID','Channel']))
df8=pd.DataFrame([[1.1,4,3],[1.2,5,3],[1.3,6,3],[1.4,7,3]],columns=pd.MultiIndex.from_product([['Test8'],channel_list],names = ['Test ID','Channel']))

df_all=pd.concat([df1,df2,df3,df4,df5,df6,df7,df8],axis=1).T

My goal is to cluster the data by the highest level 'Test ID', so for example:

Cluster 1: Test1, Test3, Test6, Test7, Test8
Cluster 2: Test2, Test4, Test5

In my example all elements from cluster 1 have an ascending channel B and all elements from cluster 2 have a descending channel B. In my real data i have like 200 channel and the correlation could be over multiple channel.

I tried to cluster with KMeans from sklearn like this

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(df_all)
y_kmeans = kmeans.predict(df_all)

but it always ignores the multiindex and works with all 8*3 rows independantly. After learning more about k-means cluster i´m not even sure if this is the right clustering method for my problem. Can anyone give me a hint how to transform my data and what clustering method i could use?

Topic scikit-learn pandas python clustering machine-learning

Category Data Science

Clustering with multiindex DataFrame

About