Clusterize item set with items as vectors of features

I have to clusterize this dataset in which I have houses and water consumption in this form: $$ House1 = (x_{1},x_{2}... x_{n});\\ House2 = (y_{1},y_{2}... y_{n});\\ House3 = (z_{1},z_{2}... z_{n});\\ $$ where $x_{i}$ is the daily consumption in liters while $n$ is a fixed parameter (length of dataset).

I need to cluster these houses in k clusters based on their water consumption.

My question is: how can I handle data expressed in this form to feed in the clustering algorithm? Maybe I will have to agglomerate each vector in some real value?

Topic data k-means clustering

Category Data Science


One approach is to treat each house as a time series, and then cluster the time series. Some approaches have adapted k-means to times series data, see the R package kml.

I also like using mixture models though for similar problems, as the underlying time series can be quite flexible. Here I have an example using pytorch, and the R package flexmix is quite good as well.

flexmix also extends to data if it has unequal time stamps. You pass the data in long format and give it a grouping factor. See here for an example.

A final example I have see is estimate various time series characteristics for each individual series (e.g. ARIMA), and then cluster those characteristics. So each house is turned into a single row in that scenario.


1.you just have to represent those features as numeric in a vector eg:[2,4,8,10]

2.Its a good practise to normalize vector i just took sum of elements and divide by each element by that sum of elements =[0.06666666666666667,0.13333333333333333,0.2,0.26666666666666666,0.3333333333333333] normalize the values in that vector to be between 0 -1

3.feed the vectors into clustering algorithm (you can try with kmeans)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.