How to cluster texts by most relevant words

Question

How to cluster texts by most relevant words

AlexMIEL

2022年4月25日 16:02

I have a huge amount of documents and every document has its own portrait, where a portrait has this structure (document_id, word, weight). TFIDF, basically.

I want to cluster these documents into different clusters, say, 10.

I'm trying to implement the K-Means algorithm with sklearn, but I have almost zero experience with data science whatsoever. All tutorials that I found get texts as input from Wikipedia or somewhere else, but I don't have access to the texts themselves. I have only their portraits. Hope that makes sense.

Is this something that can be achievable with sklearn and if so, can you guide me where to dig or what to look at

Topic clustering

Category Data Science

Palak Bansal · Accepted Answer · 2021年7月25日 20:15

I do not know the nature of words you have, but you could start with cosine similarity. It is based on the number of common words between 2 sentences/documents.

You could then extend this to include syntactically similar words using word embeddings which would translate words into numeric vectors and then perform operations on them.

If simple cosine similarity doesn't work, you will have to read about other measures of similarities or word embeddings.

Erwan · Accepted Answer · 2021年2月24日 12:19

You can use these words with their weight as a vector representation of the document. The important point is to make all the documents vectors over the full vocabulary, so that any position $i$ in any vector always represents the same word $w_i$. This means that a vector should contain zeros in all the positions corresponding to a word which is not in the document.

Using these vectors you can indeed use k-means to cluster the documents. Of course the quality of the results depends on the data: if there are very few words in common, it cannot work very well.

How to cluster texts by most relevant words

About