Association between categorical variables with no hierarchy in Python
I have a dataset with over 100 possible variable occurrences across 20 columns. At first glance this problem seemed to fit into hierarchical clustering. I started testing with Agglomerative Clustering, as described in scikit-learn documentation. It also mentions using a connectivity matrix, which is not available for this problem. However, in working with the stakeholder to increase my business understanding, I found that there is no pathing that occurs during the process, and that a hierarchical clustering is not appropriate. The data looks like this:
col_1 | col_2 | col_3 |
---|---|---|
code 1 | code 80 | code 87 |
code 80 | code 53 | NaN |
Each row represents a customer's application for a product. The application runs through a series of automated checks to determine eligibility. Several issue codes are identified for an individual to manually resolve before passing the application on. Sometimes there are duplicate codes (stakeholder is unsure why this may be) identified at the same time. Some applications have one error, some have up to 20.
The intention is to apply unsupervised learning, likely a clustering technique, to determine if there are strong associations between the occurrence any two to three or more codes. However, most of my experience is in NLP and classification. From what I've researched, dummy variables may be appropriate to create a flag for the presence of each of the variables. I have tested using them, but have not been successful so far due to the variable width of each row and inconsistent shape. A colleague suggested pairwise correlation, but since this is categorical instead of numeric, I do not know whether coercing to numeric affects the outcome of the correlation. I have tested the pairwise correlation by coercing the data type from object
to int
, but the results are inconclusive of any apparent relationship between variables.
Any suggestions on an appropriate modeling or data mining technique?
Topic scikit-learn python clustering data-mining machine-learning
Category Data Science