Identifying subsets of values significant to the total sum

Imagine a set of products in a store, with all the different attributes assigned to them - some of these hierarchical (e.g. categories), and some not (e.g. brand), but none of them continuous (if that is even important here). For each product, we know how much (in money-value) we've sold last year, and how much we sold this year. The sum of all of the differences in these sales is equal to the difference in total sales between the two years.

What we're interested in is finding some nice rules which describe the biggest sales-change-driving sets of products. For example: Smartphones have dropped in sales by 10% (\$123456), Apple products increased in sales by 20% (\$31234), etc. In other words, an end user is interested in learning what's driving our sales up and down in some easily consumable format. Hierarchical attributes should be taken into account here, as well as the combination of orthogonal attributes.

My question is not about forming those sentences, but in general about how to find those structured rules. Additionally, how to best identify composite rules, like Smartphones are dropping in sales, apart from Apple smartphones which are growing.

A very important additional question is how to balance relative and absolute change of these rules. Are there some best-practice approaches on how to deal with this tradeoff?

A somewhat naive approach would be to build a decision tree for sales-change prediction (either relative or absolute) for each product, and then use that tree as a foundation for rules, or maybe just plot the tree itself and present to the end user.

All ideas, even ones very remote to the main question, are very welcome!

Topic decision-trees

Category Data Science


I think ML is not needed here, as it is an SQL (or at least pandas) exercise. Namely, you have change in sales for each of the lowest level labels. You also have hierarchical relationship between labels Smartphones -> Apple Smartphones -> iPhone 2. So you basically need to aggregate from lower to higher label levels and look at some higher label levels you care about. Sort them from low to high by the amount you worked out, and those are your rules. There might be some rare, but expensive stuff, so might be worth weighting it by number of sales|valume|other important metric, but that is about it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.