Encoding features for multi-class classification

I have a question regarding how to setup a dataset for modeling.

Let’s say I have a dataset representing which car a person will buy depending on some characteristics:

The dependent variables are individual cars (Car 1, Car 2, … Car 100).

The independent variables are:

Budget (of the buyer)

Favorite Color (of buyer)

…..

…..

Color (of Car 1)

Color (of Car 2)

….

Color (of Car 100)

MPG (of Car 1)

MPG (of Car 2)

…..

MPG (of Car 100)

Let’s assume this is a multi-class classification problem. So, only one of the cars can be chosen in each situation.

My question is: is it appropriate to have independent variables like that - that are specific to each of the dependent variables? (Color of Car X, MPG of Car X, …). Is it appropriate to just fit a row like that into a model? How does the model know to understand that each of the Colors are discussing the same feature? Color

Lastly, is there a name for this type of data/problem? I'm not sure how to look for it on Google.

Topic classification

Category Data Science


Although I'm not very articulate, I'll try to detail some of my thoughts on your question.

First answering your questions:

  1. Yes. Since your goal is to predict the car that the client will choose out of the 100. There isn't anything wrong with including features about those 100 cars. That said, in my opinion, the way you design your model is a bit strange, but I will address that later.
  2. Yes, see above.
  3. The model doesn't know that the car color features are all discussing color, unless it is somehow hard-coded in. Theoretically, if you feed in enough data, and depending on the model, the model will eventually form some sort of association between the features. However, the association is purely implicit in the model's parameters.
  4. I don't know the specific name for your problem, but it's along the lines of multi-class classification.

Suggestions:

As I said above, the way you frame the problem is slightly strange in my opinion. You don't have to listen to anything I suggest since you know your problem the best, but here are some of my thoughts for your problem.

You want to predict which car will be picked out of 100 cars. I would then create a dataset where the rows represent each car and the columns are the features of each car. So color, MPG, etc. Then, I would have the model predict whether the client will buy the specific car. So this becomes a binary classification problem. Then, after training the model on a ton of cars, I would have the model make predictions for all the 100 cars. That is, I would have the model predict whether the client would buy the car for each of the 100 cars. I would then pick the car from the 100 that the model has the highest confidence in for being picked by the client, and use that as the final prediction.

Again, I haven't done this kind of problem before, so I don't fully know. A similar problem would be to predict which car the client would pick, but the features describe the client instead of the cars.

On an unrelated note: As for categorical features like color, I would go with one-hot encoding.


Color is a categorical feature.

One of the most common methods to encode categorical features is one-hot encoding. Color could be encoded as an indicator vector. The color of the current car would have a 1 at the appropriate index. For example, [1, 0, 0, …, 0] for a red car and [0, 1, 0, …, 0] for a blue car.

There are other options for encoding categorical features such as binary, count, hash, or label encoding.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.