How to determine irrelevant data in binary classification?

Question

How to determine irrelevant data in binary classification?

dbalagula23

2022年5月6日 22:00

Suppose I was looking at social media to find users' intents on trading stocks. I might have a binary classification model that predicted "buy" and "sell". However, it's obvious that most social media posts mentioning a company are not related to buying or selling its stock. Even if I was to look specifically at places in the Internet where the main topic of discussion is buying and selling stocks, there would be handfuls of posts that were in a sense "off-topic" (e.g. "I applied to Microsoft today." or "What does everyone here think about Alphabet?")

My question is, how would one go about recognizing when a social media post does not suggest a user would buy or sell the stock. I had three quick ideas:

Create rules that would be able to differentiate relevant from irrelevant posts
Create a second binary classifier, that differentiates between relevant and irrelevant posts, and then uses the main classifier on only the relevant posts
Change the binary classifier into a classifier that can detect buy, sell, and off-topic documents.

Is there a customary approach to this problem?

Topic classification

Category Data Science

fuwiak · Accepted Answer · 2020年3月12日 14:28

Is there a customary approach to this problem?

Yes, its called feature selection. We use them to remove irrelevant or partially relevant features which could negatively impact model performance. Example of one of easiest methods:

Univariate Selection
Feature Importance
Correlation Matrix with Heatmap

You could find examples of implementation of these methods in the link:

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e

Sean Owen · Accepted Answer · 2020年2月11日 13:33

I agree that you could reasonably classify the stock that it relates to, and whether it concerns a trading intent or not, separately. Presumably, trading intent looks similar across stocks, so the two are not related much.

Your second and third ideas therefore sound relevant to me.

It's possible to consider a multi-task classifier too with deep learning architectures, one that tries to solve both at once. There might be small advantages to that, but it is more complex.

How to determine irrelevant data in binary classification?

About