How to determine irrelevant data in binary classification?

Suppose I was looking at social media to find users' intents on trading stocks. I might have a binary classification model that predicted "buy" and "sell". However, it's obvious that most social media posts mentioning a company are not related to buying or selling its stock. Even if I was to look specifically at places in the Internet where the main topic of discussion is buying and selling stocks, there would be handfuls of posts that were in a sense "off-topic" (e.g. "I applied to Microsoft today." or "What does everyone here think about Alphabet?")

My question is, how would one go about recognizing when a social media post does not suggest a user would buy or sell the stock. I had three quick ideas:

  1. Create rules that would be able to differentiate relevant from irrelevant posts

  2. Create a second binary classifier, that differentiates between relevant and irrelevant posts, and then uses the main classifier on only the relevant posts

  3. Change the binary classifier into a classifier that can detect buy, sell, and off-topic documents.

Is there a customary approach to this problem?

Topic classification

Category Data Science


Is there a customary approach to this problem?

Yes, its called feature selection. We use them to remove irrelevant or partially relevant features which could negatively impact model performance. Example of one of easiest methods:

  1. Univariate Selection
  2. Feature Importance
  3. Correlation Matrix with Heatmap

You could find examples of implementation of these methods in the link:

https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e


I agree that you could reasonably classify the stock that it relates to, and whether it concerns a trading intent or not, separately. Presumably, trading intent looks similar across stocks, so the two are not related much.

Your second and third ideas therefore sound relevant to me.

It's possible to consider a multi-task classifier too with deep learning architectures, one that tries to solve both at once. There might be small advantages to that, but it is more complex.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.