Feasibility study of machine learning

How to know whether machine learning is possible for a given data set. I have been given a data set, I should check whether machine learning is possible or not for that data set. How can I do that. How do you come to conclusion that machine learning can be performed for the given data set or not.

Topic machine-learning

Category Data Science


A feasibility study often keeps company with business value analysis, and in your case the study entails the following steps:

  1. ML problem definition and desired outcome
    Firstly, define your problem from the ML perspective, that is, what are the features and what is the output. Secondly, make sure that the task should be complex enough, otherwise just try heuristic rules.
  2. Exploratory data analysis (EDA)
    Talk with domain experts and do some correlation analysis among features and between features and labels.
  3. Data pre-processing
    See if there are some noises and if they can be handled.
  4. Hypothesis testing
    Begin with some simplest baselines and see if the delta business impact is worthy of the time you spent by communicating frequently with a business team. If the business impact is high, spend more time on more complex models; otherwise just try some newer algorithms quickly.

References:

  1. ML Feasibility Studies
  2. How to conduct a Feasibility study for your Machine Learning projects

I have asked the same question myself many times. To add a bit of a context: I work with relatively small data (~100 observations per experiment) of environmental nature, which is often sparse and/or imbalanced.

My empirical answer is very trivial: just try it! Also, do not forget that ML is a loosely defined term - some "classical" statistical tools may very well fall under it.

To begin, use the domain knowledge to set up appropriate research questions and think about what your data can tell you. Then start exploring the basics: draw a correlation plot, check whether the data is normally distributed, etc. Then you can apply unsupervised learning to look for more complicated relationships. Perhaps afterwards you may do some supervised one to make predictions.

One important remark. On one hand, do not get discouraged by poor PCA performance - it may pretty much happen that your data relationships are not linear. On the other hand, do not expect to build a Neural Network for every single problem ever - oftentimes they are not needed. Just go from a low level of complexity to a higher one, till it makes sense to continue.

Hope it helps!


For this feasibility study, following will be high level steps :

  1. For each feature perform PCA with rest of the features as train_x and feature as train_y. If you find a feature that can be predicted by other features; ML can be applied on Dataset
  2. Can a Human solve it ? As a person; can you find patterns for a given feature, based on other features ?
  3. Exploratory data analysis with Weka, Dataframe + Matplotlib or similar tools . https://datascienceguide.github.io/exploratory-data-analysis

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.