When to split Test and Training data from the full Dataset

Question

When to split Test and Training data from the full Dataset

joesan

2022年5月6日 06:54

I'm about to put my implementation into a pipeline and I'm now faced with the dilemma on when to actually split the test and training set? I have the following steps that I currently do (the names are self explanatory)

DistinctValuesCleanser
OutlierCleanser
FeatureCoRelationAnalyzer
FeatureVarianceThresholdAnalyzer
DataEncoder
SimpleImputer

And perhaps some more EDA (Exploratory Data Analysis) steps will follow!

So, now the question, do I run all these on my entire dataset and then split or split first and then run through these steps only on the training dataset?

Topic dataset

Category Data Science

Gius · Accepted Answer · 2022年5月6日 06:54

You should split the dataset in training and test set first, because in a real environment, where your model is deployed, you just don't have a test set, since test set is used to check the ability of the model to generalize.

For example, if you do your 'SimpleImputer' step (e.g. fill null values with mean of each feature) on full dataset, you're computing this mean over the training + test set, but it's not right, because you need to think as your test set doesn't exists, so you fill null values with mean of samples' features in training set, which are samples you use to train the model. In fact, if you use the test set to compute the mean with which null values will be replaced, then those new samples are 'dependent' by the test set, so you can't use it to test the generalization error, because you "already saw" test data before.

Also for the 'OutlierCleanser' step, you shouldn't remove outliers from test set, since in a real environment, you will face cases in which outliers appear, so you should remove them only on training set, since it's the data in which you "have control".

Same reasoning can be applied on covariance analysis and so on

When to split Test and Training data from the full Dataset

About