When to split Test and Training data from the full Dataset
I'm about to put my implementation into a pipeline and I'm now faced with the dilemma on when to actually split the test and training set? I have the following steps that I currently do (the names are self explanatory)
- DistinctValuesCleanser
- OutlierCleanser
- FeatureCoRelationAnalyzer
- FeatureVarianceThresholdAnalyzer
- DataEncoder
- SimpleImputer
And perhaps some more EDA (Exploratory Data Analysis) steps will follow!
So, now the question, do I run all these on my entire dataset and then split or split first and then run through these steps only on the training dataset?
Topic dataset
Category Data Science