Does Cross Validation require splitting/shuffling and fitting of data beforehand?

Question

Does Cross Validation require splitting/shuffling and fitting of data beforehand?

wwjdm

2022年4月28日 09:02

I am trying to evaluate a logistic regression classifier using k-fold cross validation. I wanted to know if I need to shuffle data before hand when using cross_validate_predict and if I need to fit the data before hand as well:

# THIS DOES A RANDOM SHUFFLE
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 42)
transformer = WEASELMUSE(strategy='uniform',word_size=4, window_sizes=np.arange(5, 70))
logistic = LogisticRegression(solver='liblinear', multi_class='ovr')
clf = make_pipeline(transformer, logistic)

# DO I NEED TO FIT THE DATA?
clf.fit(X_train, y_train)


# DO I PASS IN THE X_test OR THE FULL DATASET x?
p = cross_val_predict(clf, x, y, cv=5)

What if I do not use train_test_split? Then would I need to do the following:

transformer = WEASELMUSE(strategy='uniform',word_size=4, window_sizes=np.arange(5, 70))
logistic = LogisticRegression(solver='liblinear', multi_class='ovr')
clf = make_pipeline(transformer, logistic)

# Shuffle x HERE BEFORE
x, y = sklearn.utils.shuffle(x, y)


p = cross_val_predict(clf, x, y, cv=5)

Topic scikit-learn python

Category Data Science

Nicolas Martin · Accepted Answer · 2021年6月17日 09:26

It all depends if your data was initially randomized or not.

If the data was well organized in a specific order, you must shuffle it first, and then split to train/test sub datasets.

Otherwise, your prediction will be wrong because a learning model need to study various potential configurations, and the best way to do it, is to use random train data and random test data. Of course, the training requires more data (usually between 70% to 80%) than test data (20% to 30%) in order to ensure that many configurations are learned.

Does Cross Validation require splitting/shuffling and fitting of data beforehand?

About