Does Cross Validation require splitting/shuffling and fitting of data beforehand?

I am trying to evaluate a logistic regression classifier using k-fold cross validation. I wanted to know if I need to shuffle data before hand when using cross_validate_predict and if I need to fit the data before hand as well:

# THIS DOES A RANDOM SHUFFLE
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 42)
transformer = WEASELMUSE(strategy='uniform',word_size=4, window_sizes=np.arange(5, 70))
logistic = LogisticRegression(solver='liblinear', multi_class='ovr')
clf = make_pipeline(transformer, logistic)

# DO I NEED TO FIT THE DATA?
clf.fit(X_train, y_train)


# DO I PASS IN THE X_test OR THE FULL DATASET x?
p = cross_val_predict(clf, x, y, cv=5)

What if I do not use train_test_split? Then would I need to do the following:

transformer = WEASELMUSE(strategy='uniform',word_size=4, window_sizes=np.arange(5, 70))
logistic = LogisticRegression(solver='liblinear', multi_class='ovr')
clf = make_pipeline(transformer, logistic)

# Shuffle x HERE BEFORE
x, y = sklearn.utils.shuffle(x, y)


p = cross_val_predict(clf, x, y, cv=5)

Topic scikit-learn python

Category Data Science


It all depends if your data was initially randomized or not.

If the data was well organized in a specific order, you must shuffle it first, and then split to train/test sub datasets.

Otherwise, your prediction will be wrong because a learning model need to study various potential configurations, and the best way to do it, is to use random train data and random test data. Of course, the training requires more data (usually between 70% to 80%) than test data (20% to 30%) in order to ensure that many configurations are learned.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.