Why should the data be shuffled for machine learning tasks

In machine learning tasks it is common to shuffle data and normalize it. The purpose of normalization is clear (for having same range of feature values). But, after struggling a lot, I did not find any valuable reason for shuffling data.

I have read this post here discussing when we need to shuffle data, but it is not obvious why we should shuffle the data. Furthermore, I have frequently seen in algorithms such as Adam or SGD where we need batch gradient descent (data should be separated to mini-batches and batch size has to be specified). It is vital according to this post to shuffle data for each epoch to have different data for each batch. So, perhaps the data is shuffled and more importantly changed.

Why do we do this?

Topic deep-learning neural-network machine-learning

Category Data Science


Most of the ML algorithms are likely to pick up patterns even in the order you feed data, to avoid picking inadvertently pattern which does not exist, it is important to feed the model with shuffled data.


Since the SGD algorithm selects the subset of instances randomly, it is quite possible that it may take few instances many numbers of time per epoch, which may bring the cost function to a global minima.

If training instances are shuffled then the chances of selecting repeating instances is much less.

Source of inforamation: handson machine learning with scikit learn keras and tensorflow.


Well, after years, now I really know why we shuffle data! The idea is very simple, but I do not know why we really did not consider it.

For making the cost function, we are explicitly considering that the samples are i.i.d. For instance, in binary cross-entropy, you can easily see that we have a summation. That summation has been a product at first, and after taking the logarithm, it has been changed to sum. Actually, in the formulation of that cost function, we have discarded the joint probability, because it is difficult to compute. With i.i.d assumption, we have the current cost function. Now suppose our task is learning with different mini-batches and these mini-batches are not identical.


By shuffling the rows and training on only a subset of them during a given iteration, changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same


For best accuracy of the model, it's always recommended that training data should have all flavours of data.

Shuffling of training data helps us in achieving this target.


Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent

Complementing @Josh's answer, I would like to add that, for the same reason, shuffling needs to be done before batching. Otherwise, you are getting the same finite number of surfaces.


We need to shuffle only for minibatch/SGD, no need for batch gradient descent.

If not shuffling data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence:

  • Similar samples will produce similar surfaces (1 surface for the loss function for 1 sample) -> gradient will points to similar directions but this direction rarely points to the minimum-> it may drive the gradient very far from the minimum
  • “Best direction”: the average of all gradient of all surfaces (batch gradient descent) which points directly to the minum
  • “Minibatch direction”: average of a variety of directions will point closer to the minimum, although non of them points to the minimum
  • “1-sample direction”: point farther to the minimum compared to the minibatch

I drew the plot of the L-2 loss function for linear regression for y=2x here


Based on What should we do when a question posted on DataScience is a duplicate of a question posted on CrossValidated?, I am reposting my answer to the same question asked on CrossValidated (https://stats.stackexchange.com/a/311318/89653).

Note: throughout this answer I refer to minimization of training loss and I do not discuss stopping criteria such as validation loss. The choice of stopping criteria does not affect the process/concepts described below.

The process of training a neural network is to find the minimum value of a loss function $ℒ_X(W)$, where $W$ represents a matrix (or several matrices) of weights between neurons and $X$ represents the training dataset. I use a subscript for $X$ to indicate that our minimization of $ℒ$ occurs only over the weights $W$ (that is, we are looking for $W$ such that $ℒ$ is minimized) while $X$ is fixed.

Now, if we assume that we have $P$ elements in $W$ (that is, there are $P$ weights in the network), $ℒ$ is a surface in a $P+1$-dimensional space. To give a visual analogue, imagine that we have only two neuron weights ($P=2$). Then $ℒ$ has an easy geometric interpretation: it is a surface in a 3-dimensional space. This arises from the fact that for any given matrices of weights $W$, the loss function can be evaluated on $X$ and that value becomes the elevation of the surface.

But there is the problem of non-convexity; the surface I described will have numerous local minima, and therefore gradient descent algorithms are susceptible to becoming "stuck" in those minima while a deeper/lower/better solution may lie nearby. This is likely to occur if $X$ is unchanged over all training iterations, because the surface is fixed for a given $X$; all its features are static, including its various minima.

A solution to this is mini-batch training combined with shuffling. By shuffling the rows and training on only a subset of them during a given iteration, $X$ changes with every iteration, and it is actually quite possible that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same $X$. The effect is that the solver can easily "bounce" out of a local minimum. Imagine that the solver is stuck in a local minimum at iteration $i$ with training mini-batch $X_i$. This local minimum corresponds to $ℒ$ evaluated at a particular value of weights; we'll call it $ℒ_{X_i}(W_i)$. On the next iteration the shape of our loss surface actually changes because we are using $X_{i+1}$, that is, $ℒ_{X_{i+1}}(W_i)$ may take on a very different value from $ℒ_{X_i}(W_i)$ and it is quite possible that it does not correspond to a local minimum! We can now compute a gradient update and continue with training. To be clear: the shape of $ℒ_{X_{i+1}}$ will -- in general -- be different from that of $ℒ_{X_{i}}$. Note that here I am referring to the loss function $ℒ$ evaluated on a training set $X$; it is a complete surface defined over all possible values of $W$, rather than the evaluation of that loss (which is just a scalar) for a specific value of $W$. Note also that if mini-batches are used without shuffling there is still a degree of "diversification" of loss surfaces, but there will be a finite (and relatively small) number of unique error surfaces seen by the solver (specifically, it will see the same exact set of mini-batches -- and therefore loss surfaces -- during each epoch).

One thing I deliberately avoided was a discussion of mini-batch sizes, because there are a million opinions on this and it has significant practical implications (greater parallelization can be achieved with larger batches). However, I believe the following is worth mentioning. Because $ℒ$ is evaluated by computing a value for each row of $X$ (and summing or taking the average; i.e., a commutative operator) for a given set of weight matrices $W$, the arrangement of the rows of $X$ has no effect when using full-batch gradient descent (that is, when each batch is the full $X$, and iterations and epochs are the same thing).


Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.

The obvious case where you'd shuffle your data is if your data is sorted by their class/target. Here, you will want to shuffle to make sure that your training/test/validation sets are representative of the overall distribution of the data.

For batch gradient descent, the same logic applies. The idea behind batch gradient descent is that by calculating the gradient on a single batch, you will usually get a fairly good estimate of the "true" gradient. That way, you save computation time by not having to calculate the "true" gradient over the entire dataset every time.

You want to shuffle your data after each epoch because you will always have the risk to create batches that are not representative of the overall dataset, and therefore, your estimate of the gradient will be off. Shuffling your data after each epoch ensures that you will not be "stuck" with too many bad batches.

In regular stochastic gradient descent, when each batch has size 1, you still want to shuffle your data after each epoch to keep your learning general. Indeed, if data point 17 is always used after data point 16, its own gradient will be biased with whatever updates data point 16 is making on the model. By shuffling your data, you ensure that each data point creates an "independent" change on the model, without being biased by the same points before them.


Suppose data is sorted in a specified order. For example a data set which is sorted base on their class. So, if you select data for training, validation, and test without considering this subject, you will select each class for different tasks, and it will fail the process.

Hence, to impede these kind of problems, a simple solution is shuffling the data to get different sets of training, validation, and test data.

About the mini-batch, answers to this post can be a solution to your question.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.