Hi Ashish!

Data can be sometimes sorted in a chronological order. If you take the gradient of minibatch whose samples are highly correlated, you are estimating the biased gradient, as opposed to the true gradient. By shuffling the data, you can estimate the gradient more accurately and therefore update better (towards the true minimum).

Your question actually made me look up the deeplearning book to see if there are any other reasons. I post an excerpt here. Hope this helps.

  • Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  • Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store