if you look at the softmax function, it divides each element by the sum of whole vector. That's why the shape of gradient can not be the same as an input. You need to take derivative w.r.t. every elements and the one item (x_i) will expand. Read about Jacobian matrix, you will get a better idea.

the dimensions of input and derivatives don't have to be the same to perform backprop (basically chain rule). Think about the start of the backprop. You start w a scalar but your gradients are vectors/matrices.

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store