if you look at the softmax function, it divides each element by the sum of whole vector. That's why the shape of gradient can not be the same as an input. You need to take derivative w.r.t. every elements and the one item (x_i) will expand. Read about Jacobian matrix, you will get a better idea.

the dimensions of input and derivatives don't have to be the same to perform backprop (basically chain rule). Think about the start of the backprop. You start w a scalar but your gradients are vectors/matrices.

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!

I’m an Engineering Manager at Scale AI and this is my notepad for Applied Math / CS / Deep Learning topics. Follow me on Twitter for more!