mygrad.Tensor.backward#

Tensor.backward(grad: ArrayLike | None = None)[source]#

Trigger backpropagation and compute the derivatives of this tensor.

Designating this tensor as the tensor ℒ, compute dℒ/dx for all (non-constant) tensors that preceded ℒ in its computational graph, and store each of these derivatives in x.grad respectively.

Once back-propagation is finished, the present tensor is removed from all computational graphs, and the preceding graph is cleared.

If ℒ is a non-scalar tensor (i.e. ℒ.ndim is greater than 0), then calling ℒ.backward() will behave as if ℒ was first reduced to a scalar via summation. I.e. it will behave identically to ℒ.sum().backward(); this ensures that each element of any dℒ/dx will represent a derivative of a scalar function.

Parameters:

gradOptional[array_like], (must be broadcast-compatible with self: By default, the present tensor is treated as the terminus of the computational graph (ℒ). Otherwise, one can specify a “downstream” derivative, representing dℒ/d(self). This can be used to effectively connect otherwise separate computational graphs.

Examples

>>> import mygrad as mg
>>> x = mg.tensor(2)
>>> y = mg.tensor(3)
>>> w = x * y
>>> ℒ = 2 * w
>>> ℒ.backward()  # computes dℒ/dℒ, dℒ/dw, dℒ/dy, and dℒ/dx

>>> ℒ.grad  # dℒ/df == 1 by identity
array(1.)
>>> w.grad  # dℒ/dw
array(2.)
>>> y.grad # dℒ/dy = dℒ/dw * dw/dy
array(4.)
>>> x.grad # dℒ/dx = dℒ/dw * dw/dx
array(6.)

Calling ℒ.backward() from a non-scalar tensor is equivalent to first summing that tensor.

>>> tensor = mg.tensor([2.0, 4.0, 8.0])
>>> ℒ = tensor * tensor[::-1]  # [x0*x2, x1*x1, x2*x0]
>>> ℒ.backward()  # behaves like ℒ = x0*x2 + x1*x1 + x2*x0
>>> tensor.grad
array([16.,  8.,  4.])

>>> tensor = mg.Tensor([2.0, 4.0, 8.0])
>>> ℒ = tensor * tensor[::-1]
>>> ℒ.sum().backward()
>>> tensor.grad
array([16.,  8.,  4.])

Specifying a value for grad

>>> x = mg.Tensor(1.)
>>> x.backward(2.)
>>> x.grad  # Would normally be dℒ/dℒ == 1
array(2.)