# (1) Get Gradients of the Intermediate Variables

To save the memory cost of the intermediate variables, pytorch will not save the gradient of the intermediate variables, it only saves the gradient of the leaf tensor. So we can not get the gradient of the intermediate node in the compution graph after backward function. autograd module is provided to get values of the intermediate gradients.

import torch

A = torch.tensor([1,2,3])
B = torch.tensor([2,3,4.], requires_grad=True)
C = A * B 
D = torch.sum(C)

D.backward()
print(C.grad)
"""output
None
"""

With torch.autograd module:

import torch
import torch.autograd as autograd

A = torch.tensor([1,2,3])
B = torch.tensor([2,3,4.], requires_grad=True)
C = A * B 
D = torch.sum(C)
C_grad = autograd.grad(D, C, retain_graph=True)
print(C_grad)
"""output
(tensor([1., 1., 1.]),)
"""

# (2) Cannot Modifiy Variables Needed for Gradient Compution by An Inplace Operation

import torch
import torch.autograd as autograd

A = torch.tensor([1,2,3])
B = torch.tensor([2,3,4.], requires_grad=True)
C = A * B + B 
D = torch.sum(C)
A[0] = .1
D.backward()
"""exception
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.LongTensor [3]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
"""

The code above will throw exception since tensor A was modified before backward operation and after computation of C. As the following code, if the modification of A was before the use, the code will work successfully but get wrong results.

import torch
import torch.autograd as autograd

A = torch.tensor([1,2,3])
A[0] = .1

B = torch.tensor([2,3,4.], requires_grad=True)

C = A * B + B
D = torch.sum(C)
D.backward()
print(B.grad)
"""output
tensor([1., 3., 4.])
"""

tensor([1., 3., 4.]) is the wrong result since gradient of tensor B should be tensor([2., 3., 4.]). C = A * B + B, so the gradient of tensor B is related to tensor A, expressed as (A + 1). If A was modified, then the gradient of tensor B in the compute graph would be changed at the same time, this would cause the wrong result.

# (3)Need Retaining Compute Graph to Call Backward Method A Second Time

To reduce memory usage, during the .backward() call with no argument, all the intermediary results are removed when they are not needed anymore. Hence, if you call the .backward() method again, the intermediary results are not existed anymore and so the backward pass can not be performed, the programs would throw error exception as the following code.

import torch
import torch.autograd as autograd

A = torch.tensor([1,2,3])
A[0] = .1

B = torch.tensor([2,3,4.], requires_grad=True)
# first backward
C = A * B + B
D = torch.sum(C)
D.backward()
print(B.grad)
# second backward
B.grad = None
C.grad = None
D.backward()
print(B.grad)
"""output
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
"""

The comute graph for the above example is quite simple, C = A * B + B, D = torch.sum(C), however, after the .backward() function, the graph would not be retained, thus when do the backward pass a second time, the gradient back propagation can not be processed without the intermediary values. To perform backward pass a second time, there can be two options, one is to invoke the .backward() function with parameter retain_graph being equal to True, this may consume more memory. The other option is to build the compute graph again to do the .backward() pass.Seeing the code below,

import torch
import torch.autograd as autograd

A = torch.tensor([1,2,3])
A[0] = .1

B = torch.tensor([2,3,4.], requires_grad=True)

C = A * B + B
D = torch.sum(C)

# first option
"""output
tensor([1., 3., 4.])
tensor([1., 3., 4.])
"""
D.backward(retain_graph=True)
print(B.grad)

# second option
"""output
tensor([1., 3., 4.])
tensor([1., 3., 4.])
"""
C = A * B + B
D = torch.sum(C)
B.grad = None
C.grad = None
D.backward()
print(B.grad)

# (4) Set requires_grad = False

Conceptually, autograd keeps a recond of tensors and all executed operations in a directed acyclic graph(DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this DAG from roots to leaves, the gradients can be calculated automatically with chain rule.

autograd automatically track computations on all the tensors which have the requires_grad flag set to true. For tensors that dont need to calculate the gradients, setting this attribute to false will disable their gradients computing in the DAG. The output tensor of an operation will require gradients only if one of the input tensors has the attribute requires_grad set to true.

Parameters that don't need to compute gradients are usually called frozen parameters. It's useful to freeze part of your model if you know in advance that you don't want those parameters to be updated. Another common usecase where exclusion from the DAG is important is for finetuning a pretrained network.

In finetuning, we freeze most of the parameters, and typically only modify the classifier layers to make predictions on new labels. In the following, we will work through a small example to demonstrate this.

Firstly, we load a pre-trained mobilenet model, and freeze all the parameters. Let's assume that we are going to finetune this model on a new dataset with 2 classes, in mobilenet, the classifier is the last Linear layer model.classifier[1], we can simply replace it with a new liner layer as our classifer. Now all parameters in the model, except the parameters of the classifier are frozen. Only parameters of the linear layer as new classifier are needed to compute gradients. As we can see, although we register all the parameters in the optimizer, The only parameters being updated in gradient descent are the weights and bias of the classifier.

from torch import nn, optim
from torchvision import models
import torch

model = models.mobilenet_v2(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
    
model.classifier[1] = nn.Linear(in_features=1280, out_features=2, bias=True)
optim = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
x = torch.rand(4, 3, 224, 224)
y_p = model(x)
y_t = torch.randint(2, (4,))
print(f"before update")
print(model.features[1].state_dict()['conv.0.0.weight'][0])
print(model.classifier[1].state_dict()['weight'][0][:10])
loss = nn.functional.cross_entropy(y_p, y_t)
loss.backward()
optim.step()
print(f"\nafter update")
print(model.features[1].state_dict()['conv.0.0.weight'][0])
print(model.classifier[1].state_dict()['weight'][0][:10])

"""output
before update
tensor([[[-0.0091, -0.0109, -0.0089],
         [-0.0183,  0.0038,  0.1027],
         [-0.0102, -0.0084,  0.0075]]])
tensor([-0.0234,  0.0231,  0.0003, -0.0230,  0.0057,  0.0165,  0.0163,  0.0257,
        -0.0085, -0.0010])

after update
tensor([[[-0.0091, -0.0109, -0.0089],
         [-0.0183,  0.0038,  0.1027],
         [-0.0102, -0.0084,  0.0075]]])
tensor([-0.0240,  0.0206, -0.0001, -0.0231,  0.0028,  0.0139,  0.0140,  0.0230,
        -0.0121, -0.0015])
"""

Another way to set requires_grad equal to false is torch.no_grad, this API is a context manager that disabled grdient calcution which is helpful for model inference, when you are sure you are not going to call .backward. In this mode, the result of every computation will have requires_grad = false, even though the input has attribute requires_grad = true. with torch.no_grad, this will reduce the memory cost for the model computation.

import torch

A = torch.tensor([1,2,3])
B = torch.tensor([2,3,4.], requires_grad=True)

C = A * B + B
D = torch.sum(C)

print(f"before set no_grad: ", D.requires_grad)

# model.eval()
with torch.no_grad():
    C = A * B + B
    D = torch.sum(C)
    print(f"after set no_grad: ", D.requires_grad)

"""output
before set no_grad:  True
after set no_grad:  False
"""
(adsbygoogle = window.adsbygoogle || []).push({});

# reference