Fastai Layerwise Sequential Unit Variance (LSUV)

This Post is based on the Notebok by the Fastai Course Part2
=> With LSUV the computer will figure out how to init our net proberly, and will initalize our network with unit Variance

#collapse
%load_ext autoreload
%autoreload 2

%matplotlib inline

#collapse
from exp.nb_07 import *

Getting the MNIST data and a CNN

Jump_to lesson 11 video

#collapse
x_train,y_train,x_valid,y_valid = get_data()

x_train,x_valid = normalize_to(x_train,x_valid)
train_ds,valid_ds = Dataset(x_train, y_train),Dataset(x_valid, y_valid)

nh,bs = 50,512
c = y_train.max().item()+1
loss_func = F.cross_entropy

data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)

#collapse_show
mnist_view = view_tfm(1,28,28)
cbfs = [Recorder,
        partial(AvgStatsCallback,accuracy),
        CudaCallback,
        partial(BatchTransformXCallback, mnist_view)]

#collapse
nfs = [8,16,32,64,64]

#collapse_show
class ConvLayer(nn.Module):
    def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)
        self.relu = GeneralRelu(sub=sub, **kwargs)
    
    def forward(self, x): return self.relu(self.conv(x))
    
    @property
    def bias(self): return -self.relu.sub
    @bias.setter
    def bias(self,v): self.relu.sub = -v
    @property
    def weight(self): return self.conv.weight

#collapse
learn,run = get_learn_run(nfs, data, 0.6, ConvLayer, cbs=cbfs)

Now we're going to look at the paper All You Need is a Good Init, which introduces Layer-wise Sequential Unit-Variance (LSUV). We initialize our neural net with the usual technique, then we pass a batch through the model and check the outputs of the linear and convolutional layers. We can then rescale the weights according to the actual variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalized.

We repeat this process until we are satisfied with the mean/variance we observe.

Let's start by looking at a baseline:

#collapse_show
run.fit(2, learn)

train: [2.31547421875, tensor(0.1591, device='cuda:0')]
valid: [2.2347482421875, tensor(0.2266, device='cuda:0')]
train: [1.050069921875, tensor(0.6582, device='cuda:0')]
valid: [0.2111361328125, tensor(0.9348, device='cuda:0')]

Now we recreate our model and we'll try again with LSUV. Hopefully, we'll get better results!

#collapse_show
learn,run = get_learn_run(nfs, data, 0.6, ConvLayer, cbs=cbfs)

Helper function to get one batch of a given dataloader, with the callbacks called to preprocess it.

#collapse_show
def get_batch(dl, run):
    run.xb,run.yb = next(iter(dl))
    for cb in run.cbs: cb.set_runner(run)
    run('begin_batch')
    return run.xb,run.yb

#collapse
xb,yb = get_batch(data.train_dl, run)

We only want the outputs of convolutional or linear layers. To find them, we need a recursive function. We can use sum(list, []) to concatenate the lists the function finds (sum applies the + operate between the elements of the list you pass it, beginning with the initial state in the second argument).

#collapse_show
def find_modules(m, cond):
    if cond(m): return [m]
    a = sum([find_modules(o,cond) for o in m.children()],[]) 
    print(a)
    return a

def is_lin_layer(l):
    lin_layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, nn.ReLU)
    return isinstance(l, lin_layers)

#collapse_show
mods = find_modules(learn.model, lambda o: isinstance(o,ConvLayer))

[]
[]
[]
[ConvLayer(
  (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
  (relu): GeneralRelu()
), ConvLayer(
  (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (relu): GeneralRelu()
), ConvLayer(
  (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (relu): GeneralRelu()
), ConvLayer(
  (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (relu): GeneralRelu()
), ConvLayer(
  (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
  (relu): GeneralRelu()
)]

#collapse_show
mods

[ConvLayer(
   (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 ),
 ConvLayer(
   (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
   (relu): GeneralRelu()
 )]

This is a helper function to grab the mean and std of the output of a hooked layer.

#collapse_show
def append_stat(hook, mod, inp, outp):
    d = outp.data
    hook.mean,hook.std = d.mean().item(),d.std().item()

#collapse_show
mdl = learn.model.cuda()

So now we can look at the mean and std of the conv layers of our model.

#collapse_show
with Hooks(mods, append_stat) as hooks:
    mdl(xb)
    for hook in hooks: print(hook.mean,hook.std)

0.4263218641281128 1.0437159538269043
0.4034019708633423 0.8199178576469421
0.3987013101577759 0.6961370706558228
0.3476980924606323 0.5620085000991821
0.24456343054771423 0.3626357316970825

We first adjust the bias terms to make the means 0, then we adjust the standard deviations to make the stds 1 (with a threshold of 1e-3). The mdl(xb) is not None clause is just there to pass xb through mdl and compute all the activations so that the hooks get updated.

#collapse_show
def lsuv_module(m, xb):
    h = Hook(m, append_stat)

    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean
    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std

    h.remove()
    return h.mean,h.std

We execute that initialization on all the conv layers in order:

#collapse_show
for m in mods: print(lsuv_module(m, xb))

(-0.017856407910585403, 0.9999998807907104)
(0.16777339577674866, 0.9999999403953552)
(0.14251837134361267, 1.0)
(0.152313232421875, 0.9999999403953552)
(0.2786397337913513, 1.0000001192092896)

Note that the mean doesn't exactly stay at 0. since we change the standard deviation after by scaling the weight.

Then training is beginning on better grounds.

#collapse_show
%time run.fit(2, learn)

train: [0.5833533203125, tensor(0.8165, device='cuda:0')]
valid: [0.14450025634765626, tensor(0.9564, device='cuda:0')]
train: [0.111139541015625, tensor(0.9656, device='cuda:0')]
valid: [0.09887520751953124, tensor(0.9700, device='cuda:0')]
CPU times: user 2.33 s, sys: 24 ms, total: 2.35 s
Wall time: 1.96 s

LSUV is particularly useful for more complex and deeper architectures that are hard to initialize to get unit variance at the last layer.