Unlike MAML, in Meta-SGD, along with finding optimal parameter value, , we also find the optimal learning rate, , and update the direction.
The learning rate is implicitly implemented in the adaptation term. So, in Meta-SGD, we don't initialize a learning rate with a small scalar value. Instead, we initialize them with random values with the same shape as and learn them along with .
The update equation of the learning rate can be expressed as .
Sample n tasks and run SGD for fewer iterations on each of the sampled tasks, and then update our model parameter in a direction that is common to all the tasks.