Model-Agnostic Meta-Learning (MAML)

General

MAML is a model for meta-learning that learns the model’s parameters $\theta$ such that a small amount of gradient updates can lead to fast leaning on a new task.

Properties

MAML does not expand the number of learned parameters
No constraint on the architecture or network of the model
Can be combined with other deep-learning frameworks such as RNN, CNN, and MLP

Problem Setup

Single Task

Model: $f_{\theta}(\mathbf{y}|\mathbf{x})$
Dataset: $\mathcal{D} = \left\{ \mathbf{x} | \mathbf{y}\right\}_i$
Goal: $\min_{\theta} \mathcal{L}(\theta, \mathcal{D})$
Task: $\mathcal{T} \equiv \left\{ p_i(\mathbf{x}), p_i(\mathbf{y}| \mathbf{x}), \mathcal{L}_i\right\}$
$p_i$ is the distribution to generate data for tasks, either in super vised learning, classification, or reinforcement learning.

Multi Task

Model: $f_{\theta}(\mathbf{y}|\mathbf{x}, z_i)$
where $z_i$ is the task index.
Dataset: $\mathcal{D} = \left\{ \mathbf{x} | \mathbf{y}\right\}_i$
Goal: $\min_{\theta}\sum \mathcal{L}_i(\theta, \mathcal{D}_i)$

Example

Assume we have a model: $f_{\theta}$ with parameters $\theta$ . We adapt a new task $\mathcal{T}_i$ to the model, our parameter $\theta$ will become $\theta_i^'$ . The new parameter $\theta_i^'$ is computerd using one or more gradient descent updates on task $\mathcal{T}_i$ . One gradient update is thus

$\begin{align*}\theta_i^{'} = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i} (f_\theta)\end{align*}$

The stepsize $\alpha$ can be learned or set as a fixed hyperparameter. By optimizing the model for the performance of $f_{\theta_i^{'}}$ with respect to $\theta$ across the tasks sampled from $p(\mathcal{T}_i)$ . The meta objective is thus:

$\begin{align*}\min_{\theta} \sum_{\mathcal{T}_{i} \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{{\theta}_i^{'}}) = \sum_{\mathcal{T}_{i} \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i} (f_\theta)}) \end{align*}$

In meta-oprimization, we optmize the model parameters $\theta$ , whereas the objective is computed using the ipdated model parameters $\theta^{'}$ . MAML aims to optimize the model parameters such that one or small number of gradient steps on a new task will produce maximally effective behavior on that task. The meta-optimization across tasks is performed via stochastic gradient descent (SDG), such that the model parameters $\theta$ are updated as follows

$\begin{align*}\theta \leftarrow \theta - \beta \nabla_{\theta} \sum_{\mathcal{T}_{i} \sim p(\mathcal{T})} \mathcal{L}_{\mathcal{T}_i}(f_{\theta_{i}^{'}})\end{align*}$

where $\beta$ is the meta step size.

Source: paperswithcode.com, https://arxiv.org/abs/1703.03400v3