这篇是我在看完bert预训练之后，bert的两个任务的loss直接加起来，之后去学习怎么调整loss
这里参考了天池比赛给出的loss调整方法：
- 第一篇是Multi-task learning using uncertainty to weigh losses for scene geometry and semantics
- 第二篇是Dynamic task prioritization for multitask learning
- 第三篇是End-to-End Multi-Task Learning with Attention

第一篇

这篇文章是cv里面的，但是它讲了两种loss的组合方式，第一种是连续性的output（如输出某个物体的距离，文章中的depth regression），第二种是分类的output（如语义分割，每个像素点是不是边界）
假设我们的模型输出值是$\mathbf{f}^{\mathbf{W}}(\mathbf{x})$，这里的$\mathbf{x}$是模型的输入值，如[batch_size, seq_len, hidden_size]这种，$\mathbf{W}$是模型的参数
下面的$\mathbf{y_i} \ i=1,\dots,K$是每个任务，比如$\mathbf{y_1}$是回归任务，$\mathbf{y_2}$是分类任务。
- 这里有一个很强的假设，就是在给定输出值之后，每个任务的预测相互独立

$p\left(\mathbf{y}_{1}, \ldots, \mathbf{y}_{K} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=p\left(\mathbf{y}_{1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \ldots p\left(\mathbf{y}_{K} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)$

对于回归问题，怎么计算这个$p\left(\mathbf{y}_{1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)$呢

$p\left(\mathbf{y_1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=\mathcal{N}\left(\mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma^{2}\right) \\ \log p\left(\mathbf{y_1} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \propto-\frac{1}{2 \sigma^{2}}\left\|\mathbf{y}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}-\log \sigma$

对于分类问题，怎么计算这个$p\left(\mathbf{y}_{2} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)$呢

$p\left(\mathbf{y}_{2} \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)=\operatorname{Softmax}\left(\frac{1}{\sigma^{2}} \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)\\ \log p\left(\mathbf{y_2}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma\right) =\frac{1}{\sigma^{2}} f_{c}^{\mathbf{W}}(\mathbf{x}) -\log \sum_{c^{\prime}} \exp \left(\frac{1}{\sigma^{2}} f_{c^{\prime}}^{\mathbf{W}}(\mathbf{x})\right)$

换句话说，对于回归，假设了输出值的条件概率是正态分布；对于分类，输出值通过一个scaled的softmax

例子

具体来讲，假设我们的模型有两个任务，那么就有两个loss，依然假设$\mathbf{y_1}$是回归任务，$\mathbf{y_2}$是分类任务
那么整体的loss方程$\mathcal{L}\left(\mathbf{W}, \sigma_{1}, \sigma_{2}\right)$就是，

$\begin{align} L&=-\log p\left(\mathbf{y}_{1}, \mathbf{y}_{2}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right) \\ &=-\log \mathcal{N}\left(\mathbf{y}_{1} ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{1}^{2}\right) \cdot \operatorname{Softmax}\left(\mathbf{y}_{2}=c ; \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{2}\right) \\ &=\frac{1}{2 \sigma_{1}^{2}}\left\|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right\|^{2}+\log \sigma_{1}-\log p\left(\mathbf{y}_{2}=c \mid \mathbf{f}^{\mathbf{W}}(\mathbf{x}), \sigma_{2}\right) \\ &=\frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(\mathbf{W})+\frac{1}{\sigma_{2}^{2}} \mathcal{L}_{2}(\mathbf{W})+\log \sigma_{1}+\log \frac{\sum_{c^{\prime}} \exp \left(\frac{1}{\sigma_{2}^{2}} f_{c^{\prime}}^{\mathbf{W}}(\mathbf{x})\right)}{\left(\sum_{c^{\prime}} \exp \left(f_{c^{\prime}}^{\mathbf{W}}(\mathbf{x})\right)\right)^{\frac{1}{\sigma_{2}^{2}}}} \\ &\approx \frac{1}{2 \sigma_{1}^{2}} \mathcal{L}_{1}(\mathbf{W})+\frac{1}{\sigma_{2}^{2}} \mathcal{L}_{2}(\mathbf{W})+\log \sigma_{1}+\log \sigma_{2} \end{align}$

其中$\mathcal{L}_{1}(\mathbf{W})=\left|\mathbf{y}_{1}-\mathbf{f}^{\mathbf{W}}(\mathbf{x})\right|^{2}$，$\mathcal{L}_{2}(\mathbf{W})=-\log \operatorname{Softmax}\left(\mathbf{y}_{2}, \mathbf{f}^{\mathbf{W}}(\mathbf{x})\right)$
注意上面的$\mathcal{L}_{2}(\mathbf{W})$应该是交叉熵函数，而且$ \mathbf{f}^{\mathbf{W}}(\mathbf{x})$没有scaled
那么这样就需要更新$\mathbf{W}$和两个$\sigma$参数了

代码

下面用torch写一份loss的代码，github上的代码其实是有问题的，它没有cross_entropy的内容

log_var_a = torch.zeros((1,), requires_grad=True)
log_var_b = torch.zeros((1,), requires_grad=True)
# 由于要更新两个sigma
params = ([p for p in model.parameters()] + [log_var_a] + [log_var_b])
# 所以在Adam更新梯度的时候，需要把这两个值放入
optimizer = optim.Adam(params)

def criterion(y_pred, y_true, log_vars):
  loss = 0
  for i in range(回归任务的个数):
    precision = torch.exp(-log_vars[i])
    diff = (y_pred[i]-y_true[i])**2.
    loss += torch.sum(precision * diff + log_vars[i], -1)
  for i in range(回归任务的个数, 回归任务的个数+分类任务的个数):
    precision = torch.exp(-log_vars[i])
    diff = F.cross_entropy(y_pred[i], y_true[i])
    loss += torch.sum(precision*diff + log_vars[i], -1)
  return torch.mean(loss)

这里贴一份torch.nn.CrossEntropyLoss()的源代码，疯狂嵌套！！！

class CrossEntropyLoss(_WeightedLoss):
    __constants__ = ['ignore_index', 'reduction']
    ignore_index: int
    def __init__(self, weight: Optional[Tensor] = None, size_average=None, ignore_index: int = -100,
                 reduce=None, reduction: str = 'mean') -> None:
        super(CrossEntropyLoss, self).__init__(weight, size_average, reduce, reduction)
        self.ignore_index = ignore_index

    def forward(self, input: Tensor, target: Tensor) -> Tensor:
        return F.cross_entropy(input, target, weight=self.weight,
                               ignore_index=self.ignore_index, reduction=self.reduction)
    
# CrossEntropyLoss调用了cross_entropy这个函数
def cross_entropy(input, target, weight=None, size_average=None, ignore_index=-100,
                  reduce=None, reduction='mean'):
    if not torch.jit.is_scripting():
        tens_ops = (input, target)
        if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
            return handle_torch_function(
                cross_entropy, tens_ops, input, target, weight=weight,
                size_average=size_average, ignore_index=ignore_index, reduce=reduce,
                reduction=reduction)
    if size_average is not None or reduce is not None:
        reduction = _Reduction.legacy_get_string(size_average, reduce)
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

# 通常我们是这样调用loss函数的
>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.empty(3, dtype=torch.long).random_(5)
>>> output = loss(input, target)
>>> output.backward()

第二篇

这篇文章的思想是，困难的任务优先处理，所以重点在于，怎么定义是困难的任务？
这里文章定义了总体的loss，总共有$|T|$个任务，其中$\mathrm{FL}\left(\bar{\kappa}_{t} ; \gamma_{t}\right)$就可以看作是各个任务的权重，就是一个Focal loss
$\mathcal{L}_{t}^{*}(\cdot)$就是某个任务的常规loss，比如cross entropy

$\begin{align} \mathcal{L}_{\text {Total }} &=\sum_{t=1}^{|T|} \lambda_{t} \mathcal{L}_{t} \\ &=\sum_{t=1}^{|T|} \mathrm{FL}\left(\bar{\kappa}_{t} ; \gamma_{t}\right) \mathcal{L}_{t}^{*}(\cdot) \\ &=\sum_{t=1}^{|T|} \mathrm{FL}\left(\bar{\kappa}_{t} ; \gamma_{t}\right) \left\{- \frac{1}{N} \sum_{i=1}^{N} (1-p_c)^{\gamma_{0}} \log(p_c) \right\}\\ &=\sum_{t=1}^{|T|} \left\{ -\left(1-\bar{\kappa}_{t}\right)^{\gamma_{t}} \log \left(\bar{\kappa}_{t}\right) \right\} \left\{- \frac{1}{N} \sum_{i=1}^{N} (1-p_c)^{\gamma_{0}} \log(p_c) \right\} \end{align}$

Focal loss是什么

Focal loss主要是为了解决one-stage目标检测中正负样本比例严重失衡的问题。该损失函数降低了大量简单负样本在训练中所占的权重，也可理解为一种困难样本挖掘。
正常的二分类cross entropy是：
- $loss =-y\log(p)-(1-y)\log(1-p)$，注意真正的loss只会有一个项，要么$y\log(p)$，要么$(1-y)\log(1-p)$
- cross entropy的问题在哪里呢？我们当然会希望，模型预测的$p$值能区分出来，比如$y=1$的样本，$p$值特别高，但是这个很难做到。如果$p$值在0.4~0.6之间，我们怎么判断这些样本是正样本或负样本？
- 苏剑林的博客提到，模型不要注意那些正样本且$p>0.5$、负样本且$p<0.5$的这些，即，已经预测得不错的样本，不要再关注了
- 那么怎么改正这个loss呢？注意下面的$\hat{y}=p$。类别不均衡本质上就是分类难度差异的体现

$L_{f l}=\left\{\begin{array}{l} -(1-\hat{y})^{\gamma} \log \hat{y}, \text { 当 } y=1 \\ -\hat{y}^{\gamma} \log (1-\hat{y}), \text { 当 } y=0 \end{array}\right.$

上面就是Focal loss
Focal loss相比正常的cross entropy，$y=1$的时候，多了一个$(1-\hat{y})^{\gamma}=(1-p)^{\gamma}$，当p值越小的时候，前面的项越大，相当于提高了这个样本的loss权重
这里的$\overline{\kappa_{t}}$定义如下，$\alpha$是一个超参数，$\kappa_{t}^{(\tau)}$定义为第$\tau$次的训练、第$t$个任务的某种性能参数（如正确率）

$\bar{\kappa}_{t}^{(\tau)}=\alpha \kappa_{t}^{(\tau)}+(1-\alpha) \bar{\kappa}_{t}^{(\tau-1)}$

代码

# 这里按照上面的理解，写一个loss的伪代码，其实可以封装成一个class
class Focal_loss():
    def __init__(self, nums_of_task, ):
        self.nums_of_task = nums_of_task
        # 超参数
        self.kappa_hat = [0]
        self.alpha = 0.5
        self.step = 0
    def forward(y_pred, y_true):
            loss = 0
            for i in range(1, self.nums_of_task+1):
                # 本次的kappa_hat
                # kappa是此次模型在这个task的表现
                kappa_hat = alpha * kappa + (1-alpha) * kappa_hat
                # 计算task权重
                FL_task_weight = Focal_loss(kappa_hat, gamma[i])
                # 计算这个task的loss
                example_loss = Focal_loss(y_pred, y_true, gamma[0])
				slef.step += 1
                loss += torch.sum(FL_task_weight * example_loss)
           return torch.mean(loss)