说明

我在看bert和gpt模型的时候，都出现了transformer，所以写一篇博客，记录下学习的内容，方便以后忘了可以回想起来。
一个Transformer层：6个（其实可以自己改数量）encoder层加上6个decoder层。
粗略的结构如下图，该图片来自龙心尘-CSDN博客_transformer
本文还参考了原论文attention is all your need，transformer的个人实现代码
由于transformer的个人实现代码的例子不太友好，我这里修改了下代码，统一了这篇博客的例子一致。
首先写一个数据流，再介绍结构，最后展示代码

全文以Je suis etudiant为输入，任务为翻译，输出值为I am a student
- 那么transformer是怎么做这个翻译的呢
- 首先encoder将输入句子Je suis etudiant编码，也就是提取特征
- 下一步，我们将这个特征，和[EOS]输入到decoder中，翻译出I
- 接着，再把特征和[EOS] I输入到decoder中，翻译出am
- 接着，把特征和[EOS] I am输入到decoder中，翻译出a
- 最后，把特征和[EOS] I am a输入到decoder中，翻译出student
- 把已经翻译了合在一起，就是最终输出I am a student
具体的结结构图如下，图片来自attention is all your need原文

注意上面的图，左边是encoder，右边是decoder。左边的encoder的方框旁边有个$\text{N}\times$，意思是这样的结构重复$\text{N}$次。
decoder的第一个自注意力层和其他的不一样，这个有$Masked$，意思是用来掩码作用的。
实际上，整个模型中会出现两次掩码
- 第一种掩码是padding mask，因为每个句子长度是不一样的，所以我们需要将这个句子补充到一样长。但是这些补充的[PAD]没有啥意义，所以需要掩盖掉[PAD]这个token，让自注意力矩阵不要关注这个token，因为这个[PAD]只是填充作用
- 第二种掩码是sequence mask，在decoder中的掩码，这个掩码是为了让decoder看不到未来的信息，下面会举例子

数据流

第一步：处理数据

将上面的句子tokenize化，也就是切分成一个一个单词，如下：
- x = [Je, suis, etudiant]
- y = [I, am, a, student]
- 注意：这里的语料库只有一个样例，实际上我们处理的句子长度不一样。这样，我们就需要把所有的句子补齐长度，变成一样长
- 不妨假设补齐后的$x$和$y$的长度是$n=5$和$m=4$，那么补齐之后的$x$和$y$就是
  - Je, suis, etudiant, [PAD], [PAD]
  - [BOS] I am a student
- 这里的[BOS]的作用，是告诉序列模型开始输出了，也对应上面的图中的shifted right，所以这里的输出序列是多一个长度，所以一般在代码里面，都是$m \leftarrow m+1$
对于上述切分好的token，要形成字典，输入值

src_vocab = {'[PAD]': 0, 'Je': 1, 'suis': 2, 'etudiant': 3}
tgt_vocab = {'[PAD]': 0, 'I': 1, 'am': 2, 'a': 3, 'student': 4, '[BOS]': 5}
# 用字典中 的号码来表示字符
"""
enc_inputs: tensor([[1, 2, 3, 0, 0]])
dec_inputs: tensor([[5, 1, 2, 3, 4]])
"""

第二步：encoder

第一个encoder之前，要计算一个[PAD]的掩码矩阵get_attn_pad_mask
- 如enc_inputs: tensor([[1, 2, 3, 0, 0]])，也就是Je, suis, etudiant, [PAD], [PAD]
- 会形成一个$1nn$的矩阵

# 这个矩阵的意思是：第一行，在我们提取特征的时候，不应该把`Je`的注意力放在第五列上（即`[PAD]`）
# 第二行，在对`suis`提取特征的时候，不要关注[PAD]
tensor([[[False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True]]])

第一个encoder进来的enc_inputs要经过词向量和位置向量的计算，转成了$\text{batch_size} \times n\times d_{model}$的矩阵，不妨记为$\mathbf{z}$
每个encoder有两个子层，分别是多头注意力层、前馈神经网络
每一层的encoder计算流程如下
1. $\mathbf{z}$乘一个$W_x, W_v, W_k$矩阵，分别得到每个$\mathbf{z_i}$的查询向量$q_i=\mathbf{z_i}W_x$、键向量$k_i=\mathbf{z_i}W_k$、值向量$v_i=\mathbf{z_i}W_v$，这里的$\mathbf{z_i}$默认是batch中的一个token，也就是$\mathbf{z_i}$的维度是$1\times d_{model}$
2. 矩阵乘法就是$Q=\mathbf{z}W_x$，以此类推
3. 计算$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$，注意这里的$QK^{T}$就是注意力得分矩阵
4. 这个得分矩阵的维度是$\text{batch_size} \times n_{heads} \times n \times n$，这个矩阵要和pad掩码矩阵做点积
5. 计算完$\text { Attention }(Q, K, V)$后乘一个权重矩阵$W^{O}$，得到一个$\text{batch_size} \times n \times d_{model}$的矩阵
6. 第5步的矩阵加上第1步的矩阵，作为残差层，再进行一个layernorm，这就完成了多头注意力
7. 假设多头注意力出来的值为$\mathbf{z}_{atten}$
8. $\mathbf{z}_{atten}$经过一个前馈神经网络的值，再加上$\mathbf{z}_{atten}$，作为残差层，进行layernorm，这就完成了第二个子层
9. 经过第8步输出的值，作为下一个encoder的输入

符号	维度	真实维度	含义
$\mathbf{z}$	$\text{batch_size} \times n \times d_{model}$	$14512$	一批输入的数据量，n是输入的最大长度，$d_{model}$是模型的维度
$W_x$	$\text{batch_size} \times n \times (d_{x}*n_{heads})$	$14(64*8)$	8个scaled dot product层并行计算
$W_k$	$\text{batch_size} \times n \times (d_{k}*n_{heads})$	$14(64*8)$	键向量矩阵
$W_v$	$\text{batch_size} \times n \times (d_{v}*n_{heads})$	$14(64*8)$	值向量矩阵
$Q$	$\text{batch_size} \times n_{heads} \times n \times d_{x}$	$184*64$	每个token的查询值
$K$	$\text{batch_size} \times n_{heads} \times n \times d_{k}$	$184*64$
$V$	$\text{batch_size} \times n_{heads} \times n \times d_{v}$	$184*64$
$\text {Attention }(Q, K, V)$	$\text{batch_size} \times n_{heads} \times n \times d_{v}$	$184*64$	注意力值
$W^{O}$	$n_{heads} d_{v} d_{model}$	$864512$	就是个全连接层

第三步：decoder

第一个的decoder需要将dec_input输入，就是dec_inputs: tensor([[5, 1, 2, 3, 4]])
并进行词向量化，和位置向量相加，即变成一个$\text{batch_size} \times m \times d_{model}$的矩阵，记为$\mathbf{z}$
这里要计算两个掩码矩阵，dec_self_attn_mask和dec_enc_attn_mask

# dec_self_attn_mask: batch_size * m * m
# 这里掩码矩阵是精髓，比如我们需要进行翻译的时候
# 第一次，我们放入德语句子和开始序列[EOS]，
# 那么我们了解已经翻译了的句子，其实就是[EOS]这一个单词
# 我们对只能对这个[EOS]进行编码，提取[EOS]中的特征
# 那么模型会翻译出第一个token，也就是I
# 当我们翻译第二个单词am的时候，模型放入原句子、已经翻译了第一个单词的序列（[EOS] I [PAD] [PAD] [PAD]）
# 那么模型就可以用到两个单词的信息[EOS] I
# 当我们再次翻译第k个token的时候，就需要放入原句子、已经翻译了k-1个token的序列
# 下面这个掩码矩阵
# 在训练的过程中，这里一次性把所有的步骤都做了并行计算
# 真实预测的时候，我们是一次一次放到这个模型中去的
tensor([[[False, True,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  False]]])

# dec_enc_attn_mask: batch_size * m*n
# 注意这里的n和m，其实在本文里n=m
# 这个注意力矩阵，
# 第一行，是[EOS]对x='Je suis etudiant [PAD] [PAD]'的注意力，
# True的意思是，[EOS] 不要关注输入句子x的第四、五个单词
# 第二行，是已经翻译了I之后，这个I，应该注意原始句子中的哪个单词
tensor([[[False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True],
         [False, False, False, True, True]]])

每一个encoder的都会接受相同的dec_inputs, enc_inputs, enc_outputs
每一层的encoder计算流程为：dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask
1. 开始计算Masked多头注意力，接受一个dec_outputs和dec_self_attn_mask

将dec_outputs分别与$W_x$，$W_k$，$W_v$矩阵相乘，得到查询向量、键向量、值向量
计算$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$，注意这里的$QK^{T}$就是注意力得分矩阵
这个得分矩阵的维度是$\text{batch_size} \times n_{heads} \times m \times m$，这个矩阵要和dec_self_attn_mask矩阵做点积
计算完$\text { Attention }(Q, K, V)$后乘一个权重矩阵$W^{O}$，得到一个$\text{batch_size} \times n \times d_{model}$的矩阵
第5步的矩阵加上第1步的矩阵，作为残差层，再进行一个layernorm，这就完成了第一个多头注意力子层

第6步的输出值与$W_x$，最后一个encoder的输出值与$W_k$，$W_v$矩阵相乘，得到查询向量、键向量、值向量
计算$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$，注意这里的$QK^{T}$就是注意力得分矩阵
这个得分矩阵的维度是$\text{batch_size} \times n_{heads} \times m \times m$，这个矩阵要和dec_enc_attn_mask矩阵做点积
计算完$\text { Attention }(Q, K, V)$后乘一个权重矩阵$W^{O}$，得到一个$\text{batch_size} \times n \times d_{model}$的矩阵第5步的矩阵加上第1步的矩阵，作为残差层，再进行一个layernorm，这就完成了第二个多头注意力子层

假设第10步计算出来的值为$\mathbf{z}_{atten}$
$\mathbf{z}_{atten}$经过一个前馈神经网络的值，再加上$\mathbf{z}_{atten}$，作为残差层，进行layernorm，这就完成了第二个子层
经过第12步输出的值，作为下一个decoder的输出

注意到，上面的2~6步和7~10步，都是两个多头注意力层，这两个层的区别如下
第一个多头注意力层，查询、键、值都是decoder的输出值
第二个多头注意力层，查询是decoder的输出值，但是键、值是最后一个encoder的输出值

	第一个子层	第二个子层
query	上一个decoder的输出值	第一个子层的输出值
key	上一个decoder的输出值	最后encoder的输出值
value	上一个decoder的输出值	最后encoder的输出值
Mask矩阵维度	$\text{batch_size} \times m \times m$	$\text{batch_size} \times n \times m $

第四步：预测序列

经过decoder出来的数值，维度为$\text{batch_size} \times m \times d_{model}$
使用一个线性层，$d_{model} \times \text{vocab_output}$，$\text{vocab_output}$是输出词表的大小
即最后输出经过softmax，得到词表中每个token的概率

嵌入和位置向量

第一个encoder和decoder层的输入，都需要将上面的$1\times 4$和$1\times 5$的向量，转成词向量
还需要记录每个字符串的位置
下面的$p=512$，是文中的给出的一个维度，用于方便计算，$\mathbf{y}$是一样的做法。
下面的词向量其实可以看成一个权重矩阵，即一个$5 \times 512$的矩阵，5是x的序列长度，每一行对应一个token。

序号	token	词向量	维度
1	Je	$[x_{11}, x_{12}, …, x_{1p}]$	$1\times 512$
2	suis	$[x_{21}, x_{22}, …, x_{2p}]$	$1\times 512$
3	etudiant	$[x_{31}, x_{32}, …, x_{3p}]$	$1\times 512$
0	[PAD]	$[x_{01}, x_{02}, …, x_{0p}]$	$1\times 512$
0	[PAD]	$[x_{01}, x_{02}, …, x_{0p}]$	$1\times 512$

除了词向量，还要加入位置的信息。就是每个token在句子中的位置，有两种计算方法，attention的计算方式如下：
- 如suis这个token，它在句子中的位置是2，那么$p o s= 2$
- 公式里面的$i$应该是$i=1,\dots, 256$，而且这个$PE$向量，偶数位是$\sin$，奇数位是$\cos$
- 但是torch和tensorflow里面的向量，前256维是$\sin$，后256维是$\cos$，源代码是get_timing_signal_1d

$\begin{equation} \begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\ P E_{(p o s, 2 i+1)} &=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned} \end{equation}$

计算完词向量和位置向量之后，将两个向量求和，$\mathbf{z} = \mathbf{x} + PE$，这个$PE$矩阵是不变的，而词向量矩阵$\mathbf{x}$是变化的
这个$\mathbf{z}$就是encoder的输入

注意力层

在讲注意力层之前，要讲讲现实中的一些东西
attention机制就是：人在阅读书籍、查看图片，并不会把所有的信息都看完，而是在句子、图片中找到重点辅助认知
现在序列到序列普遍的做法：encoder+decoder的结构
- encoder将输入的序列$x=(x_1, \dots, x_n)$编码成$z=F(x_1, \dots, x_n)$，这个$z$可以理解为是从原始序列中提取出来的语义
- decoder的任务就是，根据已有的语义$z$和已经输出的语序$(y_1, \dots, y_{i-1})$，生成第$i$个单词

$y_i = G(z, y_1, \dots, y_{i-1})$

每个$y_i$都是这么生成的，特别是在生成$y_1$时，原有的输出序列只会有[EOS]
上面的公式并没有体现出attention的机制，就是，生成$y_1, y_2, y_3$等，使用的语义都是$z$，也就是关注了原始句子中每一个单词
那么怎么从这个$z$中，找出即将输出的$y_i$所对应重点呢？只要每次生成$y_i$采用的$z$不一样就ok啦，如下
这个$z_{(i)}$，可以看做是，现在要生成第$i$个单词，应该注意原始句子中每个单词的程度（就是一个概率分布）

$\begin{align} y_1 &= G(z_{(1)}) \\ y_2 &= G(z_{(2)}, y_1) \\ y_3 &= G(z_{(3)}, y_1, y_2) \end{align}$

我们可以将一个句子，看成一组(K, V)对，就是$x=(x_1, \dots, x_n)$有$n$个(K, V)对，即[(K_1, V_1), (K_2, V_2), ...]

$attention(Query, x)= \sum_{i=1}^{n} smilarity(Query, k_i) * V_i$

当我们要生成某个单词时，把这个单词转换成$Query$，分别与$n$个(K, V)对做查询，这里的$smilarity(Query, k_i)$在transformer里面，就是一个内积，得到一个数值，这个数值就代表了$Query$应该关注第$i$个单词的程度
在计算机中，K可以看做是地址，V看做是值，通过找到地址K，取出值V；也就是计算机寻址中，只会找一个值，但是在attention机制里面，会把所有的值取出来，进行加权求和
下图已经被展示到烂了的图，展示了注意力是怎么计算的。。。。。

scaled dot product attention

第一层的encoder的输入值，是词向量和位置向量的向量和，输入的矩阵形状是$\text{batch_size} \times \text{len_input} \times p = 1\times 5\times 512$
第二到六层的encoder的输入值，是上一层的输出值，同样也是$1\times 5\times 512$
现在这个输入值$\mathbf{z}$放到encoder中，每个encoder的做法是一样的，下面以一个encoder来说明。

将$\mathbf{z} = [\mathbf{z_1}, \dots, \mathbf{z_{5}}]^{T}$放进encoder
将每个$\mathbf{z_i}$计算分别与$W_x, W_v, W_k$做矩阵乘法，分别得到每个$\mathbf{x_i}$的查询向量$q_i=\mathbf{z_i}W_x$、键向量$k_i=\mathbf{z_i}W_k$、值向量$v_i=\mathbf{z_i}W_v$
- 这里查询向量、键向量、值向量都是$1\times64$维
- 其实可以分别设置维度$d_q, d_k, d_v$，但是attention的文章里面将维度统一了，方面计算。
- 那么$W_x, W_k$就是$512\times64$的矩阵，$W_v$是$512\times d_v$
- 注意，这里每一个$q_i$会和所有的$k_j \quad (j=1, \dots, 11)$做乘法运算和softmax，就得到了一个长度为$4$的分数向量
- 这个分数向量，就是第$i$个token，对5个token的注意力得分
- 比如，这个注意力得分计算之后为[0.1, 0.4, 0.5, -1e9, -1e9]，那么第$i$个token的注意力就放到了得分为$0.5$的那个token上去了
- 我们使用$\mathbf{x}$与$W_x, W_v, W_k$做矩阵乘法，就是上面的做法的矩阵表达形式
- 不妨把$Q,K,V$记为三种向量的表示，则三个矩阵的维度是$5\times 64$，下面的$d_k=64$

$\begin{equation} \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{equation}$

MatMul就是矩阵乘积，这里有个Mask(opt.)，这里的掩码和注意力得分矩阵有关

Multi-Head Attention

注意途中的紫色方框，Scaled Dot-Product Attention就是上面的步骤
这里我们要关注$h$这个参数，$h$是head的个数，也就是Scaled Dot-Product Attention的个数，这里可以进行并行计算的
文章中$h=8$，就是一个encoder层里面，有8个这个样的attention。即有八个矩阵集合
- 当一个输入值$\mathbf{z}$进入时，这个$\mathbf{x}$分别计算出$Q,K,V$后
- $Q,K,V$同时进入8个attention层
- 每个$\text{head_i}$的维度是$\text{len_seq} \times d_k$
- 下面的Concat是拼接的意思，即8个head拼成了$\text{len_seq} \times (64*8)$的矩阵

$\begin{equation} \begin{aligned} \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(\mathrm{head}_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\ & \text { where head }_{\mathrm{i}}=\operatorname{Attention}\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned} \end{equation}$

前馈神经网络

这里网络接受一个输入值，$\mathbf{z}$的形状如$\text{batch_size} \times n \times d_{model}$或$\text{batch_size} \times m \times d_{model}$
表达式如下

$\begin{equation} \operatorname{FFN}(z)=\max \left(0, z W_{1}+b_{1}\right) W_{2}+b_{2} \end{equation}$

这里的权重矩阵$W_1$的维数是$512\times 2048$，就是文章中的$d_{ff}=2048$，不知道为啥要这么设置

权重	维度	实际
$W_1$	$d_{model} \times d_{ff}$	$512\times 2048$
$b_1$	$1 \times d_{ff}$	$1 \times 2048$
$W_2$	$d_{ff} \times d_{model}$	$2048 \times 512$
$b_2$	$1 \times d_{model}$	$1\times 512$

参数个数的估计

这里计算下参数的个数
总共出现了这些超参数

超参数	含义
$n$	输入序列长度的最大值
$m$	输出序列长度的最大值
$d_{model}$	模型的维度，也是词向量的维度
$d_{ff}$	前馈神经网络的权重维度
$d_x, d_k, d_v$	查询、键、值的维度
$h_{heads}$	多头的数量
$\text{batch_size}$	数据量的大小

分别计算参数的个数，这里不加入batch_size，即假定为1，一个token进去
词向量矩阵：$n \times d_{model}$和$m \times d_{model}$
多头注意力层：
- 查询、键、值权重矩阵，$d_{model} \times d_x$，$d_{model} \times d_k$，$d_{model} \times d_v$，即总共有$n_{heads} \times d_{model} \times (d_x + d_k + d_v)$个参数
- 权重矩阵$W^{O}$，计算完attention，进行拼接之后的权重矩阵，$(n_{heads}*d_v) \times d_{model}$
前馈神经网络：
- encoder、decoder会输入一个$1 \times d_{model}$的值，那么输出就是$\operatorname{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2}$
- 那么这个网络的权重$W_1$，$b_1$，$W_2$，$b_2$的维度就是$d_{model} \times d_{ff}$，$1\times d_{ff}$，$d_{ff} \times d_{model}$，$1 \times d_{model}$
总结：一个transformers有6个encoder、6个decoder
那么encoders共有，$6 \times n_{heads} \times d_{model} \times (d_x + d_k + d_v) + 6\times (n_{heads}*d_v) \times d_{model} + 12 \times d_{ff} \times d_{model} + 6 \times d_{model} + 6 \times d_{ff}$
那么decoders共有，$12 \times n_{heads} \times d_{model} \times (d_x + d_k + d_v) + 12\times (n_{heads}*d_v) \times d_{model} + 12 \times d_{ff} \times d_{model} + 6 \times d_{model} + 6 \times d_{ff}$
假设batch_size为1，n=1，m=1，那么这个transformer总共有42506240个参数
正确的参数是44152832，我这里算的有出入，是因为没有算bias项

代码展示

看了一些别人的代码，综合自己的理解，写了一些注释
下面的代码有几个不足之处：
- 没有分别给出输入序列、输出序列的最大长度，而是统一成了5
- 位置向量那里写得不够好
- self.pos_emb这里偷懒了，在encoder和decoder直接输入了torch.LongTensor([[1,2,3,0,0]])和torch.LongTensor([[5,1,2,3,4]])，这里应该输入enc_inputs和dec_inputs
- 实现的例子应该给出不同的序列长度，以此来区分序列的长度

# %%
# code by Tae Hwan Jung(Jeff Jung) @graykode, Derek Miller @dmmiller612
# Reference : https://github.com/jadore801120/attention-is-all-you-need-pytorch
#           https://github.com/JayParks/transformer
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

def make_batch(sentences):
    input_batch = [[src_vocab[n] for n in sentences[0].split()]]
    output_batch = [[tgt_vocab[n] for n in sentences[1].split()]]
    target_batch = [[tgt_vocab[n] for n in sentences[2].split()]]
    return torch.LongTensor(input_batch), torch.LongTensor(output_batch), torch.LongTensor(target_batch)

def get_sinusoid_encoding_table(n_position, d_model):
    def cal_angle(position, hid_idx):
        return position / np.power(10000, 2 * (hid_idx // 2) / d_model)
    def get_posi_angle_vec(position):
        return [cal_angle(position, hid_j) for hid_j in range(d_model)]

    sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(n_position)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
    return torch.FloatTensor(sinusoid_table)

def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # eq(zero) is PAD token
    # unsqueeze(1)在axis=1处插入一个维度
    # pad_attn_mask = tensor([[[False, False, False, False,  True]]])
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)  # batch_size x 1 x len_k(=len_q), one is masking
    """
    pad_attn_mask.expand复制了len_k次，为啥要复制len_k次
    tensor([[[False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True]]])
    """
    return pad_attn_mask.expand(batch_size, len_q, len_k)  # batch_size x len_q x len_k

def get_attn_subsequent_mask(seq):
    # decoder的输入为seq=dec_inputs=[5,1,2,3,4]
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)] # 1*5*5
    # 返回函数的上三角矩阵, 1*5*5的上三角矩阵，对角元与下三角均为0
    """
    array([[[0., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1.],
        [0., 0., 0., 1., 1.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]]])
    """
    subsequent_mask = np.triu(np.ones(attn_shape), k=1)
    # byte类型只占一个字节
    subsequent_mask = torch.from_numpy(subsequent_mask).byte()
    return subsequent_mask

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        # encoder输入的是q_s, k_s, v_s, attn_mask，维度分别是1*8*4*64、1*8*4*5
        # 1*8*4*4
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        # 将attn_mask矩阵中为true的位置，替换成-1e9，也就是换成了很小的值，那么这个得分就很低，那么当前词就不会注意这个单词了（也就是[pad]）
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        # 这里的Softmax(dim=-1)，是去掉第1维后，在其他维度求和，也就是每个head分别求和
        attn = nn.Softmax(dim=-1)(scores) # 这里的注意力，就是每个单词对每个单词的注意力
        context = torch.matmul(attn, V) # 矩阵乘积，1*8*4*4, 1*8*4*64
        # context 1*8*4*64
        return context, attn

class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        # 这里乘了n_heads，意思是，并行计算，提前先把heads并在一起
        self.W_Q = nn.Linear(d_model, d_k * n_heads) # 512*(64*8)
        self.W_K = nn.Linear(d_model, d_k * n_heads) # 512*(64*8)
        self.W_V = nn.Linear(d_model, d_v * n_heads) # 512*(64*8)
        self.linear = nn.Linear(n_heads * d_v, d_model) # (8*64) * 512
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, Q, K, V, attn_mask):
        # encoder输入的是：enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask，1*4*512，1*4*4
        # q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
        # 注意这里的residual，就是残差网络层
        residual, batch_size = Q, Q.size(0) # Q.size(0)=1
        # (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
        # W_Q是query的权重矩阵，W_Q(Q)的输出维度是1*4*(64*8)
        # view()的作用是，将1*5*(64*8)转成1*4*8*64
        # transpose的作用是将axis=1和axis=2调转，那么输出就是1*8*4*64
        q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]
        # attn_mask.unsqueeze(1)的维度是1*1*4*4
        # repeat意思是在axis=1的地方，复制n_heads次，变成1*8*4*4
        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]
        """
        计算完q,k,v之后，开始计算Scaled Dot Product Attention
        这里的
        """
        # context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
        context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
        # context 1*8*4*64
        # transpose之后变成1*4*8*64，view变成1*4*(8*64)
        # 这里采用view的意思就是把n_heads个矩阵给拼接在一起
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
        output = self.linear(context)
        return self.layer_norm(output + residual), attn # output: [batch_size x len_q x d_model]

class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PoswiseFeedForwardNet, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.layer_norm = nn.LayerNorm(d_model)

    def forward(self, inputs):
        residual = inputs # inputs : [batch_size, len_q, d_model]
        output = nn.ReLU()(self.conv1(inputs.transpose(1, 2)))
        output = self.conv2(output).transpose(1, 2)
        return self.layer_norm(output + residual)

class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, enc_inputs, enc_self_attn_mask):
        # 多头注意力，注意这里的enc_inputs，其实是已经经过词向量转换后的矩阵，1*4*512
        enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,V
        # 计算完多头后，要前馈神经网络层，就是layer_norm(W_1 X + b_1)W_2 +b_2，不过这里没有加偏移量
        enc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size x len_q x d_model]
        return enc_outputs, attn


class DecoderLayer(nn.Module):
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.dec_self_attn = MultiHeadAttention()
        self.dec_enc_attn = MultiHeadAttention()
        self.pos_ffn = PoswiseFeedForwardNet()

    def forward(self, dec_inputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask):
        # 从Decoder进来dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask
        # 第一个Decoder进来的dec_outputs是词向量，后面的Decoder进来的dec_outputs是上一个Decoder的输出
        # enc_outputs是最后一个Encoder的输出，每个Decoder都是一样的
        # 多头注意力层，这个输入不一样，第一个多头注意层，q,k,v都是dec_inputs
        dec_outputs, dec_self_attn = self.dec_self_attn(dec_inputs, dec_inputs, dec_inputs, dec_self_attn_mask)
        # 多头注意力层，q,k,v分别是dec_outputs, enc_outputs, enc_outputs
        # 这里每一层的自掩码的q，是上一层的输入值，但k、v是最后一个encoder的输出值
        dec_outputs, dec_enc_attn = self.dec_enc_attn(dec_outputs, enc_outputs, enc_outputs, dec_enc_attn_mask)
        dec_outputs = self.pos_ffn(dec_outputs)
        return dec_outputs, dec_self_attn, dec_enc_attn

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.src_emb = nn.Embedding(src_vocab_size, d_model) # 4 * 512, 词向量矩阵
        # 位置向量矩阵，形成4*512的矩阵，这个矩阵的参数是不可以改变的
        self.pos_emb = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(src_len, d_model),freeze=True)
        # 增加6个encoder层
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])

    def forward(self, enc_inputs): # enc_inputs : [batch_size x source_len]
        # 前向算法 enc_inputs="Je suis etudiant [PAD]"
        # enc_inputs: Tesnor[[1,2,3,0]]
        enc_outputs = self.src_emb(enc_inputs) + self.pos_emb(enc_inputs) # 1*4*512
        enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs) # 1*4*4的布尔矩阵
        enc_self_attns = []
        for layer in self.layers:
            # 1*4*512
            enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
            # 每个encoder层的注意力都要记录下来，但是记录下来有啥用？？
            enc_self_attns.append(enc_self_attn)
        return enc_outputs, enc_self_attns

class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        self.tgt_emb = nn.Embedding(tgt_vocab_size, d_model)
        self.pos_emb = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(tgt_len, d_model),freeze=True)
        self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])

    def forward(self, dec_inputs, enc_inputs, enc_outputs): # dec_inputs : [batch_size x target_len]
        # 输入值为：dec_inputs, enc_inputs, enc_outputs
        # 词嵌入，解码的词向量，这里的位置向量就离谱，# dec_inputs=[5,1,2,3,4]='[START] i am a student'
        # 为什么位置向量是[[5,1,2,3,4]]，
        dec_outputs = self.tgt_emb(dec_inputs) + self.pos_emb(torch.LongTensor([[5,1,2,3,4]]))
        # 这里的dec_inputs没有0，所以返回的dec_self_attn_pad_mask是一个1*5*5的矩阵，矩阵元素全为false
        dec_self_attn_pad_mask = get_attn_pad_mask(dec_inputs, dec_inputs)
        # 这里是decoder的特殊子序列掩码，1*5*5的上三角矩阵
        """
        dec_self_attn_subsequent_mask
        tensor([[[0, 1, 1, 1, 1],
                [0, 0, 1, 1, 1],
                [0, 0, 0, 1, 1],
                [0, 0, 0, 0, 1],
                [0, 0, 0, 0, 0]]], dtype=torch.uint8)
        """
        dec_self_attn_subsequent_mask = get_attn_subsequent_mask(dec_inputs)

        # torch.gt(a,b)函数比较a中元素大于（这里是严格大于）b中对应元素，大于则为1，不大于则为0
        """
        dec_self_attn_mask
        tensor([[[False,  True,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True]]])
        """
        dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask + dec_self_attn_subsequent_mask), 0)

        # 这里的Mask矩阵是attention矩阵的掩码，是形成一个 batch_size*输出序列最大长度*输入序列最大长度 1*4*5
        """
        dec_enc_attn_mask
        tensor([[[False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True],
         [False, False, False, False,  True]]])
        """
        dec_enc_attn_mask = get_attn_pad_mask(dec_inputs, enc_inputs)

        dec_self_attns, dec_enc_attns = [], []
        for layer in self.layers:
            dec_outputs, dec_self_attn, dec_enc_attn = layer(dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask)
            dec_self_attns.append(dec_self_attn)
            dec_enc_attns.append(dec_enc_attn)
        return dec_outputs, dec_self_attns, dec_enc_attns

class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
        self.projection = nn.Linear(d_model, tgt_vocab_size, bias=False) # 最后的全连接层，用于预测字典中的那个单词
    def forward(self, enc_inputs, dec_inputs):
        # 输入一个1*4的矩阵
        enc_outputs, enc_self_attns = self.encoder(enc_inputs)
        # 输出最后一个encoder的结果enc_outputs：1*4*512
        # enc_self_attns记录了每个encoder层的自注意力，这是个列表，每个列表是Tensor，1*8*5*5
        # dec_inputs 1*5, enc_outputs 1*4*512，在decoder层，输入的是解码的输入，编码输入，编码的输出
        dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(dec_inputs, enc_inputs, enc_outputs)
        dec_logits = self.projection(dec_outputs)
        # dec_logits : [batch_size x src_vocab_size x tgt_vocab_size]
        return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns


def showgraph(attn):
    attn = attn[-1].squeeze(0)[0]
    attn = attn.squeeze(0).data.numpy()
    fig = plt.figure(figsize=(n_heads, n_heads)) # [n_heads, n_heads]
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attn, cmap='viridis')
    ax.set_xticklabels(['']+sentences[0].split(), fontdict={'fontsize': 14}, rotation=90)
    ax.set_yticklabels(['']+sentences[2].split(), fontdict={'fontsize': 14})
    plt.show()

if __name__ == '__main__':
    sentences = ['Je suis etudiant [PAD]', '[START] i am a student', 'i am a student [END]']

    src_vocab = {'[PAD]': 0, 'Je': 1, 'suis': 2, 'etudiant': 3}
    
    tgt_vocab = {'[PAD]': 0, 'i': 1, 'am': 2, 'a': 3, 'student': 4, '[START]': 5, '[END]': 6}
    number_dict = {i: w for i, w in enumerate(tgt_vocab)}

    src_vocab_size = len(src_vocab)
    tgt_vocab_size = len(tgt_vocab)

    src_len = 4 # length of source
    tgt_len = 5 # length of target

    d_model = 512  # Embedding Size
    d_ff = 2048  # FeedForward dimension
    d_k = d_v = 64  # dimension of K(=Q), V
    n_layers = 6  # number of Encoder of Decoder Layer
    n_heads = 8  # number of heads in Multi-Head Attention

    model = Transformer()

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # 这里的input，已经自动补齐了
    enc_inputs, dec_inputs, target_batch = make_batch(sentences)

    for epoch in range(20):
        optimizer.zero_grad()
        # 前向input的是1*4的矩阵，1个样本，序列长度4
        outputs, enc_self_attns, dec_self_attns, dec_enc_attns = model(enc_inputs, dec_inputs)
        loss = criterion(outputs, target_batch.contiguous().view(-1))
        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
        # 反向
        loss.backward()
        optimizer.step()

    # Test
    predict, _, _, _ = model(enc_inputs, dec_inputs)
    predict = predict.data.max(1, keepdim=True)[1]
    print(sentences[0], '->', [number_dict[n.item()] for n in predict.squeeze()])

    print('first head of last state ')
    showgraph(enc_self_attns)

    print('first head of last state dec_self_attns')
    showgraph(dec_self_attns)

    print('first head of last state dec_enc_attns')
    showgraph(dec_enc_attns)