{"introduction_zh":"

扩散模型[1][2]是当前图像生成和视频生成使用的主要方式,但由于其晦涩的理论,很多工程师并不能很好地理解。本文将提供一种非常直观易懂的方式,方便读者理解把握扩散模型的原理。特别地,将以互动的形式,以一维随机随机变量的扩散模型进行举例,直观解释扩散模型的多个有趣的性质。

\n

扩散模型是一个概率模型。概率模型主要提供两方面的功能:计算给定样本出现的概率;采样生成新样本。扩散模型侧重于第二方面,方便采样新样本,从而实现\"生成\"的任务。

\n

扩散模型与一般的概率模型(如GMM)不同,直接建模随机变量的概率分布。扩散模型采用一种间接方式,利用“随机变量变换”的方式(如图1a),逐步将待建模的概率分布(数据分布)转变成\"标准正态分布\",同时,建模学习各个变换对应的后验概率分布(图1b-c)。有了最终的标准正态分布和各个后验概率分布,则可通过祖先采样(Ancestral Sampling)的方式,从反向逐步采样得到各个随机变量ZTZ2,Z1,XZ_T \\ldots Z_2,Z_1,XZTZ2,Z1,X的样本。同时也可通过贝叶斯公式和全概率公式确定初始的数据分布q(x)q(x)q(x)

\n

可能会有这样的疑问:间接的方式需要建模学习T个后验概率分布,直接方式只需要建模学习一个概率分布,为什么要选择间接的方式呢?是这样子的:初始的数据分布可能很复杂,很难用一个概率模型直接表示;而对于间接的方式,各个后验概率分布的复杂度会简单许多,可以用简单的概率模型进行拟合。下面将会看到,当满足一些条件时,后验概率分布将非常接近高斯分布,所以可以使用简单的条件高斯模型进行建模。

\n
\n
Figure 1: Diffusion model schematic
","transform_zh":"

为了将初始的数据分布转换为简单的标准正态分布,扩散模型采用如下的变换方式\nZ=αX+1αϵwhereα<1,ϵN(0,I)\\begin{align}\n Z = \\sqrt{\\alpha} X + \\sqrt{1-\\alpha}\\epsilon \\qquad where \\quad \\alpha < 1, \\quad \\epsilon \\sim \\mathcal{N}(0, I) \\tag{1.1}\n\\end{align}Z=αX+1αϵwhereα<1,ϵN(0,I)(1.1)\n其中Xq(x)X\\sim q(x)Xq(x)是任意的随机变量,Zq(Z)Z\\sim q(Z)Zq(Z)是变换后的随机变量。

\n

此变换可分为两个子变换。

\n

第一个子变换是对随机变量XXX执行一个线性变换(αX\\sqrt{\\alpha}XαX),根据文献[3]的结论,线性变换使XXX的概率分布“变窄变高”,并且\"变窄变高\"的程度与α\\alphaα的值成正比。具体可看Demo 1,左1图为随机生成的一维的概率分布,左2图是经过线性变换后的概率分布,可以看出,与左1图相比,左2图的曲线“变窄变高”了。读者可亲自测试不同的α\\alphaα值,获得更直观的理解。

\n

第二个子变换是“加上独立的随机噪声”(1αϵ\\sqrt{1-\\alpha}\\epsilon1αϵ),根据文献[4]的结论,“加上独立的随机变量”等效于对两个概率分布执行卷积,由于随机噪声的概率分布为高斯形状,所以相当于执行”高斯模糊“的操作。经过模糊后,原来的概率分布将变得更加平滑,与标准正态分布将更加相似。模糊的程度与噪声大小(1α1-\\alpha1α)正相关。具体可看Demo 1,左1图是随机生成的一维概率分布,左3图是经过变换后的结果,可以看出,变换后的曲线变光滑了,棱角变少了。读者可测试不同的α\\alphaα值,感受噪声大小对概率分布曲线形状的影响。左4图是综合两个子变换后的结果。

\n
","likelihood_zh":"

由变换的方式(式1.1)可以看出,前向条件概率q(zx)q(z|x)q(zx)的概率分布为高斯分布,且只与α\\alphaα的值有关,与q(x)q(x)q(x)的概率分布无关。\nq(zx)=N(αx, 1α)\\begin{align}\n q(z|x) &= \\mathcal{N}(\\sqrt{\\alpha}x,\\ 1-\\alpha) \\tag{2.1}\n\\end{align}q(zx)=N(αx, 1α)(2.1)\n具体可看Demo 2,左3图展示了q(zx)q(z|x)q(zx)的形状,从图中可以看到一条均匀的斜线,这意味着q(zx)q(z|x)q(zx)的均值与x线性相关,方差固定不变。α\\alphaα值的大小将决定斜线宽度和倾斜程度。

\n
","posterior_zh":"

后验概率分布没有闭合的形式,但可以通过一些方法,推断其大概的形状,并分析影响其形状的因素。

\n

根据Bayes公式,有\nq(xz)=q(zx)q(x)q(z)\\begin{align}\n q(x|z) = \\frac{q(z|x)q(x)}{q(z)} \\tag{3.1}\n\\end{align}q(xz)=q(z)q(zx)q(x)(3.1)

\n

zzz是取固定值时,q(z)q(z)q(z)是常数,所以q(xz)q(x|z)q(xz)是关于xxx的概率密度函数,并且其形状只与q(zx)q(x){q(z|x)q(x)}q(zx)q(x)有关。\nq(xz)=q(zx)q(x)where z is fixed\\begin{align}\n q(x|z) &=\\propto q(z|x)q(x) \\qquad \\text{where z is fixed} \\tag{3.2}\n\\end{align}q(xz)=∝q(zx)q(x)where z is fixed(3.2)

\n

实际上,q(z)=q(zx)q(x)dxq(z)=\\int q(z|x)q(x)dxq(z)=q(zx)q(x)dx,也就是说,q(z)q(z)q(z)是对函数q(zx)q(x)q(z|x)q(x)q(zx)q(x)遍历xxx求和,所以,q(zx)q(x)q(z|x)q(x)q(zx)q(x)除以q(z)q(z)q(z)相当于对q(zx)q(x)q(z|x)q(x)q(zx)q(x)执行归一化。\nq(xz)=Normalize(q(zx)q(x))\\begin{align}\n q(x|z) = \\operatorname{Normalize}\\big(q(z|x)q(x)\\big) \\tag{3.3}\n\\end{align}q(xz)=Normalize(q(zx)q(x))(3.3)

\n

由式2.1可知,q(zx)q(z|x)q(zx)为高斯分布,于是有\nq(xz)12π(1α)exp(zαx)22(1α) q(x)where z is fixed=1α12π1ααexp(zαx)221αα q(x)=1α12πσexp(xμ)22σ2GaussFun q(x)where μ=zασ=1αα\\begin{align}\n q(x|z) &\\propto \\frac{1}{\\sqrt{2\\pi(1-\\alpha)}}\\exp{\\frac{-(z-\\sqrt{\\alpha}x)^2}{2(1-\\alpha)}}\\ q(x)& \\qquad &\\text{where z is fixed} \\notag \\newline\n &= \\frac{1}{\\sqrt{\\alpha}}\\frac{1}{\\sqrt{2\\pi\\frac{1-\\alpha}{\\alpha}}}\\exp{\\frac{-(\\frac{z}{\\sqrt{\\alpha}}-x)^2}{2\\frac{1-\\alpha}{\\alpha}}}\\ q(x)& \\notag \\newline\n &= \\frac{1}{\\sqrt{\\alpha}} \\underbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}_{\\text{GaussFun}}\\ q(x)& \\qquad &\\text{where}\\ \\mu=\\frac{z}{\\sqrt{\\alpha}}\\quad \\sigma=\\sqrt{\\frac{1-\\alpha}{\\alpha}} \\tag{3.4}\n\\end{align}q(xz)2π(1α)1exp2(1α)(zαx)2 q(x)=α12πα1α1exp2α1α(αzx)2 q(x)=α1GaussFun2πσ1exp2σ2(xμ)2 q(x)where z is fixedwhere μ=αzσ=α1α(3.4)

\n

可以看出,GaussFun部分是关于xxx的高斯函数,均值为zα\\frac{z}{\\sqrt{\\alpha}}αz,标准差为1αα\\sqrt{\\frac{1-\\alpha}{\\alpha}}α1α,所以q(xz)q(x|z)q(xz)的形状由“GaussFun与q(x)q(x)q(x)相乘”决定。

\n

根据”乘法“的特点,可以总结q(xz)q(x|z)q(xz)函数形状具有的特点。

\n
    \n
  • q(xz)q(x|z)q(xz)的支撑集应该包含于GaussFun的支撑集,GaussFun的支撑集是一个超球体,中心位于均值μ\\muμ,半径约为3倍标准差σ\\sigmaσ
  • \n
  • 当高斯函数的方差较小(较小噪声),或者q(x)q(x)q(x)线性变化时,q(xz)q(x|z)q(xz)的形状将近似于高斯函数,函数形式较简单,方便建模学习。
  • \n
  • 当高斯函数的方差较大(较大噪声),或者q(x)q(x)q(x)剧烈变化时,q(xz)q(x|z)q(xz)的形状将较复杂,与高斯函数有较大的差别,难以建模学习。
  • \n
\n\n

Appendix B给出了较严谨的分析,当σ\\sigmaσ满足一些条件时,q(xz)q(x|z)q(xz)的近似于高斯分布。

\n

具体可看Demo 2,左4图给出后验概率分布q(xz)q(x|z)q(xz)的形态,可以看出,其形状较不规则,像一条弯曲且不均匀的曲线。当α\\alphaα较大时(噪声较小),曲线将趋向于均匀且笔直。读者可调整不同的α\\alphaα值,观察后验概率分布与噪声大小的关系;左5图,蓝色虚线\\textcolor{blue}{蓝色虚线}蓝色虚线给出q(x)q(x)q(x)绿色虚线\\textcolor{green}{绿色虚线}绿色虚线给出式3.4中的GaussFun,黄色实线\\textcolor{orange}{黄色实线}黄色实线给出两者相乘并归一化的结果,即固定z条件下后验概率q(xz=fixed)q(x|z=fixed)q(xz=fixed)。读者可调整不同z值,观察q(x)q(x)q(x)的波动变化对后验概率q(xz)q(x|z)q(xz)形态的影响。

\n

两个特殊状态下的后验概率分布q(xz)q(x|z)q(xz)值得考虑一下。

\n
    \n
  • α0\\alpha \\to 0α0时,GaussFun的标准差趋向于无穷大,GaussFun变成一个很大支撑集的近似的均匀分布,q(x)q(x)q(x)与均匀分布相乘结果仍为q(x)q(x)q(x),所以,不同zzz值对应的q(xz)q(x|z)q(xz)几乎变成一致,并与q(x)q(x)q(x)几乎相同。读者可在Demo 2中,将α\\alphaα设置为0.001,观察具体的结果。
  • \n
  • α1\\alpha \\to 1α1时,GaussFun的标准差趋向于无穷小,不同zzz值的q(xz)q(x|z)q(xz)收缩成一系列不同偏移量的Dirac delta函数, 偏移量等于zzz。但有一些例外,当q(x)q(x)q(x)存在为零的区域时,其对应的q(xz)q(x|z)q(xz)将不再为Dirac delta函数,而是零函数。可在Demo 2中,将α\\alphaα设置为0.999,观察具体的结果。
  • \n
\n\n

有一点需要注意一下,当α0\\alpha \\to 0α0时,较大zzz值对应的GaussFun的均值(μ=zα\\mu=\\frac{z}{\\sqrt{\\alpha}}μ=αz)也急剧变大,也就是说,GaussFun位于离原点较远的地方,此时,q(x)q(x)q(x)的支撑集对应的GaussFun部分的“均匀程度”会略微有所下降, 从而会略微降低q(xz)q(x|z)q(xz)q(x)q(x)q(x)的相似度,但这种影响会随着α\\alphaα减小而进一步降低。读者可在Demo 2中观察此影响,将α\\alphaα设置为0.001,q(xz=2)q(x|z=-2)q(xz=2)q(x)q(x)q(x)会略微有一点差别,但q(xz=0)q(x|z=0)q(xz=0)q(x)q(x)q(x)却看不出区别。

\n

关于高斯函数的\"均匀程度\",有如下两个特点:标准差越大,均匀程度越大;离均值越远,均匀程度越小。

\n
","forward_process_zh":"

对于任意的数据分布q(x)q(x)q(x),均可连续应用上述的变换(如式4.1~4.4),随着变换的次数的增多,输出的概率分布将变得越来越接近于标准正态分布。对于较复杂的数据分布,需要较多的次数或者较大的噪声。

\n

具体可看Demo 3.1,第一子图是随机生成的一维概率分布,经过7次的变换后,最终的概率分布与标准正态分布非常相似。相似的程度与迭代的次数和噪声大小正相关。对于相同的相似程度,如果每次所加的噪声较大(较小的α\\alphaα值),那所需变换的次数将较少。读者可尝试不同的α\\alphaα值和次数,观测最终概率分布的相似程度。

\n

起始概率分布的复杂度会比较高,随着变换的次数增多,概率分布q(zt)q(z_t)q(zt)的复杂度将会下降。根据第3节结论,更复杂的概率分布对应更复杂的后验概率分布,所以,为了保证后验概率分布与高斯函数较相似(较容易学习),在起始阶段,需使用较大的α\\alphaα(较小的噪声),后期阶段可适当使用较小的α\\alphaα(较大的噪声),加快向标准正态分布转变。

\n

Demo 3.1的例子可以看到,随着变换次数增多,q(zt)q(z_t)q(zt)的棱角变得越来越少,同时,后验概率分布q(zt1zt)q(z_{t-1}|z_t)q(zt1zt)图中的斜线变得越来越笔直匀称,越来越像条件高斯分布。

\n

Z1=α1X+1α1ϵ1Z2=α2Z1+1α2ϵ2Zt=αtZt1+1αtϵtZT=αTZT1+1αTϵTwhereαt<1t1,2,,T\\begin{align}\n Z_1 &= \\sqrt{\\alpha_1} X + \\sqrt{1-\\alpha_1}\\epsilon_1 \\tag{4.1} \\newline\n Z_2 &= \\sqrt{\\alpha_2} Z_1 + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.2} \\newline\n &\\dots \\notag \\newline\n Z_{t} &= \\sqrt{\\alpha_t}Z_{t-1} + \\sqrt{1-\\alpha_t}\\epsilon_{t} \\tag{4.3} \\newline\n &\\dots \\notag \\newline\n Z_{T} &= \\sqrt{\\alpha_T}Z_{T-1} + \\sqrt{1-\\alpha_T}\\epsilon_{T} \\tag{4.4} \\newline\n &where \\quad \\alpha_t < 1 \\qquad t\\in {1,2,\\dots,T} \\notag\n\\end{align}Z1Z2ZtZT=α1X+1α1ϵ1=α2Z1+1α2ϵ2=αtZt1+1αtϵt=αTZT1+1αTϵTwhereαt<1t1,2,,T(4.1)(4.2)(4.3)(4.4)

\n

把式4.1代入式4.2,同时利用高斯分布的性质,可得出q(z2x)q(z_2|x)q(z2x)的概率分布的形式\nz2=α2(α1x+1α1ϵ1)+1α2ϵ2=α2α1x+α2α2α1ϵ1+1α2ϵ2=N(α1α2x, 1α1α2)\\begin{align}\n z_2 &= \\sqrt{\\alpha_2}(\\sqrt{\\alpha_1}x + \\sqrt{1-\\alpha_1}\\epsilon_1) + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.5} \\newline\n &= \\sqrt{\\alpha_2\\alpha_1}x + \\sqrt{\\alpha_2-\\alpha_2\\alpha_1}\\epsilon_1 + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.6} \\newline\n &= \\mathcal{N}(\\sqrt{\\alpha_1\\alpha_2}x,\\ 1-\\alpha_1\\alpha_2) \\tag{4.7}\n\\end{align}z2=α2(α1x+1α1ϵ1)+1α2ϵ2=α2α1x+α2α2α1ϵ1+1α2ϵ2=N(α1α2x, 1α1α2)(4.5)(4.6)(4.7)

\n

同理,可递推得出\nq(ztx)=N(α1α2αtx, 1α1α2αt)=N(αtˉx, 1αtˉ)where αtˉj=1tαj\\begin{align}\n q(z_t|x) &= \\mathcal{N}(\\sqrt{\\alpha_1\\alpha_2\\cdots\\alpha_t}x,\\ 1-\\alpha_1\\alpha_2\\cdots\\alpha_t) = \\mathcal{N}(\\sqrt{\\bar{\\alpha_t}}x,\\ 1-\\bar{\\alpha_t}) \\qquad where\\ \\bar{\\alpha_t} \\triangleq \\prod_{j=1}^t\\alpha_j \\tag{4.8}\n\\end{align}q(ztx)=N(α1α2αtx, 1α1α2αt)=N(αtˉx, 1αtˉ)where αtˉj=1tαj(4.8)

\n

比较式4.8和式2.1的形式,可发现,两者的形式是完全一致的。

\n

如果只关注首尾两个变量之间的关系,那么连续t次的小变换可用一次大变换替代,大变换的α\\alphaα是各个小变换的α\\alphaα累积,因为两种变换对应的联合概率分布相同。

\n

读者可在Demo 3.1中做一个实验,对同样的输入分布q(x)q(x)q(x),使用两种不同的变换方式:1)使用三个变换,α\\alphaα均为0.95; 2)使用一个变换,α\\alphaα设置为0.857375。分别执行变换,然后比较变换后的两个分布,将会看到,两个分布是完全相同的。

\n

在DDPM[2]论文中,作者使用了1000步(T=1000),将数据分布q(x)q(x)q(x)转换至q(zT)q(z_T)q(zT)q(zTx)q(z_T|x)q(zTx)的概率分布如下:\nq(zTx)=N(0.00635 x, 0.99998)\\begin{align}\n q(z_T|x) &= \\mathcal{N}(0.00635\\ x,\\ 0.99998) \\tag{4.9}\n\\end{align}q(zTx)=N(0.00635 x, 0.99998)(4.9)

\n

如果只考虑X,ZTX,Z_TX,ZT的联合分布q(x,zT)q(x,z_T)q(x,zT),也可使用一次变换代替,变换如下:\nZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ\\begin{align}\n Z_T = \\sqrt{0.0000403}\\ X + \\sqrt{1-0.0000403}\\ \\epsilon = 0.00635\\ X + 0.99998\\ \\epsilon \\tag{4.10}\n\\end{align}ZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ(4.10)\n可以看出,应用两种变换后,变换后的分布q(zTx)q(z_T|x)q(zTx)相同,因此,q(x,zT)q(x, z_T)q(x,zT)也相同。

\n
","backward_process_zh":"

如果知道了最终的概率分布q(zT)q(z_T)q(zT)及各个转换过程的后验概率q(xz),q(zt1zt)q(x|z),q(z_{t-1}|z_t)q(xz),q(zt1zt),则可通过“贝叶斯公式”和“全概率公式”恢复数据分布q(x)q(x)q(x),见式5.1~5.4。当最终的概率分布q(zT)q(z_T)q(zT)与标准正态分布很相似时,可用标准正态分布代替。

\n

具体可看Demo 3.2。示例中q(zT)q(z_T)q(zT)使用N(0,1)\\mathcal{N}(0,1)N(0,1)代替,同时通过JS Div给出了误差大小。恢复的概率分布q(zt)q(z_t)q(zt)q(x)q(x)q(x)使用绿色曲线\\textcolor{green}{绿色曲线}绿色曲线标识,原始的概率分布使用蓝色曲线\\textcolor{blue}{蓝色曲线}蓝色曲线标识。可以看出,数据分布q(x)q(x)q(x)能够被很好地恢复回来,并且误差(JS Divergence)会小于标准正态分布替换q(zT)q(z_T)q(zT)引起的误差。\nq(zT1)=q(zT1,zT)dzT=q(zT1zT)q(zT)dzTq(zt1)=q(zt1,zt)dzt=q(zt1zt)q(zt)dztq(z1)=q(z1,z2)dz1=q(z1z2)q(z2)dz2q(x)=q(x,z1)dz1=q(xz1)q(z1)dz1\\begin{align}\n q(z_{T-1}) &= \\int q(z_{T-1},z_T)dz_T = \\int q(z_{T-1}|z_T)q(z_T)dz_T \\tag{5.1} \\newline\n & \\dots \\notag \\newline\n q(z_{t-1}) &= \\int q(z_{t-1},z_t)dz_t = \\int q(z_{t-1}|z_t)q(z_t)dz_t \\tag{5.2} \\newline\n & \\dots \\notag \\newline\n q(z_1) &= \\int q(z_1,z_2) dz_1 = \\int q(z_1|z_2)q(z_2)dz_2 \\tag{5.3} \\newline\n q(x) &= \\int q(x,z_1) dz_1 = \\int q(x|z_1)q(z_1)dz_1 \\tag{5.4} \\newline\n\\end{align}q(zT1)q(zt1)q(z1)q(x)=q(zT1,zT)dzT=q(zT1zT)q(zT)dzT=q(zt1,zt)dzt=q(zt1zt)q(zt)dzt=q(z1,z2)dz1=q(z1z2)q(z2)dz2=q(x,z1)dz1=q(xz1)q(z1)dz1(5.1)(5.2)(5.3)(5.4)\n在本文中,将上述恢复过程(式5.1~5.4)所使用的变换称之为“后验概率变换”。例如,在式5.4中,变换的输入为概率分布函数q(z1)q(z_1)q(z1),输出为概率分布函数q(x)q(x)q(x),整个变换由后验概率分布q(xz1)q(x|z_1)q(xz1)决定。此变换也可看作为一组基函数的线性加权和,基函数为不同条件下的q(xz1)q(x|z_1)q(xz1),各个基函数的权重为q(z1)q(z_1)q(z1)。在第7节,将会进一步介绍此变换的一些有趣性质。

\n

第3节中,我们考虑了两个特殊的后验概率分布。接下来,分析其对应的”后验概率变换“。

\n
    \n
  • α0\\alpha \\to 0α0时,不同zzz值的q(xz)q(x|z)q(xz)均与q(x)q(x)q(x)几乎相同,也就是说,线性加权和的基函数几乎相同。此状态下,不管输入如何变化,变换的输出总为q(x)q(x)q(x)
  • \n
  • α1\\alpha \\to 1α1时,不同zzz值的q(xz)q(x|z)q(xz)收缩成一系列不同偏移量的Dirac delta函数及零函数。此状态下,只要输入分布的支撑集(support set)包含于q(x)q(x)q(x)的支撑集,变换的输出与输入将保持一致。
  • \n
\n\n

第4节中提到,DDPM[2]论文所使用的1000次变换可使用一次变换表示:\nZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ\\begin{align}\n Z_T = \\sqrt{0.0000403}\\ X + \\sqrt{1-0.0000403}\\ \\epsilon = 0.00635\\ X + 0.99998\\ \\epsilon \\tag{5.5}\n\\end{align}ZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ(5.5)\n由于α=0.0000403\\alpha=0.0000403α=0.0000403非常小,其对应的GaussFun(式3.4)的标准差达到157.52。如果把q(x)q(x)q(x)的支撑集限制在单位超球范围内(x2<1\\lVert x \\rVert_2 < 1x2<1),那当zT[2,+2]z_T \\in [-2, +2]zT[2,+2]时,对应的各个q(xzT)q(x|z_T)q(xzT)均与q(x)q(x)q(x)非常相似。在这种状态下,对于q(xzT)q(x|z_T)q(xzT)相应的后验概率变换,不管输入分布的形状的如何,只要支撑集在[2,+2][-2,+2][2,+2]范围内,其输出分布都将是q(x)q(x)q(x)

\n

所以,可以总结,在DPM模型中,如果q(x)q(x)q(x)的支撑集是有限的,并且最终变量ZTZ_TZT的信噪比足够大,那恢复q(x)q(x)q(x)的过程可以使用任意的分布,不必一定需要使用标准正态分布。

\n

读者可亲自做一个类似的实验。在Demo 3.1中,将start_alpha设置0.25,end_alpha也设置为0.25,step设置为7,此时q(z7)=0.000061X+10.000061ϵq(z_7)=\\sqrt{0.000061}X + \\sqrt{1-0.000061}\\epsilonq(z7)=0.000061X+10.000061ϵ,与DDPM的q(zT)q(z_T)q(zT)基本相似。点击apply执行前向变换(蓝色曲线\\textcolor{blue}{蓝色曲线}蓝色曲线),为接下来的反向恢复做准备。在Demo 3.2中,noise_ratio设置为1,为末端分布q(z7)q(z_7)q(z7)引入100%的噪声,切换nose_random_seed的值可改变噪声的分布,取消选择backward_pdf,减少画面的干扰。点击apply将通过后验概率变换恢复q(x)q(x)q(x),将会看到,不管输入的q(z7)q(z_7)q(z7)的形状如何,恢复的q(x)q(x)q(x)均与原始的q(x)q(x)q(x)完全相同, JS Divergence为0,恢复的过程使用红色曲线\\textcolor{red}{红色曲线}红色曲线画出。

\n

另外有一点值得注意一下,在深度学习任务中,常将输入样本的各个维度缩放在[-1,1]范围内,也是说在一个超立方体内(hypercube)。超立方体内任意两点的最大欧氏距离会随着维度的增多而变大,比如,对于一维,最大距离为222,对于二维,最大距离为222\\sqrt{2}22,对于三维,最大距离为232\\sqrt{3}23,对于n维,最大距离为2n2\\sqrt{n}2n。所以,对于维度较高的数据,需要ZTZ_TZT变量有更高的信噪比,才能让恢复过程的起始分布接受任意的分布。\n

\n
","fit_posterior_zh":"

第3节前半部分可知,各个后验概率分布是未知的,并且与q(x)q(x)q(x)有关。所以,为了恢复数据分布或者从数据分布中采样,需要对各个后验概率分布进行学习估计。

\n

第3节后半部分可知,当满足一定条件时,各个后验概率分布q(xz)q(zt1zt)q(x|z)、q(z_{t-1}|z_t)q(xz)q(zt1zt)近似于高斯概率分布,所以可通过构建一批条件高斯概率模型p(xz),p(zt1zt)p(x|z),p(z_{t-1}|z_t)p(xz),p(zt1zt),学习拟合对应的q(xz),q(zt1zt)q(x|z),q(z_{t-1}|z_t)q(xz),q(zt1zt)

\n

由于模型表示能力和学习能力的局限性,拟合过程会存在一定的误差,进一步会影响恢复q(x)q(x)q(x)的准确性。拟合误差大小与后验概率分布的形状有关。由第3节可知,当q(x)q(x)q(x)较复杂或者所加噪声较大时,后验概率分布会较复杂,与高斯分布差别较大,从而导致拟合误差,进一步影响恢复q(x)q(x)q(x)

\n

具体可看Demo 3.3,读者可测试不同复杂程度的q(x)q(x)q(x)α\\alphaα,观看后验概率分布q(zt1zt)q(z_{t-1}|z_t)q(zt1zt)的拟合程度,以及恢复q(x)q(x)q(x)的准确度。恢复的概率分布使用橙色\\textcolor{orange}{橙色}橙色标识,同时也通过JS divergence给出误差。

\n

关于拟合的目标函数,与其它概率模型类似,可优化交叉熵损失\\textcolor{red}{优化交叉熵损失}优化交叉熵损失,使p(zt1zt)p(z_{t-1}|z_t)p(zt1zt)逼近于q(zt1zt)q(z_{t-1}|z_t)q(zt1zt)。由于(zt1zt)(z_{t-1}|z_t)(zt1zt)是条件概率,所以需要综合考虑各个条件,以各个条件发生的概率q(zt)q(z_t)q(zt)加权平均各个条件对应的交叉熵。最终的损失函数形式如下:\nloss=q(zt) q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt=q(zt1,zt)logp(zt1zt)dzt1dzt\\begin{align}\n loss &= -\\int q(z_t)\\ \\overbrace{\\int q(z_{t-1}|z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}}^{\\text{Cross Entropy}}\\ dz_t \\tag{6.1} \\newline\n &= -\\iint q(z_{t-1},z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.2} \n\\end{align}loss=q(zt) q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt=q(zt1,zt)logp(zt1zt)dzt1dzt(6.1)(6.2)\n也可以KL散度作为目标函数进行优化,KL散度与交叉熵是等价的[10]。\nloss=q(zt)KL(q(zt1zt)p(zt1zt))dzt=q(zt)q(zt1zt)logq(zt1zt)p(zt1zt)dzt1dzt=q(zt) q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt+q(zt)q(zt1zt)logq(zt1zt)Is Constantdz\\begin{align}\nloss &= \\int q(z_t) KL(q(z_{t-1}|z_t) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dz_t \\tag{6.3} \\newline &= \\int q(z_t) \\int q(z_{t-1}|z_t) \\log \\frac{q(z_{t-1}|z_t)}{\\textcolor{blue}{p(z_{t-1}|z_t)}} dz_{t-1} dz_t \\tag{6.4} \\newline &= -\\int q(z_t)\\ \\underbrace{\\int q(z_{t-1}|z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}}_{\\text{Cross Entropy}}\\ dz_t + \\underbrace{\\int q(z_t) \\int q(z_{t-1}|z_t) \\log q(z_{t-1}|z_t)}_{\\text{Is Constant}} dz \\tag{6.5}\n\\end{align}loss=q(zt)KL(q(zt1zt)p(zt1zt))dzt=q(zt)q(zt1zt)logp(zt1zt)q(zt1zt)dzt1dzt=q(zt) Cross Entropyq(zt1zt)logp(zt1zt)dzt1 dzt+Is Constantq(zt)q(zt1zt)logq(zt1zt)dz(6.3)(6.4)(6.5)

\n

式6.2的积分没有闭合的形式,不能直接优化。可使用蒙特卡罗(Monte Carlo)积分近似计算,新的目标函数如下:\nloss=q(zt1,zt)logp(zt1zt)dzt1dzti=0Nlogp(Zt1iZti)where(Zt1i,Zti)q(zt1,zt)\\begin{align}\n loss &= -\\iint q(z_{t-1},z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.6} \\newline\n &\\approx -\\sum_{i=0}^N \\log \\textcolor{blue}{p(Z_{t-1}^i|Z_t^i)} \\qquad where \\quad (Z_{t-1}^i,Z_t^i) \\sim q(z_{t-1},z_t) \\tag{6.7} \n\\end{align}loss=q(zt1,zt)logp(zt1zt)dzt1dzti=0Nlogp(Zt1iZti)where(Zt1i,Zti)q(zt1,zt)(6.6)(6.7)

\n

上述的样本(Zt1i,Zti)(Z_{t-1}^i,Z_t^i)(Zt1i,Zti)服从联合概率分布q(zt1,zt)q(z_{t-1},z_t)q(zt1,zt),可通过祖先采样的方式采样得到。具体方式如下:通过正向转换的方式(式4.1~4.4),逐步采样X,Z1,Z2Zt1,ZtX,Z_1,Z_2\\dots Z_{t-1},Z_tX,Z1,Z2Zt1,Zt,然后留下(Zt1,Zt)(Z_{t-1},Z_t)(Zt1,Zt)作为一个样本。但这种采样方式比较慢,可利用q(ztx)q(z_t|x)q(ztx)概率分布已知的特点(式4.8)加速采样,先从q(x)q(x)q(x)采样XXX,然后由q(zt1x)q(z_{t-1}|x)q(zt1x)采样Zt1Z_{t-1}Zt1,最后由q(ztzt1)q(z_t|z_{t-1})q(ztzt1)采样ZtZ_tZt,于是得到一个样本(Zt1,Zt)(Z_{t-1},Z_t)(Zt1,Zt)

\n

可能有些人会有疑问,式6.3的形式跟DPM[1]和DDPM[2]论文里的形式不太一样。实际上,这两个目标函数是等价的,下面给出证明。

\n

对于一致项(Consistent Term),证明如下:

\n

loss=q(zt1,zt) logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)dx logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)logq(zt1zt,x)dxdzt1dztThis Term Is Constant And Is Denoted As C1q(x)q(zt1,ztx)logp(zt1zt)dxdzt1dztC1=q(x)q(zt1,ztx)logq(zt1zt,x)p(zt1zt)dxdzt1dztC1=q(x)q(ztx)q(zt1zt,x)logq(zt1zt,x)p(zt1zt)dzt1 dztdxC1= q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdxC1 q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdx\\begin{align}\n loss &= -\\iint q(z_{t-1},z_t)\\ \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.8} \\newline\n &= -\\iint \\int q(x)q(z_{t-1}, z_t|x)dx\\ \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.9} \\newline\n &= \\overbrace{\\iint \\int q(x)q(z_{t-1}, z_t|x) \\log q(z_{t-1}|z_t,x)dxdz_{t-1}dz_t}^{\\text{This Term Is Constant And Is Denoted As}\\ \\textcolor{orange}{C_1}} \\tag{6.10} \\newline\n &\\quad - \\iint \\int q(x)q(z_{t-1}, z_t|x) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dxdz_{t-1}dz_t - \\textcolor{orange}{C_1} \\tag{6.11} \\newline\n &= \\iint \\int q(x)q(z_{t-1},z_t|x) \\log \\frac{q(z_{t-1}|z_t,x)}{\\textcolor{blue}{p(z_{t-1}|z_t)}}dxdz_{t-1}dz_t - \\textcolor{orange}{C_1} \\tag{6.12} \\newline\n &= \\iint q(x)q(z_t|x)\\int q(z_{t-1}|z_t,x) \\log \\frac{q(z_{t-1}|z_t,x)}{\\textcolor{blue}{p(z_{t-1}|z_t)}}dz_{t-1}\\ dz_tdx - \\textcolor{orange}{C_1} \\tag{6.13} \\newline\n &= \\iint \\ q(x)q(z_t|x) KL(q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)}) dz_t dx - \\textcolor{orange}{C_1} \\tag{6.14} \\newline\n &\\propto \\iint \\ q(x)q(z_t|x) KL(q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)}) dz_t dx \\tag{6.15} \\newline\n\\end{align}loss=q(zt1,zt) logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)dx logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)logq(zt1zt,x)dxdzt1dztThis Term Is Constant And Is Denoted As C1q(x)q(zt1,ztx)logp(zt1zt)dxdzt1dztC1=q(x)q(zt1,ztx)logp(zt1zt)q(zt1zt,x)dxdzt1dztC1=q(x)q(ztx)q(zt1zt,x)logp(zt1zt)q(zt1zt,x)dzt1 dztdxC1= q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdxC1 q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdx(6.8)(6.9)(6.10)(6.11)(6.12)(6.13)(6.14)(6.15)

\n

上式中的C1C_1C1项是一个固定值,不包含待优化的参数,其中,q(x)q(x)q(x)是固定的概率分布,q(zt1zt)q(z_{t-1}|z_t)q(zt1zt)也是固定概率分布,具体形式由q(x)q(x)q(x)及系数α\\alphaα确定。

\n

对于重构项(Reconstruction Term),可通过类似的方式证明:\nloss=q(z1)q(xz1)logp(xz1)dxCross Entropy dz1=q(z1,x)logp(xz1)dxdz1=q(x)q(z1x)logp(xz1)dz1 dx\\begin{align}\n loss &= -\\int q(z_1)\\overbrace{\\int q(x|z_1)\\log \\textcolor{blue}{p(x|z_1)}dx}^{\\text{Cross Entropy}}\\ dz_1 \\tag{6.16} \\newline\n &= -\\iint q(z_1,x)\\log \\textcolor{blue}{p(x|z_1)}dxdz_1 \\tag{6.17} \\newline\n &= -\\int q(x)\\int q(z_1|x)\\log \\textcolor{blue}{p(x|z_1)}dz_1\\ dx \\tag{6.18}\n\\end{align}loss=q(z1)q(xz1)logp(xz1)dxCross Entropy dz1=q(z1,x)logp(xz1)dxdz1=q(x)q(z1x)logp(xz1)dz1 dx(6.16)(6.17)(6.18)

\n

因此,式6.1的目标函数与DPM的目标函数是等价的。

\n

根据一致项证明的结论,以及交叉熵与KL散度的关系,可得出一个有趣的结论:\nminpq(zt)KL(q(zt1zt)p(zt1zt))dzt    minp q(zt)q(xzt)KL(q(zt1zt,x)p(zt1zt))dxdzt\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p}} \\int q(z_t) KL(q(z_{t-1}|z_t) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dz_t \\iff \\mathop{\\min}_{\\textcolor{blue}{p}} \\iint \\ q(z_t)q(x|z_t) KL(q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dxdz_t \\tag{6.19} \n\\end{align}minpq(zt)KL(q(zt1zt)p(zt1zt))dztminp q(zt)q(xzt)KL(q(zt1zt,x)p(zt1zt))dxdzt(6.19)\n比较左右两边的式子,可以看出,右边的目标函数比左边的目标函数多了一个条件变量XXX,同时也多了一个关于XXX积分,并且以XXX的发生的概率q(xzt)q(x|z_t)q(xzt)作为积分的加权系数。

\n

依照类似的思路,可推导出一个更通用的关系:\nminpKL(q(z)p(z))    minp q(x)KL(q(zx)p(z))dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p}} KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\iff \\mathop{\\min}_{\\textcolor{blue}{p}} \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{6.20} \n\\end{align}minpKL(q(z)p(z))minp q(x)KL(q(zx)p(z))dx(6.20)\n关于此结论的详细推导,可见Appendix A

\n
","posterior_transform_zh":"

Non-expanding mapping and Stationary Distribution

\nq(x)=q(x,z)dz=q(xz)q(z)dz\\begin{align}\n q(x) &= \\int q(x,z) dz = \\int q(x|z)q(z)dz \\tag{7.1}\n\\end{align}q(x)=q(x,z)dz=q(xz)q(z)dz(7.1)\n\n

根据Appendix B的Corollary 1和Corollary 2可知,后验概率变换是一个non-expanding mapping。也是说,对任意的两个概率分布qi1(z)qi2(z)q_{i1}(z)和q_{i2}(z)qi1(z)qi2(z),经过后验概率变换后得到qo1(x)q_{o1}(x)qo1(x)qo2(x)q_{o2}(x)qo2(x)qo1(z)q_{o1}(z)qo1(z)qo2(z)q_{o2}(z)qo2(z)的距离总是小于或等于qi1(x)q_{i1}(x)qi1(x)qi2(x)q_{i2}(x)qi2(x)的距离。这里的距离可使用KL Divergence或Total Variance或度量。\nd(qo1(z), qo2(z))d(qi1(x), qi2(x))\\begin{align}\n d(q_{o1}(z),\\ q_{o2}(z)) \\le d(q_{i1}(x),\\ q_{i2}(x)) \\tag{7.2}\n\\end{align}d(qo1(z), qo2(z))d(qi1(x), qi2(x))(7.2)\n根据Appendix B的分析可知,在大多数情况,上述的等号并不会成立。并且,α\\alphaα越小时(噪声越多),d(qo1,qo2)d(q_{o1},q_{o2})d(qo1,qo2)会越小于d(qi1,qi2)d(q_{i1},q_{i2})d(qi1,qi2)

\n

读者可查看Demo 4.1,左侧三个图呈现一个变换的过程,左1图是任意的数据分布q(x)q(x)q(x),左3图是变换后的概率分布,左2图是后验概率分布。可更改随机种子生成新的数据分布,调整α\\alphaα值引入不同程度的噪声。左侧最后两个图展示变换的“压缩性质”,左4图展示随机生成的两个输入分布,同时给出其距离度量值divindiv_{in}divin;左5图展示经过变换后的两个输出分布,输出分布之间的距离标识为divoutdiv_{out}divout。读者可改变输入的随机种子,切换不同的输入。可在图中看到,对于任意的输入,divindiv_{in}divin总是小于divoutdiv_{out}divout。另外,也可改变α\\alphaα的值,将会看到,α\\alphaα越小(噪声越大),divoutdivin\\frac{div_{out}}{div_{in}}divindivout的比值也越小,即收缩率越大。

\n

根据Appendix C的分析可知:后验概率变换可视为markov chain的一步跳转,并且,q(x)q(x)q(x)α\\alphaα满足一些条件时,此markov chain会收敛于惟一的稳态分布。另外,通过大量实验发现,稳态分布与数据分布q(x)q(x)q(x)非常相似,当α\\alphaα越小时,稳态分布与q(x)q(x)q(x)越相似。特别地,根据第5节的结论,α0\\alpha \\to 0α0时,经过一步变换后,输出分布即是q(x)q(x)q(x),所以稳态分布必定是q(x)q(x)q(x)

\n

读者可看Demo 4.2,此部分展示迭代收敛的例子。选择合适的迭代次数,点中“apply iteration transform”,将逐步画出迭代的过程,每个子图均会展示各自变换后的输出分布(绿色曲线\\textcolor{green}{绿色曲线}绿色曲线),收敛的参考点分布q(x)q(x)q(x)蓝色曲线\\textcolor{blue}{蓝色曲线}蓝色曲线画出,同时给出输出分布与q(x)q(x)q(x)之间的距离distdistdist。可以看出,随着迭代的次数增加,输出分布与q(x)q(x)q(x)越来越相似,并最终会稳定在q(x)q(x)q(x)附近。对于较复杂的分布,可能需要较多迭代的次数或者较大的噪声。迭代次数可以设置为上万步,但会花费较长时间。

\n

对于一维离散的情况,q(xz)q(x|z)q(xz)将离散成一个矩阵(记为QxzQ_{x|z}Qxz),q(z)q(z)q(z)离散成一个向量(记为qi\\boldsymbol{q_i}qi),积分操作q(xz)q(z)dz\\int q(x|z)q(z)dzq(xz)q(z)dz将离散成\"矩阵-向量\"乘法操作,所以后验概率变换可写成\nqo=Qxz qi1 iterationqo=Qxz Qxz qi2 iterationqo=(Qxz)n qin iteration\\begin{align}\n \\boldsymbol{q_o} &= Q_{x|z}\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{1 iteration} \\tag{7.3} \\newline\n \\boldsymbol{q_o} &= Q_{x|z}\\ Q_{x|z}\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{2 iteration} \\tag{7.4} \\newline\n & \\dots & \\notag \\newline\n \\boldsymbol{q_o} &= (Q_{x|z})^n\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{n iteration} \\tag{7.5} \\newline\n\\end{align}qoqoqo=Qxz qi=Qxz Qxz qi=(Qxz)n qi1 iteration2 iterationn iteration(7.3)(7.4)(7.5)\n于是,为了更深入地理解变换的特点,Demo 4.2也画出矩阵(Qxz)n(Q_{x|z})^n(Qxz)n的结果。从图里可以看到,当迭代趋向收敛时,矩阵(Qxz)n(Q_{x|z})^n(Qxz)n的行向量将变成一个常数向量,即向量的各分量都相等。在二维密度图里将表现为一条横线。

\n

对于一维离散的markov chain,收敛速度与转移概率矩阵的第二大特征值的绝对值(λ2\\lvert \\lambda_2 \\rvertλ2)反相关,λ2\\lvert \\lambda_2 \\rvertλ2越小,收敛速度越快。经过大量的实验发现,α\\alphaαλ2\\lvert \\lambda_2 \\rvertλ2有着明确的线性关系,α\\alphaα越小,λ2\\lvert \\lambda_2 \\rvertλ2也越小。所以,α\\alphaα越小(噪声越大),收敛速度越快。 特别地,当α0\\alpha \\to 0α0时,由第3节的结论可知,各个zzz对应的后验概率分布趋向一致,而由文献[21]的Theorem 21可知,λ2\\lvert \\lambda_2 \\rvertλ2小于任意两个zzz对应的后验概率分布的L1距离,所以,可知λ20\\lvert \\lambda_2 \\rvert \\to 0λ20

\n
\n

Anti-noise Capacity In Restoring Data Distribution

\n由上面的分析可知,在大多数情况下,\"后验概率变换\"是一个收缩映射,所以存在如下的关系:\nd(q(x), qo(x))<d(q(z), qi(z))\\begin{align}\n d(q(x),\\ q_o(x)) < d(q(z),\\ q_i(z)) \\tag{7.12}\n\\end{align}d(q(x), qo(x))<d(q(z), qi(z))(7.12)\n其中,q(z)q(z)q(z)是理想的输入分布,q(x)q(x)q(x)理想的输出分布,q(x)=q(xz)q(z)dzq(x)=\\int q(x|z)q(z)dzq(x)=q(xz)q(z)dzqi(z)q_i(z)qi(z)是任意的输入分布,qo(x)q_o(x)qo(x)是变换后的输出分布,qo(x)=q(xz)qi(z)dzq_o(x)=\\int q(x|z)q_i(z)dzqo(x)=q(xz)qi(z)dz。\n\n

上式表明,输出的分布qo(x)q_o(x)qo(x)与理想输出分布q(x)q(x)q(x)之间的距离总会小于输入分布qi(z)q_i(z)qi(z)与理想输入分布q(x)q(x)q(x)的距离。所以,”后验概率变换“天然具备一定的抵抗噪声能力。这意味着,在恢复q(x)q(x)q(x)的过程中(第5节),哪怕输入的“末尾分布q(zT)q(z_T)”q(zT)存在一定的误差,经过一系列变换后,输出的“数据分布q(x)q(x)q(x)“的误差也会比输入的误差更小。

\n

具体可看Demo 3.2,通过增加“noise ratio”的值可以向“末尾分布q(zT)q(z_T)q(zT)”添加噪声,点击“apply”按钮将逐步画出恢复的过程,恢复的分布以红色曲线\\textcolor{red}{红色曲线}红色曲线画出,同时也会通过JS散度标出误差的大小。将会看到,恢复的q(x)q(x)q(x)的误差总是小于q(zT)q(z_T)q(zT)的误差。

\n

由上面的讨论可知,α\\alphaα越小(即变换过程中使用的噪声越大),收缩映射的收缩率越大,相应地,抗噪声的能力也越强。特别地,当α0\\alpha \\to 0α0时,抗噪声能力无限大,不论多大噪声的输入,输出都为q(x)q(x)q(x)

\n
\n

Markov Chain Monte Carlo Sampling

\n\n

在DPM模型中,通常是通过Ancestral Sampling的方式进行采样。由上面的分析可知,当α\\alphaα足够小时,后验概率变换会收敛于q(x)q(x)q(x),所以,可通过Markov Chain Monte Carlo的方式进行采样。如图7.1所示。图中α\\alphaα代表一个较大的噪声的后验概率变换,较大的噪声使稳态分布更接近于数据分布q(x)q(x)q(x),但由第3节可知,较大噪声的后验变换不利于拟合,所以把较大噪声的后验概率变换分成多个小噪声的后验概率变换。

\n
\n
Figure 7.1: Markov Chain Monte Carlo Sampling
","deconvolution_zh":"

第1节中提到,式2.1的变换可分为两个子变换,第一个子变换为”线性变换“,第二个为“加上独立高斯噪声”。线性变换相当于对概率分布进行拉伸变换,所以存在逆变换。\"加上独立高斯噪声”相当于对概率分布执行卷积操作,卷积操作可通过逆卷积恢复。所以,理论上,可通过“逆线性变换”和“逆卷积”从最终的概率分布q(zT)q(z_T)q(zT)恢复数据分布q(x)q(x)q(x)

\n

但实际上,会存在一些问题。由于逆卷积对误差极为敏感,具有很高的输入灵敏度,很小的输入噪声就会引起输出极大的变化[11][12]。而在扩散模型中,会使用标准正态分布近似代替q(zT)q(z_T)q(zT),因此,在恢复的起始阶段就会引入噪声。虽然噪声较小,但由于逆卷积的敏感性,噪声会逐步放大,影响恢复。

\n

另外,也可以从另一个角度理解“逆卷积恢复”的不可行性。由于前向变换的过程(式4.1~4.4)是确定的,所以卷积核是固定的,因此,相应的“逆卷积变换“也是固定的。由于起始的数据分布q(x)q(x)q(x)可以是任意的分布,所以,通过一系列固定的“卷积正变换”,可以将任意的概率分布转换成近似N(0,I)\\mathcal{N}(0,I)N(0,I)的分布。如“逆卷积变换“可行,则意味着,可用一个固定的“逆卷积变换\",将N(0,I)\\mathcal{N}(0,I)N(0,I)分布恢复成任意的数据分布q(x)q(x)q(x),这明显是一个悖论。同一个输入,同一个变换,不可能会有多个输出。

\n
","reference_zh":"
","about_zh":"

APP: 本Web APP是使用Gradio开发,并部署在HuggingFace。由于资源有限(2核,16G内存),所以可能会响应较慢。为了更好地体验,建议从github复制源代码,在本地机器运行。本APP只依赖Gradio, SciPy, Matplotlib。

\n

Author: 郑镇鑫,资深视觉算法工程师,十年算法开发经历,曾就职于腾讯京东等互联网公司,目前专注于视频生成(类似Sora)。

\n

Email: blair.star@163.com

\n
","introduction_en":"

The Diffusion Probability Model[1][2] is currently the main method used in image and video generation, but due to its abstruse theory, many engineers are unable to understand it well. This article will provide a very easy-to-understand method to help readers grasp the principles of the Diffusion Model. Specifically, it will illustrate the Diffusion Model using examples of one-dimensional random variables in an interactive way, explaining several interesting properties of the Diffusion Model in an intuitive manner.

\n

The diffusion model is a probabilistic model. Probabilistic models mainly offer two functions: calculating the probability of a given sample appearing; and generating new samples. The diffusion model focuses on the latter aspect, facilitating the production of new samples, thus realizing the task of generation.

\n

The diffusion model differs from general probability models (such as GMM), which directly models the probability distribution of random variables. The diffusion model adopts an indirect approach, which utilizes random variable transform(shown in Figure 1a) to gradually convert the data distribution (the probability distribution to be modeled) into the standard normal distribution, and meanwhile models the posterior probability distribution corresponding to each transformation (Figure 1b-c). Upon obtaining the final standard normal distribution and the posterior probability distributions, one can generate samples of each random variable ZTZ2,Z1,XZ_T \\ldots Z_2,Z_1,XZTZ2,Z1,X in reverse order through Ancestral Sampling. Simultaneously, initial data distribution q(x)q(x)q(x) can be determined by employing Bayes theorem and the total probability theorem.

\n

One might wonder: indirect methods require modeling and learning T posterior probability distributions, while direct methods only need to model one probability distribution, Why would we choose the indirect approach? Here's the reasoning: the initial data distribution might be quite complex and hard to represent directly with a probability model. In contrast, the complexity of each posterior probability distribution in indirect methods is significantly simpler, allowing it to be approximated by simple probability models. As we will see later, given certain conditions, posterior probability distributions can closely resemble Gaussian distributions, thus a simple conditional Gaussian model can be used for modeling.

\n
\n
Figure 1: Diffusion probability model schematic
","transform_en":"

To transform the initial data distribution into a simple standard normal distribution, the diffusion model uses the following transformation method: \nZ=αX+1αϵwhereα<1,ϵN(0,I)\\begin{align}\n Z = \\sqrt{\\alpha} X + \\sqrt{1-\\alpha}\\epsilon \\qquad where \\quad \\alpha < 1, \\quad \\epsilon \\sim \\mathcal{N}(0, I) \\tag{1.1}\n\\end{align}Z=αX+1αϵwhereα<1,ϵN(0,I)(1.1)\nwhere Xq(x)X\\sim q(x)Xq(x)is any random variable,Zq(Z)Z\\sim q(Z)Zq(Z) is the transformed random variable。

\n

This transformation can be divided into two sub-transformations。

\n

The first sub-transformation performs a linear transformation (αX\\sqrt{\\alpha}XαX) on the random variable XXX. According to the conclusion of the literature[3], the linear transformation makes the probability distribution of XXX narrower and taller, and the extent of narrowing and heightening is directly proportional to the value of α\\alphaα.

\n

This can be specifically seen in Demo 1, where the first figure depicts a randomly generated one-dimensional probability distribution, and the second figure represents the probability distribution after the linear transformation. It can be observed that the curve of the third figure has become narrower and taller compared to the first image. Readers can experiment with different α\\alphaα to gain a more intuitive understanding.

\n

The second sub-transformation is adding independent random noise(1αϵ\\sqrt{1-\\alpha}\\epsilon1αϵ). According to the conclusion of the literature[4], adding independent random variables is equivalent to performing convolution on the two probability distributions. Since the probability distribution of random noise is Gaussian, it is equivalent to performing a Gaussian Blur operation. After blurring, the original probability distribution will become smoother and more similar to the standard normal distribution. The degree of blurring is directly proportional to the noise level (1α\\sqrt{1-\\alpha}1α).

\n

For specifics, one can see Demo 1, where the first figure is a randomly generated one-dimensional probability distribution, and the third figure is the result after the transformation. It can be seen that the transformed probability distribution curve is smoother and there are fewer corners. The readers can test different α\\alphaα values to feel how the noise level affect the shape of the probability distribution. The last figure is the result after applying all two sub-transformations.

\n
","likelihood_en":"

From the transformation method (equation 1.1), it can be seen that the probability distribution of the forward conditional probability q(zx)q(z|x)q(zx) is a Gaussian distribution, which is only related to the value of α\\alphaα, regardless of the probability distribution of q(x)q(x)q(x). \nq(zx)=N(αx, 1α)\\begin{align}\n q(z|x) &= \\mathcal{N}(\\sqrt{\\alpha}x,\\ 1-\\alpha) \\tag{2.1}\n\\end{align}q(zx)=N(αx, 1α)(2.1)\nIt can be understood by concrete examples in Demo 2. The third figure depict the shape of q(zx)q(z|x)q(zx). From the figure, a uniform slanting line can be observed. This implies that the mean of q(zx)q(z|x)q(zx) is linearly related to x, and the variance is fixed. The magnitude of α\\alphaα will determine the width and incline of the slanting line.

\n
","posterior_en":"

The posterior probability distribution does not have a closed form, but its shape can be inferred approximately through some technique.

\n

According to Bayes formula, we have\nq(xz)=q(zx)q(x)q(z)\\begin{align}\n q(x|z) = \\frac{q(z|x)q(x)}{q(z)} \\tag{3.1}\n\\end{align}q(xz)=q(z)q(zx)q(x)(3.1)

\n

When zzz is fixed, q(z)q(z)q(z) is a constant, so q(xz)q(x|z)q(xz) is a probability density function with respect to xxx, and its shape depends only on q(zx)q(x)q(z|x)q(x)q(zx)q(x). \nq(xz)q(zx)q(x)where z is fixed\\begin{align}\n q(x|z) \\propto q(z|x)q(x) \\qquad where\\ z\\ is\\ fixed \\tag{3.2}\n\\end{align}q(xz)q(zx)q(x)where z is fixed(3.2)

\n

In fact, q(z)=q(zx)q(x)dxq(z)=\\int q(z|x)q(x)dxq(z)=q(zx)q(x)dx, which means that q(z)q(z)q(z) is the sum over xxx of the function q(zx)q(x)q(z|x)q(x)q(zx)q(x). Therefore, dividing q(zx)q(x)q(z|x)q(x)q(zx)q(x) by q(z)q(z)q(z) is equivalent to normalizing q(zx)q(x)q(z|x)q(x)q(zx)q(x). \nq(xz)=Normalize(q(zx)q(x))\\begin{align}\n q(x|z) = \\operatorname{Normalize}\\big(q(z|x)q(x)\\big) \\tag{3.3}\n\\end{align}q(xz)=Normalize(q(zx)q(x))(3.3)

\n

From Equation 2.1, we can see that q(zx)q(z|x)q(zx) is a Gaussian distribution, so we have\nq(xz)12π(1α)exp(zαx)22(1α) q(x)where z is fixed=1α12π1ααexp(zαx)221αα q(x)=1α12πσexp(xμ)22σ2GaussFun q(x)where μ=zασ=1αα\\begin{align}\n q(x|z) &\\propto \\frac{1}{\\sqrt{2\\pi(1-\\alpha)}}\\exp{\\frac{-(z-\\sqrt{\\alpha}x)^2}{2(1-\\alpha)}}\\ q(x)& \\qquad &\\text{where z is fixed} \\notag \\newline\n &= \\frac{1}{\\sqrt{\\alpha}}\\frac{1}{\\sqrt{2\\pi\\frac{1-\\alpha}{\\alpha}}}\\exp{\\frac{-(\\frac{z}{\\sqrt{\\alpha}}-x)^2}{2\\frac{1-\\alpha}{\\alpha}}}\\ q(x)& \\notag \\newline\n &= \\frac{1}{\\sqrt{\\alpha}} \\underbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}_{\\text{GaussFun}}\\ q(x)& \\qquad &\\text{where}\\ \\mu=\\frac{z}{\\sqrt{\\alpha}}\\quad \\sigma=\\sqrt{\\frac{1-\\alpha}{\\alpha}} \\tag{3.4}\n\\end{align}q(xz)2π(1α)1exp2(1α)(zαx)2 q(x)=α12πα1α1exp2α1α(αzx)2 q(x)=α1GaussFun2πσ1exp2σ2(xμ)2 q(x)where z is fixedwhere μ=αzσ=α1α(3.4)

\n

It can be observed that the GaussFun part is a Gaussian function of xxx, with a mean of zα\\frac{z}{\\sqrt{\\alpha}}αz and a variance of 1αα\\sqrt{\\frac{1-\\alpha}{\\alpha}}α1α, so the shape of q(xz)q(x|z)q(xz) is determined by the product of GaussFun and q(x).

\n

According to the characteristics of multiplication, the characteristics of the shape of the q(xz)q(x|z)q(xz) function can be summarized.

\n
    \n
  • The support set of q(xz)q(x|z)q(xz) should be contained within the support set of GaussFun. The support set of GaussFun is a hypersphere, centered at the mean μ\\muμ with a radius of approximately 3 times the standard deviation σ\\sigmaσ.
  • \n\n
  • When the variance of the Gaussian function is small (small noise), or when q(x)q(x)q(x) changes linearly, the shape of q(xz)q(x|z)q(xz) will approximate to the Gaussian function, and have a simpler function form, which is convenient for modeling and learning.
  • \n \n
  • When the variance of the Gaussian function is large (large noise), or when q(x)q(x)q(x) changes drastically, the shape of q(xz)q(x|z)q(xz) will be more complex, and greatly differ from a Gaussian function, which makes it difficult to model and learn.
  • \n
\n\n

Appendix B provides a more rigorous analysis. When σ\\sigmaσ satisfies certain conditions, q(xz)q(x|z)q(xz) approximates to Gaussiani distribution.

\n

The specifics can be seen in Demo 2. The fourth figure present the shape of the posterior q(xz)q(x|z)q(xz), which shows an irregular shape and resembles a curved and uneven line. As α\\alphaα increases (noise decreases), the curve tends to be uniform and straight. Readers can adjust different α\\alphaα values and observe the relationship between the shape of posterior and the level of noise. In the last figure, the blue dash line\\textcolor{blue}{\\text{blue dash line}}blue dash line represents q(x)q(x)q(x), the green dash line\\textcolor{green}{\\text{green dash line}}green dash line represents GaussFun in the equation 3.4, and the orange curve\\textcolor{orange}{\\text{orange curve}}orange curve represents the result of multiplying the two function and normalizing it, which is the posterior probability q(xz=fixed)q(x|z=fixed)q(xz=fixed) under a fixed z condition. Readers can adjust different values of z to observe how the fluctuation of q(x)q(x)q(x) affect the shape of the posterior probability q(xz)q(x|z)q(xz).

\n

The posterior q(xz)q(x|z)q(xz) under two special states are worth considering.

\n
    \n
  • As α0\\alpha \\to 0α0, the variance of GaussFun tends to \\infty, and GaussFun almost becomes a uniform distribution over a very large support set, and the result of multiplying q(x)q(x)q(x) by the uniform distribution is still q(x)q(x)q(x), therefore, q(xz)q(x|z)q(xz) for different zzz almost become identical, and almost the same as q(x)q(x)q(x). Readers can set α\\alphaα to 0.001 in Demo 2 to observe the specific results.
  • \n \n
  • As α1\\alpha \\to 1α1, the variance of GaussFun tends to 000, The q(xz)q(x|z)q(xz) for different zzz values contract into a series of Dirac delta functions with different offsets equalling to zzz. However, there are some exceptions. When there are regions where q(x)q(x)q(x) is zero, the corresponding q(xz)q(x|z)q(xz) will no longer be a Dirac delta function, but a zero function. Readers can set α\\alphaα to 0.999 in Demo 2 to observe the specific results.
  • \n
\n\n

There is one point to note. when α0\\alpha \\to 0α0, the mean of GaussFun corresponding for larger zzz values(μ=zα\\mu = \\frac{z}{\\sqrt{\\alpha}}μ=αz) also increases sharply. This means that GaussFun is located farther from the support of q(x)q(x)q(x). In this case, the \"uniformity\" of the part of GaussFun corresponding to the support of q(x)q(x)q(x) will slightly decrease, thereby slightly reducing the similarity between q(xz)q(x|z)q(xz) and q(x)q(x)q(x). However, this effect will further diminish as α\\alphaα decreases. Readers can observe this effect in Demo 2. Set α\\alphaα to 0.001, and you will see a slight difference between q(xz=2)q(x|z=-2)q(xz=2) and q(x)q(x)q(x), but no noticeable difference between q(xz=0)q(x|z=0)q(xz=0) and q(x)q(x)q(x).

\n

Regarding the \"uniformity\" of the Gaussian function, there are two characteristics: the larger the standard deviation, the greater the uniformity; the farther away from the mean, the smaller the uniformity.

\n
","forward_process_en":"

For any arbitrary data distribution q(x)q(x)q(x), the transform(equation 2.1) in section 2 can be continuously applied(equation 4.1~4.4). As the number of transforms increases, the output probability distribution will become increasingly closer to the standard normal distribution. For more complex data distributions, more iterations or larger noise are needed.

\n

Specific details can be observed in Demo 3.1. The first figure illustrates a randomly generated one-dimensional probability distribution. After seven transforms, this distribution looks very similar to the standard normal distribution. The degree of similarity increases with the number of iterations and the level of the noise. Given the same degree of similarity, fewer transforms are needed if the noise added at each step is larger (smaller α\\alphaα value). Readers can try different α\\alphaα values and numbers of transforms to see how similar the final probability distribution is.

\n

The complexity of the initial probability distribution tends to be high, but as the number of transforms increases, the complexity of the probability distribution q(zt)q(z_t)q(zt) will decrease. As concluded in section 4, a more complex probability distribution corresponds to a more complex posterior probability distribution. Therefore, in order to ensure that the posterior probability distribution is more similar to the Conditional Gaussian function (easier to learn), a larger value of α\\alphaα (smaller noise) should be used in the initial phase, and a smaller value of α\\alphaα (larger noise) can be appropriately used in the later phase to accelerate the transition to the standard normal distribution.

\n

In the example of Demo 3.1, it can be seen that as the number of transforms increases, the corners of q(zt)q(z_t)q(zt) become fewer and fewer. Meanwhile, the slanting lines in the plot of the posterior probability distribution q(zt1zt)q(z_{t-1}|z_t)q(zt1zt) become increasingly straight and uniform, resembling more and more the conditional Gaussian distribution.

\n

Z1=α1X+1α1ϵ1Z2=α2Z1+1α2ϵ2Zt=αtZt1+1αtϵtZT=αTZT1+1αTϵTwhereαt<1t1,2,,T\\begin{align}\n Z_1 &= \\sqrt{\\alpha_1} X + \\sqrt{1-\\alpha_1}\\epsilon_1 \\tag{4.1} \\newline\n Z_2 &= \\sqrt{\\alpha_2} Z_1 + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.2} \\newline\n &\\dots \\notag \\newline\n Z_{t} &= \\sqrt{\\alpha_t}Z_{t-1} + \\sqrt{1-\\alpha_t}\\epsilon_{t} \\tag{4.3} \\newline\n &\\dots \\notag \\newline\n Z_{T} &= \\sqrt{\\alpha_T}Z_{T-1} + \\sqrt{1-\\alpha_T}\\epsilon_{T} \\tag{4.4} \\newline\n &where \\quad \\alpha_t < 1 \\qquad t\\in {1,2,\\dots,T} \\notag\n\\end{align}Z1Z2ZtZT=α1X+1α1ϵ1=α2Z1+1α2ϵ2=αtZt1+1αtϵt=αTZT1+1αTϵTwhereαt<1t1,2,,T(4.1)(4.2)(4.3)(4.4)

\n

By substituting Equation 4.1 into Equation 4.2, and utilizing the properties of Gaussian distribution, we can derive the form of q(z2x)q(z_2|x)q(z2x) \nz2=α2(α1x+1α1ϵ1)+1α2ϵ2=α2α1x+α2α2α1ϵ1+1α2ϵ2=N(α1α2x, 1α1α2)\\begin{align}\n z_2 &= \\sqrt{\\alpha_2}(\\sqrt{\\alpha_1}x + \\sqrt{1-\\alpha_1}\\epsilon_1) + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.5} \\newline\n &= \\sqrt{\\alpha_2\\alpha_1}x + \\sqrt{\\alpha_2-\\alpha_2\\alpha_1}\\epsilon_1 + \\sqrt{1-\\alpha_2}\\epsilon_2 \\tag{4.6} \\newline\n &= \\mathcal{N}(\\sqrt{\\alpha_1\\alpha_2}x,\\ 1-\\alpha_1\\alpha_2) \\tag{4.7}\n\\end{align}z2=α2(α1x+1α1ϵ1)+1α2ϵ2=α2α1x+α2α2α1ϵ1+1α2ϵ2=N(α1α2x, 1α1α2)(4.5)(4.6)(4.7)

\n

In the same way, it can be deduced recursively that\nq(ztx)=N(α1α2αtx, 1α1α2αt)=N(αtˉx, 1αtˉ)where αtˉj=1tαj\\begin{align}\n q(z_t|x) &= \\mathcal{N}(\\sqrt{\\alpha_1\\alpha_2\\cdots\\alpha_t}x,\\ 1-\\alpha_1\\alpha_2\\cdots\\alpha_t) = \\mathcal{N}(\\sqrt{\\bar{\\alpha_t}}x,\\ 1-\\bar{\\alpha_t}) \\qquad where\\ \\bar{\\alpha_t} \\triangleq \\prod_{j=1}^t\\alpha_j \\tag{4.8}\n\\end{align}q(ztx)=N(α1α2αtx, 1α1α2αt)=N(αtˉx, 1αtˉ)where αtˉj=1tαj(4.8)

\n

Comparing the forms of Equation 4.8 and Equation 2.1, it can be found that their forms are completely consistent.

\n

If we only focus on the relationship between the initial and final random variables, then a sequence of t small transforms can be replaced by one large transform, and the α\\alphaα of the large transform is the accumulation of the α\\alphaα from each small transform, because the joint probability distributions corresponding to both types of transforms are the same.

\n

Readers can perform an experiment in Demo 3.1 using the same input distribution q(x)q(x)q(x) but with two different transform methods: 1) using three transformations, each with α\\alphaα equal to 0.95; 2) using a single transform with α\\alphaα set to 0.857375. Perform the transformations separately and then compare the two resulting distributions. You will see that the two distributions are identical.

\n

In the DDPM[2] paper, the authors used 1000 steps (T=1000) to transform the data distribution q(x)q(x)q(x) to q(zT)q(z_T)q(zT). The probability distribution of q(zTx)q(z_T|x)q(zTx) is as follows:\nq(zTx)=N(0.00635 x, 0.99998)\\begin{align}\n q(z_T|x) &= \\mathcal{N}(0.00635\\ x,\\ 0.99998) \\tag{4.9}\n\\end{align}q(zTx)=N(0.00635 x, 0.99998)(4.9)

\n

If only considering the joint distribution q(x,zT)q(x, z_T)q(x,zT), a single transformation can also be used as a substitute, which is as follows:\nZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ\\begin{align}\n Z_T = \\sqrt{0.0000403}\\ X + \\sqrt{1-0.0000403}\\ \\epsilon = 0.00635\\ X + 0.99998\\ \\epsilon \\tag{4.10}\n\\end{align}ZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ(4.10)\nIt can be seen that, after applying two transforms, the transformed distributions q(zTx)q(z_T|x)q(zTx) are the same. Thus, q(x,zT)q(x,z_T)q(x,zT) is also the same.

\n
","backward_process_en":"

If the final probability distribution q(zT)q(z_T)q(zT) and the posterior probabilities of each transform q(xz),q(zt1zt)q(x|z),q(z_{t-1}|z_t)q(xz),q(zt1zt) are known, the data distribution q(x)q(x)q(x) can be recovered through the Bayes Theorem and the Law of Total Probability, as shown in equations 5.1~5.4. When the final probability distribution q(zT)q(z_T)q(zT) is very similar to the standard normal distribution, the standard normal distribution can be used as a substitute.

\n

Specifics can be seen in Demo 3.2. In the example, q(zT)q(z_T)q(zT) substitutes N(0,1)\\mathcal{N}(0,1)N(0,1), and the error magnitude is given through JS Divergence. The restored probability distribution q(zt)q(z_t)q(zt) and q(x)q(x)q(x) are identified by the green curve\\textcolor{green}{\\text{green curve}}green curve, and the original probability distribution is identified by the blue curve\\textcolor{blue}{\\text{blue curve}}blue curve. It can be observed that the data distribution q(x)q(x)q(x) can be well restored, and the error (JS Divergence) will be smaller than the error caused by the standard normal distribution replacing q(zT)q(z_T)q(zT).\nq(zT1)=q(zT1,zT)dzT=q(zT1zT)q(zT)dzTq(zt1)=q(zt1,zt)dzt=q(zt1zt)q(zt)dztq(z1)=q(z1,z2)dz1=q(z1z2)q(z2)dz2q(x)=q(x,z1)dz1=q(xz1)q(z1)dz1\\begin{align}\n q(z_{T-1}) &= \\int q(z_{T-1},z_T)dz_T = \\int q(z_{T-1}|z_T)q(z_T)dz_T \\tag{5.1} \\newline\n & \\dots \\notag \\newline\n q(z_{t-1}) &= \\int q(z_{t-1},z_t)dz_t = \\int q(z_{t-1}|z_t)q(z_t)dz_t \\tag{5.2} \\newline\n & \\dots \\notag \\newline\n q(z_1) &= \\int q(z_1,z_2) dz_1 = \\int q(z_1|z_2)q(z_2)dz_2 \\tag{5.3} \\newline\n q(x) &= \\int q(x,z_1) dz_1 = \\int q(x|z_1)q(z_1)dz_1 \\tag{5.4} \\newline\n\\end{align}q(zT1)q(zt1)q(z1)q(x)=q(zT1,zT)dzT=q(zT1zT)q(zT)dzT=q(zt1,zt)dzt=q(zt1zt)q(zt)dzt=q(z1,z2)dz1=q(z1z2)q(z2)dz2=q(x,z1)dz1=q(xz1)q(z1)dz1(5.1)(5.2)(5.3)(5.4)\nIn this article, the aforementioned transform is referred to as the Posterior Transform. For example, in equation 5.4, the input of the transform is the probability distribution function q(z1)q(z_1)q(z1), and the output is the probability distribution function q(x)q(x)q(x).The entire transform is determined by the posterior q(xz1)q(x|z_1)q(xz1). This transform can also be considered as the linear weighted sum of a set of basis functions, where the basis functions are q(xz1)q(x|z_1)q(xz1) under different z1z_1z1, and the weights of each basis function are q(z1)q(z_1)q(z1). Some interesting properties of this transform will be introduced in Section 7.

\n

In Section 3, we have considered two special posterior probability distributions. Next, we analyze their corresponding posterior transforms.

\n
    \n
  • When α0\\alpha \\to 0α0, the q(xz)q(x|z)q(xz) for different zzz are almost the same as q(x)q(x)q(x). In other words, the basis functions of linear weighted sum are almost the same. In this state, no matter how the input changes, the output of the transformation is always q(x)q(x)q(x).
  • \n
  • When α1\\alpha \\to 1α1, the q(xz)q(x|z)q(xz) for different zzz values becomes a series of Dirac delta functions and zero functions. In this state, as long as the support of the input distribution is included in the support set of q(x)q(x)q(x), the output of the transformation will remain the same with the input.
  • \n
\n\n

In Section 4, it is mentioned that the 1000 transformations used in the DDPM[2] can be represented using a single transformation\nZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ\\begin{align}\n Z_T = \\sqrt{0.0000403}\\ X + \\sqrt{1-0.0000403}\\ \\epsilon = 0.00635\\ X + 0.99998\\ \\epsilon \\tag{5.5}\n\\end{align}ZT=0.0000403 X+10.0000403 ϵ=0.00635 X+0.99998 ϵ(5.5)

\n

Since α=0.0000403\\alpha=0.0000403α=0.0000403 is very small, the corresponding standard deviation of GaussFun (Equation 3.4) reaches 157.52. If we constrain the support of q(x)q(x)q(x) within the unit hypersphere (x2<1\\lVert x \\rVert_2 < 1x2<1), then for zTz_TzT in the range [2,+2][-2, +2][2,+2], each corresponding q(xzT)q(x|z_T)q(xzT) is very similar to q(x)q(x)q(x). In this state, for the posterior transform of q(xzT)q(x|z_T)q(xzT), regardless of the shape of the input distribution, as long as the support set is within the range [2,+2][-2,+2][2,+2], the output distribution will be q(x)q(x)q(x).

\n

Furthermore, we can conclude that in the DPM model, if the support of q(x)q(x)q(x) is finite and the signal-to-noise ratio of the final variable ZTZ_TZT is sufficiently high, the process of restoring q(x)q(x)q(x) can use any distribution; it doesn't necessarily have to use the standard normal distribution.

\n

Readers can conduct a similar experiment themselves. In Demo 3.1, set start_alpha to 0.25, end_alpha to 0.25, and step to 7. At this point, q(z7)=0.000061X+10.000061ϵq(z_7)=\\sqrt{0.000061}X + \\sqrt{1-0.000061} \\epsilonq(z7)=0.000061X+10.000061ϵ, which is roughly equivalent to DDPM's q(zT)q(z_T)q(zT). Click on apply to perform the forward transform (plotted using blue curves\\textcolor{blue}{\\text{blue curves}}blue curves), which prepares for the subsequent restoring process. In Demo 3.2, set the noise_ratio to 1, introducing 100% noise into the tail distribution q(z7)q(z_7)q(z7). Changing the value of nose_random_seed will change the distribution of noise. Deselect backward_pdf to reduce screen clutter. Click on apply to restore q(x)q(x)q(x) through posterior transform. You will see that, no matter what the shape of input q(z7)q(z_7)q(z7) may be, the restored q(x)q(x)q(x) is always exactly the same as the original q(x)q(x)q(x). The JS Divergence is zero. The restoration process is plotted using a red curve\\textcolor{red}{\\text{red curve}}red curve.

\n

There is another point worth noting. In deep learning tasks, it is common to scale each dimension of the input within the range [-1, 1], which means within a unit hypercube. The maximum Euclidean distance between any two points in a unit hypercube increases with the dimensionality. For example, in one dimension, the maximum distance is 222, two dimensions is 222\\sqrt{2}22, three dimensions is 232\\sqrt{3}23, and n dimensions is 2n2\\sqrt{n}2n. Therefore, for data with higher dimensions, the variable ZTZ_TZT needs a higher signal-to-noise ratio to allow the starting distribution of the recovery process to accept any distribution.

\n
","fit_posterior_en":"

From the front part of Section 3, it is known that the posterior probability distributions are unknown and related to q(x)q(x)q(x). Therefore, in order to recover the data distribution or sample from it, it is necessary to learn and estimate each posterior probability distribution.

\n

From the latter part of Section 3, it can be understood that when certain conditions are met, each posterior probability distribution q(xz),q(zt1zt)q(x|z), q(z_{t-1}|z_t)q(xz),q(zt1zt) approximates the Gaussian probability distribution. Therefore, by constructing a set of conditional Gaussian probability models p(xz),p(zt1zt)p(x|z), p(z_{t-1}|z_t)p(xz),p(zt1zt), we can learn to fit the corresponding q(xz),q(zt1zt)q(x|z), q(z_{t-1}|z_t)q(xz),q(zt1zt).

\n

Due to the limitations of the model's representative and learning capabilities, there will be certain errors in the fitting process, which will further impact the accuracy of restored q(x)q(x)q(x). The size of the fitting error is related to the complexity of the posterior probability distribution. As can be seen from Section 3, when q(x)q(x)q(x) is more complex or the added noise is large, the posterior probability distribution will be more complex, and it will differ greatly from the Gaussian distribution, thus leading to fitting errors and further affecting the restoration of q(x)q(x)q(x).

\n

Refer to Demo 3.3 for the specifics. The reader can test different q(x)q(x)q(x) and α\\alphaα, observe the fitting degree of the posterior probability distribution q(zt1zt)q(z_{t-1}|z_t)q(zt1zt) and the accuracy of restored q(x)q(x)q(x). The restored probability distribution is ploted with orange\\textcolor{orange}{\\text{orange}}orange, and the error is also measured by JS divergence.

\n

Regarding the objective function for fitting, similar to other probability models, the cross-entropy loss can be optimized to make p(zt1zt)p(z_{t-1}|z_t)p(zt1zt) approaching q(zt1zt)q(z_{t-1}|z_t)q(zt1zt). Since (zt1zt)(z_{t-1}|z_t)(zt1zt) is a conditional probability, it is necessary to fully consider all conditions. This can be achieved by averaging the cross-entropy corresponding to each condition weighted by the probability of each condition happening. The final form of the loss function is as follows.\nloss=q(zt)q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt=q(zt1,zt)logp(zt1zt)dzt1dzt\\begin{align}\n loss &= -\\int q(z_t) \\overbrace{\\int q(z_{t-1}|z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}}^{\\text{Cross Entropy}}\\ dz_t \\tag{6.1} \\newline\n &= -\\iint q(z_{t-1},z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.2} \n\\end{align}loss=q(zt)q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt=q(zt1,zt)logp(zt1zt)dzt1dzt(6.1)(6.2)

\n

KL divergence can also be optimized as the objective function. KL divergence and cross-entropy are equivalent[10]\nloss=q(zt)KL(q(zt1zt)p(zt1zt))dzt=q(zt)q(zt1zt)logq(zt1zt)p(zt1zt)dzt1dzt=q(zt) q(zt1zt)logp(zt1zt)dzt1Cross Entropy dzt+q(zt)q(zt1zt)logq(zt1zt)Is Constantdz\\begin{align}\nloss &= \\int q(z_t) KL(q(z_{t-1}|z_t) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dz_t \\tag{6.3} \\newline &= \\int q(z_t) \\int q(z_{t-1}|z_t) \\log \\frac{q(z_{t-1}|z_t)}{\\textcolor{blue}{p(z_{t-1}|z_t)}} dz_{t-1} dz_t \\tag{6.4} \\newline &= -\\int q(z_t)\\ \\underbrace{\\int q(z_{t-1}|z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}}_{\\text{Cross Entropy}}\\ dz_t + \\underbrace{\\int q(z_t) \\int q(z_{t-1}|z_t) \\log q(z_{t-1}|z_t)}_{\\text{Is Constant}} dz \\tag{6.5}\n\\end{align}loss=q(zt)KL(q(zt1zt)p(zt1zt))dzt=q(zt)q(zt1zt)logp(zt1zt)q(zt1zt)dzt1dzt=q(zt) Cross Entropyq(zt1zt)logp(zt1zt)dzt1 dzt+Is Constantq(zt)q(zt1zt)logq(zt1zt)dz(6.3)(6.4)(6.5)

\n

The integral in equation 6.2 does not have a closed form and cannot be directly optimized. The Monte Carlo integration can be used for approximate calculation. The new objective function is as follows:\nloss=q(zt1,zt)logp(zt1zt)dzt1dzti=0Nlogp(Zt1iZti)where(Zt1i,Zti)q(zt1,zt)\\begin{align}\n loss &= -\\iint q(z_{t-1},z_t) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.6} \\newline\n &\\approx -\\sum_{i=0}^N \\log \\textcolor{blue}{p(Z_{t-1}^i|Z_t^i)} \\qquad where \\quad (Z_{t-1}^i,Z_t^i) \\sim q(z_{t-1},z_t) \\tag{6.7} \n\\end{align}loss=q(zt1,zt)logp(zt1zt)dzt1dzti=0Nlogp(Zt1iZti)where(Zt1i,Zti)q(zt1,zt)(6.6)(6.7)

\n

The aforementioned samples (Zt1i,Zti)(Z_{t-1}^i,Z_t^i)(Zt1i,Zti) follow a joint probability distribution q(zt1,zt)q(z_{t-1},z_t)q(zt1,zt), which can be sampled via an Ancestral Sampling. The specific method is as follows: sample X,Z1,Z2Zt1,ZtX,Z_1,Z_2 \\dots Z_{t-1},Z_tX,Z1,Z2Zt1,Zt step by step through forward transforms (Formulas 4.1~4.4), and then reserve (Zt1,Zt)(Z_{t-1},Z_t)(Zt1,Zt) as a sample. This sampling process is relatively slow. To speed up the sampling, we can take advantage of the known features of the probability distribution q(ztx)q(z_t|x)q(ztx) (Formula 4.8). First, sample XXX from q(x)q(x)q(x), then sample Zt1Z_{t-1}Zt1 from q(zt1x)q(z_{t-1}|x)q(zt1x), and finally sample ZtZ_tZt from q(ztzt1)q(z_t|z_{t-1})q(ztzt1). Thus, a sample (Zt1,Zt)(Z_{t-1},Z_t)(Zt1,Zt) is obtained.

\n

Some people may question that the objective function in Equation 6.3 seems different from those in the DPM[1] and DDPM[2] papers. In fact, these two objective functions are equivalent, and the proof is given below.

\n

For Consistent Terms, the proof is as follows:

\n

loss=q(zt1,zt) logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)dx logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)logq(zt1zt,x)dxdzt1dztThis Term Is Constant And Is Denoted As C1q(x)q(zt1,ztx)logp(zt1zt)dxdzt1dztC1=q(x)q(zt1,ztx)logq(zt1zt,x)p(zt1zt)dxdzt1dztC1=q(x)q(ztx)q(zt1zt,x)logq(zt1zt,x)p(zt1zt)dzt1 dztdxC1= q(x)q(ztx)KL[q(zt1zt,x)p(zt1zt)]dztdxC1 q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdx\\begin{align}\n loss &= -\\iint q(z_{t-1},z_t)\\ \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.8} \\newline\n &= -\\iint \\int q(x)q(z_{t-1}, z_t|x)dx\\ \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dz_{t-1}dz_t \\tag{6.9} \\newline\n &= \\overbrace{\\iint \\int q(x)q(z_{t-1}, z_t|x) \\log q(z_{t-1}|z_t,x)dxdz_{t-1}dz_t}^{\\text{This Term Is Constant And Is Denoted As}\\ \\textcolor{orange}{C_1}} \\tag{6.10} \\newline\n &\\quad - \\iint \\int q(x)q(z_{t-1}, z_t|x) \\log \\textcolor{blue}{p(z_{t-1}|z_t)}dxdz_{t-1}dz_t - \\textcolor{orange}{C_1} \\tag{6.11} \\newline\n &= \\iint \\int q(x)q(z_{t-1},z_t|x) \\log \\frac{q(z_{t-1}|z_t,x)}{\\textcolor{blue}{p(z_{t-1}|z_t)}}dxdz_{t-1}dz_t - \\textcolor{orange}{C_1} \\tag{6.12} \\newline\n &= \\iint q(x)q(z_t|x)\\int q(z_{t-1}|z_t,x) \\log \\frac{q(z_{t-1}|z_t,x)}{\\textcolor{blue}{p(z_{t-1}|z_t)}}dz_{t-1}\\ dz_t dx - \\textcolor{orange}{C_1} \\tag{6.13} \\newline\n &= \\iint \\ q(x)q(z_t|x) KL[q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)}] dz_t dx - \\textcolor{orange}{C_1} \\tag{6.14} \\newline\n &\\propto \\iint \\ q(x)q(z_t|x) KL(q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)}) dz_t dx \\tag{6.15} \\newline\n\\end{align}loss=q(zt1,zt) logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)dx logp(zt1zt)dzt1dzt=q(x)q(zt1,ztx)logq(zt1zt,x)dxdzt1dztThis Term Is Constant And Is Denoted As C1q(x)q(zt1,ztx)logp(zt1zt)dxdzt1dztC1=q(x)q(zt1,ztx)logp(zt1zt)q(zt1zt,x)dxdzt1dztC1=q(x)q(ztx)q(zt1zt,x)logp(zt1zt)q(zt1zt,x)dzt1 dztdxC1= q(x)q(ztx)KL[q(zt1zt,x)p(zt1zt)]dztdxC1 q(x)q(ztx)KL(q(zt1zt,x)p(zt1zt))dztdx(6.8)(6.9)(6.10)(6.11)(6.12)(6.13)(6.14)(6.15)

\n

In the above formula, the term C1C_1C1 is a fixed value, which does not contain parameters to be optimized. Here, q(x)q(x)q(x) is a fixed probability distribution, and q(zt1zt)q(z_{t-1}|z_t)q(zt1zt) is also a fixed probability distribution, whose specific form is determined by q(x)q(x)q(x) and the coefficient α\\alphaα.

\n

For the Reconstruction Term, it can be proven in a similar way.

\n

loss=q(z1)q(xz1)logp(xz1)dxCross Entropy dz1=q(z1,x)logp(xz1)dxdz1=q(x)q(z1x)logp(xz1)dz1 dx\\begin{align}\n loss &= -\\int q(z_1)\\overbrace{\\int q(x|z_1)\\log \\textcolor{blue}{p(x|z_1)}dx}^{\\text{Cross Entropy}}\\ dz_1 \\tag{6.16} \\newline\n &= -\\iint q(z_1,x)\\log \\textcolor{blue}{p(x|z_1)}dxdz_1 \\tag{6.17} \\newline\n &= -\\int q(x)\\int q(z_1|x)\\log \\textcolor{blue}{p(x|z_1)}dz_1\\ dx \\tag{6.18}\n\\end{align}loss=q(z1)q(xz1)logp(xz1)dxCross Entropy dz1=q(z1,x)logp(xz1)dxdz1=q(x)q(z1x)logp(xz1)dz1 dx(6.16)(6.17)(6.18)

\n

Therefore, the objective function in equation 6.1 is equivalent with the DPM original objective function.

\n

Based on the conclusion of the Consistent Terms proof and the relationship between cross entropy and KL divergence, an interesting conclusion can be drawn:\nminpq(zt)KL(q(zt1zt)p(zt1zt))dzt    minp q(zt)q(xzt)KL(q(zt1zt,x)p(zt1zt))dxdzt\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p}} \\int q(z_t) KL(q(z_{t-1}|z_t) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dz_t \\iff \\mathop{\\min}_{\\textcolor{blue}{p}} \\iint \\ q(z_t)q(x|z_t) KL(q(z_{t-1}|z_t,x) \\Vert \\textcolor{blue}{p(z_{t-1}|z_t)})dxdz_t \\tag{6.19} \n\\end{align}minpq(zt)KL(q(zt1zt)p(zt1zt))dztminp q(zt)q(xzt)KL(q(zt1zt,x)p(zt1zt))dxdzt(6.19)\nBy comparing the expressions on the left and right, it can be observed that the objective function on the right side includes an additional variable XXX compared to the left side. At the same time, there is an additional integral with respect to XXX, with the occurrence probability of XXX, denoted as q(xzt)q(x|z_t)q(xzt), serving as the weighting coefficient for the integral.

\n

Following a similar proof method, a more general relationship can be derived:\nminpKL(q(z)p(z))    minp q(x)KL(q(zx)p(z))dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p}} KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\iff \\mathop{\\min}_{\\textcolor{blue}{p}} \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{6.20} \n\\end{align}minpKL(q(z)p(z))minp q(x)KL(q(zx)p(z))dx(6.20)\nA detailed derivation of this conclusion can be found in Appendix A.

\n
","posterior_transform_en":"

\n

Non-expanding Mapping and Stationary Distribution

\nq(x)=q(x,z)dz=q(xz)q(z)dz\\begin{align}\n q(x) &= \\int q(x,z) dz = \\int q(x|z)q(z)dz \\tag{7.1}\n\\end{align}q(x)=q(x,z)dz=q(xz)q(z)dz(7.1)\n\n

According to Corollary 1 and Corollary 2 in Appendix B, the posterior transform is a non-expanding mapping. This means that for any two probability distributions qi1(z)q_{i1}(z)qi1(z) and qi2(z)q_{i2}(z)qi2(z), after the posterior transform, the resulting distributions qo1(x)q_{o1}(x)qo1(x) and qo2(x)q_{o2}(x)qo2(x) will have a distance that is always less than or equal to the distance between qi1(x)q_{i1}(x)qi1(x) and qi2(x)q_{i2}(x)qi2(x). The distance here can be measured using KL Divergence or Total Variance.\nd(qo1(z), qo2(z))d(qi1(x), qi2(x))\\begin{align}\n d(q_{o1}(z),\\ q_{o2}(z)) \\le d(q_{i1}(x),\\ q_{i2}(x)) \\tag{7.2}\n\\end{align}d(qo1(z), qo2(z))d(qi1(x), qi2(x))(7.2)\nAccording to the analysis in Appendix B, the aforementioned equality does not hold in most cases and posterior transform becomes a shrinkig mapping. Furthermore, the smaller α\\alphaα is (the more noise), the smaller d(qo1,qo2)d(q_{o1},q_{o2})d(qo1,qo2) will be compared to d(qi1,qi2)d(q_{i1},q_{i2})d(qi1,qi2).

\n

Readers can refer to Demo 4.1, where the first three figure present a transform process. The first figure is an arbitrary data distribution q(x)q(x)q(x), the third figure is the transformed probability distribution, and second figure is the posterior probability distribution q(xz)q(x|z)q(xz). You can change the random seed to generate a new data distributionq(x)q(x)q(x), and adjust the value of α\\alphaα to introduce different degrees of noise.

\n

The last two figures show contraction of the transform. The fourth figure displays two randomly generated input distributions and their distance, divindiv_{in}divin. The fifth figure displays the two output distributions after transform, with the distance denoted as divoutdiv_{out}divout.

\n

Readers can change the input random seed to toggle different inputs. It can be observed from the figures that divindiv_{in}divin is always smaller than divoutdiv_{out}divout for any input. Additionally, if you change the value of α\\alphaα, you will see that the smaller the α\\alphaα(larger noise), the smaller the ratio of divout/divindiv_{out}/div_{in}divout/divin,indicating a larger rate of contraction.

\n

According to the analysis in Appendix C: the posterior transform can be seen as a one-step jump of a Markov chain, and when q(x)q(x)q(x) and α\\alphaα meet certain conditions, this Markov chain will converge to a unique stationary distribution. Additionally, numerous experiments have shown that the stationary distribution is very similar to the data distribution q(x)q(x)q(x), and the smaller α\\alphaα is, the more similar the stationary distribution is to q(x)q(x)q(x). Specifically, according to the conclusion in Section 5, when α0\\alpha \\to 0α0, after one step of transform, the output distribution will be q(x)q(x)q(x), so the stationary distribution must be q(x)q(x)q(x).

\n

Readers can refer to Demo 4.2, which illustrates an example of applying posterior transform iteratively. Choose an appropriate number of iterations, and click on the button of Apply, and the iteration process will be draw step by step. Each subplot shows the transformed output distribution(green curve\\textcolor{green}{\\text{green curve}}green curve) from each transform, with the reference distribution q(x)q(x)q(x) expressed as a blue curve\\textcolor{blue}{\\text{blue curve}}blue curve, as well as the distance divdivdiv between the output distribution and q(x)q(x)q(x). It can be seen that as the number of iterations increases, the output distribution becomes more and more similar to q(x)q(x)q(x), and will eventually stabilize near q(x)q(x)q(x). For more complicated distributions, more iterations or greater noise may be required. The maximum number of iterations can be set to tens of thousands, but it'll take longer.

\n

For the one-dimensional discrete case, q(xz)q(x|z)q(xz) is discretized into a matrix (denoted as QxzQ_{x|z}Qxz), q(z)q(z)q(z) is discretized into a vector (denoted as qi\\boldsymbol{q_i}qi). The integration operation q(xz)q(z)dz\\int q(x|z)q(z)dzq(xz)q(z)dz is discretized into a matrix-vector multiplication operation, thus the posterior transform can be written as\nqo=Qxz qi1 iterationqo=Qxz Qxz qi2 iterationqo=(Qxz)n qin iteration\\begin{align}\n \\boldsymbol{q_o} &= Q_{x|z}\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{1 iteration} \\tag{7.3} \\newline\n \\boldsymbol{q_o} &= Q_{x|z}\\ Q_{x|z}\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{2 iteration} \\tag{7.4} \\newline\n & \\dots & \\notag \\newline\n \\boldsymbol{q_o} &= (Q_{x|z})^n\\ \\boldsymbol{q_i} & \\quad\\quad &\\text{n iteration} \\tag{7.5} \\newline\n\\end{align}qoqoqo=Qxz qi=Qxz Qxz qi=(Qxz)n qi1 iteration2 iterationn iteration(7.3)(7.4)(7.5)\nIn order to better understand the property of the transform, the matrix (Qxz)n(Q_{x|z})^n(Qxz)n is also plotted in Demo 4.2. From the demo we can see that, as the iterations converge, the row vectors of the matrix (Qxz)n(Q_{x|z})^n(Qxz)n will become a constant vector, that is, all components of the vector will be the same, which will appear as a horizontal line in the denisty plot.

\n

For a one-dimensional discrete Markov chain, the convergence rate is inversely related to the absolute value of the second largest eigenvalue of the transition probability matrix (λ2\\lvert \\lambda_2 \\rvertλ2). The smaller λ2\\lvert \\lambda_2 \\rvertλ2 is, the faster the convergence. Numerous experiments have shown that α\\alphaα has a clear linear relationship with λ2\\lvert \\lambda_2 \\rvertλ2; the smaller α\\alphaα is, the smaller λ2\\lvert \\lambda_2 \\rvertλ2 is. Therefore, the smaller α\\alphaα (the greater the noise), the faster the convergence rate. Specifically, when α0\\alpha \\to 0α0, according to the conclusion in Section 3, the posterior probability distributions corresponding to different zzz tend to be consistent. Additionally, according to Theorem 21 in [21], λ2\\lvert \\lambda_2 \\rvertλ2 is smaller than the L1 distance between any two posterior probability distributions corresponding to different zzz, so it can be concluded that λ20\\lvert \\lambda_2 \\rvert \\to 0λ20.

\n
\n

Anti-noise Capacity In Restoring Data Distribution

\n\n

From the above analysis, it can be seen that, in most cases, the posterior transform is a shrinking mapping, which means the following relationship

\n

d(q(x), qo(x))<d(q(z), qi(z))\\begin{align}\n d(q(x),\\ q_o(x)) < d(q(z),\\ q_i(z)) \\tag{7.12}\n\\end{align}d(q(x), qo(x))<d(q(z), qi(z))(7.12)

\n

Among them, q(z)q(z)q(z) is the ideal input distribution, q(x)q(x)q(x) is the ideal output distribution, q(x)=q(xz)q(z)dzq(x) = \\int q(x|z) q(z) dzq(x)=q(xz)q(z)dz, qi(z)q_i(z)qi(z) is any input distribution, and qo(x)q_o(x)qo(x) is the transformed output distribution, qo(x)=q(xz)qi(z)dzq_o(x) = \\int q(x|z) q_i(z) dzqo(x)=q(xz)qi(z)dz.

\n

The above equation indicates that the distance between the output distribution qo(x)q_o(x)qo(x) and the ideal output distribution q(x) will always be less than the distance between the input distribution qi(z)q_i(z)qi(z) and the ideal input distribution q(x). Hence, the posterior transform naturally possesses a certain ability to resist noise . This means that during the process of restoring q(x)q(x)q(x)(Section 5), even if the tail distribution q(zT)q(z_T)q(zT) contains some error, the error of the outputed distribution q(x)q(x)q(x) will be smaller than the error of input after undergoing a series of transform.

\n

Refer specifically to Demo 3.2, where by increasing the value of the noise ratio, noise can be added to the tail distribution q(zT)q(z_T)q(zT). Clicking the \"apply\" button will gradually draw out the restoring process, with the restored distribution represented by a red curve\\textcolor{red}{\\text{red curve}}red curve, and the error size will be computed by the JS divergence. You will see that the error of restored q(x)q(x)q(x) is always less than the error of q(zT)q(z_T)q(zT).

\n

From the above discussion, it can be seen that the smaller α\\alphaα is (the larger the noise used in the transform), the greater the shrinking rate of the shrink mapping, and correspondingly, the stronger the error resistance capability. Specifically, when α0\\alpha \\to 0α0, the noise resistance capability becomes infinite, meaning that regardless of the magnitude of the error in the input, the output will always be q(x)q(x)q(x).

\n
\n

Markov Chain Monte Carlo Sampling

\n\n

In DPM models, sampling is typically performed using Ancestral Sampling. From the analysis above, it can be inferred that when α\\alphaα is sufficiently small, the posterior transform will converge to q(x)q(x)q(x). Therefore, sampling can be conducted using Markov Chain Monte Carlo (MCMC) methods, as depicted in Figure 7.1. In the figure, α\\alphaα represents a posterior transform with relatively large noise, where larger noise makes the steady-state distribution closer to the data distribution q(x)q(x)q(x). However, as discussed in Section 3, posterior transform with larger noise are less favorable for fitting. Therefore, transform with larger noise are split into multiple transform with smaller noise.

\n
\n
Figure 7.1: Markov Chain Monte Carlo Sampling
","deconvolution_en":"

As mentioned in the Section 1, the transform of Equation 2.1 can be divided into two sub-transforms, the first one being a linear transform and the second being adding independent Gaussian noise. The linear transform is equivalent to a scaling transform of the probability distribution, so it has an inverse transformation. Adding independent Gaussian noise is equivalent to the execution of a convolution operation on the probability distribution, which can be restored through deconvolution. Therefore, theoretically, the data distribution q(x)q(x)q(x) can be recovered from the final probability distribution q(zT)q(z_T)q(zT) through inverse linear transform and deconvolution.

\n

However, in actuality, some problems do exist. Due to the extreme sensitivity of deconvolution to errors, having high input sensitivity, even a small amount of input noise can lead to significant changes in output[11][12]. Meanwhile, in the diffusion model, the standard normal distribution is used as an approximation to replace q(zT)q(z_T)q(zT), thus, noise is introduced at the initial stage of restoring. Although the noise is relatively small, because of the sensitivity of deconvolution, the noise will gradually amplify, affecting the restoring.

\n

In addition, the infeasibility of deconvolution restoring can be understood from another perspective. Since the process of forward transform (equations 4.1 to 4.4) is fixed, the convolution kernel is fixed. Therefore, the corresponding deconvolution transform is also fixed. Since the initial data distribution q(x)q(x)q(x) is arbitrary, any probability distribution can be transformed into an approximation of N(0,I)\\mathcal{N}(0,I)N(0,I) through a series of fixed linear transforms and convolutions. If deconvolution restoring is feasible, it means that a fixed deconvolution can be used to restore any data distribution q(x)q(x)q(x) from the N(0,I)\\mathcal{N}(0,I)N(0,I) , this is clearly paradoxical. The same input, the same transform, cannot have multiple different outputs.

\n
","reference_en":"
","about_en":"

APP: This Web APP is developed using Gradio and deployed on HuggingFace. Due to limited resources (2 cores, 16G memory), the response may be slow. For a better experience, it is recommended to clone the source code from github and run it locally. This program only relies on Gradio, SciPy, and Matplotlib.

\n

Author: Zhenxin Zheng, Senior computer vision engineer with ten years of algorithm development experience, Formerly employed by Tencent and JD.com, currently focusing on image and video generation.

\n

Email: blair.star@163.com.

\n
","cond_kl_zh":"

本节主要介绍KL散度条件KL散度之间的关系。在正式介绍之前,先简单介绍条件熵的定义,以及两者之间存在的不等式关系,为后面的证明作准备。

\n

熵及条件熵

\n对于任意两个随机变量Z,XZ,XZ,X(Entropy)定义如下[16]:\nH(Z)=p(z)logp(z)dz\\begin{align}\n \\mathbf{H}(Z) = \\int -p(z)\\log{p(z)}dz \\tag{A.1}\n\\end{align}H(Z)=p(z)logp(z)dz(A.1)\n条件熵(Conditional Entropy)的定义如下[17]:\nH(ZX)=p(x)p(zx)logp(zx)dzEntropy dx\\begin{align}\n \\mathbf{H}(Z|X) = \\int p(x) \\overbrace{\\int -p(z|x)\\log{p(z|x)}dz}^{\\text{Entropy}}\\ dx \\tag{A.2}\n\\end{align}H(ZX)=p(x)p(zx)logp(zx)dzEntropy dx(A.2)\n两者存在如下的不等式关系:\nH(ZX)H(Z)\\begin{align}\n \\mathbf{H}(Z|X) \\le \\mathbf{H}(Z) \\tag{A.3}\n\\end{align}H(ZX)H(Z)(A.3)\n也就是说,条件熵总是小于或者等于熵,当且仅当X与Z相互独立时,两者相等。此关系的证明可看文献[17]。\n\n

KL散度及条件KL散度

\n仿照条件熵定义的方式,引入一个新定义,条件KL散度,记为KLCKL_{\\mathcal{C}}KLC。由于KL散度的定义是非对称的,所以存在两种形式,如下:\nKLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dxKLC(q(z)p(zx))= p(x)KL(q(z)p(zx))dx\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) = \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{A.4} \\newline\n KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) = \\int \\ \\textcolor{blue}{p(x)} KL(q(z) \\Vert \\textcolor{blue}{p(z|x)})dx \\tag{A.5}\n\\end{align}KLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dxKLC(q(z)p(zx))= p(x)KL(q(z)p(zx))dx(A.4)(A.5)\n\n

与条件熵类似,两种形式的条件KL散度也都存在类似的不等式关系:\nKLC(q(zx)p(z))KL(q(z)p(z))KLC(q(z)p(zx))KL(q(z)p(z))\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) \\ge KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.6} \\newline\n KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) \\ge KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.7}\n\\end{align}KLC(q(zx)p(z))KL(q(z)p(z))KLC(q(z)p(zx))KL(q(z)p(z))(A.6)(A.7)\n也就是说,条件KL散度总是大于或者等于KL散度,当且仅当X与Z相互独立时,两者相等。

\n

下面对式A.5和式A.6的结论分别证明。

\n

对于式A.6,证明如下:\nKLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dx=q(x)q(zx)logq(zx)p(z)dzdx=q(x)q(zx)logq(zx)dzdxConditional Entropy Hq(ZX)q(x)q(zx)logp(z)dzdx=Hq(ZX){q(x)q(zx)dx}logp(z)dz=Hq(ZX)+q(z)logp(z)dzCross Entropy=Hq(ZX)+q(z){logq(z)p(z)logq(z)}dz=Hq(ZX)+q(z)logq(z)p(z)dz+q(z)logq(z)dzEntropy Hq(Z)=KL(q(z)p(z))+Hq(Z)Hq(ZX)0KL(q(z)p(z))\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) &= \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{A.8} \\newline\n &= \\iint q(x) q(z|x) \\log \\frac{q(z|x)}{\\textcolor{blue}{p(z)}}dzdx \\tag{A.9} \\newline\n &= -\\overbrace{\\iint - q(x)q(z|x) \\log q(z|x) dzdx}^{\\text{Conditional Entropy }\\mathbf{H}_q(Z|X)} - \\iint q(x) q(z|x) \\log \\textcolor{blue}{p(z)} dzdx \\tag{A.10} \\newline\n &= -\\mathbf{H}_q(Z|X) - \\int \\left\\lbrace \\int q(x) q(z|x)dx \\right\\rbrace \\log \\textcolor{blue}{p(z)}dz \\tag{A.11} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\overbrace{\\int - q(z) \\log p(z)dz}^{\\text{Cross Entropy}} \\tag{A.12} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\int q(z)\\left\\lbrace \\log\\frac{q(z)}{\\textcolor{blue}{p(z)}} -\\log q(z)\\right\\rbrace dz \\tag{A.13} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\int q(z)\\log\\frac{q(z)}{\\textcolor{blue}{p(z)}}dz + \\overbrace{\\int - q(z)\\log q(z)dz}^{\\text{Entropy } \\mathbf{H}_q(Z)} \\tag{A.14} \\newline\n &= KL(q(z) \\Vert \\textcolor{blue}{p(z)}) + \\overbrace{\\mathbf{H}_q(Z) - \\mathbf{H}_q(Z|X)}^{\\ge 0} \\tag{A.15} \\newline\n &\\le KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.16}\n\\end{align}KLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dx=q(x)q(zx)logp(z)q(zx)dzdx=q(x)q(zx)logq(zx)dzdxConditional Entropy Hq(ZX)q(x)q(zx)logp(z)dzdx=Hq(ZX){q(x)q(zx)dx}logp(z)dz=Hq(ZX)+q(z)logp(z)dzCross Entropy=Hq(ZX)+q(z){logp(z)q(z)logq(z)}dz=Hq(ZX)+q(z)logp(z)q(z)dz+q(z)logq(z)dzEntropy Hq(Z)=KL(q(z)p(z))+Hq(Z)Hq(ZX)0KL(q(z)p(z))(A.8)(A.9)(A.10)(A.11)(A.12)(A.13)(A.14)(A.15)(A.16)\n其中式A.15应用了\"条件熵总是小于或者等于熵\"的结论。于是,得到式A.6的关系。

\n

对于式A.7,证明如下:\nKL(q(z)p(z))=q(z)logq(z)p(z)dz=q(z)logq(z)p(zx)p(x)dxdz=p(x)dxq(z)logq(z)dzq(z)logp(zx)p(x)dxdz p(x)dx=1p(x)q(z)logq(z)dzdxq(z)p(x)logp(zx)dxdz jensen inequality=p(x)q(z)logq(z)dzdxp(x)q(z)logp(zx)dzdx=p(x)q(z)(logq(z)logp(zx))dzdx=p(x)q(z)logq(z)p(zx)dzdx=p(x){q(z)logq(z)p(zx)dz}dx=p(x)KL(q(z)p(zx))dx=KLC(q(z)p(zx))\\begin{align}\n KL(\\textcolor{blue}{q(z)} \\Vert p(z)) &= \\int \\textcolor{blue}{q(z)}\\log\\frac{\\textcolor{blue}{q(z)}}{p(z)}dz \\tag{A.15} \\newline\n &= \\int q(z)\\log\\frac{q(z)}{\\int p(z|x)p(x)dx}dz \\tag{A.16} \\newline\n &= \\textcolor{orange}{\\int p(x)dx}\\int q(z)\\log q(z)dz - \\int q(z)\\textcolor{red}{\\log\\int p(z|x)p(x)dx}dz \\qquad \\ \\textcolor{orange}{\\int p(x)dx=1} \\tag{A.17} \\newline\n &\\le \\iint p(x) q(z)\\log q(z)dzdx - \\int q(z)\\textcolor{red}{\\int p(x)\\log p(z|x)dx}dz \\ \\qquad \\textcolor{red}{\\text{jensen\\ inequality}} \\tag{A.18} \\newline\n &= \\iint p(x)q(z)\\log q(z)dzdx - \\iint p(x)q(z)\\log p(z|x)dzdx \\tag{A.19} \\newline\n &= \\iint p(x)q(z)(\\log q(z) - \\log p(z|x))dzdx \\tag{A.20} \\newline\n &= \\iint p(x)q(z)\\log \\frac{q(z)}{p(z|x)}dzdx \\tag{A.21} \\newline\n &= \\int p(x)\\left\\lbrace \\int q(z)\\log \\frac{q(z)}{p(z|x)}dz\\right\\rbrace dx \\tag{A.22} \\newline\n &= \\int p(x)KL(\\textcolor{blue}{q(z)} \\Vert p(z|x))dx \\tag{A.23} \\newline\n &= KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) \\tag{A.24}\n\\end{align}KL(q(z)p(z))=q(z)logp(z)q(z)dz=q(z)logp(zx)p(x)dxq(z)dz=p(x)dxq(z)logq(z)dzq(z)logp(zx)p(x)dxdz p(x)dx=1p(x)q(z)logq(z)dzdxq(z)p(x)logp(zx)dxdz jensen inequality=p(x)q(z)logq(z)dzdxp(x)q(z)logp(zx)dzdx=p(x)q(z)(logq(z)logp(zx))dzdx=p(x)q(z)logp(zx)q(z)dzdx=p(x){q(z)logp(zx)q(z)dz}dx=p(x)KL(q(z)p(zx))dx=KLC(q(z)p(zx))(A.15)(A.16)(A.17)(A.18)(A.19)(A.20)(A.21)(A.22)(A.23)(A.24)\n于是,得到式A.7的关系。

\n

从式A.15可得出另外一个重要的结论

\n

KL散度常用于拟合数据的分布。在此场景中,数据潜在的分布用q(z)q(z)q(z)表示,参数化的模型分布用pθ(z)\\textcolor{blue}{p_\\theta(z)}pθ(z)表示。在优化的过程中,由于q(zx)q(z|x)q(zx)q(x)q(x)q(x)均保持不变,所以式A.15中的H(Z)H(ZX)\\mathbf{H}(Z) - \\mathbf{H}(Z|X)H(Z)H(ZX)为一个常数项。于是,可得到如下的关系\nminpθKL(q(z)pθ(z))    minpθ q(x)KL(q(zx)pθ(z))dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p_\\theta}} KL(q(z) \\Vert \\textcolor{blue}{p_\\theta(z)}) \\iff \\mathop{\\min}_{\\textcolor{blue}{p_\\theta}} \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p_\\theta(z)})dx \\tag{A.25}\n\\end{align}minpθKL(q(z)pθ(z))minpθ q(x)KL(q(zx)pθ(z))dx(A.25)

\n

把上述的关系与Denoised Score Matching[18]作比较,可发现一些相似的地方。两者均引入一个新变量XXX,并且将拟合的目标分布q(z)代替为q(z|x)。代替后,由于q(z|x)是条件概率分布,所以,两者均考虑了所有的条件,并以条件发生的概率q(x)q(x)q(x)作为权重系数执行加权和。\nminψθ12q(z)ψθ(z)q(z)z2dz    minψθq(x) 12q(zx)ψθ(z)q(zx)z2dzScore Matching of q(zx) dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{\\psi_\\theta}} \\frac{1}{2} \\int q(z) \\left\\lVert \\textcolor{blue}{\\psi_\\theta(z)} - \\frac{\\partial q(z)}{\\partial z} \\right\\rVert^2 dz \\iff \\mathop{\\min}_{\\textcolor{blue}{\\psi_\\theta}} \\int q(x)\\ \\overbrace{\\frac{1}{2} \\int q(z|x) \\left\\lVert \\textcolor{blue}{\\psi_\\theta(z)} - \\frac{\\partial q(z|x)}{\\partial z} \\right\\rVert^2 dz}^{\\text{Score Matching of }q(z|x)}\\ dx \\tag{A.26}\n\\end{align}minψθ21q(z)ψθ(z)zq(z)2dzminψθq(x) 21q(zx)ψθ(z)zq(zx)2dzScore Matching of q(zx) dx(A.26)

\n

上述加权和的操作有点类似于\"全概率公式消元\"。\nq(z)=q(z,x)dx=q(x)q(zx)dx\\begin{align}\n q(z) = \\int q(z,x) dx = \\int q(x) q(z|x) dx \\tag{A.27}\n\\end{align}q(z)=q(z,x)dx=q(x)q(zx)dx(A.27)

\n
","proof_ctr_zh":"
\n
Figure 2: Only one component in support
\n\n

本节将证明,当q(x)q(x)q(x)α\\alphaα满足一些条件时,后验概率变换是一个压缩映射,并存在惟一收敛点。

\n

下面分四种情况进行证明。证明的过程假设随机变量是离散型的,因此,后验概率变换可看作是一个离散Markov Chain的一步转移,后验概率q(xz)q(x|z)q(xz)对应于转移矩阵(Transfer Matrix)。连续型的变量可认为是无限多状态的离散型变量。

\n
    \n
  1. q(x)q(x)q(x)均大于0时,后验概率变换矩阵q(xz)q(x|z)q(xz)将大于0,于是此矩阵是一个不可约非周期\\textcolor{red}{不可约}\\textcolor{green}{非周期}不可约非周期的Markov Chain的转移矩阵,根据文献[13]的结论,此变换是一个关于Total Variance度量的压缩映射,于是,根据Banach fixed-point theorem,此变换存在惟一定点(收敛点)。
  2. \n \n
  3. q(x)q(x)q(x)部分大于0,并且q(x)q(x)q(x)的支撑集(q(x)q(x)q(x)大于0的区域)只存在一个连通域时(图2),由式(3.4)可分析出几个结论:\n\n
      \n
    1. zzzxxx在支撑集内时,由于q(x)q(x)q(x)和GaussFun均大于0,所以,转移矩阵的对角元素{q(xz)z=x}\\{q(x|z)|z=x\\}{q(xz)z=x}大于0。这意味着,支撑集内的状态是非周期\\textcolor{green}{非周期}非周期的。
    2. \n\n
    3. zzzxxx在支撑集内时,由于GaussFun的支撑集存在一定的半径,所以,在对角元素上下附近区域内的{q(xz)x=z+ϵ}\\{q(x|z)|x=z+\\epsilon\\}{q(xz)x=z+ϵ}也大于0。这意味着,支撑集内的状态可相互访问(accessible),形成一个Communication Class\\textcolor{red}{\\text{Communication Class}}Communication Class[14]
    4. \n \n
    5. zzz在支撑集内xxx在支撑集外时,q(xz){q(x|z)}q(xz)全为0。这意味着,支撑集内的状态不可访问支撑集外的状态(图2b的inaccessible区域)。
    6. \n \n
    7. zzz在支撑集外xxx在支撑集内时,由于GaussFun的支撑集存在一定的范围,所以,存在部分扩展区域(图2b的extension区域),其对应的{q(xz)xsupport}\\{q(x|z)|x\\in support\\}{q(xz)xsupport}不全为0。这意味着,此部分扩展区域的状态可单向访问(access)支撑集内的状态(图2b的unidirectional区域)。
    8. \n \n
    9. zzz在支撑集外xxx在支撑集外时,对应的q(xz)q(x|z)q(xz)全为0。这意味着,支撑集外的状态不会转移至支撑集外的状态,也就是说,支撑集外的状态只来源于支撑集内的状态。
    10. \n\n

      \n由(c)可知,支撑集内的状态不会转移到支撑集外的状态,由(a)和(b)可知,支撑集内的状态是非周期且构成一个Communicate Class,所以,支撑集内的状态独立构成一个不可约且非周期的Markov Chain,根据文献[7]中Theorem 11.4.1的结论,当n+n\\to+\\inftyn+时,q(xz)nq(x|z)^nq(xz)n收敛于一个固定矩阵,并且矩阵每个列向量都相同。这意味着,对于不同的z,q(xz)nq(x|z)^nq(xz)n都相同(可见图2c)。另外,由(d)和(e)可知,存在部分支撑集外的z状态,能转移至支撑集内,并且会带着支撑集内的信息转移回支撑集外,于是,此部分z状态对应的q(xz)q(x|z)q(xz)(图2c的q(xzex)q(x|z_{ex})q(xzex)区域)也会等于支撑集内对应的q(xz)q(x|z)q(xz)(图2c的q(xzsup)q(x|z_{sup})q(xzsup)区域)。\n

      \n\n

      \n所以,可以得出结论,当状态限制在支撑集和两个扩展区域内时,limnq(xz)n\\lim_{n\\to\\infty}{q(x|z)^n}limnq(xz)n会收敛于一个固定矩阵,并且每个列向量均相同。于是,对于任意的输入分布,如果连续应用足够多后验概率变换,最终会收敛于一个固定分布,此分布等于收敛的矩阵的列向量。根据文献[9]的结论,当迭代变换收敛于惟一定点时,此变换是关于某个metric的Contraction Mapping。\n

      \n\n
    \n
  4. \n\n
  5. q(x)q(x)q(x)部分大于0,q(x)q(x)q(x)的支撑集存在多个连通域,并且各个连通域的最大距离被相应的GaussFun的支撑集所覆盖时,那各个连通域内的状态构成一个Communicate Class。如图3所示,q(x)q(x)q(x)存在两个连通域,在第一个连通域的边缘,q(xz=0.3)q(x|z=-0.3)q(xz=0.3)对应的GaussFun的支撑集能跨越间隙到达第二个连通域,于是第一个连通域的状态能访问第二个连通域的状态;在第二个连通域的边缘,q(xz=0)q(x|z=0)q(xz=0)对应的GaussFun的支撑集也能跨越间隙到达第一个连通域,于是第二个连通域的状态能访问第一个连通域的状态,所以两个连通域构成一个Communicate Class。因此,与单个连通域的情况类似,当状态限制在各个连通域、间隙及扩展区域内时,后验概率变换存在惟一一个迭代收敛点,并且是关于某个metric的压缩映射。
  6. \n\n
  7. q(x)q(x)q(x)部分大于0,q(x)q(x)q(x)的支撑集存在多个连通域时,并且各个连通域的最大距离不能被相应的GaussFun的支撑集所覆盖时,那各个连通域内的状态构成多个Communicate Class,如图4所示。此情况下,当nn\\to\\inftyn时,q(xz)nq(x|z)^nq(xz)n也会收敛于一个固定矩阵,但每个列向量不尽相同。所以,后验概率变换不是一个严格的压缩映射。但当输入分布的状态限制在单个Communicate Class及相应的扩展范围内时,后验概率变换也是一个压缩映射,存在惟一收敛点。
  8. \n
\n\n
\n
Figure 3: Two component which can communicate with each other
\n\n
\n
Figure 4: Two component which cannot communicate with each other
\n\n

另外,后验概率变换存在一个更通用的关系,与q(xz)q(x|z)q(xz)的具体值无关: 两个输出分布的之间的Total Variance距离总是会小于等于对应输入分布之间的Total Variance距离,即\ndist(qo1(x), qo2(x))dist(qi1(z), qi2(z))\\begin{align}\n dist(q_{o1}(x),\\ q_{o2}(x)) \\le dist(q_{i1}(z),\\ q_{i2}(z)) \\tag{B.1}\n\\end{align}dist(qo1(x), qo2(x))dist(qi1(z), qi2(z))(B.1)\n下面通过离散的形式给出证明:\nqo1qo2TV=Qxzqi1Qxzqi2TV=mnQxz(m,n)qi1(n)nQxz(m,n)qi2(n)=mnQxz(m,n)(qi1(n)qi2(n))mnQxz(m,n)(qi1(n)qi2(n))Absolute value inequality=n(qi1(n)qi2(n))mQxz(m,n)mQxz(m,n)=1=n(qi1(n)qi2(n))\\begin{align}\n \\lVert q_{o1}-q_{o2}\\rVert_{TV} &= \\lVert Q_{x|z}q_{i1} - Q_{x|z}q_{i2}\\rVert_{TV} \\tag{B.2} \\newline\n &= \\sum_{m}\\textcolor{red}{|}\\sum_{n}Q_{x|z}(m,n)q_{i1}(n) - \\sum_{n}Q_{x|z}(m,n)q_{i2}(n)\\textcolor{red}{|} \\tag{B.3} \\newline\n &= \\sum_{m}\\textcolor{red}{|}\\sum_{n}Q_{x|z}(m,n)(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\tag{B.4} \\newline\n &\\leq \\sum_{m}\\sum_{n}Q_{x|z}(m,n)\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\qquad \\qquad \\qquad \\text{Absolute value inequality} \\tag{B.5} \\newline\n &= \\sum_{n}\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\sum_{m} Q_{x|z}(m,n) \\qquad \\qquad \\qquad \\sum_{m} Q_{x|z}(m,n) = 1 \\tag{B.6} \\newline\n &= \\sum_{n}\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\tag{B.7}\n\\end{align}qo1qo2TV=Qxzqi1Qxzqi2TV=mnQxz(m,n)qi1(n)nQxz(m,n)qi2(n)=mnQxz(m,n)(qi1(n)qi2(n))mnQxz(m,n)(qi1(n)qi2(n))Absolute value inequality=n(qi1(n)qi2(n))mQxz(m,n)mQxz(m,n)=1=n(qi1(n)qi2(n))(B.2)(B.3)(B.4)(B.5)(B.6)(B.7)\n其中,Qxz(m,n)Q_{x|z}(m,n)Qxz(m,n)表示矩阵QxzQ_{x|z}Qxz的第m行第n列的元素,qi1(n)q_{i1}(n)qi1(n)表示向量qi1q_{i1}qi1的第n个元素。

\n
","cond_kl_en":"

This section mainly introduces the relationship between KL divergence and conditional KL divergence. Before the formal introduction, we will briefly introduce the definitions of Entropy and Conditional Entropy, as well as the inequality relationship between them, in preparation for the subsequent proof.

\n
\n

Entropy and Conditional Entropy

\nFor any two random variables Z,XZ, XZ,X, the Entropy is defined as follows[16]:\nH(Z)=p(z)logp(z)dz\\begin{align}\n \\mathbf{H}(Z) = \\int -p(z)\\log{p(z)}dz \\tag{A.1}\n\\end{align}H(Z)=p(z)logp(z)dz(A.1)\nThe Conditional Entropy is defined as followed [17]:\nH(ZX)=p(x)p(zx)logp(zx)dzEntropy dx\\begin{align}\n \\mathbf{H}(Z|X) = \\int p(x) \\overbrace{\\int -p(z|x)\\log{p(z|x)}dz}^{\\text{Entropy}}\\ dx \\tag{A.2}\n\\end{align}H(ZX)=p(x)p(zx)logp(zx)dzEntropy dx(A.2)\nThe following inequality relationship exists between the two:\nH(ZX)H(Z)\\begin{align}\n \\mathbf{H}(Z|X) \\le \\mathbf{H}(Z) \\tag{A.3}\n\\end{align}H(ZX)H(Z)(A.3)\nIt is to say that the Conditional Entropy is always less than or equal to the Entropy, and they are equal only when X and Z are independent. The proof of this relationship can be found in the literature [17].\n\n
\n

KL Divergence and Conditional KL Divergence

\nIn the same manner as the definition of Conditional Entropy, we introduce a new definition, Conditional KL Divergence, denoted as KLCKL_{\\mathcal{C}}KLC. Since KL Divergence is non-symmetric, there exist two forms as follows. \nKLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dxKLC(q(z)p(zx))= p(x)KL(q(z)p(zx))dx\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) = \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{A.4} \\newline\n KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) = \\int \\ \\textcolor{blue}{p(x)} KL(q(z) \\Vert \\textcolor{blue}{p(z|x)})dx \\tag{A.5}\n\\end{align}KLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dxKLC(q(z)p(zx))= p(x)KL(q(z)p(zx))dx(A.4)(A.5)\n\n

Similar to Conditional Entropy, there also exists a similar inequality relationship for both forms of Conditional KL Divergence:\nKLC(q(zx)p(z))KL(q(z)p(z))KLC(q(z)p(zx))KL(q(z)p(z))\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) \\ge KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.6} \\newline\n KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) \\ge KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.7}\n\\end{align}KLC(q(zx)p(z))KL(q(z)p(z))KLC(q(z)p(zx))KL(q(z)p(z))(A.6)(A.7)\nIt is to say that the Conditional KL Divergence is always less than or equal to the KL Divergence, and they are equal only when X and Z are independent.

\n

The following provides proofs for the conclusions on Equation A.5 and Equation A.6 respectively.

\n

For equation A.6, the proof is as follows:\nKLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dx=q(x)q(zx)logq(zx)p(z)dzdx=q(x)q(zx)logq(zx)dzdxConditional Entropy Hq(ZX)q(x)q(zx)logp(z)dzdx=Hq(ZX){q(x)q(zx)dx}logp(z)dz=Hq(ZX)+q(z)logp(z)dzCross Entropy=Hq(ZX)+q(z){logq(z)p(z)logq(z)}dz=Hq(ZX)+q(z)logq(z)p(z)dz+q(z)logq(z)dzEntropy Hq(Z)=KL(q(z)p(z))+Hq(Z)Hq(ZX)0KL(q(z)p(z))\\begin{align}\n KL_{\\mathcal{C}}(q(z|x) \\Vert \\textcolor{blue}{p(z)}) &= \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p(z)})dx \\tag{A.8} \\newline\n &= \\iint q(x) q(z|x) \\log \\frac{q(z|x)}{\\textcolor{blue}{p(z)}}dzdx \\tag{A.9} \\newline\n &= -\\overbrace{\\iint - q(x)q(z|x) \\log q(z|x) dzdx}^{\\text{Conditional Entropy }\\mathbf{H}_q(Z|X)} - \\iint q(x) q(z|x) \\log \\textcolor{blue}{p(z)} dzdx \\tag{A.10} \\newline\n &= -\\mathbf{H}_q(Z|X) - \\int \\left\\lbrace \\int q(x) q(z|x)dx \\right\\rbrace \\log \\textcolor{blue}{p(z)}dz \\tag{A.11} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\overbrace{\\int - q(z) \\log p(z)dz}^{\\text{Cross Entropy}} \\tag{A.12} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\int q(z)\\left\\lbrace \\log\\frac{q(z)}{\\textcolor{blue}{p(z)}} -\\log q(z)\\right\\rbrace dz \\tag{A.13} \\newline\n &= -\\mathbf{H}_q(Z|X) + \\int q(z)\\log\\frac{q(z)}{\\textcolor{blue}{p(z)}}dz + \\overbrace{\\int - q(z)\\log q(z)dz}^{\\text{Entropy } \\mathbf{H}_q(Z)} \\tag{A.14} \\newline\n &= KL(q(z) \\Vert \\textcolor{blue}{p(z)}) + \\overbrace{\\mathbf{H}_q(Z) - \\mathbf{H}_q(Z|X)}^{\\ge 0} \\tag{A.15} \\newline\n &\\le KL(q(z) \\Vert \\textcolor{blue}{p(z)}) \\tag{A.16}\n\\end{align}KLC(q(zx)p(z))= q(x)KL(q(zx)p(z))dx=q(x)q(zx)logp(z)q(zx)dzdx=q(x)q(zx)logq(zx)dzdxConditional Entropy Hq(ZX)q(x)q(zx)logp(z)dzdx=Hq(ZX){q(x)q(zx)dx}logp(z)dz=Hq(ZX)+q(z)logp(z)dzCross Entropy=Hq(ZX)+q(z){logp(z)q(z)logq(z)}dz=Hq(ZX)+q(z)logp(z)q(z)dz+q(z)logq(z)dzEntropy Hq(Z)=KL(q(z)p(z))+Hq(Z)Hq(ZX)0KL(q(z)p(z))(A.8)(A.9)(A.10)(A.11)(A.12)(A.13)(A.14)(A.15)(A.16)\nIn this context, equation A.15 applies the conclusion that Conditional Entropy is always less than or equal to Entropy. Thus, the relationship in equation A.6 is derived.

\n

For equation A.6, the proof is as follows:\nKL(q(z)p(z))=q(z)logq(z)p(z)dz=q(z)logq(z)p(zx)p(x)dxdz=p(x)dxq(z)logq(z)dzq(z)logp(zx)p(x)dxdz p(x)dx=1p(x)q(z)logq(z)dzdxq(z)p(x)logp(zx)dxdz jensen inequality=p(x)q(z)logq(z)dzdxp(x)q(z)logp(zx)dzdx=p(x)q(z)(logq(z)logp(zx))dzdx=p(x)q(z)logq(z)p(zx)dzdx=p(x){q(z)logq(z)p(zx)dz}dx=p(x)KL(q(z)p(zx))dx=KLC(q(z)p(zx))\\begin{align}\n KL(\\textcolor{blue}{q(z)} \\Vert p(z)) &= \\int \\textcolor{blue}{q(z)}\\log\\frac{\\textcolor{blue}{q(z)}}{p(z)}dz \\tag{A.15} \\newline\n &= \\int q(z)\\log\\frac{q(z)}{\\int p(z|x)p(x)dx}dz \\tag{A.16} \\newline\n &= \\textcolor{orange}{\\int p(x)dx}\\int q(z)\\log q(z)dz - \\int q(z)\\textcolor{red}{\\log\\int p(z|x)p(x)dx}dz \\qquad \\ \\textcolor{orange}{\\int p(x)dx=1} \\tag{A.17} \\newline\n &\\le \\iint p(x) q(z)\\log q(z)dzdx - \\int q(z)\\textcolor{red}{\\int p(x)\\log p(z|x)dx}dz \\ \\qquad \\textcolor{red}{\\text{jensen\\ inequality}} \\tag{A.18} \\newline\n &= \\iint p(x)q(z)\\log q(z)dzdx - \\iint p(x)q(z)\\log p(z|x)dzdx \\tag{A.19} \\newline\n &= \\iint p(x)q(z)(\\log q(z) - \\log p(z|x))dzdx \\tag{A.20} \\newline\n &= \\iint p(x)q(z)\\log \\frac{q(z)}{p(z|x)}dzdx \\tag{A.21} \\newline\n &= \\int p(x)\\left\\lbrace \\int q(z)\\log \\frac{q(z)}{p(z|x)}dz\\right\\rbrace dx \\tag{A.22} \\newline\n &= \\int p(x)KL(\\textcolor{blue}{q(z)} \\Vert p(z|x))dx \\tag{A.23} \\newline\n &= KL_{\\mathcal{C}}(q(z) \\Vert \\textcolor{blue}{p(z|x)}) \\tag{A.24}\n\\end{align}KL(q(z)p(z))=q(z)logp(z)q(z)dz=q(z)logp(zx)p(x)dxq(z)dz=p(x)dxq(z)logq(z)dzq(z)logp(zx)p(x)dxdz p(x)dx=1p(x)q(z)logq(z)dzdxq(z)p(x)logp(zx)dxdz jensen inequality=p(x)q(z)logq(z)dzdxp(x)q(z)logp(zx)dzdx=p(x)q(z)(logq(z)logp(zx))dzdx=p(x)q(z)logp(zx)q(z)dzdx=p(x){q(z)logp(zx)q(z)dz}dx=p(x)KL(q(z)p(zx))dx=KLC(q(z)p(zx))(A.15)(A.16)(A.17)(A.18)(A.19)(A.20)(A.21)(A.22)(A.23)(A.24)\nThus, the relationship in equation A.7 is obtained.

\n

Another important conclusion can be drawn from equation A.15.

\n

The KL Divergence is often used to fit the distribution of data. In this scenario, the distribution of the data is denoted by q(z)q(z)q(z) and the parameterized model distribution is denoted by pθ(z)\\textcolor{blue}{p_\\theta(z)}pθ(z). During the optimization process, since both q(zx)q(z|x)q(zx) and q(x)q(x)q(x) remain constant, the term H(Z)H(ZX)\\mathbf{H}(Z) - \\mathbf{H}(Z|X)H(Z)H(ZX) in Equation A.15 is a constant. Thus, the following relationship is obtained:\nminpθKL(q(z)pθ(z))    minpθ q(x)KL(q(zx)pθ(z))dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{p_\\theta}} KL(q(z) \\Vert \\textcolor{blue}{p_\\theta(z)}) \\iff \\mathop{\\min}_{\\textcolor{blue}{p_\\theta}} \\int \\ q(x) KL(q(z|x) \\Vert \\textcolor{blue}{p_\\theta(z)})dx \\tag{A.25}\n\\end{align}minpθKL(q(z)pθ(z))minpθ q(x)KL(q(zx)pθ(z))dx(A.25)

\n

Comparing the above relationship with Denoised Score Matching [18](equation A.26), some similarities can be observed. Both introduce a new variable XXX, and substitute the targeted fitting distribution q(z) with q(z|x). After the substitution, since q(z|x) is a conditional probability distribution, both consider all conditions and perform a weighted sum using the probability of the conditions occurring, q(x)q(x)q(x), as the weight coefficient.\nminψθ12q(z)ψθ(z)q(z)z2dz    minψθq(x) 12q(zx)ψθ(z)q(zx)z2dzScore Matching of q(zx) dx\\begin{align}\n\\mathop{\\min}_{\\textcolor{blue}{\\psi_\\theta}} \\frac{1}{2} \\int q(z) \\left\\lVert \\textcolor{blue}{\\psi_\\theta(z)} - \\frac{\\partial q(z)}{\\partial z} \\right\\rVert^2 dz \\iff \\mathop{\\min}_{\\textcolor{blue}{\\psi_\\theta}} \\int q(x)\\ \\overbrace{\\frac{1}{2} \\int q(z|x) \\left\\lVert \\textcolor{blue}{\\psi_\\theta(z)} - \\frac{\\partial q(z|x)}{\\partial z} \\right\\rVert^2 dz}^{\\text{Score Matching of }q(z|x)}\\ dx \\tag{A.26}\n\\end{align}minψθ21q(z)ψθ(z)zq(z)2dzminψθq(x) 21q(zx)ψθ(z)zq(zx)2dzScore Matching of q(zx) dx(A.26)

\n

The operation of the above weighted sum is somewhat similar to Elimination by Total Probability Formula .\nq(z)=q(z,x)dx=q(x)q(zx)dx\\begin{align}\n q(z) = \\int q(z,x) dx = \\int q(x) q(z|x) dx \\tag{A.27}\n\\end{align}q(z)=q(z,x)dx=q(x)q(zx)dx(A.27)

\n
","proof_ctr_en":"
\n
Figure 2: Only one component in support
\n\n

The following will prove that with some conditions, the posterior transform is a contraction mapping, and there exists a unique point, which is also the converged point.

\n

The proof will be divided into several cases, and assumes that the random variable is discrete, so the posterior transform can be regarded as a single step transition of a discrete Markov Chain. The posterior q(xz)q(x|z)q(xz) corresponds to the transfer matrix. Continuous variables can be considered as discrete variables with infinite states.

\n
    \n
  1. When q(x)q(x)q(x) is greater than 0, the posterior transform matrix q(xz)q(x|z)q(xz) will be greater than 0 too. Therefore, this matrix is the transition matrix of an irreducible aperiodic\\textcolor{red}{\\text{irreducible}}\\ \\textcolor{green}{\\text{aperiodic}}irreducible aperiodic Markov Chain. According to the conclusion of the literature [13], this transformation is a contraction mapping with respect to Total Variance metric. Therefore, according to the Banach fixed-point theorem, this transformation has a unique fixed point(converged point).
  2. \n \n
  3. When q(x)q(x)q(x) is partially greater than 0, and the support of q(x)q(x)q(x) (the region where q(x)q(x)q(x) is greater than 0) consists only one connected component (Figure 2), several conclusions can be drawn from equation (3.4):\n\n
      \n
    1. When zzz and xxx are within the support set, since both q(x)q(x)q(x) and GaussFun are greater than 0, the diagonal elements of the transfer matrix {q(xz)z=x}\\{q(x|z)|z=x\\}{q(xz)z=x} are greater than 0. This means that the state within the support set is aperiodic\\textcolor{green}{\\text{aperiodic}}aperiodic.
    2. \n\n
    3. When zzz and xxx are within the support set, since GaussFun's support set has a certain range, elements above and below the diagonal {q(xz)x=z+ϵ}\\{q(x|z)|x=z+\\epsilon\\}{q(xz)x=z+ϵ}is also greater than 0. This means that states within the support set are accessible to each other, forming a Communication Class\\textcolor{red}{\\text{Communication Class}}Communication Class[14], see in Figure 2b.
    4. \n \n
    5. When zzz is within the support set and xxx is outside the support set, q(xz){q(x|z)}q(xz) is entirely 0. This means that the state within the support set is inaccessible to the state outside the support set (Inaccessible Region in Figure 2b)
    6. \n \n
    7. When zzz is outside the support set and xxx is inside the support set, due to the existence of a certain range of the support set of GaussFun, there are some extension areas (Extension Region in Figure 2b), where the corresponding {q(xz)xsupport}\\{q(x|z)|x \\in support\\}{q(xz)xsupport} is not all zero. This means that the state of this part of the extension area can unidirectionally access the state inside the support set (Unidirectional Region in Figure 2b).
    8. \n \n
    9. When zzz is outside the support set and xxx is outside the support set, the corresponding q(xz)q(x|z)q(xz) is entirely zero. This implies that, states outside the support set will not transit to states outside the support set. In other words, states outside the support set only originate from states within the support set.
    10. \n\n
    \n

    \nFrom (c), we know that states within the support set will not transition to states outside of the support set. From (a) and (b), we know that the states within the support set are non-periodic and form a Communicate Class. Therefore, the states within the support set independently form an irreducible and non-periodic Markov Chain. According to the conclusion of Theorem 11.4.1 in reference [7], as nn\\to\\inftyn, q(xz)nq(x|z)^nq(xz)n will converge to a constant matrix, with each column vector in the matrix being identical. This implies that for different values of z, q(xz)nq(x|z)^nq(xz)n are the same (as seen in Figure 2c). In Addition, according to (d) and (e), there exist some states z, which are outside of the support set, that can transition into the support set and will carry information from within the support set back to the outside. Thus, the corresponding q(xz)nq(x|z)^nq(xz)n for these z states (the q(xzex)q(x|z_{ex})q(xzex) region in Figure 2c) will equal the corresponding q(xz)nq(x|z)^nq(xz)n in the support set (the q(xzsup)q(x|z_{sup})q(xzsup) region in Figure 2c).\n

    \n\n

    \nTherefore, it can be concluded that when the state is confined within the support set and two extension regions, limnq(xz)n\\lim_{n\\to\\infty}{q(x|z)^n}limnq(xz)n will converge to a fixed matrix, and each column vector is identical. Hence, for any input distribution, if posterior transforms are continuously applied, it will eventually converge to a fixed distribution, which is equal to the column vector of the converged matrix. Based on the conclusion from the literature [9], when a iterative transform converges to a unique fixed point, this transform is a Contraction Mapping with respect to a certain metric. \n

    \n
  4. \n\n
  5. When q(x)q(x)q(x) is partially greater than 0, and multiple connected component exist in the support set of q(x)q(x)q(x), and the maximum distance of each connected component can be covered by the support set of corresponding GaussFun, the states within each connected domain constitute only one Communicate Class. As shown in Figure 3, q(x)q(x)q(x) has two connected component. On the edge of the first component, the support set of GaussFun corresponding to q(xz=0.3)q(x|z=-0.3)q(xz=0.3) can span the gap to reach the second component, so the states of the first component can access the states of the second component. On the edge of the second component, the support set of GaussFun corresponding to q(xz=0)q(x|z=0)q(xz=0) can also span the gap to reach the first. Thus, the states of the second component can access the states of the first component, so these two component form a Communicate Class. Therefore, similar to the case with a single component, when states are confined to each component, gaps, and extension areas, the posterior transform has a unique iterative convergence point, which is a contraction mapping with respect to a certain metric.
  6. \n\n
  7. When q(x)q(x)q(x) is partially greater than 0, and multiple connected component exist in the support set of q(x)q(x)q(x), and the maximum distance of each connected component cannot be covered by the support set of corresponding GaussFun, the states within each component constitute multiple Communicate Classes, as shown in Figure 4. Under such circumstances, as nn\\to\\inftyn, q(xz)nq(x|z)^nq(xz)n will also converge to a fixed matrix, but not all the column vectors are identical. Therefore, the posterior transforma is not a strict contraction mapping. However, when the state of the input distribution is confined to a single Communicate Class and its corresponding extension, the posterior transform is also a contraction mapping with a unique convergence point.
  8. \n
\n\n
\n
Figure 3: Two components which can communicate with each other
\n\n
\n
Figure 4: Two components which cannot communicate with each other
\n\n

Additionally, there exists a more generalized relation about the posterior transform that is independent of q(xz)q(x|z)q(xz): the Total Variance distance between two output distributions will always be less than or equal to the Total Variance distance between their corresponding input distributions, that is\ndist(qo1(x), qo2(x))<=dist(qi1(z), qi2(z))\\begin{align}\n dist(q_{o1}(x),\\ q_{o2}(x)) <= dist(q_{i1}(z),\\ q_{i2}(z)) \\tag{B.1}\n\\end{align}dist(qo1(x), qo2(x))<=dist(qi1(z), qi2(z))(B.1)\nThe proof is given below in discrete form:\nqo1qo2TV=Qxzqi1Qxzqi2TV=mnQxz(m,n)qi1(n)nQxz(m,n)qi2(n)=mnQxz(m,n)(qi1(n)qi2(n))mnQxz(m,n)(qi1(n)qi2(n))Absolute value inequality=n(qi1(n)qi2(n))mQxz(m,n)mQxz(m,n)=1=n(qi1(n)qi2(n))\\begin{align}\n \\lVert q_{o1}-q_{o2}\\rVert_{TV} &= \\lVert Q_{x|z}q_{i1} - Q_{x|z}q_{i2}\\rVert_{TV} \\tag{B.2} \\newline\n &= \\sum_{m}\\textcolor{red}{|}\\sum_{n}Q_{x|z}(m,n)q_{i1}(n) - \\sum_{n}Q_{x|z}(m,n)q_{i2}(n)\\textcolor{red}{|} \\tag{B.3} \\newline\n &= \\sum_{m}\\textcolor{red}{|}\\sum_{n}Q_{x|z}(m,n)(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\tag{B.4} \\newline\n &\\leq \\sum_{m}\\sum_{n}Q_{x|z}(m,n)\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\qquad \\qquad \\qquad \\text{Absolute value inequality} \\tag{B.5} \\newline\n &= \\sum_{n}\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\sum_{m} Q_{x|z}(m,n) \\qquad \\qquad \\qquad \\sum_{m} Q_{x|z}(m,n) = 1 \\tag{B.6} \\newline\n &= \\sum_{n}\\textcolor{red}{|}(q_{i1}(n) - q_{i2}(n))\\textcolor{red}{|} \\tag{B.7}\n\\end{align}qo1qo2TV=Qxzqi1Qxzqi2TV=mnQxz(m,n)qi1(n)nQxz(m,n)qi2(n)=mnQxz(m,n)(qi1(n)qi2(n))mnQxz(m,n)(qi1(n)qi2(n))Absolute value inequality=n(qi1(n)qi2(n))mQxz(m,n)mQxz(m,n)=1=n(qi1(n)qi2(n))(B.2)(B.3)(B.4)(B.5)(B.6)(B.7)\nIn this context, Qxz(m,n)Q_{x|z}(m,n)Qxz(m,n) represents the element at the m-th row and n-th column of the matrix QxzQ_{x|z}Qxz, and qi1(n)q_{i1}(n)qi1(n) represents the n-th element of the vector qi1q_{i1}qi1.\n

\n
","approx_gauss_zh":"

由式3.4可知,q(xz)q(x|z)q(xz)有如下的形式\nq(xz)=Normalize( 12πσexp(xμ)22σ2 q(x) )where μ=zασ=1αα12πσexp(xμ)22σ2GaussFun q(x)\\begin{align}\n q(x|z) &= \\operatorname{Normalize} \\Big(\\ \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(x)\\ \\Big)& \\qquad &\\text{where}\\ \\mu=\\frac{z}{\\sqrt{\\alpha}}\\quad \\sigma=\\sqrt{\\frac{1-\\alpha}{\\alpha}} \\tag{B.1} \\newline\n &\\propto \\underbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}_{\\text{GaussFun}}\\ q(x) \\tag{B.2}\n\\end{align}q(xz)=Normalize( 2πσ1exp2σ2(xμ)2 q(x) )GaussFun2πσ1exp2σ2(xμ)2 q(x)where μ=αzσ=α1α(B.1)(B.2)

\n

下面证明,如果满足如下两个假设,q(xz)q(x|z)q(xz)近似于高斯分布。

\n
    \n
  • \n假设在GaussFun的支撑集内,q(x)q(x)q(x)是线性变化的。以GaussFun的均值为中心,对q(x)q(x)q(x)进行泰勒展开。由泰勒展开的性质可知,当GaussFun的标准差σ\\sigmaσ足够小时,上述假设可以满足。\nq(x)q(μ)+xq(μ)(xμ)whereq(μ)q(x)x=μxq(μ)xq(x)x=μ=q(μ)(1+xq(μ)q(μ)(xμ))=q(μ)(1+xlogq(μ)(xμ))wherexlogq(μ)xlogq(x)x=μ\\begin{align}\n q(x) &\\approx q(\\mu) + \\nabla_xq(\\mu)(x-\\mu)& \\quad &\\text{where}\\quad q(\\mu)\\triangleq q(x)\\bigg|_{x=\\mu} \\quad \\nabla_xq(\\mu)\\triangleq \\nabla_xq(x)\\bigg|_{x=\\mu} \\tag{B.3} \\newline\n &= q(\\mu)\\big(1+ \\frac{\\nabla_xq(\\mu)}{q(\\mu)}(x-\\mu)\\big)& \\tag{B.4} \\newline\n &= q(\\mu)\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)& \\quad &\\text{where}\\quad \\nabla_x\\log{q(\\mu)}\\triangleq \\nabla_x\\log{q(x)}\\bigg|_{x=\\mu} \\tag{B.5}\n\\end{align}q(x)q(μ)+xq(μ)(xμ)=q(μ)(1+q(μ)xq(μ)(xμ))=q(μ)(1+xlogq(μ)(xμ))whereq(μ)q(x)x=μxq(μ)xq(x)x=μwherexlogq(μ)xlogq(x)x=μ(B.3)(B.4)(B.5)\n
  • \n
  • \n假设在GaussFun的支撑集内,log(1+xlogq(μ)(xμ))\\log\\big(1+\\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)log(1+xlogq(μ)(xμ))可近似为 xlogq(μ)(xμ)\\nabla_x\\log{q(\\mu)}(x-\\mu)xlogq(μ)(xμ)。对log(1+y)\\log(1+y)log(1+y)进行泰勒展开,由泰勒展开的性质可知,当y2\\lVert y\\rVert_2y2较小时,log(1+y)\\log(1+y)log(1+y)可近似为yyy。当σ\\sigmaσ足够小时,xu2\\lVert x-u\\rVert_2xu2将较小,xlogq(μ)(xμ)\\nabla_x\\log{q(\\mu)}(x-\\mu)xlogq(μ)(xμ)也将较小,所以上述假设可以满足。一般情况下,当xlogq(μ)(xμ)<0.1\\nabla_x\\log{q(\\mu)}(x-\\mu)<0.1xlogq(μ)(xμ)<0.1时,近似的误差较小,可忽略。\nlog(1+y)log(1+y)y=0+ylog(1+y)y=0(y0)=y\\begin{align}\n \\log(1+y) &\\approx \\log(1+y)\\bigg|_{y=0} + \\nabla_y\\log(1+y)\\bigg|_{y=0}(y-0) \\tag{B.6} \\newline\n &= y \\tag{B.7}\n\\end{align}log(1+y)log(1+y)y=0+ylog(1+y)y=0(y0)=y(B.6)(B.7)\n
  • \n
\n利用上面的两个假设,可对q(xz)q(x|z)q(xz)进行如下的推导:\n\n

q(xz)12πσexp(xμ)22σ2 q(x)12πσexp(xμ)22σ2 q(μ)(1+xlogq(μ)(xμ))=q(μ)2πσexp((xμ)22σ2+log(1+xlogq(μ)(xμ)))q(μ)2πσexp((xμ)22σ2+xlogq(μ)(xμ))=q(μ)2πσexp((xμ)22σ2xlogq(μ)(xμ)2σ2)=q(μ)2πσexp((xμσ2xlogq(μ))22σ2+(σ2xlogq(μ))22σ2)=exp((xμσ2xlogq(μ))22σ2)q(μ)2πσexp(12(σxlogq(μ))2)const\\begin{align}\n q(x|z) &\\propto \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(x) \\tag{B.8} \\newline\n &\\approx \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(\\mu)\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big) \\tag{B.9} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(\\frac{-(x-\\mu)^2}{2\\sigma^2}+\\log\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)\\right) \\tag{B.10} \\newline\n &\\approx \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(\\frac{-(x-\\mu)^2}{2\\sigma^2}+\\nabla_x\\log{q(\\mu)}(x-\\mu)\\right) \\tag{B.11} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(-\\frac{(x-\\mu)^2-2\\sigma^2\\nabla_x\\log{q(\\mu)}(x-\\mu)}{2\\sigma^2}\\right) \\tag{B.12} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(-\\frac{\\big(x-\\mu-\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}+\\frac{\\big(\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}\\right) \\tag{B.13} \\newline\n &= \\exp\\left(-\\frac{\\big(x-\\mu-\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}\\right) \\underbrace{\\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma} \\exp\\left( \\frac{1}{2}\\big(\\sigma\\nabla_x\\log{q(\\mu)}\\big)^2\\right)}_{\\text{const}} \\tag{B.14}\n\\end{align}q(xz)2πσ1exp2σ2(xμ)2 q(x)2πσ1exp2σ2(xμ)2 q(μ)(1+xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμ)2+log(1+xlogq(μ)(xμ)))2πσq(μ)exp(2σ2(xμ)2+xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμ)22σ2xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμσ2xlogq(μ))2+2σ2(σ2xlogq(μ))2)=exp(2σ2(xμσ2xlogq(μ))2)const2πσq(μ)exp(21(σxlogq(μ))2)(B.8)(B.9)(B.10)(B.11)(B.12)(B.13)(B.14)

\n

其中,式B.9应用了假设1的结论,式B.11应用了假设2的结论。

\n

式B.14中的const项是常数项,不会影响函数的形状。另外,由上面可知,q(xz)q(x|z)q(xz)具有自归一化的功能,所以,q(xz)q(x|z)q(xz)是一个高斯概率密度函数,均值为μ+σ2xlogq(μ)\\mu+\\sigma^2\\nabla_x\\log{q(\\mu)}μ+σ2xlogq(μ),方差为σ2\\sigma^2σ2

\n
","non_expanding_zh":"

Corollary 1

\n

以KL Divergence为度量,markov chain的转移变换是non-expanding的[23],即\nKL(p(x),q(x))KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) &\\le KL\\big(p(z), q(z)\\big) \\tag{C.1} \\newline\n\\end{align}KL(p(x),q(x))KL(p(z),q(z))(C.1)\n其中,p(z)p(z)p(z)q(z)q(z)q(z)是任意的概率密度函数,r(xz)r(x|z)r(xz)是markov chain的转移概率密度函数,p(x)=r(xz)p(z)dzp(x) = \\int r(x|z)p(z)dzp(x)=r(xz)p(z)dzq(x)=r(xz)q(z)dzq(x) = \\int r(x|z) q(z) dzq(x)=r(xz)q(z)dz

\n

证明:

\n

对于p(x,z)p(x,z)p(x,z)q(x,z)q(x,z)q(x,z)的KL divergence,存在如下的关系:\nKL(p(x,z),q(x,z))=p(x,z)logp(x,z)q(x,z)dxdz=p(x,z)logp(x)p(xz)q(x)q(xz)dxdz=p(x,z)logp(x)q(x)dxdz+p(x,z)logp(xz)q(xz)dxdz=p(x,z)dz logp(x)q(x)dx+p(z)p(xz)logp(xz)q(xz)dx dz=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz\\begin{align}\n KL\\big(p(x,z), q(x,z)\\big) &= \\iint p(x,z)\\log \\frac{p(x,z)}{q(x,z)}dxdz \\tag{C.2} \\newline\n & = \\iint p(x,z)\\log \\frac{p(x)p(x|z)}{q(x)q(x|z)}dxdz \\tag{C.3} \\newline\n &= \\iint p(x,z)\\log \\frac{p(x)}{q(x)}dxdz + \\iint p(x,z) \\log\\frac{p(x|z)}{q(x|z)} dxdz \\tag{C.4} \\newline\n &= \\int \\int p(x,z) dz\\ \\log \\frac{p(x)}{q(x)}dx + \\int p(z)\\int p(x|z) \\log\\frac{p(x|z)}{q(x|z)} dx\\ dz \\tag{C.5} \\newline\n &= KL\\big(p(x), q(x)\\big) + \\int p(z) KL\\big(p(x|z), q(x|z)\\big)dz \\tag{C.6} \\newline\n\\end{align}KL(p(x,z),q(x,z))=p(x,z)logq(x,z)p(x,z)dxdz=p(x,z)logq(x)q(xz)p(x)p(xz)dxdz=p(x,z)logq(x)p(x)dxdz+p(x,z)logq(xz)p(xz)dxdz=∫∫p(x,z)dz logq(x)p(x)dx+p(z)p(xz)logq(xz)p(xz)dx dz=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz(C.2)(C.3)(C.4)(C.5)(C.6)

\n

类似地,调换ZZZXXX的顺序,可得到下面的关系:\nKL(p(x,z),q(x,z))=KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx\\begin{align}\n KL\\big(p(x,z), q(x,z)\\big) &= KL\\big(p(z), q(z)\\big) + \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx \\tag{C.7}\n\\end{align}KL(p(x,z),q(x,z))=KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx(C.7)

\n

比较两个关系式,可得:\nKL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz\\begin{align}\n KL\\big(p(z), q(z)\\big) + \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx = KL\\big(p(x), q(x)\\big) + \\int p(z) KL\\big(p(x|z), q(x|z)\\big)dz \\tag{C.8}\n\\end{align}KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz(C.8)

\n

由于q(xz)q(x|z)q(xz)p(xz)p(x|z)p(xz)都是markov chain的转移概率密度,均等于r(xz)r(x|z)r(xz),所以p(z)KL(p(xz),q(xz))dz\\int p(z) KL\\big(p(x|z), q(x|z)\\big)dzp(z)KL(p(xz),q(xz))dz等于0。于是,上式简化为:\nKL(p(x),q(x))=KL(p(z),q(z))p(x)KL(p(zx),q(zx))dx\\begin{align}\n KL\\big(p(x), q(x)\\big) = KL\\big(p(z), q(z)\\big) - \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx \\tag{C.9}\n\\end{align}KL(p(x),q(x))=KL(p(z),q(z))p(x)KL(p(zx),q(zx))dx(C.9)

\n

由于KL divergence总是大于或者等于0,所以,加权和p(x)KL(p(zx),q(zx))dx\\int p(x) KL\\big(p(z|x), q(z|x)\\big)dxp(x)KL(p(zx),q(zx))dx也是大于等于0。于是,可得:\nKL(p(x),q(x))KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) \\le KL\\big(p(z), q(z)\\big) \\tag{C.10}\n\\end{align}KL(p(x),q(x))KL(p(z),q(z))(C.10)

\n
\n\n

上式等号成立的条件是p(x)KL(p(zx),q(zx))dx\\int p(x) KL\\big(p(z|x), q(z|x)\\big)dxp(x)KL(p(zx),q(zx))dx等于0,这要求对不同的条件xxxp(zx)p(z|x)p(zx)q(zx)q(z|x)q(zx)均要相等。在大多数情况下,当p(z)p(z)p(z)q(z)q(z)q(z)不同时,p(zx)p(z|x)p(zx)也和q(zx)q(z|x)q(zx)不同。这意味着,在大多数情况下,有\nKL(p(x),q(x))<KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) < KL\\big(p(z), q(z)\\big) \\tag{C.11}\n\\end{align}KL(p(x),q(x))<KL(p(z),q(z))(C.11)

\n



\nCorollary 2

\n

以Total Variance(L1 distance)为度量,markov chain的转移变换是non-expanding,即\np(x)q(x)1  p(z)q(z)1\\begin{align}\n \\left\\lVert p(x)-q(x) \\right\\rVert_1\\ &\\le\\ \\left\\lVert p(z) - q(z) \\right\\rVert_1 \\tag{C.12}\n\\end{align}p(x)q(x)1  p(z)q(z)1(C.12)

\n

其中,p(z)p(z)p(z)q(z)q(z)q(z)是任意的概率密度函数,r(xz)r(x|z)r(xz)是markov chain的转移概率密度函数,p(x)=r(xz)p(z)dzp(x) = \\int r(x|z)p(z)dzp(x)=r(xz)p(z)dzq(x)=r(xz)q(z)dzq(x) = \\int r(x|z) q(z) dzq(x)=r(xz)q(z)dz

\n

证明:\np(x)q(x)1 =p(x)q(x)dx=r(xz)p(z)dzr(xz)q(z)dzdx=r(xz)(p(z)q(z))dzdxr(xz)(p(z)q(z))dzdx=r(xz)dx(p(z)q(z))dz=(p(z)q(z))dz=p(z)q(z)1\\begin{align}\n \\left\\lVert p(x)-q(x) \\right\\rVert_1\\ &= \\int \\big\\lvert p(x) - q(x) \\big\\rvert dx \\tag{C.13} \\newline\n &= \\int \\left\\lvert \\int r(x|z) p(z) dz - \\int r(x|z)q(z)dz \\right\\rvert dx \\tag{C.14} \\newline\n &= \\int \\left\\lvert \\int r(x|z) \\big(p(z)-q(z)\\big) dz \\right\\rvert dx \\tag{C.15} \\newline\n &\\le \\int \\int r(x|z) \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz dx \\tag{C.16} \\newline\n &= \\int \\int r(x|z)dx \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz \\tag{C.17} \\newline\n &= \\int \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz \\tag{C.18} \\newline\n &= \\left\\lVert p(z) - q(z) \\right\\rVert_1 \\tag{C.19}\n\\end{align}p(x)q(x)1 =p(x)q(x)dx=r(xz)p(z)dzr(xz)q(z)dzdx=r(xz)(p(z)q(z))dzdx∫∫r(xz)(p(z)q(z))dzdx=∫∫r(xz)dx(p(z)q(z))dz=(p(z)q(z))dz=p(z)q(z)1(C.13)(C.14)(C.15)(C.16)(C.17)(C.18)(C.19)

\n

其中,式C.16应用了绝对值不等式,式C.18利用了r(xz)r(x|z)r(xz)是概率分布的性质。

\n

证明完毕。

\n
\n\n

图C.1展示了一个一维随机变量的例子,可以更直观地理解上述推导的过程。

\n

上述等式的成立的条件是:各个绝对值括号内的非零项均是同样的符号。如图C.1(a),包含5个绝对值括号,每个对应一行,每个括号内有5项,当且仅当每行各个非零项同号时,上述的等式才成立。如果出现不同号的情况,则会导致p(x)q(x)1 < p(z)q(z)1\\lVert p(x)-q(x) \\rVert_1\\ <\\ \\lVert p(z) - q(z) \\rVert_1p(x)q(x)1 < p(z)q(z)1。不同号出现的数量与转移概率矩阵的非零元素有关,一般情况下,非零元素越多,不同号出现的数量会越多。

\n

在后验概率变换中,一般情况下,当α\\alphaα越小(噪声越多)时,转移概率密度函数会有越多的非零元素,如图C.2(a)所示;当α\\alphaα越大(噪声越小)时,转移概率密度函数会有越少的非零元素,如图C.2(b)所示。

\n

所以,有这么一个规律:α\\alphaα越小时,则会导致p(x)q(x)1\\lVert p(x)-q(x) \\rVert_1p(x)q(x)1越小于p(z)q(z)1\\lVert p(z) - q(z) \\rVert_1p(z)q(z)1,也就是说,这个变换的压缩率越大

\n
\n
Figure C.1: Non-expanding under L1 norm
\n
\n
\n
Figure C.2: More non-zero elements as α\\alphaα gets smaller
","stationary_zh":"

根据文献[19]Theorem 3的结论,非周期(aperiodic)不可约(irreducible)的markov chain会收敛于惟一的稳态分布

\n

下面将表明,当满足一定的条件时,后验概率变换是一个非周期不可约的markov chain的转移概率密度函数。

\n

为了表述方便,下面以一个更通用的形式来描述扩散模型的前向变换。\nZ=αX+β ϵ\\begin{align}\n Z = \\sqrt{\\alpha}X + \\sqrt{\\beta}\\ \\epsilon \\tag{D.1} \\newline\n\\end{align}Z=αX+β ϵ(D.1)

\n

第1节可知,αX\\sqrt{\\alpha}XαX会对XXX的概率密度函数执行缩放,所以α\\alphaα控制着缩放的强度,β\\betaβ控制着添加噪声的大小。当β=1α\\beta = 1-\\alphaβ=1α时,上述的变换与式1.1一致。

\n

新变换对应的后验概率分布的形式如下:\nq(xz=c)=Normalize( 12πσexp(xμ)22σ2GaussFun q(x) )where μ=cασ=βαc is a fixed value\\begin{align}\n q(x|z=c) = \\operatorname{Normalize} \\Big(\\ \\overbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}^{\\text{GaussFun}}\\ q(x)\\ \\Big) \\tag{D.2} \\newline\n \\text{where}\\ \\mu=\\frac{c}{\\sqrt{\\alpha}}\\qquad \\sigma=\\sqrt{\\frac{\\beta}{\\alpha}} \\qquad \\text{$c$ is a fixed value} \\notag\n\\end{align}q(xz=c)=Normalize( 2πσ1exp2σ2(xμ)2GaussFun q(x) )where μ=αcσ=αβc is a fixed value(D.2)

\n

β=1α\\beta = 1-\\alphaβ=1α时,上述的变换与式3.4一致。

\n

为了表述方便,下面以g(x)g(x)g(x)表示式D.2中GaussFun。

\n

由于αX\\sqrt{\\alpha}XαX会缩放XXX的概率密度函数q(x)q(x)q(x),这会使分析转移概率密度函数q(xz)q(x|z)q(xz)的非周期性和不可约性变得更复杂。所以,为了分析方便,先假设α=1\\alpha=1α=1,后面再分析α1\\alpha \\neq 1α=1β=1α\\beta = 1-\\alphaβ=1α的情况。

\n
\n
Figure D.1: Only one component in support
\n\n
\n
Figure D.2: One component which can communicate with each other
\n \n
\n

α=1\\alpha=1α=1

\n \n

α=1\\alpha=1α=1时,如果q(x)q(x)q(x)β\\betaβ满足下面两个条件之一,则q(xz)q(x|z)q(xz)对应的markov chain是非周期且不可约的。

\n
    \n
  1. 如果q(x)q(x)q(x)的支撑集只存在一个connected component。
  2. \n
  3. 如果q(x)q(x)q(x)的支撑集存在多个connected component,但各个connected component之间的距离小于333σ\\sigmaσ。也就是说,间隙能被g(x)g(x)g(x)的有效区域的半径所覆盖。
  4. \n
\n\n

证明如下:

\n
    \n
  1. \n对q(x)q(x)q(x)支撑集内的任意点ccc,当z=cz=cz=cx=cx=cx=c时,q(x=c)>0q(x=c)>0q(x=c)>0;由式D.2可知,g(x)g(x)g(x)的中心位于ccc,所以g(x)g(x)g(x)x=cx=cx=c处也大于0。于是,根据式D.2中相乘的关系可知,q(x=cz=c)>0q(x=c|z=c)>0q(x=cz=c)>0。因此,q(xz)q(x|z)q(xz)对应的markov chain是非周期的。 \n\n

    q(x)q(x)q(x)支撑集内的任意点ccc,当z=cz=cz=c时,g(x)g(x)g(x)的中心位于ccc, 所以存在一个以ccc为中心的超球(xc2<δ\\lVert x-c\\rVert_2 < \\deltaxc2<δ),在此超球内,q(xz=c)>0q(x|z=c)>0q(xz=c)>0,也就是说,状态ccc可以访问(access)附近的其它状态。由于支撑集内每个状态都具有此性质,所以,整个支撑集内的状态构成一个Communicate Class\\textcolor{red}{\\text{Communicate Class}}Communicate Class[14]。因此,q(xz)q(x|z)q(xz)对应的markov chain是不可约的。

    \n

    所以,满足条件1的markov chain是非周期和不可约的。可看图D.1的例子,其展示了单个connected component的例子。

    \n
  2. \n\n
  3. \n当q(x)q(x)q(x)支撑集存在多个connected component时,markov chain可能存在多个communicate class。但当各间隙小于g(x)g(x)g(x)的3倍标准差时,那各个component的状态的将可互相访问(access),因此,q(xz)q(x|z)q(xz)对应的markov chain也只存在一个communicate class,与条件1的情况相同。所以,满足条件2的markov chain是非周期和不可约的。\n\n

    可看图d2的例子,其展示了多个connected component的例子。

    \n
  4. \n
\n\n
\n
Figure D.3: Two component which cannot communicate with each other
\n\n
\n

α1\\alpha \\neq 1α=1

\n\n

α1\\alpha \\neq 1α=1时,对q(x)q(x)q(x)支撑集内的任意点ccc,由式D.2可知,g(x)g(x)g(x)的中心不再是ccc,而是cα\\frac{c}{\\sqrt{\\alpha}}αc。也就是说g(x)g(x)g(x)的中心会偏离ccc,偏离的距离为c(1αα)\\lVert c\\rVert(\\frac{1-\\sqrt{\\alpha}}{\\sqrt{\\alpha}})c(α1α)。可以看出,c\\lVert c\\rVertc越大,偏离越多。具体可看图D.4(c)和图D.4(d)的例子,在图D.4(d)中,当z=2.0z=2.0z=2.0g(x)g(x)g(x)的中心明显偏离x=2.0x=2.0x=2.0。本文将此现象称之为中心偏离现象

\n

中心偏离现象将会影响markov chain一些状态的性质。

\n

当偏离的距离明显大于3σ3\\sigma3σ时,g(x)g(x)g(x)x=cx=cx=c及其附近可能均为零,于是,q(x=cz=c)q(x=c|z=c)q(x=cz=c)可能等于0,并且在x=cx=cx=c附近q(xz=c)q(x|z=c)q(xz=c)也可能等于0。所以,状态ccc不一定可访问附近的状态。这一点与α=1\\alpha=1α=1的情况不同。具体可图D.5的例子,绿色曲线\\textcolor{green}{\\text{绿色曲线}}绿色曲线z=6.0z=6.0z=6.0g(x)g(x)g(x)黄线曲线\\textcolor{orange}{\\text{黄线曲线}}黄线曲线q(xz=6.0)q(x|z=6.0)q(xz=6.0),由于g(x)g(x)g(x)的中心偏离x=6.0x=6.0x=6.0太多,导致q(x=6.0z=6.0)=0q(x=6.0|z=6.0)=0q(x=6.0∣z=6.0)=0

\n

当偏离的距离明显小于3σ3\\sigma3σ时,g(x)g(x)g(x)x=cx=cx=c及其附近均不为零,于是,q(x=cz=c)q(x=c|z=c)q(x=cz=c)不等于0,并且在x=cx=cx=c附近q(xz=c)q(x|z=c)q(xz=c)也不等于0。所以,状态ccc可访问附近的状态,并且是非周期的。

\n

ccc满足什么要求时,g(x)g(x)g(x)中心的偏离距离会小于3σ3\\sigma3σ呢?\nc(1αα) < 3βαc < 3β1α\\begin{align}\n \\lVert c\\rVert(\\frac{1-\\sqrt{\\alpha}}{\\sqrt{\\alpha}})\\ <\\ 3\\frac{\\sqrt{\\beta}}{\\sqrt{\\alpha}} \\qquad \\Rightarrow \\qquad \\lVert c\\rVert \\ <\\ 3\\frac{\\sqrt{\\beta}}{1-\\sqrt{\\alpha}} \\tag{D.3} \\newline\n\\end{align}c(α1α) < 3αβc < 31αβ(D.3)

\n

由上可知,存在一个上限,只要c\\lVert c\\rVertc小于这个上限,可保证偏离量小于3σ3\\sigma3σ

\n

β=1α\\beta=1-\\alphaβ=1α时,上式变为\nc < 31α1α\\begin{align}\n \\lVert c\\rVert \\ <\\ 3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}} \\tag{D.4} \\newline\n\\end{align}c < 31α1α(D.4)

\n

31α1α3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}}31α1αα\\alphaα有着严格的单调递减的关系。

\n

α(0,1)\\alpha \\in (0, 1)α(0,1)时,\n31α1α>3\\begin{align}\n 3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}} > 3 \\tag{D.5} \\newline\n\\end{align}31α1α>3(D.5)

\n

根据上面的分析,可总结出以下的结论:

\n
    \n
  1. \n如果q(x)q(x)q(x)的支撑集只存在一个connected component,并且支撑集的点离原点的距离均小于31α1α 3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}}31α1α,那么q(xz)q(x|z)q(xz)对应的markov chain是非周期和不可约的。\n
  2. \n\n
  3. \n如果q(x)q(x)q(x)的支撑集存在多个connected component,由于g(x)g(x)g(x)的中心偏离效应,准确判断两个component之间是否可以互相访问变得更加复杂,这里不再详细分析。但下面给出一个保守的结论:如果支撑集的点离原点的距离均小于111,并且各个connected component之间的间隙均小于2σ2\\sigma2σ,那么q(xz)q(x|z)q(xz)对应的markov chain是非周期和不可约的。\n
  4. \n
\n\n
\n
Figure D.4: Center Deviation of the GaussFun
\n
\n
\n
Figure D.5: Deviation is More Than 3σ3\\sigma3σ
","approx_gauss_en":"

From equation 3.4, it can be seen that q(xz)q(x|z)q(xz) takes the following form:\nq(xz)=Normalize( 12πσexp(xμ)22σ2 q(x) )where μ=zασ=1αα12πσexp(xμ)22σ2GaussFun q(x)\\begin{align}\n q(x|z) &= \\operatorname{Normalize} \\Big(\\ \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(x)\\ \\Big)& \\qquad &\\text{where}\\ \\mu=\\frac{z}{\\sqrt{\\alpha}}\\quad \\sigma=\\sqrt{\\frac{1-\\alpha}{\\alpha}} \\tag{B.1} \\newline\n &\\propto \\underbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}_{\\text{GaussFun}}\\ q(x) \\tag{B.2}\n\\end{align}q(xz)=Normalize( 2πσ1exp2σ2(xμ)2 q(x) )GaussFun2πσ1exp2σ2(xμ)2 q(x)where μ=αzσ=α1α(B.1)(B.2)

\n

Below we will prove that if the following two assumptions are satisfied, q(xz)q(x|z)q(xz) approximates a Gaussian distribution.

\n
    \n
  • \nAssume that within the support of GaussFun, q(x)q(x)q(x) undergoes linear changes. Expand q(x)q(x)q(x) around the mean of GaussFun using a Taylor series. According to the properties of Taylor expansion, these assumptions can be satisfied when the standard deviation σ\\sigmaσ of GaussFun is sufficiently small.\nq(x)q(μ)+xq(μ)(xμ)whereq(μ)q(x)x=μxq(μ)xq(x)x=μ=q(μ)(1+xq(μ)q(μ)(xμ))=q(μ)(1+xlogq(μ)(xμ))wherexlogq(μ)xlogq(x)x=μ\\begin{align}\n q(x) &\\approx q(\\mu) + \\nabla_xq(\\mu)(x-\\mu)& \\quad &\\text{where}\\quad q(\\mu)\\triangleq q(x)\\bigg|_{x=\\mu} \\quad \\nabla_xq(\\mu)\\triangleq \\nabla_xq(x)\\bigg|_{x=\\mu} \\tag{B.3} \\newline\n &= q(\\mu)\\big(1+ \\frac{\\nabla_xq(\\mu)}{q(\\mu)}(x-\\mu)\\big)& \\tag{B.4} \\newline\n &= q(\\mu)\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)& \\quad &\\text{where}\\quad \\nabla_x\\log{q(\\mu)}\\triangleq \\nabla_x\\log{q(x)}\\bigg|_{x=\\mu} \\tag{B.5}\n\\end{align}q(x)q(μ)+xq(μ)(xμ)=q(μ)(1+q(μ)xq(μ)(xμ))=q(μ)(1+xlogq(μ)(xμ))whereq(μ)q(x)x=μxq(μ)xq(x)x=μwherexlogq(μ)xlogq(x)x=μ(B.3)(B.4)(B.5)\n
  • \n
  • \nAssuming within the support of GaussFun, log(1+xlogq(μ)(xμ))\\log\\big(1+\\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)log(1+xlogq(μ)(xμ)) can be approximated by xlogq(μ)(xμ)\\nabla_x\\log{q(\\mu)}(x-\\mu)xlogq(μ)(xμ). By expanding log(1+y)\\log(1+y)log(1+y) using Taylor series, according to the properties of Taylor expansion, when y2\\lVert y\\rVert_2y2 is small, log(1+y)\\log(1+y)log(1+y) can be approximated by yyy. When σ\\sigmaσ is sufficiently small, xu2\\lVert x-u\\rVert_2xu2 will be small, and xlogq(μ)(xμ)\\nabla_x\\log{q(\\mu)}(x-\\mu)xlogq(μ)(xμ)will also be small, hence the above assumption can be satisfied. Generally, when xlogq(μ)(xμ)<0.1\\nabla_x\\log{q(\\mu)}(x-\\mu)<0.1xlogq(μ)(xμ)<0.1, the approximation error is small enough to be negligible.\nlog(1+y)log(1+y)y=0+ylog(1+y)y=0(y0)=y\\begin{align}\n \\log(1+y) &\\approx \\log(1+y)\\bigg|_{y=0} + \\nabla_y\\log(1+y)\\bigg|_{y=0}(y-0) \\tag{B.6} \\newline\n &= y \\tag{B.7}\n\\end{align}log(1+y)log(1+y)y=0+ylog(1+y)y=0(y0)=y(B.6)(B.7)\n
  • \n
\nUsing the above two assumptions, q(xz)q(x|z)q(xz) can be transformed into the following form:\n\n

q(xz)12πσexp(xμ)22σ2 q(x)12πσexp(xμ)22σ2 q(μ)(1+xlogq(μ)(xμ))=q(μ)2πσexp((xμ)22σ2+log(1+xlogq(μ)(xμ)))q(μ)2πσexp((xμ)22σ2+xlogq(μ)(xμ))=q(μ)2πσexp((xμ)22σ2xlogq(μ)(xμ)2σ2)=q(μ)2πσexp((xμσ2xlogq(μ))22σ2+(σ2xlogq(μ))22σ2)=exp((xμσ2xlogq(μ))22σ2)q(μ)2πσexp(12(σxlogq(μ))2)const\\begin{align}\n q(x|z) &\\propto \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(x) \\tag{B.8} \\newline\n &\\approx \\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}\\ q(\\mu)\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big) \\tag{B.9} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(\\frac{-(x-\\mu)^2}{2\\sigma^2}+\\log\\big(1+ \\nabla_x\\log{q(\\mu)}(x-\\mu)\\big)\\right) \\tag{B.10} \\newline\n &\\approx \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(\\frac{-(x-\\mu)^2}{2\\sigma^2}+\\nabla_x\\log{q(\\mu)}(x-\\mu)\\right) \\tag{B.11} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(-\\frac{(x-\\mu)^2-2\\sigma^2\\nabla_x\\log{q(\\mu)}(x-\\mu)}{2\\sigma^2}\\right) \\tag{B.12} \\newline\n &= \\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma}\\exp\\left(-\\frac{\\big(x-\\mu-\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}+\\frac{\\big(\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}\\right) \\tag{B.13} \\newline\n &= \\exp\\left(-\\frac{\\big(x-\\mu-\\sigma^2\\nabla_x\\log{q(\\mu)}\\big)^2}{2\\sigma^2}\\right) \\underbrace{\\frac{q(\\mu)}{\\sqrt{2\\pi}\\sigma} \\exp\\left( \\frac{1}{2}\\big(\\sigma\\nabla_x\\log{q(\\mu)}\\big)^2\\right)}_{\\text{const}} \\tag{B.14}\n\\end{align}q(xz)2πσ1exp2σ2(xμ)2 q(x)2πσ1exp2σ2(xμ)2 q(μ)(1+xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμ)2+log(1+xlogq(μ)(xμ)))2πσq(μ)exp(2σ2(xμ)2+xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμ)22σ2xlogq(μ)(xμ))=2πσq(μ)exp(2σ2(xμσ2xlogq(μ))2+2σ2(σ2xlogq(μ))2)=exp(2σ2(xμσ2xlogq(μ))2)const2πσq(μ)exp(21(σxlogq(μ))2)(B.8)(B.9)(B.10)(B.11)(B.12)(B.13)(B.14)

\n

Among them, Equation B.9 applies the conclusion of Assumption 1, and Equation B.11 applies the conclusion of Assumption 2.

\n

The const term in Equation B.14 is constant and does not affect the shape of the function. Additionally, as can be seen from the above, q(xz)q(x|z)q(xz) is self-normalizing. Therefore, q(xz)q(x|z)q(xz) is a Gaussian probability density function with a mean of μ+σ2xlogq(μ)\\mu + \\sigma^2 \\nabla_x \\log{q(\\mu)}μ+σ2xlogq(μ) and a variance of σ2\\sigma^2σ2.

\n
","non_expanding_en":"

Corollary 1

\n

Using KL Divergence as a metric, the transition transform of Markov chain is non-expanding[23], which means\nKL(p(x),q(x))KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) &\\le KL\\big(p(z), q(z)\\big) \\tag{C.1} \\newline\n\\end{align}KL(p(x),q(x))KL(p(z),q(z))(C.1)\nHere, p(z)p(z)p(z) and q(z)q(z)q(z) are arbitrary probability density functions, and r(xz)r(x|z)r(xz) is the transition probability density function of the Markov chain. We have p(x)=r(xz)p(z)dzp(x) = \\int r(x|z)p(z)dzp(x)=r(xz)p(z)dz and q(x)=r(xz)q(z)dzq(x) = \\int r(x|z)q(z)dzq(x)=r(xz)q(z)dz.

\n

Proof:

\n

For the KL divergence of p(x,z)p(x,z)p(x,z) and q(x,z)q(x,z)q(x,z), the following relationship exists:\nKL(p(x,z),q(x,z))=p(x,z)logp(x,z)q(x,z)dxdz=p(x,z)logp(x)p(xz)q(x)q(xz)dxdz=p(x,z)logp(x)q(x)dxdz+p(x,z)logp(xz)q(xz)dxdz=p(x,z)dz logp(x)q(x)dx+p(z)p(xz)logp(xz)q(xz)dx dz=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz\\begin{align}\n KL\\big(p(x,z), q(x,z)\\big) &= \\iint p(x,z)\\log \\frac{p(x,z)}{q(x,z)}dxdz \\tag{C.2} \\newline\n & = \\iint p(x,z)\\log \\frac{p(x)p(x|z)}{q(x)q(x|z)}dxdz \\tag{C.3} \\newline\n &= \\iint p(x,z)\\log \\frac{p(x)}{q(x)}dxdz + \\iint p(x,z) \\log\\frac{p(x|z)}{q(x|z)} dxdz \\tag{C.4} \\newline\n &= \\int \\int p(x,z) dz\\ \\log \\frac{p(x)}{q(x)}dx + \\int p(z)\\int p(x|z) \\log\\frac{p(x|z)}{q(x|z)} dx\\ dz \\tag{C.5} \\newline\n &= KL\\big(p(x), q(x)\\big) + \\int p(z) KL\\big(p(x|z), q(x|z)\\big)dz \\tag{C.6} \\newline\n\\end{align}KL(p(x,z),q(x,z))=p(x,z)logq(x,z)p(x,z)dxdz=p(x,z)logq(x)q(xz)p(x)p(xz)dxdz=p(x,z)logq(x)p(x)dxdz+p(x,z)logq(xz)p(xz)dxdz=∫∫p(x,z)dz logq(x)p(x)dx+p(z)p(xz)logq(xz)p(xz)dx dz=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz(C.2)(C.3)(C.4)(C.5)(C.6)

\n

Similarly, by swapping the order of ZZZ and XXX, the following relationship can be obtaine:\nKL(p(x,z),q(x,z))=KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx\\begin{align}\n KL\\big(p(x,z), q(x,z)\\big) &= KL\\big(p(z), q(z)\\big) + \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx \\tag{C.7}\n\\end{align}KL(p(x,z),q(x,z))=KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx(C.7)

\n

Comparing the two equations, we can obtain:\nKL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz\\begin{align}\n KL\\big(p(z), q(z)\\big) + \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx = KL\\big(p(x), q(x)\\big) + \\int p(z) KL\\big(p(x|z), q(x|z)\\big)dz \\tag{C.8}\n\\end{align}KL(p(z),q(z))+p(x)KL(p(zx),q(zx))dx=KL(p(x),q(x))+p(z)KL(p(xz),q(xz))dz(C.8)

\n

Since q(xz)q(x|z)q(xz) and p(xz)p(x|z)p(xz) are both transition probability densities of the Markov chain, equal to r(xz)r(x|z)r(xz), the integral p(z)KL(p(xz),q(xz))dz\\int p(z) KL\\big(p(x|z), q(x|z)\\big)dzp(z)KL(p(xz),q(xz))dz equals 0. Therefore, the above equation simplifies to:\nKL(p(x),q(x))=KL(p(z),q(z))p(x)KL(p(zx),q(zx))dx\\begin{align}\n KL\\big(p(x), q(x)\\big) = KL\\big(p(z), q(z)\\big) - \\int p(x) KL\\big(p(z|x), q(z|x)\\big)dx \\tag{C.9}\n\\end{align}KL(p(x),q(x))=KL(p(z),q(z))p(x)KL(p(zx),q(zx))dx(C.9)

\n

Since KL divergence is always greater than or equal to 0, the weighted sum p(x)KL(p(zx),q(zx))dx\\int p(x) KL\\big(p(z|x), q(z|x)\\big)dxp(x)KL(p(zx),q(zx))dx is also greater than or equal to 0. Therefore, we can conclude:\nKL(p(x),q(x))KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) \\le KL\\big(p(z), q(z)\\big) \\tag{C.10}\n\\end{align}KL(p(x),q(x))KL(p(z),q(z))(C.10)

\n
\n\n

The condition for the above equation to hold is that p(x)KL(p(zx),q(zx))dx\\int p(x) KL\\big(p(z|x), q(z|x)\\big)dxp(x)KL(p(zx),q(zx))dx equals 0, which requires that for different conditions xxx, p(zx)p(z|x)p(zx) and q(zx)q(z|x)q(zx) must be equal. In most cases, when p(z)p(z)p(z) and q(z)q(z)q(z) are different, p(zx)p(z|x)p(zx) and q(zx)q(z|x)q(zx) are also different. This means that in most cases, we have\nKL(p(x),q(x))<KL(p(z),q(z))\\begin{align}\n KL\\big(p(x), q(x)\\big) < KL\\big(p(z), q(z)\\big) \\tag{C.11}\n\\end{align}KL(p(x),q(x))<KL(p(z),q(z))(C.11)

\n



\nCorollary 2

\n

Using Total Variance (L1 distance) as a metric, the transition transform of a Markov chain is non-expanding, which means
p(x)q(x)1  p(z)q(z)1\\begin{align}\n \\left\\lVert p(x)-q(x) \\right\\rVert_1\\ &\\le\\ \\left\\lVert p(z) - q(z) \\right\\rVert_1 \\tag{C.12}\n\\end{align}p(x)q(x)1  p(z)q(z)1(C.12)

\n

Here, p(z)p(z)p(z) and q(z)q(z)q(z) are arbitrary probability density functions, and r(xz)r(x|z)r(xz) is the transition probability density function of a Markov chain. We have p(x)=r(xz)p(z)dzp(x) = \\int r(x|z)p(z)dzp(x)=r(xz)p(z)dz and q(x)=r(xz)q(z)dzq(x) = \\int r(x|z) q(z) dzq(x)=r(xz)q(z)dz.

\n

Proof:\np(x)q(x)1 =p(x)q(x)dx=r(xz)p(z)dzr(xz)q(z)dzdx=r(xz)(p(z)q(z))dzdxr(xz)(p(z)q(z))dzdx=r(xz)dx(p(z)q(z))dz=(p(z)q(z))dz=p(z)q(z)1\\begin{align}\n \\left\\lVert p(x)-q(x) \\right\\rVert_1\\ &= \\int \\big\\lvert p(x) - q(x) \\big\\rvert dx \\tag{C.13} \\newline\n &= \\int \\left\\lvert \\int r(x|z) p(z) dz - \\int r(x|z)q(z)dz \\right\\rvert dx \\tag{C.14} \\newline\n &= \\int \\left\\lvert \\int r(x|z) \\big(p(z)-q(z)\\big) dz \\right\\rvert dx \\tag{C.15} \\newline\n &\\le \\int \\int r(x|z) \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz dx \\tag{C.16} \\newline\n &= \\int \\int r(x|z)dx \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz \\tag{C.17} \\newline\n &= \\int \\left\\lvert \\big(p(z)-q(z)\\big) \\right\\rvert dz \\tag{C.18} \\newline\n &= \\left\\lVert p(z) - q(z) \\right\\rVert_1 \\tag{C.19}\n\\end{align}p(x)q(x)1 =p(x)q(x)dx=r(xz)p(z)dzr(xz)q(z)dzdx=r(xz)(p(z)q(z))dzdx∫∫r(xz)(p(z)q(z))dzdx=∫∫r(xz)dx(p(z)q(z))dz=(p(z)q(z))dz=p(z)q(z)1(C.13)(C.14)(C.15)(C.16)(C.17)(C.18)(C.19)

\n

Here, Equation C.16 applies the Absolute Value Inequality, while Equation C.18 utilizes the property of r(xz)r(x|z)r(xz) being a probability distribution.

\n

Proof completed。

\n
\n\n

Figure C.1 shows an example of a one-dimensional random variable, which can help better understand the derivation process described above.

\n

The condition for the above equation to hold is that all non-zero terms inside the absolute value brackets have the same sign. As shown in Figure C.1(a), there are five absolute value brackets, each corresponding to a row, with five terms in each bracket. The above equation holds if and only if all non-zero terms in each row have the same sign. If different signs occur, this will lead to p(x)q(x)1 < p(z)q(z)1\\lVert p(x)-q(x) \\rVert_1\\ <\\ \\lVert p(z) - q(z) \\rVert_1p(x)q(x)1 < p(z)q(z)1. The number of different signs is related to the nonzero elements of the transition probability matrix. In general, the more nonzero elements there are, the more different signs there will be.

\n

For the posterior transform, generally, when α\\alphaα decreases (more noise), the transition probability density function will have more nonzero elements, as shown in Figure C.2(a); whereas when α\\alphaα increases (less noise), the transition probability density function will have fewer nonzero elements, as shown in Figure C.2(b).

\n

So, there is a feature: when α\\alphaα decreases, it leads to p(x)q(x)1\\lVert p(x)-q(x) \\rVert_1p(x)q(x)1 being smaller than p(z)q(z)1\\lVert p(z) - q(z) \\rVert_1p(z)q(z)1, which means the shrinking rate of the posterior transform is larger.

\n
\n
Figure C.1: Non-expanding under L1 norm
\n
\n
\n
Figure C.2: More non-zero elements as α\\alphaα gets smaller
","stationary_en":"

According to the conclusion of Theorem 3 in [19], an aperiodic and irreducible Markov chain will converge to a unique stationary distribution.

\n

The following will show that under certain conditions, the posterior transform is the transition probability density function of an aperiodic and irreducible Markov chain.

\n

For convenience, the forward transform of the diffusion model is described below in a more general form.\nZ=αX+β ϵ\\begin{align}\n Z = \\sqrt{\\alpha}X + \\sqrt{\\beta}\\ \\epsilon \\tag{D.1} \\newline\n\\end{align}Z=αX+β ϵ(D.1)

\n

As described in Section 1, αX\\sqrt{\\alpha}XαX narrows the probability density function of XXX, so α\\alphaα controls the narrowing intensity, while β\\betaβ controls the amount of noise added(smoothing). When β=1α\\beta = 1 - \\alphaβ=1α, the above transform is consistent with the equation 1.1.

\n

The form of the posterior probability distribution corresponding to the new transformation is as follows:\nq(xz=c)=Normalize( 12πσexp(xμ)22σ2GaussFun q(x) )where μ=cασ=βαc is a fixed value\\begin{align}\n q(x|z=c) = \\operatorname{Normalize} \\Big(\\ \\overbrace{\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp{\\frac{-(x-\\mu)^2}{2\\sigma^2}}}^{\\text{GaussFun}}\\ q(x)\\ \\Big) \\tag{D.2} \\newline\n \\text{where}\\ \\mu=\\frac{c}{\\sqrt{\\alpha}}\\qquad \\sigma=\\sqrt{\\frac{\\beta}{\\alpha}} \\qquad \\text{$c$ is a fixed value} \\notag\n\\end{align}q(xz=c)=Normalize( 2πσ1exp2σ2(xμ)2GaussFun q(x) )where μ=αcσ=αβc is a fixed value(D.2)

\n

When β=1α\\beta = 1 - \\alphaβ=1α, the above transform is consistent with the equation 3.4.

\n

For convenience, let g(x)g(x)g(x) represent GaussFun in Equation D.2.

\n

Since αX\\sqrt{\\alpha}XαX narrows the probability density function q(x)q(x)q(x) of XXX, this makes the analysis of the aperiodicity and irreducibility of the transition probability density function q(xz)q(x|z)q(xz) more complex. Therefore, for the sake of analysis, we first assume α=1\\alpha = 1α=1 and later analyze the case when α1\\alpha \\neq 1α=1 and β=1α\\beta = 1 - \\alphaβ=1α.

\n
\n
Figure D.1: Only one component in support
\n\n
\n
Figure D.2: One component which can communicate with each other
\n\n
\n

α=1\\alpha=1α=1

\n\n

When α=1\\alpha=1α=1, if q(x)q(x)q(x) and β\\betaβ satisfy either of the following two conditions, the Markov chain corresponding to q(xz)q(x|z)q(xz) is aperiodic and irreducible.

\n
    \n
  1. If the support of q(x)q(x)q(x) contains only one connected component.
  2. \n
  3. If the support of q(x)q(x)q(x) has multiple connected components, but the distance between each connected component is less than 333 times σ\\sigmaσ. In other words, the gaps can be covered by the radius of the effective region of g(x)g(x)g(x).
  4. \n
\n\n

Proof:

\n
    \n
  1. \nFor any point ccc in the support of q(x)q(x)q(x), when z=cz=cz=c and x=cx=cx=c, q(x=c)>0q(x=c)>0q(x=c)>0; from Equation D.2, we know that the center of g(x)g(x)g(x) is located at ccc, so g(x)g(x)g(x) is also greater than 0 at x=cx=cx=c. Therefore, according to characteristics of multiplication in the equation D.2, q(x=cz=c)>0q(x=c|z=c)>0q(x=cz=c)>0. Hence, the Markov chain corresponding to q(xz)q(x|z)q(xz) is aperiodic. \n\n

    For any point ccc in the support of q(x)q(x)q(x), when z=cz=cz=c, the center of g(x)g(x)g(x) is located at ccc, so there exists a hypersphere with ccc as its center (xc2<δ\\lVert x-c\\rVert_2 < \\deltaxc2<δ). Within this hypersphere, q(xz=c)>0q(x|z=c)>0q(xz=c)>0, which means that state ccc can access nearby states. Since every state in the support has this property, all states within the entire support form a Communicate Class\\textcolor{red}{\\text{Communicate Class}}Communicate Class [14]. Therefore, the Markov chain corresponding to q(xz)q(x|z)q(xz) is irreducible.

    \n

    Therefore, a Markov chain that satisfies condition 1 is aperiodic and irreducible. See the example in Figure D.1, which illustrates a single connected component

    \n
  2. \n\n
  3. \nWhen the support set of q(x)q(x)q(x) has multiple connected components, the Markov chain may have multiple communicate classes. However, if the gaps between components are smaller than 3σ3\\sigma3σ(standard deviation of g(x)g(x)g(x)), the states of each component can access each other. Thus, the Markov chain corresponding to q(xz)q(x|z)q(xz) will have only one communicate class, similar to the case in condition 1. Therefore, a Markov chain that satisfies condition 2 is aperiodic and irreducible.\n\n

    In Figure D.2, an example of multiple connected components is shown.

    \n
  4. \n
\n\n
\n
Figure D.3: Two component which cannot communicate with each other
\n\n
\n

α1\\alpha \\neq 1α=1

\n\n

When α1\\alpha \\neq 1α=1, for any point ccc within the support of q(x)q(x)q(x), it follows from Equation D.2 that the center of g(x)g(x)g(x) is no longer ccc but rather cα\\frac{c}{\\sqrt{\\alpha}}αc. That is to say, the center of g(x)g(x)g(x) deviates from ccc, with the deviation distance being c(1αα)\\lVert c\\rVert(\\frac{1-\\sqrt{\\alpha}}{\\sqrt{\\alpha}})c(α1α). It can be observed that the larger c\\lVert c\\rVertc is, the greater the deviation. See the examples in Figures D.4(c) and D.4(d) for specifics. In Figure D.4(d), when z=2.0z=2.0z=2.0, the center of g(x)g(x)g(x) is noticeably offset from x=2.0x=2.0x=2.0. This phenomenon is referred to in this article as the Center Deviation Phenomenon.

\n

The Center Deviation Phenomenon will affect the properties of some states in the Markov chain.

\n

When the deviation distance is significantly greater than 3σ3\\sigma3σ, g(x)g(x)g(x) may be zero at x=cx = cx=c and its vicinity. Consequently, q(x=cz=c)q(x=c|z=c)q(x=cz=c) may also be zero, and q(xz=c)q(x|z=c)q(xz=c) in the vicinity of x=cx = cx=c may also be zero. Therefore, state ccc may not be able to access nearby states and may be periodic. This is different from the case when α=1\\alpha=1α=1. Refer to the example in Figure D.5: the green curve\\textcolor{green}{\\text{green curve}}green curve represents g(x)g(x)g(x) for z=6.0z=6.0z=6.0, and the orange curve\\textcolor{orange}{\\text{orange curve}}orange curve represents q(xz=6.0)q(x|z=6.0)q(xz=6.0). Because the center of g(x)g(x)g(x) deviates too much from x=6.0x=6.0x=6.0, q(x=6.0z=6.0)=0q(x=6.0|z=6.0)=0q(x=6.0∣z=6.0)=0.

\n

When the deviation distance is significantly less than 3σ3\\sigma3σ, g(x)g(x)g(x) is non-zero at x=cx = cx=c and its vicinity. Consequently, q(x=cz=c)q(x=c|z=c)q(x=cz=c) will not be zero, and q(xz=c)q(x|z=c)q(xz=c) in the vicinity of x=cx = cx=c will also not be zero. Therefore, state ccc can access nearby states and is aperiodic.

\n

Under what conditions for ccc will the deviation distance of the center of g(x)g(x)g(x) be less than 3σ3\\sigma3σ?

\n

c(1αα) < 3βαc < 3β1α\\begin{align}\n \\lVert c\\rVert(\\frac{1-\\sqrt{\\alpha}}{\\sqrt{\\alpha}})\\ <\\ 3\\frac{\\sqrt{\\beta}}{\\sqrt{\\alpha}} \\qquad \\Rightarrow \\qquad \\lVert c\\rVert \\ <\\ 3\\frac{\\sqrt{\\beta}}{1-\\sqrt{\\alpha}} \\tag{D.3} \\newline\n\\end{align}c(α1α) < 3αβc < 31αβ(D.3)

\n

From the above, it is known that there exists an upper limit such that as long as c\\lVert c\\rVertc is less than this upper limit, the deviation amount will be less than 3σ3\\sigma3σ.

\n

When β=1α\\beta=1-\\alphaβ=1α, the above expression becomes\nc < 31α1α\\begin{align}\n \\lVert c\\rVert \\ <\\ 3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}} \\tag{D.4} \\newline\n\\end{align}c < 31α1α(D.4)

\n

31α1α3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}}31α1α has a strictly monotonically decreasing relationship with α\\alphaα.

\n

When α(0,1)\\alpha \\in (0, 1)α(0,1),\n31α1α>3\\begin{align}\n 3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}} > 3 \\tag{D.5} \\newline\n\\end{align}31α1α>3(D.5)

\n

Based on the analysis above, the following conclusion can be drawn

\n
    \n
  1. \nIf the support of q(x)q(x)q(x) contains only one connected component, and the points of the support set are all within a distance less than 31α1α3\\frac{\\sqrt{1-\\alpha}}{1-\\sqrt{\\alpha}}31α1α from the origin, then the Markov chain corresponding to q(xz)q(x|z)q(xz) will be aperiodic and irreducible.\n
  2. \n\n
  3. \nIf the support of q(x)q(x)q(x) contains multiple connected components, the accurate determination of whether two components can access each other becomes more complex due to the Center Deviation Phenomenon of g(x)g(x)g(x). Here, we won't delve into further analysis. But just give a conservative conclusion: If the points of the support are all within a distance less than 111 from the origin, and the gaps between each connected component are all less than 2σ2\\sigma2σ, then the Markov chain corresponding to q(xz)q(x|z)q(xz) will be aperiodic and irreducible.\n
  4. \n
\n\n
\n
Figure D.4: Center Deviation of the GaussFun
\n
\n
\n
Figure D.5: Deviation is More Than 3σ3\\sigma3σ
"}