File size: 13,477 Bytes
9432d80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
When 0:00:00 - 0:00:02
我看到这些画面:"一张标语“监管你的神经网络”的图片"
我发现这些内容:"黑圆"
我检测到这些标签:"黑色 | 商标 | 圆 | 螺旋 | 标志  | 漩涡  | 文本 ."
我识别到这些文字:"Regularizing your neural network Regularization deeplearning.ai"
我听到有人说:" If you suspect your neural network is overfitting your data, "


When 0:00:02 - 0:00:07
我看到这些画面:"一张标语“监管你的神经网络”的图片"
我发现这些内容:"黑圆"
我检测到这些标签:"黑色 | 商标 | 圆 | 螺旋 | 标志  | 漩涡  | 文本 ."
我识别到这些文字:"Regularizing your neural network Regularization deeplearning.ai"
我听到有人说:" If you suspect your neural network is overfitting your data,  that is if you have a high variance problem,  one of the first things you should try is probably regularization. "


When 0:00:07 - 0:00:10
我看到这些画面:"一个穿着衬衫和领带的男人 坐在电脑面前"
我发现这些内容:"一个口袋里手握手的男人 黑白相框 标志是白色的"
我检测到这些标签:"圆 | 电脑 | 计算机显示器 | 电脑屏幕 | 桌子/表格 | 正装衬衫 | 形象 | 男人 | 办公室  | 屏幕  | 衬衫  | 坐/放置/坐落 | 站/矗立/摊位 | 领带 ."
我识别到这些文字:"Regularizingyour neural network Regularization deeplearning.ai"
我听到有人说:" one of the first things you should try is probably regularization.  The other way to address high variance is to get more training data, "


When 0:00:10 - 0:00:41
我看到这些画面:"一张标语“监管你的神经网络”的图片"
我发现这些内容:"黑圆"
我检测到这些标签:"黑色 | 商标 | 圆 | 螺旋 | 标志  | 漩涡  | 文本 ."
我识别到这些文字:"Regularizing your neural network Regularization deeplearning.ai"
我听到有人说:" The other way to address high variance is to get more training data,  that's also quite reliable.  We can't always get more training data,  could be expensive to get more data,  but adding regularization will often help to  prevent overfitting or to reduce variance in your network.  So, let's see how regularization works.  Let's develop these ideas using logistic regression.  Recall that for logistic regression,  you try to minimize the cost function j,  which is defined as this cost function.  Some of your training examples of the losses of  the individual predictions and the different examples, "


When 0:00:41 - 0:01:12
我看到这些画面:"电脑屏幕上写着 后勤回归的注纸"
我发现这些内容:"调"
我检测到这些标签:"数字 | 线条 | 指/朝向 | 坡  | 解决方案 ."
我识别到这些文字:"Logistic regression WeR,b∈R minJ(w,b) w,b {(b)= AndrewNg"
我听到有人说:" the individual predictions and the different examples,  where you recall that w and b in logistic regression are the parameters.  So, w is an x-dimensional parameter vector and b is a real number.  And so, to add regularization to logistic regression,  what you do is add to it,  this thing lambda,  which is called the regularization parameter,  say more about that in a second,  but lambda over 2m times the norm of w squared.  So here, the norm of w squared is just equal to sum from j equals 1 to nx of wj squared. "


When 0:01:12 - 0:01:57
我看到这些画面:"手写便条,并进行若干不同的计算"
我发现这些内容:"建筑物一侧的一堵墙,一堵白色的墙"
我检测到这些标签:"数字 | 线条 | 指/朝向 | 坡  | 解决方案 ."
我识别到这些文字:"Logistic regression WeR,b∈R minJ(w,b) w,b +→ () (b)= 2m Muiotion l= rers AndrewNg"
我听到有人说:" So here, the norm of w squared is just equal to sum from j equals 1 to nx of wj squared.  Or this can also be written,  w transpose w is just a square Euclidean norm of the parameter vector w.  And this is called,  l2 regularization.  Because here you're using the Euclidean norm,  also called the l2 norm of the parameter vector w.  Now, why do you regularize just the parameter w?  Why don't we add something here,  you know, about b as well.  In practice, you could do this,  but I usually just omit this because if you look at your parameters,  w is usually a pretty high dimensional parameter vector, "


When 0:01:57 - 0:02:07
我看到这些画面:"在一张纸上写的说明,带有后勤回归方程"
我发现这些内容:"建筑物一侧的一堵墙,一堵白色的墙"
我检测到这些标签:"数字 | 线条 | 指/朝向 | 坡  | 解决方案 ."
我识别到这些文字:"Logistic regression WeIRx ,b∈R min J(w,b) w,b G () 十 < (b)= 2m Onit Muiotion rey AndrewNg"
我听到有人说:" w is usually a pretty high dimensional parameter vector,  especially if a high dimensional parameter vector is a high dimensional parameter vector.  So if you look at the l2 regularization,  you might have a very high variance problem,  maybe w just has a lot of parameters,  so you aren't fitting all the parameters well,  whereas b is just a single number.  So almost all the parameters are in w rather than b. "


When 0:02:07 - 0:02:33
我看到这些画面:"上面写着数学问题的白板"
我发现这些内容:"文字是黑色的,绿色的公园长椅 在公园,白色的墙壁"
我检测到这些标签:"数字 | 线条 | 指/朝向 | 坡  | 解决方案  | 三角形 ."
我识别到这些文字:"Logistic regression WeIR” beR min J(w,b) w,b G > /< (b)= + M 2m Onit ) Juiotian [1012 rey AndrewNg"
我听到有人说:" So almost all the parameters are in w rather than b.  And if you add this last term,  in practice, it won't make much of a difference  because b is just one parameter out of a very large number of parameters.  In practice, I usually just don't bother to include it.  But you can if you want.  So l2 regularization is the most common type of regularization.  You might have also seen this,  you might have also heard of some people talk about l1 regularization.  And that's when you add, instead of this l2 norm, "


When 0:02:33 - 0:03:52
我看到这些画面:"写在纸纸上的纸纸上的纸条 带有一些数学"
我发现这些内容:"白色的,绿色的字母k, 字是黑色的, 墙是白色的"
我检测到这些标签:"数字 | 线条 | 指/朝向 | 坡  | 三角形 ."
我识别到这些文字:"Logistic regression WeIR” ,b∈R min J(w,b) w,b 十 > (b)= M 2m Onit Munizotian Z rey W ullbe 洲三“ Bguleriat AndrewNg"
我听到有人说:" And that's when you add, instead of this l2 norm,  you instead add a term that is lambda over m of sum over of this.  And this is also called the l1 norm of the parameter vector w.  So a little substrapt one down there.  And I guess whether you put m or 2m in the denominator,  it's just a scaling constant.  So that's the l1 regularization.  If you use l1 regularization, then w will end up being sparse.  And what that means is that the w vector will have a lot of zeros in it.  And some people say that this can help with compressing the model  because if some parameters are zero,  then you need less memory to store the model.  Although I find that in practice,  l1 regularization to make your model sparse helps only a little bit.  So I don't think it's used that much,  at least not for the purpose of compressing your model.  And when people train you in NethRest,  l2 regularization is just used much, much more often.  Sorry, fixing up some of the notation here.  So one last detail.  Lambda here is called the regularization parameter.  And usually you set this using your development set  or using holdout cross validation,  where you try a variety of values and see what does the best "


When 0:03:52 - 0:04:37
我看到这些画面:"白板,上面有数学和写作"
我发现这些内容:"白墙上涂有涂鸦 绿色字母k 字是黑色的"
我检测到这些标签:"数字 | 线条 | 故事情节 | 指/朝向 | 坡  | 三角形  | 手写/字迹."
我识别到这些文字:"Logistic regression 吴恩达婴康学 WeIR” ,b∈R [eyalanzothis Posdeste min J(w,b) lamba w,b {(,b) = M 3 2m 02 onit luizortion M rey ullbe L AndrewNg"
我听到有人说:" where you try a variety of values and see what does the best  in terms of trading off between doing well in your training set,  versus also setting the two norm of your parameters to be small,  which helps prevent overfitting.  So lambda is another hyperparameter that you might have to tune.  And by the way, for the programming exercises,  lambda is a reserved keyword in the Python programming language.  So in the programming exercise,  we'll have LAMBD without the A,  so as not to clash with a reserved keyword in Python.  So we'll use LAMBD to represent the A.  So we'll use LAMBD to represent the lambda regularization parameter.  So this is how you implement l2 regularization for logistic regression.  How about the neural network?  In a neural network, you have a cost function. "


When 0:04:37 - 0:09:42
我看到这些画面:"白板,上面有数学和写作"
我发现这些内容:"白墙,黑字,前字,右前字,右字"
我检测到这些标签:"数字 | 线条 | 故事情节 | 指/朝向 | 坡  | 手写/字迹."
我识别到这些文字:"Neural network J(""), ) 2m M(山) 11-(2 11-1l norm CSom back poe) W= M AndrewNg"
我听到有人说:" In a neural network, you have a cost function.  There's a function of all of your parameters,  w1, b1 through w capital L, b capital L,  where capital L is the number of layers in your neural network.  And so the cost function is this sum of the losses,  sum over your losses.  These are 2m training examples.  And so as regularization,  you add lambda over 2m of sum  over all of your parameters, w,  your parameter matrices w,  of their, let's call it the squared norm,  where this norm of the matrix,  really, the squared norm,  is defined as sum over i,  sum over j of, you know,  of that matrix squared.  And if you want the indices of the summation,  this is sum from i equals 1 through nl minus 1,  sum from j equals 1 through nl,  because w is a nl minus 1 by nl dimensional matrix,  where these are the number of hidden units,  or number of units in layers l minus 1 and layer l.  So this matrix norm, it turns out,  is called the Frobenius norm of a matrix,  denoted with f in the subscript.  So for arcane linear algebra technical reasons,  this is not called the L2 norm of a matrix.  Instead, it's called the Frobenius norm of a matrix.  I know it sounds like it would be more natural to just call the L2 norm of the matrix,  but for really arcane reasons that you don't need to know by convention,  this is called the Frobenius norm.  It just means the sum of squared of elements of a matrix.  So how do you implement gradient descent with this?  Previously, we would compute dw using backprop,  where backprop would give us the partial derivative of j with respect to w,  or well, really, w for any given l.  And then you update wl as wl minus the learning rate times d.  So this is before we added this extra regularization term to the objective.  Now that we've added this regularization term to the objective,  what you do is you take dw and you add to it lambda over m times w.  And then you just compute this update same as before.  And it turns out that with this new definition of dwl,  this is still, you know, this new dwl is still a correct definition  of the derivative of your cost function with respect to your parameters.  Now that you've added the extra regularization term at the end.  And it's for this reason that l2 regularization is sometimes also called weight decay.  So if I take this definition of, you know, dwl and just plug it in here,  then you see that the update is wl gets updated as wl times the learning rate alpha times,  you know, the thing from backprop.  Plus lambda over m times wl.  So the minus sign there.  And so this is equal to wl minus alpha lambda over m times wl minus alpha times,  you know, the thing you got from backprop.  And so this term,  shows that whatever the matrix wl is,  you're going to make it, you know, a little bit smaller, right?  This is actually as if you're taking the matrix w  and you're multiplying it by one minus alpha lambda over m.  You're really taking the matrix w and subtracting alpha lambda over m times.  It's like you're multiplying the matrix w by this number,  which is going to be a little bit less than one.  So this is why l2 norm regularization is also called weight decay,  because it's just like the ordinary gradient descent,  where you update w by subtracting alpha times the original gradient you got from backprop.  But now you're also,  you know, multiplying w by this,  by this thing, which is a little bit less than one.  So the alternative name for l2 regularization is weight decay.  I'm not really going to use that name.  But the intuition for why it's called weight decay is that this first,  this first term here is equal to this.  So you're just multiplying the weight matrix by a number slightly less than one.  So that's how you implement l2 regularization in the neural network.  Now, one question that people sometimes ask me is,  you know, hey, Andrew, why does regularization prevent overfitting?  Let's take a quick look at the next video and gain some intuition for how regularization prevents overfitting. "