Chapter
Regularization
for
Deep
Learning
cen
tral
problem
in
mac
hine
learning
is
ho
to
mak
an
algorithm
that
will
erform
ell
not
just
on
the
training
data,
but
also
on
new
inputs.
Man
strategies
used
in
mac
hine
learning
are
explicitly
designed
to
reduce
the
test
error,
possibly
at
the
exp
ense
of
increased
training
error.
These
strategies
are
known
collectiv
ely
as
regularization.
great
man
forms
of
regularization
are
ailable
to
the
deep
learning
practitioner.
In
fact,
dev
eloping
more
effectiv
regularization
strategies
has
een
one
of
the
ma
jor
research
efforts
in
the
field.
Chapter
introduced
the
basic
concepts
of
generalization,
underfitting,
ov
erfit-
ting,
bias,
ariance
and
regularization.
If
you
are
not
already
familiar
with
these
notions,
please
refer
to
that
hapter
efore
contin
uing
with
this
one.
In
this
hapter,
describ
regularization
in
more
detail,
focusing
on
regular-
ization
strategies
for
deep
models
or
mo
dels
that
ma
used
as
building
blo
cks
to
form
deep
models.
Some
sections
of
this
chapter
deal
with
standard
concepts
in
machine
learning.
If
you
are
already
familiar
with
these
concepts, feel
free
to
skip
the
relev
ant
sections.
Ho
ev
er,
most
of
this
chapter
is
concerned
with
the
extension
of
these
basic
concepts
to
the
particular
case
of
neural
netw
orks.
In
section
5.2.2
defined
regularization
as
“an
modification
we
make
to
learning
algorithm
that
is
intended
to
reduce
its
generalization
error
but
not
its
training
error.”
There
are
many
regularization
strategies.
Some
put
extra
con-
strain
ts
on
machine
learning
model,
suc
as
adding
restrictions
on
the
parameter
alues.
Some
add
extra
in
the
ob
jectiv
function
that
can
though
of
as
corresp
onding
to
soft
constraint
on
the
parameter
alues.
If
chosen
carefully
these
extra
constrain
ts
and
enalties
can
lead
to
improv
ed
erformance
on
the
224
CHAPTER
7.
REGULARIZA
TION
OR
DEEP
LEARNING
test
set.
Sometimes
these
constraints
and
penalties
are
designed
to
enco
de
sp
ecific
kinds
of
prior
knowledge.
Other
times,
these
constrain
ts
and
enalties
are
designed
to
express
generic
preference
for
simpler
mo
del
class
in
order
to
promote
generalization.
Sometimes
enalties
and
constraints
are
necessary
to
make
an
underdetermined
problem
determined.
Other
forms
of
regularization,
known
as
ensem
ble
metho
ds,
com
bine
ultiple
hypotheses
that
explain
the
training
data.
In
the
context
of
deep
learning,
most
regularization
strategies
are
based
on
regularizing
estimators.
Regularization
of
an
estimator
orks
trading
increased
bias
for
reduced
ariance.
An
effective
regularizer
is
one
that
mak
es
profitable
trade,
reducing
ariance
significan
tly
while
not
verly
increasing
the
bias.
When
we
discussed
generalization
and
ov
erfitting
in
hapter
we
fo
cused
on
three
situations,
where
the
mo
del
family
eing
trained
either
(1)
excluded
the
true
data-generating
pro
cess—corresp
onding
to
underfitting
and
inducing
bias,
or
(2)
matc
hed
the
true
data-generating
pro
cess,
or
(3)
included
the
generating
pro
cess
but
also
many
other
ossible
generating
processes—the
ov
erfitting
regime
where
ariance
rather
than
bias
dominates
the
estimation
error.
The
goal
of
regularization
is
to
take
mo
del
from
the
third
regime
in
to
the
second
regime.
In
practice,
an
ov
erly
complex
mo
del
family
do
es
not
necessarily
include
the
target
function
or
the
true
data-generating
pro
cess,
or
ev
en
close
approximation
of
either.
almost
never
hav
access
to
the
true
data-generating
pro
cess
so
can
nev
er
kno
for
sure
if
the
mo
del
family
being
estimated
includes
the
generating
pro
cess
or
not.
Most
applications
of
deep
learning
algorithms,
how
ever,
are
to
domains
where
the
true
data-generating
pro
cess
is
almost
certainly
outside
the
mo
del
family
Deep
learning
algorithms
are
ypically
applied
to
extremely
complicated
domains
such
as
images,
audio
sequences
and
text,
for
which
the
true
generation
pro
cess
essentially
inv
olv
es
simulating
the
entire
universe.
some
exten
t,
are
alw
ys
trying
to
fit
square
eg
(the
data-generating
pro
cess)
in
to
round
hole
(our
model
family).
What
this
means
is
that
con
trolling
the
complexity
of
the
mo
del
is
not
simple
matter
of
finding
the
mo
del
of
the
right
size,
with
the
right
num
er
of
parameters.
Instead,
we
migh
find—and
indeed
in
practical
deep
learning
scenarios,
almost
alw
ys
do
find—that
the
est
fitting
mo
del
(in
the
sense
of
minimizing
generalization
error)
is
large
model
that
has
een
regularized
appropriately
no
review
sev
eral
strategies
for
how
to
create
suc
large,
deep
regularized
mo
del.
225
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
7.1
arameter
Norm
enalties
Regularization
has
een
used
for
decades
prior
to
the
adven
of
deep
learning.
Linear
mo
dels
such
as
linear
regression
and
logistic
regression
allow
simple,
straigh
tforw
ard,
and
effectiv
regularization
strategies.
Man
regularization
approac
hes
are
based
on
limiting
the
capacity
of
mo
dels,
suc
as
neural
netw
orks,
linear
regression,
or
logistic
regression,
adding
pa-
rameter
norm
enalty
Ω(
to
the
ob
jective
function
denote
the
regularized
ob
jectiv
function
) =
Ω(
(7.1)
where
[0
is
hyperparameter
that
weigh
ts
the
relative
con
tribution
of
the
norm
enalty
term,
relativ
to
the
standard
ob
jective
function
Setting
to
results
in
no
regularization.
Larger
alues
of
corresp
ond
to
more
regularization.
When
our
training
algorithm
minimizes
the
regularized
ob
jective
function
it
will
decrease
oth
the
original
ob
jectiv
on
the
training
data
and
some
measure
of
the
size
of
the
parameters
(or
some
subset
of
the
parameters).
Differen
hoices
for
the
parameter
norm
can
result
in
differen
solutions
being
preferred.
In
this
section,
discuss
the
effects
of
the
arious
norms
when
used
as
enalties
on
the
model
parameters.
Before
delving
into
the
regularization
ehavior
of
different
norms,
we
note
that
for
neural
netw
orks,
we
typically
choose
to
use
parameter
norm
enalty
that
enalizes
only
the
weights
of
the
affine
transformation
at
eac
lay
er
and
lea
es
the
biases
unregularized.
The
biases
typically
require
less
data
than
the
eigh
ts
to
fit
accurately
Eac
eigh
sp
ecifies
how
wo
ariables
interact.
Fitting
the
eigh
ell
requires
observing
both
ariables
in
ariety
of
conditions.
Eac
bias
con
trols
only
single
ariable.
This
means
that
do
not
induce
to
muc
ariance
lea
ving
the
biases
unregularized.
Also,
regularizing
the
bias
parameters
can
introduce
significant
amoun
of
underfitting. W
therefore
use
the
vector
to
indicate
all
the
weigh
ts
that
should
affected
norm
enalty
while
the
ector
denotes
all
the
parameters,
including
both
and
the
unregularized
parameters.
In
the
con
text
of
neural
net
orks,
it
is
sometimes
desirable
to
use
separate
enalt
with
different
co
efficien
for
eac
la
er
of
the
netw
ork.
Because
it
can
exp
ensiv
to
searc
for
the
correct
alue
of
multiple
yp
erparameters,
it
is
still
reasonable
to
use
the
same
weigh
decay
at
all
lay
ers
just
to
reduce
the
size
of
searc
space.
226
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
7.1.1
arameter
Regularization
hav
already
seen,
in
section
5.2.2
one
of
the
simplest
and
most
common
kinds
of
parameter
norm
penalty:
the
parameter
norm
penalty
commonly
kno
wn
as
eigh
deca
This
regularization
strategy
drives
the
weigh
ts
closer
to
the
origin
adding
regularization
term
Ω(
) =
to
the
ob
jectiv
function.
In
other
academic
communities,
regularization
is
also
known
as
ridge
regression
or
Tikhono
regularization
can
gain
some
insight
into
the
ehavior
of
weigh
decay
regularization
studying
the
gradient
of
the
regularized
ob
jective
function.
simplify
the
presen
tation,
assume
no
bias
parameter,
so
is
just
Suc
mo
del
has
the
follo
wing
total
ob
jective
function:
) =
(7.2)
with
the
corresponding
parameter
gradient
) =
(7.3)
tak
single
gradien
step
to
up
date
the
weigh
ts,
we
erform
this
up
date:
))
(7.4)
ritten
another
the
update
is
(1
α
(7.5)
can
see
that
the
addition
of
the
eigh
deca
term
has
mo
dified
the
learning
rule
to
multiplicativ
ely
shrink
the
eigh
ector
constant
factor
on
eac
step,
just
efore
erforming
the
usual
gradient
up
date.
This
describ
es
what
happ
ens
in
single
step.
But
what
happens
ov
er
the
entire
course
of
training?
will
further
simplify
the
analysis
by
making
quadratic
approximation
to
the
ob
jective
function
in
the
neighborho
of
the
alue
of
the
weigh
ts
that
obtains
minimal
unregularized
training
cost,
arg
min
If
the
ob
jective
function
is
truly
quadratic,
as
in
the
case
of
fitting
linear
regression
model
with
More
generally
we
could
regularize
the
parameters
to
near
any
sp
ecific
oint
in
space
and,
surprisingly
still
get
regularization
effect,
but
etter
results
will
obtained
for
alue
closer
to
the
true
one,
with
zero
eing
default
alue
that
makes
sense
when
we
do
not
know
if
the
correct
alue
should
ositive
or
negative.
Since
it
is
far
more
common
to
regularize
the
mo
del
parameters
to
ward
zero,
we
will
fo
cus
on
this
sp
ecial
case
in
our
exp
osition.
227
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
mean
squared
error,
then
the
appro
ximation
is
erfect.
The
appro
ximation
is
giv
en
) =
(7.6)
where
is
the
Hessian
matrix
of
with
resp
ect
to
ev
aluated
at
There
is
no
first-order
term
in
this
quadratic
approximation,
because
is
defined
to
minim
um,
where
the
gradien
anishes.
Lik
ewise,
ecause
is
the
lo
cation
of
minim
um
of
can
conclude
that
is
positive
semidefinite.
The
minim
um
of
ccurs
where
its
gradient
) =
(7.7)
is
equal
to
study
the
effect
of
weigh
decay
we
mo
dify
equation
7.7
adding
the
eigh
decay
gradient.
can
now
solve
for
the
minimum
of
the
regularized
ersion
of
use
the
ariable
to
represen
the
location
of
the
minimum.
) = 0
(7.8)
(7.9)
= (
(7.10)
As
approac
hes
0,
the
regularized
solution
approac
hes
But
what
happ
ens
as
gro
ws?
Because
is
real
and
symmetric,
we
can
decomp
ose
it
in
to
diagonal
matrix
and
an
orthonormal
basis
of
eigen
vectors,
suc
that
Applying
the
decomp
osition
to
equation
7.10
obtain
= (
(7.11)
(7.12)
(7.13)
see
that
the
effect
of
eigh
deca
is
to
rescale
along
the
axes
defined
by
the
eigen
ectors
of
Sp
ecifically
the
comp
onen
of
that
is
aligned
with
the
-th
eigenv
ector
of
is
rescaled
by
factor
of
(Y
ou
ma
wish
to
review
ho
this
kind
of
scaling
orks,
first
explained
in
figure
2.3
).
Along
the
directions
where
the
eigen
alues
of
are
relativ
ely
large,
for
example,
where
the
effect
of
regularization
is
relativ
ely
small.
et
comp
onents
with
will
shrunk
to
ha
nearly
zero
magnitude.
This
effect
is
illustrated
in
figure
7.1
228
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Figure
7.1:
An
illustration
of
the
effect
of
(or
weigh
decay)
regularization
on
the
alue
of
the
optimal
The
solid
ellipses
represent
contours
of
equal
alue
of
the
unregularized
ob
jective.
The
dotted
circles
represen
con
tours
of
equal
alue
of
the
regularizer.
At
the
oint
these
competing
ob
jectives
reach
an
equilibrium.
In
the
first
dimension,
the
eigen
alue
of
the
Hessian
of
is
small. The
ob
jective
function
do
es
not
increase
muc
when
moving
horizontally
wa
from
Because
the
ob
jective
function
does
not
express
strong
preference
along
this
direction,
the
regularizer
has
strong
effect
on
this
axis.
The
regularizer
pulls
close
to
zero.
In
the
second
dimension,
the
ob
jective
function
is
ery
sensitiv
to
mo
ements
wa
from
The
corresp
onding
eigenv
alue
is
large,
indicating
high
curv
ature.
As
result,
eight
decay
affects
the
osition
of
relativ
ely
little.
229
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Only
directions
along
which
the
parameters
con
tribute
significan
tly
to
reducing
the
ob
jectiv
function
are
preserved
relatively
intact. In
directions
that
do
not
con
tribute
to
reducing
the
ob
jective
function,
small
eigenv
alue
of
the
Hessian
tells
us
that
mov
ement
in
this
direction
will
not
significan
tly
increase
the
gradient.
Comp
onen
ts
of
the
eight
vector
corresponding
to
suc
unimp
ortan
directions
are
deca
ed
through
the
use
of
the
regularization
throughout
training.
So
far
we
hav
discussed
eight
decay
in
of
its
effect
on
the
optimization
of
an
abstract,
general
quadratic
cost
function.
Ho
do
these
effects
relate
to
mac
hine
learning
in
particular?
can
find
out
studying
linear
regression,
mo
del
for
whic
the
true
cost
function
is
quadratic
and
therefore
amenable
to
the
same
kind
of
analysis
we
ha
ve
used
so
far.
Applying
the
analysis
again,
we
will
able
to
obtain
sp
ecial
case
of
the
same
results,
but
with
the
solution
no
phrased
in
of
the
training
data.
or
linear
regression,
the
cost
function
is
the
sum
of
squared
errors:
(7.14)
When
add
regularization,
the
ob
jective
function
hanges
to
(7.15)
This
hanges
the
normal
equations
for
the
solution
from
= (
(7.16)
to
= (
(7.17)
The
matrix
in
equation
7.16
is
prop
ortional
to
the
cov
ariance
matrix
Using
regularization
replaces
this
matrix
with
in
equation
7.17
The
new
matrix
is
the
same
as
the
original
one,
but
with
the
addition
of
to
the
diagonal.
The
diagonal
entries
of
this
matrix
corresp
ond
to
the
ariance
of
eac
input
feature.
can
see
that
regularization
causes
the
learning
algorithm
to
“p
erceive”
the
input
as
having
higher
ariance,
whic
makes
it
shrink
the
eigh
ts
on
features
whose
cov
ariance
with
the
output
target
is
low
compared
to
this
added
ariance.
7.1.2
Regularization
While
eigh
deca
is
the
most
common
form
of
eight
deca
there
are
other
ys
to
enalize
the
size
of
the
mo
del
parameters. Another
option
is
to
use
regularization.
230
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
ormally
regularization
on
the
model
parameter
is
defined
as
Ω(
) =
||
||
(7.18)
that
is,
as
the
sum
of
absolute
alues
of
the
individual
parameters.
will
no
discuss
the
effect
of
regularization
on
the
simple
linear
regression
mo
del,
with
no
bias
parameter,
that
studied
in
our
analysis
of
regularization.
In
particular,
are
in
terested
in
delineating
the
differences
et
een
and
forms
of
regularization.
As
with
eigh
deca
eigh
deca
con
trols
the
strength
of
the
regularization
scaling
the
penalty
using
ositive
yp
erparameter
Th
us,
the
regularized
ob
jective
function
is
giv
en
by
) =
||
||
(7.19)
with
the
corresponding
gradient
(actually
sub
gradien
t)
) =
sign(
(7.20)
where
sign(
is
simply
the
sign
of
applied
elemen
t-wise.
By
insp
ecting
equation
7.20
we
can
see
immediately
that
the
effect
of
regularization
is
quite
different
from
that
of
regularization.
Sp
ecifically
we
can
see
that
the
regularization
contribution
to
the
gradient
no
longer
scales
linearly
with
each
instead
it
is
constant
factor
with
sign
equal
to
sign
One
consequence
of
this
form
of
the
gradien
is
that
we
will
not
necessarily
see
clean
algebraic
solutions
to
quadratic
appro
ximations
of
as
did
for
regularization.
Our
simple
linear
mo
del
has
quadratic
cost
function
that
can
represen
via
its
ylor
series.
Alternately
we
could
imagine
that
this
is
truncated
aylor
series
appro
ximating
the
cost
function
of
more
sophisticated
mo
del.
The
gradient
in
this
setting
is
giv
en
by
) =
(7.21)
where,
again,
is
the
Hessian
matrix
of
with
resp
ect
to
ev
aluated
at
Because
the
enalt
do
es
not
admit
clean
algebraic
expressions
in
the
case
of
fully
general
Hessian,
will
also
mak
the
further
simplifying
assumption
that
the
Hessian
is
diagonal,
diag
([
n,n
])
where
eac
i,i
As
with
regularization,
we
could
regularize
the
parameters
tow
ard
alue
that
is
not
zero,
but
instead
tow
ard
some
parameter
alue
In
that
case
the
regularization
would
in
tro
duce
the
term
Ω(
) =
||
||
231
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
This
assumption
holds
if
the
data
for
the
linear
regression
problem
has
een
prepro
cessed
to
remo
all
correlation
etw
een
the
input
features,
which
ma
accomplished
using
PCA.
Our
quadratic
appro
ximation
of
the
regularized
ob
jective
function
decom-
oses
in
to
sum
ver
the
parameters:
) =
i,i
(7.22)
The
problem
of
minimizing
this
approximate
cost
function
has
an
analytical
solution
(for
eac
dimension
),
with
the
follo
wing
form:
= sign(
max
i,i
(7.23)
Consider
the
situation
where
for
all
There
are
possible
outcomes:
1.
The
case
where
i,i
Here
the
optimal
alue
of
under
the
regularized
ob
jectiv
is
simply
= 0
This
ccurs
because
the
con
tribution
of
to
the
regularized
ob
jective
is
erwhelmed—in
direction
—b
the
regularization,
whic
pushes
the
alue
of
to
zero.
2.
The
case
where
i,i
In
this
case,
the
regularization
do
es
not
mo
the
optimal
alue
of
to
zero
but
instead
just
shifts
it
in
that
direction
distance
equal
to
i,i
similar
process
happ
ens
when
but
with
the
enalt
making
less
negativ
i,i
or
0.
In
comparison
to
regularization,
regularization
results
in
solution
that
is
more
sparse
Sparsit
in
this
context
refers
to
the
fact
that
some
parameters
ha
an
optimal
alue
of
zero.
The
sparsity
of
regularization
is
qualitativ
ely
differen
ehavior
than
arises
with
regularization.
Equation
7.13
ga
the
solution
for
regularization.
If
revisit
that
equation
using
the
assumption
of
diagonal
and
ositive
definite
Hessian
that
introduced
for
our
analysis
of
regularization,
find
that
i,i
i,i
If
as
nonzero,
then
remains
nonzero.
This
demonstrates
that
regularization
do
es
not
cause
the
parameters
to
ecome
sparse,
while
regularization
ma
do
so
for
large
enough
The
sparsit
property
induced
regularization
has
been
used
extensively
as
feature
selection
mec
hanism.
eature
selection
simplifies
mac
hine
learning
problem
by
choosing
which
subset
of
the
av
ailable
features
should
used.
In
232
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
particular,
the
ell
known
LASSO
Tibshirani
1995
(least
absolute
shrink
age
and
selection
op
erator)
mo
del
in
tegrates
an
enalt
with
linear
mo
del
and
least-squares
cost
function. The
enalt
causes
subset
of
the
weigh
ts
to
ecome
zero,
suggesting
that
the
corresponding
features
may
safely
discarded.
In
section
5.6.1
we
sa
that
man
regularization
strategies
can
interpreted
as
MAP
Ba
esian
inference,
and
that
in
particular,
regularization
is
equiv
alent
to
MAP
Ba
esian
inference
with
Gaussian
prior
on
the
weigh
ts.
or
regu-
larization,
the
enalt
Ω(
used
to
regularize
cost
function
is
equiv
alen
to
the
log-prior
term
that
is
maximized
by
MAP
Bay
esian
inference
when
the
prior
is
an
isotropic
Laplace
distribution
(equation
3.26
ov
er
log
) =
log
Laplace(
) =
||
||
log
log
(7.24)
rom
the
oint
of
view
of
learning
via
maximization
with
resp
ect
to
we
can
ignore
the
log
log
ecause
they
do
not
dep
end
on
7.2
Norm
enalties
as
Constrained
Optimization
Consider
the
cost
function
regularized
parameter
norm
enalty:
) =
Ω(
(7.25)
Recall
from
section
4.4
that
can
minimize
function
sub
ject
to
constrain
ts
constructing
generalized
Lagrange
function,
consisting
of
the
original
ob
jective
function
plus
set
of
enalties.
Eac
penalty
is
pro
duct
et
een
coefficient,
called
Karush–Kuhn–T
uck
er
(KKT)
ultiplier,
and
function
representing
whether
the
constrain
is
satisfied.
If
we
anted
to
constrain
Ω(
to
less
than
some
constan
could
construct
generalized
Lagrange
function
) =
(Ω(
(7.26)
The
solution
to
the
constrained
problem
is
giv
en
by
= arg
min
max
α,α
(7.27)
As
describ
ed
in
section
4.4
solving
this
problem
requires
mo
difying
oth
and
Section
4.5
provides
work
ed
example
of
linear
regression
with
an
constrain
t.
Many
differen
pro
cedures
are
ossible—some
may
use
gradient
descen
t,
233
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
while
others
ma
use
analytical
solutions
for
where
the
gradien
is
zero—but
in
all
pro
cedures
ust
increase
whenev
er
Ω(
and
decrease
whenev
er
Ω(
All
positive
encourage
Ω(
to
shrink.
The
optimal
alue
will
encourage
Ω(
to
shrink,
but
not
so
strongly
to
mak
Ω(
ecome
less
than
gain
some
insigh
into
the
effect
of
the
constrain
t,
can
fix
and
view
the
problem
as
just
function
of
= arg
min
) = arg
min
Ω(
(7.28)
This
is
exactly
the
same
as
the
regularized
training
problem
of
minimizing
can
th
us
think
of
parameter
norm
enalty
as
imp
osing
constrain
on
the
eigh
ts.
If
is
the
norm,
then
the
weigh
ts
are
constrained
to
lie
in
an
ball. If
is
the
norm,
then
the
eigh
ts
are
constrained
to
lie
in
region
of
limited
norm.
Usually
do
not
know
the
size
of
the
constrain
region
that
we
imp
ose
using
weigh
decay
with
coefficient
ecause
the
alue
of
do
es
not
directly
tell
us
the
alue
of
In
principle,
one
can
solv
for
but
the
relationship
et
een
and
dep
ends
on
the
form
of
While
we
do
not
kno
the
exact
size
of
the
constraint
region,
we
can
control
it
roughly
increasing
or
decreasing
in
order
to
gro
or
shrink
the
constraint
region.
Larger
will
result
in
smaller
constrain
region.
Smaller
will
result
in
larger
constrain
region.
Sometimes
ma
wish
to
use
explicit
constrain
ts
rather
than
penalties.
As
describ
ed
in
section
4.4
we
can
mo
dify
algorithms
such
as
sto
hastic
gradien
descen
to
take
step
downhill
on
and
then
pro
ject
bac
to
the
nearest
oin
that
satisfies
Ω(
This
can
useful
if
ha
an
idea
of
what
alue
of
is
appropriate
and
do
not
wan
to
sp
end
time
searching
for
the
alue
of
that
corresp
onds
to
this
Another
reason
to
use
explicit
constrain
ts
and
repro
jection
rather
than
enforcing
constrain
ts
with
enalties
is
that
enalties
can
cause
noncon
ex
optimization
pro
cedures
to
get
stuc
in
lo
cal
minima
corresp
onding
to
small
When
training
neural
net
orks,
this
usually
manifests
as
neural
net
orks
that
train
with
sev
eral
“dead
units.”
These
are
units
that
do
not
con
tribute
muc
to
the
ehavior
of
the
function
learned
by
the
net
work
because
the
eigh
ts
going
into
or
out
of
them
are
all
ery
small. When
training
with
penalty
on
the
norm
of
the
weigh
ts,
these
configurations
can
be
lo
cally
optimal,
even
if
it
is
ossible
to
significantly
reduce
making
the
eigh
ts
larger.
Explicit
constraints
implemen
ted
by
repro
jection
can
ork
uch
etter
in
these
cases
because
they
do
not
encourage
the
eigh
ts
to
approac
the
origin.
Explicit
constrain
ts
implemen
ted
by
repro
jection
hav
an
effect
only
when
the
weigh
ts
ecome
large
and
attempt
to
lea
the
constraint
region.
234
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Finally
explicit
constrain
ts
with
repro
jection
can
useful
ecause
they
imp
ose
some
stabilit
on
the
optimization
pro
cedure.
When
using
high
learning
rates,
it
is
ossible
to
encounter
ositive
feedback
lo
op
in
which
large
weigh
ts
induce
large
gradien
ts,
whic
then
induce
large
up
date
to
the
weigh
ts.
If
these
up
dates
consisten
tly
increase
the
size
of
the
eigh
ts,
then
rapidly
mov
es
from
the
origin
un
til
numerical
verflo
ccurs.
Explicit
constraints
with
repro
jection
prev
en
this
feedback
lo
op
from
con
tin
uing
to
increase
the
magnitude
of
the
weigh
ts
without
bound.
Hinton
et
al.
2012c
recommend
using
constrain
ts
combined
with
high
learning
rate
to
enable
rapid
exploration
of
parameter
space
while
main
taining
some
stabilit
In
particular,
Hin
ton
et
al.
2012c
recommend
strategy
in
tro
duced
Srebro
and
Shraibman
2005
):
constraining
the
norm
of
eac
olumn
of
the
weigh
matrix
of
neural
net
la
er,
rather
than
constraining
the
robenius
norm
of
the
entire
eigh
matrix.
Constraining
the
norm
of
eac
column
separately
preven
ts
any
one
hidden
unit
from
ha
ving
very
large
eigh
ts.
If
we
con
verted
this
constrain
into
enalt
in
Lagrange
function,
it
would
be
similar
to
eigh
deca
but
with
separate
KKT
ultiplier
for
the
eigh
ts
of
eac
hidden
unit.
Eac
of
these
KKT
ultipliers
would
be
dynamically
up
dated
separately
to
make
eac
hidden
unit
ob
ey
the
constraint.
In
practice,
column
norm
limitation
is
alwa
ys
implemen
ted
as
an
explicit
constrain
with
repro
jection.
7.3
Regularization
and
Under-Constrained
Problems
In
some
cases,
regularization
is
necessary
for
mac
hine
learning
problems
to
prop
erly
defined.
Man
linear
mo
dels
in
machine
learning,
including
linear
re-
gression
and
PCA,
dep
end
on
in
erting
the
matrix
This
is
not
ossible
when
is
singular.
This
matrix
can
singular
whenever
the
data-generating
distribution
truly
has
no
ariance
in
some
direction,
or
when
no
ariance
is
observe
in
some
direction
ecause
there
are
few
er
examples
(rows
of
than
input
features
(columns
of
).
In
this
case,
man
forms
of
regularization
corresp
ond
to
in
erting
instead.
This
regularized
matrix
is
guaran
teed
to
in
ertible.
These
linear
problems
ha
closed
form
solutions
when
the
relev
ant
matrix
is
in
ertible.
It
is
also
possible
for
problem
with
no
closed
form
solution
to
be
underdetermined.
An
example
is
logistic
regression
applied
to
problem
where
the
classes
are
linearly
separable.
If
weigh
ector
is
able
to
achiev
erfect
classification,
then
will
also
ac
hiev
erfect
classification
and
higher
likelihoo
d.
An
iterative
optimization
pro
cedure
like
sto
chastic
gradien
descen
will
contin
ually
increase
the
magnitude
of
and,
in
theory
will
nev
er
halt.
In
practice,
umerical
235
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
implemen
tation
of
gradien
descen
will
ev
en
tually
reach
sufficien
tly
large
eights
to
cause
umerical
ov
erflow,
at
which
oint
its
ehavior
will
dep
end
on
how
the
programmer
has
decided
to
handle
alues
that
are
not
real
um
ers.
Most
forms
of
regularization
are
able
to
guarantee
the
con
vergence
of
iterativ
metho
ds
applied
to
underdetermined
problems. F
or
example,
weigh
deca
will
cause
gradien
descen
to
quit
increasing
the
magnitude
of
the
weigh
ts
when
the
slop
of
the
lik
elihoo
is
equal
to
the
weigh
decay
co
efficient.
The
idea
of
using
regularization
to
solv
underdetermined
problems
extends
ey
ond
mac
hine
learning.
The
same
idea
is
useful
for
several
basic
linear
algebra
problems.
As
sa
in
section
2.9
can
solv
underdetermined
linear
equations
using
the
Mo
ore-P
enrose
pseudoin
erse.
Recall
that
one
definition
of
the
pseudoinv
erse
of
matrix
is
lim
(7.29)
can
no
recognize
equation
7.29
as
erforming
linear
regression
with
eight
deca
Sp
ecifically
equation
7.29
is
the
limit
of
equation
7.17
as
the
regularization
co
efficien
shrinks
to
zero.
can
th
us
in
terpret
the
pseudoin
verse
as
stabilizing
underdetermined
problems
using
regularization.
7.4
Dataset
Augmen
tation
The
est
wa
to
make
mac
hine
learning
mo
del
generalize
etter
is
to
train
it
on
more
data.
Of
course,
in
practice,
the
amoun
of
data
ha
ve
is
limited.
One
wa
to
get
around
this
problem
is
to
create
fake
data
and
add
it
to
the
training
set.
or
some
mac
hine
learning
tasks,
it
is
reasonably
straigh
tforward
to
create
new
fak
data.
This
approac
is
easiest
for
classification.
classifier
needs
to
tak
complicat-
ed,
high-dimensional
input
and
summarize
it
with
single
category
iden
tit
This
means
that
the
main
task
facing
classifier
is
to
be
inv
ariant
to
wide
ariety
of
transformations.
can
generate
new
pairs
easily
just
by
transforming
the
inputs
in
our
training
set.
This
approac
is
not
as
readily
applicable
to
many
other
tasks.
or
example,
it
is
difficult
to
generate
new
fak
data
for
densit
estimation
task
unless
we
ha
ve
already
solv
ed
the
densit
estimation
problem.
Dataset
augmen
tation
has
een
particularly
effective
tec
hnique
for
sp
ecific
classification
problem:
ob
ject
recognition.
Images
are
high
dimensional
and
include
236
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
an
enormous
range
of
factors
of
ariation,
many
of
which
can
be
easily
simulated.
Op
erations
lik
translating
the
training
images
few
pixels
in
eac
direction
can
often
greatly
improv
generalization,
even
if
the
mo
del
has
already
een
designed
to
partially
translation
in
ariant
by
using
the
con
olution
and
oling
tec
hniques
describ
ed
in
chapter
Many
other
op
erations,
such
as
rotating
the
image
or
scaling
the
image,
ha
also
pro
ved
quite
effectiv
e.
One
must
be
careful
not
to
apply
transformations
that
would
hange
the
correct
class.
or
example,
optical
character
recognition
tasks
require
recognizing
the
difference
etw
een
“b”
and
“d”
and
the
difference
betw
een
“6”
and
“9,”
so
horizon
tal
flips
and
180
rotations
are
not
appropriate
wa
ys
of
augmen
ting
datasets
for
these
tasks.
There
are
also
transformations
that
would
like
our
classifiers
to
be
in
arian
to
but
that
are
not
easy
to
erform.
or
example,
out-of-plane
rotation
cannot
implemen
ted
as
simple
geometric
operation
on
the
input
pixels.
Dataset
augmen
tation
is
effective
for
sp
eech
recognition
tasks
as
well
Jaitly
and
Hin
ton
2013
).
Injecting
noise
in
the
input
to
neural
net
ork
Sietsma
and
Do
1991
can
also
seen
as
form
of
data
augmentation.
or
many
classification
and
ev
en
some
regression
tasks,
the
task
should
still
ossible
to
solv
ev
en
if
small
random
noise
is
added
to
the
input.
Neural
netw
orks
prov
not
to
very
robust
to
noise,
how
ever
ang
and
Eliasmith
2010
).
One
to
improv
the
robustness
of
neural
net
orks
is
simply
to
train
them
with
random
noise
applied
to
their
inputs.
Input
noise
injection
is
part
of
some
unsup
ervised
learning
algorithms,
suc
as
the
denoising
auto
enco
der
Vincent
et
al.
2008
).
Noise
injection
also
orks
when
the
noise
is
applied
to
the
hidden
units,
whic
can
be
seen
as
doing
dataset
augmen
tation
at
ultiple
levels
of
abstraction.
Poole
et
al.
2014
recen
tly
sho
ed
that
this
approac
can
be
highly
effective
pro
vided
that
the
magnitude
of
the
noise
is
carefully
tuned.
Drop
out,
ow
erful
regularization
strategy
that
will
described
in
section
7.12
can
be
seen
as
pro
cess
of
constructing
new
inputs
multiplying
by
noise.
When
comparing
machine
learning
enchmark
results,
taking
the
effect
of
dataset
augmen
tation
in
to
accoun
is
imp
ortan
t.
Often,
hand-designed
dataset
augmen
tation
sc
hemes
can
dramatically
reduce
the
generalization
error
of
ma-
hine
learning
technique.
compare
the
erformance
of
one
mac
hine
learning
algorithm
to
another,
it
is
necessary
to
erform
controlled
exp
eriments.
When
comparing
mac
hine
learning
algorithm
and
machine
learning
algorithm
B,
mak
sure
that
oth
algorithms
are
ev
aluated
using
the
same
hand-designed
dataset
augmen
tation
schemes.
Supp
ose
that
algorithm
erforms
orly
with
no
dataset
237
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
augmen
tation,
and
algorithm
erforms
well
when
com
bined
with
numerous
syn-
thetic
transformations
of
the
input.
In
such
case
the
synthetic
transformations
lik
ely
caused
the
impro
ved
performance,
rather
than
the
use
of
machine
learning
algorithm
B.
Sometimes
deciding
whether
an
exp
erimen
has
been
prop
erly
con-
trolled
requires
sub
jective
judgment.
or
example,
mac
hine
learning
algorithms
that
inject
noise
in
to
the
input
are
erforming
form
of
dataset
augmentation.
Usually
operations
that
are
generally
applicable
(such
as
adding
Gaussian
noise
to
the
input)
are
considered
part
of
the
mac
hine
learning
algorithm,
while
op
erations
that
are
sp
ecific
to
one
application
domain
(suc
as
randomly
cropping
an
image)
are
considered
to
be
separate
prepro
cessing
steps.
7.5
Noise
Robustness
Section
7.4
has
motiv
ated
the
use
of
noise
applied
to
the
inputs
as
dataset
augmen
tation
strategy
or
some
models,
the
addition
of
noise
with
infinitesimal
ariance
at
the
input
of
the
mo
del
is
equiv
alent
to
imp
osing
enalty
on
the
norm
of
the
eigh
ts
Bishop
1995a
).
In
the
general
case,
it
is
imp
ortan
to
remem
er
that
noise
injection
can
be
muc
more
ow
erful
than
simply
shrinking
the
parameters,
esp
ecially
when
the
noise
is
added
to
the
hidden
units.
Noise
applied
to
the
hidden
units
is
such
an
imp
ortant
topic
that
it
merits
its
own
separate
discussion;
the
drop
out
algorithm
describ
ed
in
section
7.12
is
the
main
dev
elopmen
of
that
approac
h.
Another
wa
that
noise
has
een
used
in
the
service
of
regularizing
mo
dels
is
by
adding
it
to
the
weigh
ts.
This
tec
hnique
has
een
used
primarily
in
the
con
text
of
recurren
neural
netw
orks
Jim
et
al.
1996
Grav
es
2011
). This
can
in
terpreted
as
a stochastic
implementation
of Ba
yesian
inference o
ver
the
eigh
ts. The
Ba
esian
treatment
of
learning
would
consider
the
mo
del
we
igh
ts
to
uncertain
and
represen
table
via
probability
distribution
that
reflects
this
uncertain
dding
noise
to
the
eigh
ts
is
practical,
sto
hastic
ay
to
reflect
this
uncertain
Noise
applied
to
the
weigh
ts
can
also
interpreted
as
equiv
alent
(under
some
assumptions)
to
more
traditional
form
of
regularization,
encouraging
stability
of
the
function
to
be
learned.
Consider
the
regression
setting,
where
we
wish
to
train
function
that
maps
set
of
features
to
scalar
using
the
least-squares
cost
function
betw
een
the
model
predictions
and
the
true
alues
x,y
(7.30)
The
training
set
consists
of
labeled
examples
(1)
(1)
238
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
no
assume
that
with
eac
input
presen
tation
we
also
include
random
erturbation
of
the
net
ork
weigh
ts. Let
us
imagine
that
ha
standard
-la
er
MLP
denote
the
erturb
ed
mo
del
as
Despite
the
injection
of
noise,
are
still
in
terested
in
minimizing
the
squared
error
of
the
output
of
the
net
ork.
The
ob
jective
function
th
us
ecomes
,y
(7.31)
,y
(7.32)
or
small
the
minimization
of
with
added
eight
noise
(with
cov
ariance
) is
equiv
alen
t to
minimization of
with an
additional regularization
ter-
m:
,y
∇
This
form
of
regularization
encourages
the
parameters
to
go
to
regions
of
parameter
space
where
small
erturbations
of
the
eigh
ts
ha
relatively
small
influence
on
the
output.
In
other
ords,
it
pushes
the
mo
del
in
to
regions
where
the
mo
del
is
relativ
ely
insensitive
to
small
ariations
in
the
eigh
ts,
finding
oin
ts
that
are
not
merely
minima,
but
minima
surrounded
flat
regions
Hochreiter
and
Schmidh
ub
er
1995
).
In
the
simplified
case
of
linear
regression
(where,
for
instance,
) =
),
this
regularization
term
collapses
in
to
whic
is
not
function
of
parameters
and
therefore
do
es
not
con
tribute
to
the
gradien
of
with
resp
ect
to
the
model
parameters.
7.5.1
Injecting
Noise
at
the
Output
argets
Most
datasets
hav
some
num
er
of
mistak
es
in
the
lab
els.
It
can
harmful
to
maximize
log
when
is
mistak
e.
One
to
prev
en
this
is
to
explicitly
mo
del
the
noise
on
the
labels.
or
example,
can
assume
that
for
some
small
constan
the
training
set
lab
el
is
correct
with
probability
and
otherwise
an
of
the
other
ossible
lab
els
might
correct.
This
assumption
is
easy
to
incorp
orate
in
to
the
cost
function
analytically
rather
than
explicitly
dra
wing
noise
samples.
or
example,
lab
el
smo
othing
regularizes
mo
del
based
on
softmax
with
output
alues
replacing
the
hard
and
classification
targets
with
targets
of
and
respectively
The
standard
cross-entrop
loss
may
then
used
with
these
soft
targets.
Maximum
lik
eliho
learning
with
softmax
classifier
and
hard
targets
may
actually
nev
er
con
erge—the
softmax
can
never
predict
probability
of
exactly
or
exactly
so
it
will
contin
ue
to
learn
larger
and
larger
eigh
ts,
making
more
extreme
predictions
forev
er.
It
is
ossible
to
prev
en
this
scenario
using
other
regularization
strategies
like
weigh
deca
Lab
el
smo
othing
has
the
adv
an
tage
of
prev
en
ting
the
pursuit
of
hard
probabilities
without
discouraging
correct
classification.
This
strategy
has
een
used
since
the
1980s
239
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
and
contin
ues
to
featured
prominently
in
mo
dern
neural
netw
orks
Szegedy
et
al.
2015
).
7.6
Semi-Sup
ervised
Learning
In
the
paradigm
of
semi-supervised
learning,
oth
unlab
eled
examples
from
and
lab
eled
examples
from
are
used
to
estimate
or
predict
from
In
the
con
text
of
deep
learning, semi-supervised
learning
usually
refers
to
learning
represen
tation
The
goal
is
to
learn
represen
tation
so
that
examples
from
the
same
class
ha
similar
representations.
Unsup
ervised
learning
can
pro
vide
useful
clues
for
how
to
group
examples
in
representation
space.
Examples
that
cluster
tigh
tly
in
the
input
space
should
mapp
ed
to
similar
represen
tations.
linear
classifier
in
the
new
space
ma
achiev
etter
generalization
in
man
cases
Belkin
and
Niy
ogi
2002
Chap
elle
et
al.
2003
).
long-standing
arian
of
this
approac
is
the
application
of
principal
components
analysis
as
prepro
cessing
step
efore
applying
classifier
(on
the
pro
jected
data).
Instead
of
having
separate
unsup
ervised
and
sup
ervised
comp
onen
ts
in
the
mo
del,
one
can
construct
mo
dels
in
which
generative
model
of
either
or
shares
parameters
with
discriminative
mo
del
of
One
can
then
trade
off
the
sup
ervised
criterion
log
with
the
unsup
ervised
or
generativ
one
(suc
as
log
or
log
).
The
generative
criterion
then
expresses
particular
form
of
prior
elief
ab
out
the
solution
to
the
sup
ervised
learning
problem
Lasserre
et
al.
2006
),
namely
that
the
structure
of
is
connected
to
the
structure
of
in
wa
that
is
captured
by
the
shared
parametrization.
By
con
trolling
ho
uc
of
the
generative
criterion
is
included
in
the
total
criterion,
one
can
find
etter
trade-off
than
with
purely
generative
or
purely
discriminativ
training
criterion
Lasserre
et
al.
2006
Laro
helle
and
Bengio
2008
).
Salakh
utdino
and
Hinton
2008
describ
metho
for
learning
the
ernel
function
of
ernel
machine
used
for
regression,
in
which
the
usage
of
unlab
eled
examples
for
modeling
improv
es
quite
significantly
See
Chapelle
et
al.
2006
for
more
information
ab
out
semi-sup
ervised
learning.
240
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
7.7
Multitask
Learning
Multitask
learning
Caruana
1993
is
to
impro
generalization
oling
the
examples
(whic
can
seen
as
soft
constraints
imp
osed
on
the
parameters)
arising
out
of
several
tasks. In
the
same
wa
that
additional
training
examples
put
more
pressure
on
the
parameters
of
the
model
tow
ard
alues
that
generalize
ell,
when
part
of
mo
del
is
shared
across
tasks,
that
part
of
the
model
is
more
constrained
to
ard
goo
alues
(assuming
the
sharing
is
justified),
often
yielding
etter
generalization.
Figure
7.2
illustrates
ery
common
form
of
ultitask
learning,
in
which
differen
sup
ervised
tasks
(predicting
giv
en
share
the
same
input
as
ell
as
some
in
termediate-level
representation
(shared)
capturing
common
ol
of
factors.
The
model
can
generally
be
divided
in
to
tw
kinds
of
parts
and
associated
parameters:
1.
ask-sp
ecific
parameters
(which
only
benefit
from
the
examples
of
their
task
to
achiev
go
generalization).
These
are
the
upp
er
la
ers
of
the
neural
net
ork
in
figure
7.2
2.
Generic
parameters,
shared
across
all
the
tasks
(whic
enefit
from
the
oled
data
of
all
the
tasks).
These
are
the
low
er
lay
ers
of
the
neural
net
ork
in
figure
7.2
Impro
ed
generalization
and
generalization
error
bounds
Baxter
1995
can
be
ac
hiev
ed
because
of
the
shared
parameters,
for
which
statistical
strength
can
be
greatly
improv
ed
(in
prop
ortion
with
the
increased
um
er
of
examples
for
the
shared
parameters,
compared
to
the
scenario
of
single-task
mo
dels).
Of
course
this
will
happen
only
if
some
assumptions
ab
out
the
statistical
relationship
etw
een
the
differen
tasks
are
alid,
meaning
that
there
is
something
shared
across
some
of
the
tasks.
rom
the
oin
of
view
of
deep
learning,
the
underlying
prior
elief
is
the
follo
wing:
among
the
factors that
explain the
variations
observe
in
the
data
asso
ciate
with
the
differ
ent
tasks,
some
ar
shar
acr
oss
two
or
mor
tasks.
7.8
Early
Stopping
When
training
large
mo
dels
with
sufficient
representational
capacity
to
ov
erfit
the
task,
often
observe
that
training
error
decreases
steadily
ov
er
time,
but
241
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
(1)
(1)
(2)
(2)
(3)
(3)
(1)
(1)
(2)
(2)
(shared)
(shared)
Figure
7.2:
Multitask
learning
can
be
cast
in
sev
eral
ays
in
deep
learning
frameworks,
and
this
figure
illustrates
the
common
situation
where
the
tasks
share
common
input
but
in
olve
different
target
random
ariables.
he
low
er
la
yers
of
deep
netw
ork
(whether
it
is
sup
ervised
and
feedforward
or
includes
generative
comp
onen
with
do
wn
ard
arrows)
can
shared
across
uc
tasks,
while
task-sp
ecific
parameters
(associated
resp
ectively
with
the
weigh
ts
in
to
and
from
(1)
and
(2)
can
learned
on
top
of
those
yielding
shared
representation
(shared)
The
underlying
assumption
is
that
there
exists
common
ool
of
factors
that
explain
the
ariations
in
the
input
while
eac
task
is
asso
ciated
with
subset
of
these
factors.
In
this
example,
it
is
additionally
assumed
that
top-lev
el
hidden
units
(1)
and
(2)
are
sp
ecialized
to
each
task
(resp
ectively
predicting
(1)
and
(2)
),
while
some
intermediate-lev
el
representation
(shared)
is
shared
across
all
tasks.
In
the
unsup
ervised
learning
con
text,
it
makes
sense
for
some
of
the
top-lev
el
factors
to
be
asso
ciated
with
none
of
the
output
tasks
(3)
):
these
are
the
factors
that
explain
some
of
the
input
ariations
but
are
not
relev
an
for
predicting
(1)
or
(2)
242
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
50
100
150
200
250
Time
(epo
chs)
00
05
10
15
20
Loss
(negativ
log-lik
elihoo
d)
raining
set
loss
alidation
set
loss
Figure
7.3:
Learning
curv
es
showing
how
the
negative
log-lik
eliho
loss
changes
ov
er
time
(indicated
as
umber
of
training
iterations
ov
er
the
dataset,
or
ep
hs
).
In
this
example,
we
train
maxout
netw
ork
on
MNIST.
Observe
that
the
training
ob
jectiv
decreases
consistently
ov
er
time,
but
the
alidation
set
verage
loss
even
tually
egins
to
increase
again,
forming
an
asymmetric
U-shap
ed
curve.
alidation
set
error
egins
to
rise
again.
See
figure
7.3
for
an
example
of
this
eha
vior,
whic
occurs
reliably
This
means
can
obtain
mo
del
with
etter
alidation
set
error
(and
th
us,
hop
efully
etter
test
set
error)
returning
to
the
parameter
setting
at
the
oin
in
time
with
the
low
est
alidation
set
error.
Every
time
the
error
on
the
alidation
set
impro
es,
store
cop
of
the
model
parameters.
When
the
training
algorithm
terminates,
return
these
parameters,
rather
than
the
latest
parameters.
The
algorithm
terminates
when
no
parameters
ha
impro
ed
ver
the
est
recorded
alidation
error
for
some
pre-sp
ecified
num
er
of
iterations.
This
pro
cedure
is
sp
ecified
more
formally
in
algorithm
7.1
This
strategy
is
known
as
early
stopping
It
is
probably
the
most
commonly
used
form
of
regularization
in
deep
learning.
Its
opularity
is
due
to
oth
its
effectiv
eness
and
its
simplicit
One
ay
to
think
of
early
stopping
is
as
ery
efficien
yp
erparameter
selection
algorithm.
In
this
view,
the
num
er
of
training
steps
is
just
another
hyperparameter.
can
see
in
figure
7.3
that
this
hyperparameter
has
U-shap
ed
alidation
set
erformance
curve.
Most
hyperparameters
that
con
trol
mo
del
capacit
ha
ve
such
U-shap
ed
alidation
set
erformance
curv
e,
as
illustrated
in
figure
5.3
In
the
case
of
early
stopping,
we
are
con
trolling
the
effective
capacit
of
the
mo
del
determining
ho
man
steps
it
can
take
to
fit
the
training
set.
Most
yp
erparameters
ust
243
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
hosen
using
an
exp
ensive
guess
and
chec
process,
where
we
set
hyperparameter
at
the
start
of
training,
then
run
training
for
several
steps
to
see
its
effect.
The
“training
time” h
yp
erparameter
is
unique
in
that
by
definition,
single
run
of
training
tries
out
man
alues
of
the
hyperparameter.
The
only
significant
cost
to
hoosing
this
hyperparameter
automatically
via
early
stopping
is
running
the
alidation
set
ev
aluation
erio
dically
during
training.
Ideally
, this
is
done
in
parallel
to
the
training
pro
cess
on
separate
machine,
separate
CPU,
or
separate
GPU
from
the
main
training
pro
cess.
If
such
resources
are
not
ailable,
then
the
cost
of
these
erio
dic
ev
aluations
may
be
reduced
by
using
alidation
set
that
is
Algorithm 7.1
The
early
stopping meta-algorithm
for determining
the best
amoun
of
time
to
train.
This
meta-algorithm
is
general
strategy
that
orks
ell
with
ariet
of
training
algorithms
and
wa
ys
of
quantifying
error
on
the
alidation
set.
Let
the
umber
of
steps
etw
een
ev
aluations.
Let
the
“patience,”
the
um
er
of
times
to
observ
orsening
alidation
set
error
efore
giving
up.
Let
the
initial
parameters.
while
do
Up
date
running
the
training
algorithm
for
steps.
alidationSetError
if
then
else
end
if
end
while
Best
parameters
are
est
um
ber
of
training
steps
is
244
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
small
compared
to
the
training
set
or
by
ev
aluating
the
alidation
set
error
less
frequen
tly
and
obtaining
lo
wer-resolution
estimate
of
the
optimal
training
time.
An
additional
cost
to
early
stopping
is
the
need
to
maintain
copy
of
the
est
parameters.
This
cost
is
generally
negligible,
ecause
it
is
acceptable
to
store
these
parameters
in
slo
er
and
larger
form
of
memory
(for
example,
training
in
GPU
memory
but
storing
the
optimal
parameters
in
host
memory
or
on
disk
driv
e).
Since
the
est
parameters
are
written
to
infrequently
and
nev
er
read
during
training,
these
occasional
slow
writes
ha
ve
little
effect
on
the
total
training
time.
Early
stopping
is
an
unobtrusive
form
of
regularization,
in
that
it
requires
almost
no
hange
in
the
underlying
training
pro
cedure,
the
ob
jectiv
function,
or
the
set
of
allow
able
parameter
alues.
This
means
that
it
is
easy
to
use
early
stopping
without
damaging
the
learning
dynamics.
This
is
in
con
trast
to
weigh
deca
where
one
ust
be
careful
not
to
use
to
uc
eigh
deca
and
trap
the
net
ork
in
bad
lo
cal
minimum
corresp
onding
to
solution
with
pathologically
small
eigh
ts.
Early
stopping
may
be
used
either
alone
or
in
conjunction
with
other
regulariza-
tion
strategies.
Even
when
using
regularization
strategies
that
mo
dify
the
ob
jective
function
to
encourage
better
generalization,
it
is
rare
for
the
best
generalization
to
ccur
at
local
minimum
of
the
training
ob
jective.
Early
stopping
requires
alidation
set,
whic
means
some
training
data
is
not
fed
to
the
model.
est
exploit
this
extra
data,
one
can
perform
extra
training
after
the
initial
training
with
early
stopping
has
completed.
In
the
second,
extra
training
step,
all
the
training
data
is
included.
There
are
wo
basic
strategies
one
can
use
for
this
second
training
procedure.
One
strategy
(algorithm
7.2
is
to
initialize
the
mo
del
again
and
retrain
on
all
the
data.
In
this
second
training
pass,
we
train
for
the
same
umber
of
steps
as
the
early
stopping
procedure
determined
was
optimal
in
the
first
pass.
There
are
some
subtleties
associated
with
this
pro
cedure.
or
example,
there
is
not
goo
of
kno
wing
whether
to
retrain
for
the
same
num
er
of
parameter
up
dates
or
the
same
um
er
of
passes
through
the
dataset.
On
the
second
round
of
training,
eac
pass
through
the
dataset
will
require
more
parameter
up
dates
ecause
the
training
set
is
bigger.
Another
strategy
for
using
all
the
data
is
to
eep
the
parameters
obtained
from
the
first
round
of
training
and
then
ontinue
training,
but
now
using
all
the
data.
this
stage,
no
no
longer
ha
ve
guide
for
when
to
stop
in
of
num
ber
of
steps.
Instead,
we
can
monitor
the
erage
loss
function
on
the
alidation
set
and
con
tin
ue
training
un
til
it
falls
below
the
alue
of
the
training
set
ob
jective
at
whic
the
early
stopping
pro
cedure
halted.
This
strategy
oids
the
high
cost
of
245
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Algorithm
7.2
meta-algorithm
for
using
early
stopping
to
determine
how
long
to
train,
then
retraining
on
all
the
data.
Let
train
and
train
the
training
set.
Split
train
and
train
in
to
subtrain
alid
and
subtrain
alid
resp
ectiv
ely
Run
early
stopping
(algorithm
7.1
starting
from
random
using
subtrain
and
subtrain
for
training
data
and
alid
and
alid
for
alidation
data.
This
returns
the
optimal
um
ber
of
steps.
Set
to
random
alues
again.
rain
on
train
and
train
for
steps.
Algorithm
7.3
Meta-algorithm
using
early
stopping
to
determine
at
what
ob
jec-
tiv
alue
start
to
verfit,
then
con
tinue
training
un
til
that
alue
is
reached.
Let
train
and
train
the
training
set.
Split
train
and
train
in
to
subtrain
alid
and
subtrain
alid
resp
ectiv
ely
Run
early
stopping
(algorithm
7.1
starting
from
random
using
subtrain
and
subtrain
for
training
data
and
alid
and
alid
for
alidation
data.
This
up
dates
subtrain
subtrain
while
alid
alid
do
rain
on
train
and
train
for
steps.
end
while
retraining
the
mo
del
from
scratch
but
is
not
as
well
behav
ed.
or
example,
the
ob
jectiv
on
the
alidation
set
ma
not
ev
er
reach
the
target
alue,
so
this
strategy
is
not
even
guaran
teed
to
terminate.
This
pro
cedure
is
presented
more
formally
in
algorithm
7.3
Early
stopping
is
also
useful
ecause
it
reduces
the
computational
cost
of
the
training
pro
cedure.
Besides
the
ob
vious
reduction
in
cost
due
to
limiting
the
umber
of
training
iterations,
it
also
has
the
benefit
of
providing
regularization
without
requiring
the
addition
of
enalt
to
the
cost
function
or
the
computation
of
the
gradien
ts
of
suc
additional
terms.
Ho
early
stopping
acts
as
regularizer:
So
far
ha
stated
that
early
stopping
is
regularization
strategy
but
we
hav
supp
orted
this
claim
only
by
sho
wing
learning
curves
where
the
alidation
set
error
has
U-shap
ed
curve.
What
246
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Figure
7.4:
An
illustration
of
the
effect
of
early
stopping.
(L
eft)
The
solid
contour
lines
indicate
the
con
tours
of
the
negative
log-likelihoo
d.
The
dashed
line
indicates
the
tra
jectory
tak
en
by
SGD
eginning
from
the
origin.
Rather
than
stopping
at
the
oint
that
minimizes
the
cost,
early
stopping
results
in
the
tra
jectory
stopping
at
an
earlier
point
(R
ight)
An
illustration
of
the
effect
of
regularization
for
comparison.
The
dashed
circles
indicate
the
con
tours
of
the
enalt
which
causes
the
minim
um
of
the
total
cost
to
lie
nearer
the
origin
than
the
minim
um
of
the
unregularized
cost.
is
the
actual
mechanism
by
whic
early
stopping
regularizes
the
mo
del?
Bishop
1995a
and
Sjöb
erg
and
Ljung
1995
argued
that
early
stopping
has
the
effect
of
restricting
the
optimization
pro
cedure
to
relativ
ely
small
olume
of
parameter
space
in
the
neighborho
of
the
initial
parameter
alue
as
illustrated
in
figure
7.4
More
sp
ecifically
imagine
taking
optimization
steps
(corresponding
to
training
iterations)
and
with
learning
rate
can
view
the
pro
duct
τ
as
measure
of
effective
capacit
Assuming
the
gradient
is
ounded,
restricting
oth
the
um
er
of
iterations
and
the
learning
rate
limits
the
olume
of
parameter
space
reac
hable
from
In
this
sense,
τ
eha
es
as
if
it
ere
the
recipro
cal
of
the
co
efficien
used
for
eight
deca
Indeed,
we
can
sho
how—in
the
case
of
simple
linear
mo
del
with
quadratic
error
function
and
simple
gradient
descent—early
stopping
is
equiv
alent
to
regularization.
compare
with
classical
regularization,
we
examine
simple
setting
where
the
only
parameters
are
linear
weigh
ts
).
can
model
the
cost
function
with
quadratic
approximation
in
the
neigh
orho
of
the
empirically
optimal
alue
of
the
eigh
ts
) =
(7.33)
where
is
the
Hessian
matrix
of
with
respect
to
ev
aluated
at
Given
the
assumption
that
is
minim
um
of
we
know
that
is
ositiv
semidefinite.
247
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Under
local
aylor
series
appro
ximation,
the
gradient
is
given
) =
(7.34)
are
going
to
study
the
tra
jectory
follow
ed
by
the
parameter
ector
during
training.
or
simplicity
let
us
set
the
initial
parameter
vector
to
the
origin,
(0)
Let
us
study
the
approximate
ehavior
of
gradient
descent
on
analyzing
gradien
descen
on
1)
1)
(7.35)
1)
1)
(7.36)
= (
)(
1)
(7.37)
Let
us
no
rewrite
this
expression
in
the
space
of
the
eigen
ectors
of
exploiting
the
eigendecomp
osition
of
where
is
diagonal
matrix
and
is
an
orthonormal
basis
of
eigen
vectors.
= (
)(
1)
(7.38)
) = (
1)
(7.39)
Assuming
that
(0)
= 0
and
that
is
hosen
to
small
enough
to
guaran
tee
λ
the
parameter
tra
jectory
during
training
after
parameter
up
dates
is
as
follo
ws:
= [
(7.40)
No
w,
the
expression
for
in
equation
7.13
for
regularization
can
be
rear-
ranged
as
= (
(7.41)
= [
(7.42)
Comparing
equation
7.40
and
equation
7.42
see
that
if
the
hyperparameters
and
are
hosen
suc
that
= (
α,
(7.43)
or
neural
netw
orks,
to
obtain
symmetry
breaking
etw
een
hidden
units,
cannot
initialize
all
the
parameters
to
as
discussed
in
section
6.2
How
ever,
the
argumen
holds
for
any
other
initial
alue
(0)
248
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
then
regularization
and
early
stopping
can
seen
as
equiv
alent
(at
least
under
the
quadratic
appro
ximation
of
the
ob
jectiv
function). Going
even
further,
taking
logarithms
and
using
the
series
expansion
for
log
(1
can
conclude
that
if
all
are
small
(that
is,
λ
and

then
α
(7.44)
(7.45)
That
is,
under
these
assumptions,
the
um
er
of
training
iterations
pla
ys
role
in
ersely
prop
ortional
to
the
regularization
parameter,
and
the
inv
erse
of
pla
ys
the
role
of
the
eight
deca
co
efficient.
arameter
alues
corresp
onding
to
directions
of
significant
curv
ature
(of
the
ob
jectiv
function)
are
regularized
less
than
directions
of
less
curv
ature.
Of
course,
in
the
con
text
of
early
stopping,
this
really
means
that
parameters
that
corresp
ond
to
directions
of
significan
curv
ature
tend
to
learn
early
relativ
to
parameters
corresp
onding
to
directions
of
less
curv
ature.
The
deriv
ations
in
this
section
ha
ve
shown
that
tra
jectory
of
length
ends
at
point
that
corresponds
to
minim
um
of
the
-regularized
ob
jectiv
e.
Early
stopping
is
of
course
more
than
the
mere
restriction
of
the
tra
jectory
length;
instead,
early
stopping
typically
in
olv
es
monitoring
the
alidation
set
error
in
order
to
stop
the
tra
jectory
at
particularly
goo
oint
in
space.
Early
stopping
therefore
has
the
adv
an
tage
er
eigh
deca
in
that
it
automatically
determines
the
correct
amount
of
regularization
while
weigh
deca
requires
many
training
exp
erimen
ts
with
differen
alues
of
its
hyperparameter.
7.9
arameter
ying
and
arameter
Sharing
Th
us
far,
in
this
hapter,
when
we
ha
ve
discussed
adding
constraints
or
penalties
to
the
parameters,
ha
alw
ys
done
so
with
resp
ect
to
fixed
region
or
oint.
or
example,
regularization
(or
eight
deca
y)
enalizes
mo
del
parameters
for
deviating
from
the
fixed
alue
of
zero.
Sometimes,
how
ever,
ma
need
other
ys
to
express
our
prior
kno
wledge
about
suitable
alues
of
the
mo
del
parameters.
Sometimes
migh
not
kno
precisely
what
alues
the
parameters
should
take,
but
kno
w,
from
kno
wledge
of
the
domain
and
mo
del
arc
hitecture,
that
there
should
some
dependencies
etw
een
the
mo
del
parameters.
common
yp
of
dep
endency
that
often
an
to
express
is
that
certain
parameters should
e close
to one
another.
Consider the
following
scenario:
249
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
ha
mo
dels
erforming
the
same
classification
task
(with
the
same
set
of
classes)
but
with
somewhat
differen
input
distributions.
ormally
we
hav
mo
del
with
parameters
and
mo
del
with
parameters
The
tw
mo
dels
map
the
input
to
tw
differen
but
related
outputs:
and
Let
us
imagine
that
the
tasks
are
similar
enough
(p
erhaps
with
similar
input
and
output
distributions)
that
believe
the
mo
del
parameters
should
close
to
eac
other:
should
close
to
can
lev
erage
this
information
through
regularization.
Sp
ecifically
can
use
parameter
norm
enalt
of
the
form
Ω(
Here
we
used
an
enalt
but
other
hoices
are
also
possible.
This
kind
of
approach
as
prop
osed
Lasserre
et
al.
2006
),
who
regularized
the
parameters
of
one
mo
del,
trained
as
classifier
in
sup
ervised
paradigm,
to
close
to
the
parameters
of
another
mo
del,
trained
in
an
unsup
ervised
paradigm
(to
capture
the
distribution
of
the
observ
ed
input
data).
The
architectures
were
constructed
such
that
many
of
the
parameters
in
the
classifier
mo
del
could
paired
to
corresponding
parameters
in
the
unsup
ervised
mo
del.
While
parameter
norm
enalt
is
one
to
regularize
parameters
to
close
to
one
another,
the
more
opular
wa
is
to
use
constraints:
to
for
sets
of
ar
ameters
to
qual
This
metho
of
regularization
is
often
referred
to
as
parameter
sharing
because
interpret
the
arious
mo
dels
or
mo
del
components
as
sharing
unique
set
of
parameters.
significant
adv
antage
of
parameter
sharing
er
regularizing
the
parameters
to
close
(via
norm
enalt
y)
is
that
only
subset
of
the
parameters
(the
unique
set)
needs
to
stored
in
memory
In
certain
mo
dels—suc
as
the
con
olutional
neural
netw
ork—this
can
lead
to
significan
reduction
in
the
memory
footprint
of
the
model.
7.9.1
Con
olutional
Neural
Net
works
By
far
the
most
opular
and
extensiv
use
of
parameter
sharing
ccurs
in
con
o-
lutional
neural
netw
orks
(CNNs)
applied
to
computer
vision.
Natural
images
ha
ve
many
statistical
prop
erties
that
are
in
ariant
to
translation.
or
example,
photo
of
cat
remains
photo
of
cat
if
it
is
translated
one
pixel
to
the
righ
t.
CNNs
tak
this
prop
erty
in
to
account
by
sharing
parameters
across
ultiple
image
lo
cations.
The
same
feature
(a
hidden
unit
with
the
same
weigh
ts)
is
computed
ver
differen
lo
cations
in
the
input.
This
means
that
we
can
find
cat
with
the
same
cat
detector
whether
the
cat
app
ears
at
column
or
column
in
the
image.
250
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
arameter
sharing
has
enabled
CNNs
to
dramatically
low
er
the
num
er
of
unique model parameters and
to significan
tly increase
net
ork sizes
without
requiring
corresp
onding
increase
in
training
data.
It
remains
one
of
the
est
examples
of
ho
to
effectively
incorp
orate
domain
kno
wledge
into
the
netw
ork
arc
hitecture.
CNNs
are
discussed
in
more
detail
in
hapter
7.10
Sparse
Represen
tations
eigh
deca
acts
placing
enalty
directly
on
the
mo
del
parameters.
Another
strategy
is
to
place
enalty
on
the
activ
ations
of
the
units
in
neural
net
work,
encouraging
their
activ
ations
to
sparse.
This
indirectly
imp
oses
complicated
enalt
on
the
model
parameters.
hav
already
discussed
(in
section
7.1.2
how
enalization
induces
sparse
parametrization—meaning
that
many
of
the
parameters
ecome
zero
(or
close
to zero).
Represen
tational
sparsit
, on the
other hand, describ
es
represen
tation
where
many
of
the
elemen
ts
of
the
representation
are
zero
(or
close
to
zero).
simplified
view
of
this
distinction
can
illustrated
in
the
con
text
of
linear
regression:
18
15
(7.46)
14
19
23
(7.47)
In
the
first
expression,
we
ha
ve
an
example
of
sparsely
parametrized
linear
regression
mo
del.
In
the
second,
ha
linear
regression
with
sparse
representa-
251
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
tion
of
the
data
That
is,
is
function
of
that,
in
some
sense,
represen
ts
the
information
presen
in
but
does
so
with
sparse
ector.
Represen
tational
regularization
is
accomplished
by
the
same
sorts
of
mechanisms
that
ha
used
in
parameter
regularization.
Norm
enalt
regularization
of
represen
tations
is
erformed
by
adding
to
the
loss
function
norm
enalty
on
the
epr
esentation
This
enalty
is
denoted
Ω(
As
efore,
denote
the
regularized
loss
function
by
) =
Ω(
(7.48)
where
[0
eigh
ts
the
relative
con
tribution
of
the
norm
enalty
term,
with
larger
alues
of
corresp
onding
to
more
regularization.
Just
as
an
enalt
on
the
parameters
induces
parameter
sparsity
an
enalt
on
the
elements
of
the
representation
induces
representational
sparsit
y:
Ω(
) =
||
||
. Of
course,
the
enalt
is
only
one
hoice
of
enalty
that
can
result
in
sparse
represen
tation.
Others
include
the
enalty
derived
from
Student
prior
on
the
represen
tation
Olshausen
and
Field
1996
Bergstra
2011
and
KL
div
ergence
enalties
Laro
chelle
and
Bengio
2008
),
whic
are
especially
useful
for
representations
with
elements
constrained
to
lie
on
the
unit
in
terv
al.
Lee
et
al.
2008
and
Go
dfellow
et
al.
2009
oth
provide
examples
of
strategies
based
on
regularizing
the
av
erage
activ
ation
across
sev
eral
examples,
to
near
some
target
alue,
such
as
vector
with
.01
for
eac
entry
Other
approac
hes
obtain
representational
sparsit
with
hard
constrain
on
the
activ
ation
alues.
or
example,
orthogonal
matching
pursuit
ati
et
al.
1993
enco
des
an
input
with
the
representation
that
solv
es
the
constrained
optimization
problem
arg
min
(7.49)
where
is
the
num
er
of
nonzero
entries
of
This
problem
can
solv
ed
efficien
tly
when
is
constrained
to
orthogonal.
This
method
is
often
called
OMP-
with
the
alue
of
sp
ecified
to
indicate
the
num
ber
of
nonzero
features
allo
ed.
Coates
and
Ng
2011
demonstrated
that
OMP-
can
be
very
effective
feature
extractor
for
deep
arc
hitectures.
Essen
tially
an
mo
del
that
has
hidden
units
can
made
sparse.
Throughout
this
ok,
we
see
many
examples
of
sparsity
regularization
used
in
arious
contexts.
252
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
7.11
Bagging
and
Other
Ensem
ble
Metho
ds
Bagging
(short
for
otstrap
aggregating
is
tec
hnique
for
reducing
general-
ization
error
com
bining
sev
eral
models
Breiman
1994
).
The
idea
is
to
train
sev
eral
differen
mo
dels
separately
then
ha
all
the
models
ote
on
the
output
for
test
examples.
This
is
an
example
of
general
strategy
in
machine
learning
called
mo
del
eraging
ec
hniques
emplo
ying
this
strategy
are
known
as
ensem
ble
metho
ds
The
reason
that
mo
del
eraging
works
is
that
different
mo
dels
will
usually
not
mak
all
the
same
errors
on
the
test
set.
Consider
for
example
set
of
regression
models.
Supp
ose
that
eac
model
mak
es
an
error
on
each
example, with
the
errors
drawn
from
zero-mean
ultiv
ariate
normal
distribution
with
ariances
] =
and
cov
ariances
] =
. Then
the
error
made
by
the
av
erage
prediction
of
all
the
ensemble
models
is
The
exp
ected
squared
error
of
the
ensem
ble
predictor
is
(7.50)
c.
(7.51)
In
the
case
where
the
errors
are
perfectly
correlated
and
the
mean
squared
error
reduces
to
so
the
model
av
eraging
does
not
help
at
all.
In
the
case
where
the
errors
are
perfectly
uncorrelated
and
= 0
the
exp
ected
squared
error
of
the
ensem
ble
is
only
This
means
that
the
exp
ected
squared
error
of
the
ensemble
is
inv
ersely
prop
ortional
to
the
ensemble
size.
In
other
words,
on
av
erage,
the
ensem
ble
will
perform
at
least
as
well
as
any
of
its
members,
and
if
the
members
mak
independent
errors,
the
ensemble
will
erform
significan
tly
better
than
its
mem
ers.
Differen
ensem
ble
methods
construct
the
ensem
ble
of
models
in
differen
ys.
or
example,
eac
mem
er
of
the
ensem
ble
could
be
formed
by
training
completely
differen
kind
of
model
using
different
algorithm
or
ob
jective
function.
Bagging
is
metho
that
allo
ws
the
same
kind
of
mo
del,
training
algorithm
and
ob
jectiv
function
to
be
reused
several
times.
Sp
ecifically
bagging
in
volv
es
constructing
differen
datasets.
Each
dataset
has
the
same
um
er
of
examples
as
the
original
dataset,
but
eac
dataset
is
constructed
sampling
with
replacemen
from
the
original
dataset.
This
means
that,
with
high
probability
eac
dataset
is
missing
some
of
the
examples
from
the
253
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
First ensemble mem
er
Second ensemble mem
er
Original dataset
First resampled dataset
Second resampled dataset
Figure
7.5:
carto
on
depiction
of
how
bagging
orks.
Suppose
train
an
detector
on
the
dataset
depicted
abov
e,
containing
an
8,
and
9.
Supp
ose
mak
tw
differen
resampled
datasets.
The
bagging
training
pro
cedure
is
to
construct
each
of
these
datasets
sampling
with
replacemen
t.
The
first
dataset
omits
the
and
rep
eats
the
8.
On
this
dataset,
the
detector
learns
that
loop
on
top
of
the
digit
corresp
onds
to
an
8.
On
the
second
dataset,
we
rep
eat
the
and
omit
the
6.
In
this
case,
the
detector
learns
that
lo
op
on
the
ottom
of
the
digit
corresp
onds
to
an
8.
Each
of
these
individual
classification
rules
is
brittle,
but
if
av
erage
th
eir
output,
then
the
detector
is
robust,
ac
hieving
maximal
confidence
only
when
both
loops
of
the
are
presen
t.
original
dataset
and
contains
sev
eral
duplicate
examples
Mo
del
is
then
trained
on
dataset
The
differences
et
een
whic
examples
are
included
in
eac
dataset
result
in
differences
betw
een
the
trained
models.
See
figure
7.5
for
an
example.
Neural
net
orks
reac
wide
enough
ariet
of
solution
points
that
they
can
often
enefit
from
mo
del
eraging
ev
en
if
all
the
mo
dels
are
trained
on
the
same
dataset.
Differences
in
random
initialization,
in
random
selection
of
minibatc
hes,
in
yp
erparameters,
or
in
outcomes
of
nondeterministic
implementations
of
neural
net
orks
are
often
enough
to
cause
different
members
of
the
ensem
ble
to
make
partially
indep
enden
errors.
Mo
del
eraging
is
an
extremely
ow
erful
and
reliable
metho
for
reducing
generalization
error.
Its
use
is
usually
discouraged
when
enc
hmarking
algorithms
for
scien
tific
pap
ers,
ecause
an
mac
hine
learning
algorithm
can
benefit
substan-
tially
from
mo
del
av
eraging
at
the
price
of
increased
computation
and
memory
When
both
the
original
and
the
resampled
dataset
contain
examples,
the
exact
prop
ortion
of
examples
missing
from
the
new
dataset
is
(1
This
is
the
hance
that
particular
example
is
not
hosen
among
the
ossible
source
examples
for
all
dra
ws
used
to
create
the
new
dataset. As
approac
hes
infinit
this
quantit
conv
erges
to
which
is
slightly
larger
than
254
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
or
this
reason,
benchmark
comparisons
are
usually
made
using
single
mo
del.
Mac
hine
learning
con
tests
are
usually
won
by
metho
ds
using
mo
del
av
erag-
ing
ov
er
dozens
of
mo
dels.
recent
prominent
example
is
the
Netflix
Grand
Prize
oren
2009
).
Not
all
tec
hniques
for
constructing
ensem
bles
are
designed
to
make
the
ensemble
more
regularized
than
the
individual
mo
dels.
or
example,
technique
called
osting
reund
and
Sc
hapire
1996b
constructs
an
ensemble
with
higher
capacit
than
the
individual
models.
Bo
osting
has
een
applied
to
build
ensem
bles
of
neural
net
orks
Sc
enk
and
Bengio
1998
incremen
tally
adding
neural
net
orks
to
the
ensem
ble.
Bo
osting
has
also
een
applied
interpreting
an
individual
neural
net
ork
as
an
ensem
ble
Bengio
et
al.
2006a
),
incremen
tally
adding
hidden
units
to
the
net
ork.
7.12
Drop
out
Drop
out
Sriv
asta
et
al.
2014
pro
vides
computationally
inexp
ensive
but
erful
metho
of
regularizing
broad
family
of
mo
dels.
first
approximation,
drop
out
can
thought
of
as
metho
of
making
bagging
practical
for
ensembles
of
very
many
large
neural
net
orks. Bagging
inv
olves
training
multiple
mo
dels
and
ev
aluating
multiple
mo
dels
on
each
test
example.
This
seems
impractical
when
eac
mo
del
is
large
neural
net
ork,
since
training
and
ev
aluating
suc
net
orks
is
costly
in
of
runtime
and
memory
It
is
common
to
use
ensemb
les
of
five
to
ten
neural
netw
orks—
Szegedy
et
al.
2014a
used
six
to
win
the
ILSVRC—
but
more
than
this
rapidly
ecomes
un
wieldy
Dropout
provides
an
inexp
ensive
appro
ximation
to
training
and
ev
aluating
bagged
ensem
ble
of
exp
onen
tially
man
neural
net
orks.
Sp
ecifically
drop
out
trains
the
ensemble
consisting
of
all
subnetw
orks
that
can
formed
remo
ving
nonoutput
units
from
an
underlying
base
net
work,
as
illustrated
in
figure
7.6
In
most
mo
dern
neural
netw
orks,
based
on
series
of
affine
transformations
and
nonlinearities,
can
effectiv
ely
remov
unit
from
net
ork
by
ultiplying
its
output
alue
by
zero. This
pro
cedure
requires
some
sligh
mo
dification
for
models
such
as
radial
basis
function
net
orks,
whic
tak
the
difference
et
een
the
unit’s
state
and
some
reference
alue.
Here,
presen
the
drop
out
algorithm
in
of
multiplication
zero
for
simplicity
but
it
can
trivially
modified
to
ork
with
other
op
erations
that
remo
ve
unit
from
the
net
ork.
Recall
that
to
learn
with
bagging,
we
define
differen
mo
dels,
construct
255
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Base netw
ork
Ensemble of subnet
works
Figure
7.6: Dropout
trains
an
ensemble
consisting
of
all
subnet
works
that
can
con-
structed
by
removing
nonoutput
units
from
an
underlying
base
net
ork.
Here,
begin
with
base
netw
ork
with
tw
visible
units
and
tw
hidden
units.
There
are
sixteen
ossible
subsets
of
these
four
units.
sho
all
sixteen
subnetw
orks
that
may
formed
dropping
out
different
subsets
of
units
from
the
original
netw
ork.
In
this
small
example,
large
prop
ortion
of
the
resulting
netw
orks
ha
ve
no
input
units
or
no
path
connecting
the
input
to
the
output.
This
problem
ecomes
insignificant
for
netw
orks
with
wider
la
ers,
where
the
probabilit
of
dropping
all
ossible
paths
from
inputs
to
outputs
ecomes
smaller.
256
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
differen
datasets
sampling
from
the
training
set
with
replacemen
t,
and
then
train
model
on
dataset
Drop
out
aims
to
appro
ximate
this
pro
cess,
but
with
an
exp
onen
tially
large
um
er
of
neural
netw
orks.
Sp
ecifically
to
train
with
drop
out,
use
minibatch-based
learning
algorithm
that
mak
es
small
steps,
such
as
sto
hastic
gradien
descen
t.
Each
time
we
load
an
example
in
to
minibatc
h,
we
randomly
sample
different
binary
mask
to
apply
to
all
the
input
and
hidden
units
in
the
netw
ork.
The
mask
for
eac
unit
is
sampled
indep
endently
from
all
the
others.
The
probability
of
sampling
mask
alue
of
one
(causing
unit
to
included)
is
hyperparameter
fixed
efore
training
egins. It
is
not
function
of
the
curren
alue
of
the
mo
del
parameters
or
the
input
example.
Typically
an
input
unit
is
included
with
probability
0.8,
and
hidden
unit
is
included
with
probabilit
0.5.
then
run
forward
propagation,
back-propagation,
and
the
learning
update
as
usual.
Figure
7.7
illustrates
how
to
run
forw
ard
propagation
with
drop
out.
More
formally
suppose
that
mask
ector
sp
ecifies
whic
units
to
include,
and
defines
the
cost
of
the
mo
del
defined
by
parameters
and
mask
Then
drop
out
training
consists
of
minimizing
The
exp
ectation
contains
exp
onen
tially
man
terms,
but
can
obtain
an
un
biased
estimate
of
its
gradient
sampling
alues
of
Drop
out
training
is
not
quite
the
same
as
bagging
training.
In
the
case
of
bagging,
the
mo
dels
are
all
indep
endent.
In
the
case
of
drop
out,
the
mo
dels
share
parameters,
with
eac
mo
del
inheriting
different
subset
of
parameters
from
the
parent
neural
netw
ork.
This
parameter
sharing
makes
it
ossible
to
represen
an
exponential
num
er
of
mo
dels
with
tractable
amoun
of
memory
In
the
case
of
bagging,
eac
mo
del
is
trained
to
conv
ergence
on
its
resp
ective
training
set.
In
the
case
of
drop
out,
typically
most
mo
dels
are
not
explicitly
trained
at
all—usually
the
mo
del
is
large
enough
that
it
would
be
infeasible
to
sample
all
ossible
subnetw
orks
within
the
lifetime
of
the
universe.
Instead,
tin
fraction
of
the
ossible
subnetw
orks
are
eac
trained
for
single
step,
and
the
parameter
sharing
causes
the
remaining
subnet
orks
to
arriv
at
go
settings
of
the
parameters.
These
are
the
only
differences.
Bey
ond
these,
drop
out
follo
ws
the
bagging
algorithm.
or
example,
the
training
set
encoun
tered
each
subnet
work
is
indeed
subset
of
the
original
training
set
sampled
with
replacement.
mak
prediction,
bagged
ensem
ble
ust
accumulate
votes
from
all
its
members.
refer
to
this
pro
cess
as
inference
in
this
context. So
far,
our
description
of
bagging
and
dropout
has
not
required
that
the
model
explicitly
probabilistic.
Now,
we
assume
that
the
mo
del’s
role
is
to
output
probabilit
distribution.
In
the
case
of
bagging,
eac
mo
del
pro
duces
probabilit
distribution
257
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Figure
7.7:
An
example
of
forw
ard
propagation
through
feedforw
ard
netw
ork
using
drop
out.
(T
op)
In
this
example,
use
feedforw
ard
netw
ork
with
tw
input
units,
one
hidden
lay
er
with
wo
hidden
units,
and
one
output
unit.
(Bottom)
erform
forward
propagation
with
dropout,
randomly
sample
ector
with
one
en
try
for
each
input
or
hidden
unit
in
the
netw
ork.
The
entries
of
are
binary
and
are
sampled
indep
endently
from
each
other.
The
probabilit
of
each
entry
eing
is
hyperparameter,
usually
for
the
hidden
la
ers
and
for
the
input.
Eac
unit
in
the
net
ork
is
ultiplied
the
corresp
onding
mask,
and
then
forward
propagation
con
tinues
through
the
rest
of
the
net
ork
as
usual.
This
is
equiv
alent
to
randomly
selecting
one
of
the
sub-netw
orks
from
figure
7.6
and
running
forward
propagation
through
it.
258
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
The
prediction
of
the
ensemble
is
giv
en
by
the
arithmetic
mean
of
all
these
distributions,
=1
(7.52)
In
the
case
of
drop
out,
eac
submo
del
defined
by
mask
ector
defines
probabilit
distribution
The
arithmetic
mean
ver
all
masks
is
giv
en
by
(7.53)
where
is
the
probabilit
distribution
that
as
used
to
sample
at
training
time.
Because
this
sum
includes
an
exp
onential
num
ber
of
terms,
it
is
in
tractable
to
ev
aluate
except
when
the
structure
of
the
mo
del
ermits
some
form
of
simplification.
So
far,
deep
neural
nets
are
not
kno
wn
to
ermit
any
tractable
simplification.
Instead,
can
appro
ximate
the
inference
with
sampling,
av
eraging
together
the
output
from
many
masks.
Even
10–20
masks
are
often
sufficient
to
obtain
go
erformance.
An
ev
en
etter
approac
h,
how
ever,
allows
us
to
obtain
go
approximation
to
the
predictions
of
the
en
tire
ensemble,
at
the
cost
of
only
one
forw
ard
propagation.
do
so,
hange
to
using
the
geometric
mean
rather
than
the
arithmetic
mean
of
the
ensemble
members’
predicted
distributions.
arde-F
arley
et
al.
2014
presen
argumen
ts
and
empirical
evidence
that
the
geometric
mean
performs
comparably
to
the
arithmetic
mean
in
this
con
text.
The
geometric
mean
of
ultiple
probabilit
distributions
is
not
guaranteed
to
probability
distribution.
guaran
tee
that
the
result
is
probability
distribution,
imp
ose
the
requirement
that
none
of
the
submo
dels
assigns
probability
to
an
ev
en
t,
and
we
renormalize
the
resulting
distribution.
The
unnormalized
probabilit
distribution
defined
directly
the
geometric
mean
is
given
ensemble
) =
(7.54)
where
is
the
num
er
of
units
that
may
dropp
ed.
Here
we
use
uniform
distribution
er
to
simplify
the
presen
tation,
but
non
uniform
distributions
are
also
ossible.
mak
predictions
ust
renormalize
the
ensemble:
ensemble
) =
ensemble
ensemble
(7.55)
259
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
ey
insight
Hinton
et
al.
2012c
in
volv
ed
in
drop
out
is
that
we
can
approxi-
mate
ensemble
ev
aluating
in
one
model:
the
mo
del
with
all
units,
but
with
the
eigh
ts
going
out
of
unit
ultiplied
the
probabilit
of
including
unit
The
motiv
ation
for
this
mo
dification
is
to
capture
the
right
exp
ected
alue
of
the
output
from
that
unit.
call
this
approach
the
eigh
scaling
inference
rule
There
is
not
yet
any
theoretical
argument
for
the
accuracy
of
this
approximate
inference
rule
in
deep
nonlinear
net
works,
but
empirically
it
erforms
very
ell.
Because
usually
use
an
inclusion
probabilit
of
the
eight
scaling
rule
usually
amounts
to
dividing
the
weigh
ts
by
at
the
end
of
training,
and
then
using
the
mo
del
as
usual.
Another
wa
to
ac
hiev
the
same
result
is
to
multiply
the
states
of
the
units
during
training.
Either
the
goal
is
to
mak
sure
that
the
expected
total
input
to
unit
at
test
time
is
roughly
the
same
as
the
exp
ected
total
input
to
that
unit
at
train
time,
ev
en
though
half
the
units
at
train
time
are
missing
on
erage.
or
many
classes
of
mo
dels
that
do
not
ha
nonlinear
hidden
units,
the
weigh
scaling
inference
rule
is
exact.
or
simple
example,
consider
softmax
regression
classifier
with
input
ariables
represented
the
vector
) =
softmax
(7.56)
can
index
in
to
the
family
of
submo
dels
elemen
t-wise
ultiplication
of
the
input
with
binary
ector
) =
softmax
(7.57)
The
ensem
ble
predictor
is
defined
renormalizing
the
geometric
mean
er
all
ensem
ble
mem
ers’
predictions:
ensemble
) =
ensemble
ensemble
(7.58)
where
ensemble
) =
∈{
(7.59)
see
that
the
eight
scaling
rule
is
exact,
can
simplify
ensemble
ensemble
) =
∈{
(7.60)
260
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
∈{
softmax
(7.61)
∈{
exp
exp
(7.62)
∈{
exp
∈{
exp
(7.63)
Because
will
normalized,
can
safely
ignore
ultiplication
by
factors
that
are
constan
with
respect
to
ensemble
∈{
exp
(7.64)
= exp
∈{
(7.65)
= exp
(7.66)
Substituting
this
back
into
equation
7.58
we
obtain
softmax
classifier
with
eigh
ts
The
weigh
scaling
rule
is
also
exact
in
other
settings,
including
regression
net
orks
with
conditionally
normal
outputs
as
ell
as
deep
netw
orks
that
ha
hidden
la
ers
without
nonlinearities.
How
ever,
the
weigh
scaling
rule
is
only
an
appro
ximation
for
deep
mo
dels
that
ha
nonlinearities.
Though
the
approximation
has
not
been
theoretically
characterized,
it
often
orks
ell,
empirically
Go
dfellow
et
al.
2013a
found
exp
erimentally
that
the
weigh
scaling
approximation
can
work
etter
(in
of
classification
accuracy)
than
Monte
Carlo
approximations
to
the
ensem
ble
predictor.
This
held
true
ev
en
when
the
Monte
Carlo
appro
ximation
as
allo
ed
to
sample
up
to
1,000
subnetw
orks.
Gal
and
Ghahramani
2015
found
that
some
models
obtain
etter
classification
accuracy
using
wen
samples
and
the
Mon
te
Carlo
appro
ximation.
It
appears
that
the
optimal
choice
of
inference
appro
ximation
is
problem
dependent.
Sriv
astav
et
al.
2014
show
ed
that
drop
out
is
more
effective
than
other
standard
computationally
inexp
ensive
regularizers,
suc
as
weigh
decay
filter
261
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
norm
constraints,
and
sparse
activity
regularization.
Drop
out
ma
also
be
com
bined
with
other
forms
of
regularization
to
yield
further
improv
ement.
One
adv
an
tage
of
drop
out
is
that
it
is
very
computationally
cheap.
Using
drop
out
during
training
requires
only
computation
er
example
er
update,
to
generate
random
binary
um
ers
and
ultiply
them
the
state.
Dep
ending
on
the
implemen
tation,
it
may
also
require
memory
to
store
these
binary
um
ers
un
til
the
back-propagation
stage.
Running
inference
in
the
trained
mo
del
has
the
same
cost
er
example
as
if
drop
out
ere
not
used,
though
we
ust
pay
the
cost
of
dividing
the
weigh
ts
by
once
efore
eginning
to
run
inference
on
examples.
Another
significan
adv
an
tage
of
drop
out
is
that
it
does
not
significantly
limit
the
yp
of
mo
del
or
training
pro
cedure
that
can
used.
It
works
ell
with
nearly
an
mo
del
that
uses
distributed
represen
tation
and
can
trained
with
sto
hastic
gradien
descen
t.
This
includes
feedforw
ard
neural
net
orks,
probabilistic
mo
dels
suc
as
restricted
Boltzmann
machines
Sriv
astav
et
al.
2014
),
and
recurren
neural
net
orks
Ba
er
and
Osendorfer
2014
ascan
et
al.
2014a
).
Many
other
regularization
strategies
of
comparable
ow
er
imp
ose
more
sev
ere
restrictions
on
the
arc
hitecture
of
the
model.
Though
the
cost
er
step
of
applying
dropout
to
sp
ecific
mo
del
is
negligible,
the
cost
of
using
drop
out
in
complete
system
can
be
significan
t.
Because
dropout
is
regularization
tec
hnique,
it
reduces
the
effectiv
capacit
of
model.
offset
this
effect,
we
ust
increase
the
size
of
the
mo
del.
Typically
the
optimal
alidation
set
error
is
uc
lo
er
when
using
drop
out,
but
this
comes
at
the
cost
of
uc
larger
model
and
many
more
iterations
of
the
training
algorithm.
or
very
large
datasets,
regularization
confers
little
reduction
in
generalization
error. In
these
cases,
the
computational
cost
of
using
dropout
and
larger
models
may
outw
eigh
the
enefit
of
regularization.
When
extremely
few
labeled
training
examples
are
av
ailable,
drop
out
is
less
effectiv
e.
Bay
esian
neural
netw
orks (
Neal
1996
) outperform dropout on
the
Alternativ
Splicing
Dataset
Xiong
et
al.
2011
),
where
fewer
than
5,000
examples
are
ailable
Sriv
astav
et
al.
2014
).
When
additional
unlab
eled
data
is
av
ailable,
unsup
ervised
feature
learning
can
gain
an
adv
antage
ver
dropout.
ager
et
al.
2013
show
ed
that,
when
applied
to
linear
regression,
drop
out
is
equiv
alent
to
eigh
deca
with
different
weigh
deca
co
efficien
for
eac
input
feature.
The
magnitude
of
each
feature’s
weigh
decay
co
efficient
is
determined
its
ariance.
Similar
results
hold
for
other
linear
mo
dels.
or
deep
mo
dels,
drop
out
is
not
equiv
alent
to
eight
deca
The
sto
chasticit
used
while
training
with
drop
out
is
not
necessary
for
the
262
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
approac
h’s
success.
It
is
just
means
of
approxima
ting
the
sum
ver
all
submo
dels.
ang
and
Manning
2013
deriv
ed
analytical
appro
ximations
to
this
marginaliza-
tion.
Their
approximation,
known
as
fast
drop
out
resulted
in
faster
con
ergence
time
due
to
the
reduced
sto
chasticit
in
the
computation
of
the
gradient. This
metho
can
also
applied
at
test
time,
as
more
principled
(but
also
more
computationally
exp
ensiv
e)
appro
ximation
to
the
verage
ver
all
sub-net
works
than
the
eigh
scaling
approximation.
ast
drop
out
has
een
used
to
nearly
matc
the
erformance
of
standard
dropout
on
small
neural
netw
ork
problems,
but
has
not
et
yielded
significan
improv
ement
or
been
applied
to
large
problem.
Just
as
sto
hasticit
is
not
necessary
to
achiev
e the
regularizing
effect of
drop
out,
it
is
also
not
sufficient.
demonstrate
this,
arde-F
arley
et
al.
2014
designed
control
exp
erimen
ts
using
metho
called
drop
out
osting
which
they
designed
to
use
exactly
the
same
mask
noise
as
traditional
drop
out
but
lack
its
regularizing
effect.
Drop
out
osting
trains
the
entire
ensem
ble
to
jointly
maximize
the
log-lik
eliho
on
the
training
set.
In
the
same
sense
that
traditional
drop
out
is
analogous
to
bagging, this
approach
is
analogous
to
osting.
As
in
tended,
exp
erimen
ts
with
drop
out
osting
sho
almost
no
regularization
effect
compared
to
training
the
entire
netw
ork
as
single
mo
del.
This
demonstrates
that
the
in
terpretation
of
drop
out
as
bagging
has
alue
eyond
the
in
terpretation
of
drop
out
as
robustness
to
noise.
The
regularization
effect
of
the
bagged
ensemble
is
only
ac
hiev
ed
when
the
stochastically
sampled
ensemble
mem
ers
are
trained
to
erform
ell
independently
of
eac
other.
Drop
out
has
inspired
other
sto
chastic
approaches
to
training
exp
onentially
large
ensembles
of
mo
dels
that
share
eigh
ts. DropConnect
is
sp
ecial
case
of
drop
out
where
eac
pro
duct
etw
een
single
scalar
weigh
and
single
hidden
unit
state
is
considered
unit
that
can
dropp
ed
an
et
al.
2013
).
Sto
chastic
oling
is
form
of
randomized
po
oling
(see
section
9.3
for
building
ensembles
of
con
olutional
net
orks,
with
eac
con
volutional
net
ork
attending
to
different
spatial
lo
cations
of
eac
feature
map. So
far,
drop
out
remains
the
most
widely
used
implicit
ensem
ble
method.
One
of
the
key
insigh
ts
of
drop
out
is
that
training
netw
ork
with
sto
hastic
eha
vior
and
making
predictions
by
av
eraging
ver
ultiple
sto
chastic
decisions
implemen
ts
form
of
bagging
with
parameter
sharing.
Earlier,
we
describ
ed
drop
out
as
bagging
an
ensemble
of
mo
dels
formed
including
or
excluding
units.
et
this
mo
del
av
eraging
strategy
do
es
not
need
to
based
on
inclusion
and
exclusion.
In
principle,
any
kind
of
random
mo
dification
is
admissible.
In
practice,
must
choose
mo
dification
families
that
neural
netw
orks
are
able
to
learn
to
resist. Ideally
we
should
also
use
mo
del
families
that
allow
fast
appro
ximate
263
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
inference
rule.
can
think
of
an
form
of
mo
dification
parametrized
by
ector
as
training
an
ensemble
consisting
of
for
all
ossible
alues
of
There
is
no
requirement
that
ha
finite
num
er
of
alues.
or
example,
can
real
alued.
Sriv
astav
et
al.
2014
show
ed
that
multiplying
the
weigh
ts
by
can
outperform
drop
out
based
on
binary
masks.
Because
] =
the
standard
netw
ork
automatically
implements
appro
ximate
inference
in
the
ensem
ble,
without
needing
an
eight
scaling.
So
far
ha
describ
ed
drop
out
purely
as
means
of
erforming
efficient,
appro
ximate
bagging.
Another
view
of
dropout
go
es
further
than
this.
Drop
out
trains
not
just
bagged
ensem
ble
of
models,
but
an
ensem
ble
of
models
that
share
hidden
units.
This
means
each
hidden
unit
ust
able
to
erform
ell
regardless
of
which
other
hidden
units
are
in
the
mo
del.
Hidden
units
ust
be
prepared
to
sw
app
ed
and
interc
hanged
etw
een
mo
dels.
Hin
ton
et
al.
2012c
were
inspired
an
idea
from
biology:
sexual
repro
duction,
which
in
volv
es
swapping
genes
betw
een
differen
organisms,
creates
evolutionary
pressure
for
genes
to
ecome
not
just
go
but
readily
sw
app
ed
etw
een
different
organisms.
Suc
genes
and
suc
features
are
robust
to
changes
in
their
environmen
ecause
they
are
not
able
to
incorrectly
adapt
to
unusual
features
of
an
one
organism
or
mo
del.
Drop
out
th
us
regularizes
eac
hidden
unit
to
be
not
merely
go
feature
but
feature
that
is
go
in
many
con
texts.
arde-F
arley
et
al.
2014
compared
dropout
training
to
training
of
large
ensembles
and
concluded
that
drop
out
offers
additional
impro
emen
ts
to
generalization error
eyond
those obtained
by
ensembles
of
indep
enden
mo
dels.
It
is
imp
ortant
to
understand
that
large
ortion
of
the
ow
er
of
drop
out
arises
from
the
fact
that
the
masking
noise
is
applied
to
the
hidden
units.
This
can
seen
as
form
of
highly
in
telligen
t,
adaptiv
destruction
of
the
information
con
ten
of
the
input
rather
than
destruction
of
the
ra
alues
of
the
input.
or
example,
if
the
mo
del
learns
hidden
unit
that
detects
face
by
finding
the
nose,
then
dropping
corresp
onds
to
erasing
the
information
that
there
is
nose
in
the
image.
The
model
must
learn
another
that
either
redundan
tly
enco
des
the
presence
of
nose
or
detects
the
face
another
feature,
suc
as
the
mouth.
raditional
noise
injection
tec
hniques
that
add
unstructured
noise
at
the
input
are
not
able
to
randomly
erase
the
information
about
nose
from
an
image
of
face
unless
the
magnitude
of
the
noise
is
so
great
that
nearly
all
the
information
in
the
image
is
remov
ed.
Destro
ying
extracted
features
rather
than
original
alues
allo
ws
the
destruction
pro
cess
to
mak
use
of
all
the
knowledg
ab
out
the
input
distribution
that
the
model
has
acquired
so
far.
Another
imp
ortan
asp
ect
of
drop
out
is
that
the
noise
is
ultiplicative.
If
the
264
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
noise
were
additive
with
fixed
scale,
then
rectified
linear
hidden
unit
with
added
noise
could
simply
learn
to
hav
ecome
ery
large
in
order
to
make
the
added
noise
insignifican
comparison.
Multiplicativ
noise
do
es
not
allo
suc
pathological
solution
to
the
noise
robustness
problem.
Another
deep
learning
algorithm,
batc
normalization,
reparametrizes
the
mo
del
in
wa
that
in
tro
duces
oth
additive
and
multiplicativ
noise
on
the
hidden
units
at
training
time.
The
primary
purp
ose
of
batch
normalization
is
to
impro
optimization,
but
the
noise
can
ha
regularizing
effect,
and
sometimes
mak
es
drop
out
unnecessary
Batc
normalization
is
described
further
in
section
8.7.1
7.13
dversarial
raining
In
many
cases,
neural
netw
orks
hav
egun
to
reach
human
erformance
when
ev
aluated
on
an
i.i.d.
test
set.
It
is
natural
therefore
to
wonder
whether
these
mo
dels
ha
obtained
true
uman-level
understanding
of
these
tasks.
prob
the
lev
el
of
understanding
netw
ork
has
of
the
underlying
task,
we
can
for
examples
that
the
mo
del
misclassifies.
Szegedy
et
al.
2014b
found
that
ev
en
neural
net
orks
that
perform
at
human
level
accuracy
ha
ve
nearly
100
ercent
error
rate
on
examples
that
are
in
ten
tionally
constructed
using
an
optimization
pro
cedure
to
for
an
input
near
data
oin
suc
that
the
mo
del
output
is
very
differen
at
In
man
cases,
can
so
similar
to
that
007
sign
))
sign
))
“panda”
“nemato
de”
“gibb
on”
w/
57.7%
confidence
w/
8.2%
confidence
w/
99.3%
confidence
Figure
7.8: A
demonstration
of
adv
ersarial
example
generation
applied
to
Go
ogLeNet
Szegedy
et
al.
2014a
on
ImageNet.
By
adding
an
imp
erceptibly
small
vector
whose
elemen
ts
are
equal
to
the
sign
of
the
elemen
ts
of
the
gradient
of
the
cost
function
with
resp
ect
to
the
input,
we
can
hange
GoogLeNet’s
classification
of
the
image.
Repro
duced
with
ermission
from
Goo
dfellow
et
al.
2014b
).
265
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
uman
observ
er
cannot
tell
the
difference
et
een
the
original
example
and
the
adv
ersarial
example
but
the
netw
ork
can
mak
highly
different
predictions.
See
figure
7.8
for
an
example.
dv
ersarial
examples
ha
man
implications,
for
example,
in
computer
securit
that
are
beyond
the
scope
of
this
hapter.
They
are
in
teresting
in
the
context
of
regularization,
ho
ev
er,
ecause
one
can
reduce
the
error
rate
on
the
original
i.i.d.
test
set
via
adv
ersarial
training
—training
on
adv
ersarially
erturb
ed
examples
from
the
training
set
Szegedy
et
al.
2014b
Go
dfello
et
al.
2014b
).
Go
dfello
et
al.
2014b
sho
ed
that
one
of
the
primary
causes
of
these
adv
ersarial examples
is excessiv
e linearit
Neural net
orks are
built out of
primarily
linear
building
blo
cks. In
some
exp
eriments
the
ov
erall
function
they
implemen
pro
es
to
highly
linear
as
result.
These
linear
functions
are
easy
to
optimize.
Unfortunately
the
alue
of
linear
function
can
change
ery
rapidly
if
it
has
umerous
inputs.
If
we
change
each
input
by
then
linear
function
with
eigh
ts
can
hange
by
as
uc
as
||
||
which
can
very
large
amoun
if
is
highdimensional.
dv
ersarial
training
discourages
this
highly
sensitiv
lo
cally
linear
eha
vior
encouraging
the
netw
ork
to
lo
cally
constant
in
the
neigh
borho
of
the
training
data.
This
can
seen
as
ay
of
explicitly
in
tro
ducing
local
constancy
prior
into
supervised
neural
nets.
dv
ersarial
training
helps
to
illustrate
the
ow
er
of
using
large
function
family
in
com
bination
with
aggressiv
regularization.
Purely
linear
mo
dels,
lik
logistic
regression,
are
not
able
to
resist
adversarial
examples
ecause
they
are
forced
to
be
linear.
Neural
net
orks
are
able
to
represen
functions
that
can
range
from
nearly
linear
to
nearly
lo
cally
constant
and
th
us
ha
the
flexibility
to
capture
linear
trends
in
the
training
data
while
still
learning
to
resist
lo
cal
erturbation.
dv
ersarial
examples
also
provide
means
of
accomplishing
semi-supervised
learning.
At
oint
that
is
not
asso
ciated
with
lab
el
in
the
dataset,
the
mo
del
itself
assigns
some
label
The
mo
del’s
label
ma
not
be
the
true
lab
el,
but
if
the
mo
del
is
high
quality
then
has
high
probability
of
providing
the
true
lab
el.
can
seek
an
adv
ersarial
example
that
causes
the
classifier
to
output
label
with
dv
ersarial
examples
generated
using
not
the
true
lab
el
but
lab
el
provided
by
trained
mo
del
are
called
virtual
adversarial
examples
Miy
ato
et
al.
2015
).
The
classifier
ma
then
be
trained
to
assign
the
same
label
to
and
This
encourages
the
classifier
to
learn
function
that
is
robust
to
small
hanges
an
ywhere
along
the
manifold
where
the
unlab
eled
data
lie.
The
assumption
motiv
ating
this
approach
is
that
differen
classes
usually
lie
on
disconnected
manifolds,
and
small
erturbation
should
not
be
able
to
jump
from
one
class
manifold
to
another
class
manifold.
266
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
7.14
angen
Distance,
angen
Prop
and
Manifold
angen
Classifier
Man
mac
hine
learning
algorithms
aim
to
ov
ercome
the
curse
of
dimensionalit
assuming
that
the
data
lies
near
low-dimensional
manifold,
as
described
in
section
5.11.3
One
of
the
early
attempts
to
take
adv
an
tage
of
the
manifold
yp
othesis
is
the
tangen
distance
algorithm
Simard
et
al.
1993
1998
). It
is
nonparametric
nearest
neigh
or
algorithm
in
whic
the
metric
used
is
not
the
generic
Euclidean
distance
but
one
that
is
derived
from
kno
wledge
of
the
manifolds
near
whic
probabilit
concen
trates.
It
is
assumed
that
we
are
trying
to
classify
examples,
and
that
examples
on
the
same
manifold
share
the
same
category
Since
the
classifier
should
in
arian
to
the
lo
cal
factors
of
ariation
that
corresp
ond
to
mo
emen
on
the
manifold,
it
ould
make
sense
to
use
as
nearest
neigh
or
distance
et
een
oin
ts
and
the
distance
etw
een
the
manifolds
and
to
whic
they
resp
ectiv
ely
elong.
Although
that
may
computationally
difficult
(it
would
require
solving
an
optimization
problem,
to
find
the
nearest
pair
of
oints
on
and
),
cheap
alternativ
that
mak
es
sense
lo
cally
is
to
approximate
its
tangen
plane
at
and
measure
the
distance
et
een
the
tw
tangents,
or
etw
een
tangen
plane
and
oint.
That
can
ac
hiev
ed
solving
lo
w-dimensional
linear
system
(in
the
dimension
of
the
manifolds).
Of
course,
this
algorithm
requires
one
to
specify
the
tangent
ectors.
In
related
spirit,
the
tangen
prop
algorithm
Simard
et
al.
1992
(figure
7.9
trains
neural
net
classifier
with
an
extra
enalty
to
make
eac
output
of
the
neural
net
lo
cally
inv
ariant
to
known
factors
of
ariation.
These
factors
of
ariation
corresp
ond
to
mo
emen
along
the
manifold
near
whic
examples
of
the
same
class
concen
trate.
Lo
cal
in
ariance
is
achiev
ed
by
requiring
to
orthogonal
to
the
kno
wn
manifold
tangent
vectors
at
or
equiv
alently
that
the
directional
deriv
ativ
of
at
in
the
directions
small
adding
regularization
enalt
Ω(
) =
))
(7.67)
This
regularizer
can
of
course
be
scaled
by
an
appropriate
hyperparameter,
and
for
most
neural
net
orks,
ould
need
to
sum
ver
man
outputs
rather
than
the
lone
output
describ
ed
here
for
simplicity
As
with
the
tangent
distance
algorithm,
the
tangen
vectors
are
derived
priori,
usually
from
the
formal
kno
wledge
of
the
effect
of
transformations,
suc
as
translation,
rotation,
and
scaling
in
images.
267
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
Normal
angent
Figure
7.9: Illustration
of
the
main
idea
of
the
tangent
prop
algorithm
Simard
et
al.
1992
and
manifold
tangent
classifier
Rifai
et
al.
2011c
),
whic
oth
regularize
the
classifier
output
function
Each
curve
represents
the
manifold
for
different
class,
illustrated
here
as
one-dimensional
manifold
embedded
in
tw
o-dimensional
space.
On
one
curve,
we
hav
hosen
single
oint
and
drawn
ector
that
is
tangent
to
the
class
manifold
(parallel
to
and
touching
the
manifold)
and
vector
that
is
normal
to
the
class
manifold
(orthogonal
to
the
manifold).
In
multiple
dimensions
there
may
many
tangen
directions
and
man
normal
directions.
exp
ect
the
classification
function
to
hange
rapidly
as
it
mov
es
in
the
direction
normal
to
the
manifold,
and
not
to
hange
as
it
mov
es
along
the
class
manifold.
Both
tangent
propagation
and
the
manifold
tangen
classifier
regularize
to
not
change
ery
muc
as
mo
es
along
the
manifold.
angent
propagation
requires
the
user
to
manually
sp
ecify
functions
that
compute
the
tangen
directions
(such
as
sp
ecifying
that
small
translations
of
images
remain
in
the
same
class
manifold),
while
the
manifold
tangent
classifier
estimates
the
manifold
tangen
directions
training
an
autoe
ncoder
to
fit
the
training
data.
The
use
of
auto
enco
ders
to
estimate
manifolds
is
described
in
hapter
14
268
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
angen
prop
has
een
used
not
just
for
supervised
learning
Simard
et
al.
1992
but
also
in
the
con
text
of
reinforcement
learning
Thrun
1995
).
angen
propagation is
closely related
to dataset
augmentation.
In
oth
cases,
the
user
of
the
algorithm
enco
des
his
or
her
prior
knowledge
of
the
task
sp
ecifying
set
of
transformations
that
should
not
alter
the
output
of
the
net
ork.
The
difference
is
that
in
the
case
of
dataset
augmentation,
the
net
ork
is
explicitly
trained
to
correctly
classify
distinct
inputs
that
ere
created
by
applying
more
than
an
infinitesimal
amount
of
these
transformations.
angent
propagation
do
es
not
require
explicitly
visiting
new
input
oint.
Instead,
it
analytically
regularizes
the
mo
del
to
resist
erturbation
in
the
directions
corresp
onding
to
the specified transformation.
While this
analytical approac
h is
in
tellectually
elegan
t,
it
has
ma
jor
drawbac
ks.
First,
it
only
regularizes
the
mo
del
to
resist
infinitesimal
erturbation.
Explicit
dataset
augmentation
confers
resistance
to
larger
perturbations.
Second,
the
infinitesimal
approac
oses
difficulties
for
models
based
on
rectified
linear
units.
These
mo
dels
can
only
shrink
their
deriv
ativ
es
turning
units
off
or
shrinking
their
weigh
ts.
They
are
not
able
to
shrink
their
deriv
ativ
es
by
saturating
at
high
alue
with
large
weigh
ts,
as
sigmoid
or
tanh
units
can.
Dataset
augmentation
works
well
with
rectified
linear
units
ecause
differen
subsets
of
rectified
units
can
activ
ate
for
different
transformed
ersions
of
eac
original
input.
angen
propagation
is
also
related
to
double
bac
kprop
Druc
er
and
LeCun
1992
and
adv
ersarial
training
Szegedy
et
al.
2014b
Go
dfellow
et
al.
2014b
).
Double
bac
kprop
regularizes
the
Jacobian
to
small,
while
adv
ersarial
training
finds
inputs
near
the
original
inputs
and
trains
the
mo
del
to
pro
duce
the
same
output
on
these
as
on
the
original
inputs.
angen
propagation
and
dataset
augmen
tation
using
man
ually
sp
ecified
transformations
oth
require
that
the
mo
del
be
inv
ariant
to
certain
sp
ecified
directions
of
hange
in
the
input.
Double
bac
kprop
and
adv
ersarial
training
oth
require
that
the
mo
del
should
in
arian
to
al
directions
of
hange
in
the
input
as
long
as
the
change
is
small.
Just
as
dataset
augmen
tation
is
the
non-infinitesimal
version
of
tangen
propagation,
adversarial
training
is
the
non-infinitesimal
ersion
of
double
backprop.
The
manifold
tangen
classifier
Rifai
et
al.
2011c
),
eliminates
the
need
to
kno
the
tangen
ectors
priori.
As
will
see
in
hapter
14
auto
enco
ders
can
estimate
the
manifold
tangen
ectors.
The
manifold
tangen
classifier
mak
es
use
of
this
technique
to
oid
needing
user-sp
ecified
tangent
ectors. As
illustrated
in
figure
14.10
these
estimated
tangen
ectors
go
ey
ond
the
classical
in
ariants
that
arise
out
of
the
geometry
of
images
(suc
as
translation,
rotation,
and
scaling)
and
include
factors
that
must
be
learned
ecause
they
are
ob
ject-sp
ecific
(such
as
269
CHAPTER
7.
REGULARIZA
TION
FOR
DEEP
LEARNING
mo
ving
dy
parts).
The
algorithm
prop
osed
with
the
manifold
tangen
classifier
is
therefore
simple:
(1)
use
an
auto
enco
der
to
learn
the
manifold
structure
unsup
ervised
learning,
and
(2)
use
these
tangen
ts
to
regularize
neural
net
classifier
as
in
tangen
prop
(equation
7.67
).
In
this c
hapter, we
ha
describ
ed most
of
the general
strategies
used to
regularize
neural
net
orks.
Regularization
is
cen
tral
theme
of
machine
learning
and
as
suc
will
revisited
erio
dically
in
most
of
the
remaining
hapters.
Another
cen
tral
theme
of
mac
hine
learning
is
optimization,
describ
ed
next.
270