Chapter
Optimization
for
raining
Deep
Mo
dels
Deep
learning
algorithms
in
olv
optimization
in
man
con
texts.
or
example,
erforming
inference
in
mo
dels
suc
as
PCA
in
volv
es
solving
an
optimization
problem.
often
use
analytical
optimization
to
write
pro
ofs
or
design
algorithms.
Of
all
the
many
optimization
problems
in
olv
ed
in
deep
learning,
the
most
difficult
is
neural
netw
ork
training.
It
is
quite
common
to
inv
est
days
to
mon
ths
of
time
on
undreds
of
machines
to
solve
even
single
instance
of
the
neural
net
ork
training
problem.
Because
this
problem
is
so
imp
ortan
and
so
expensive,
sp
ecialized
set
of
optimization
techniques
hav
een
developed
for
solving
it. This
hapter
presen
ts
these
optimization
techniques
for
neural
net
ork
training.
If
you
are
unfamiliar
with
the
basic
principles
of
gradien
t-based
optimization,
suggest
reviewing
chapter
That
chapter
includes
brief
ov
erview
of
numerical
optimization
in
general.
This
chapter
fo
cuses
on
one
particular
case
of
optimization:
finding
the
param-
eters
of
neural
net
ork
that
significantly
reduce
cost
function
which
ypically
includes
erformance
measure
ev
aluated
on
the
entire
training
set
as
ell
as
additional
regularization
terms.
egin
with
description
of
how
optimization
used
as
training
algorithm
for
machine
learning
task
differs
from
pure
optimization.
Next,
we
present
sev
eral
of
the
concrete
hallenges
that
make
optimization
of
neural
netw
orks
difficult.
then
define
several
practical
algorithms,
including
oth
optimization
algorithms
themselv
es
and
strategies
for
initializing
the
parameters.
More
adv
anced
algorithms
adapt
their
learning
rates
during
training
or
leverage
information
con
tained
in
271
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
the
second
deriv
ativ
es
of
the
cost
function.
Finally
conclude
with
review
of
sev
eral
optimization
strategies
that
are
formed
combining
simple
optimization
algorithms
into
higher-lev
el
pro
cedures.
8.1
Ho
Learning
Differs
from
Pure
Optimization
Optimization
algorithms
used
for
training
of
deep
mo
dels
differ
from
traditional
optimization
algorithms
in
several
ys.
Mac
hine
learning
usually
acts
indirectly
In
most
machine
learning
scenarios,
care
ab
out
some
erformance
measure
that
is
defined
with
resp
ect
to
the
test
set
and
may
also
be
in
tractable. W
therefore
optimize
only
indirectly
reduce
different
cost
function
in
the
hop
that
doing
so
will
improv
This
is
in
con
trast
to
pure
optimization,
where
minimizing
is
goal
in
and
of
itself.
Optimization
algorithms
for
training
deep
mo
dels
also
typically
include
some
sp
ecialization
on
the
sp
ecific
structure
of
mac
hine
learning
ob
jectiv
functions.
ypically
the
cost
function
can
written
as
an
av
erage
ov
er
the
training
set,
suc
as
) =
data
(8.1)
where
is
the
er-example
loss
function,
is
the
predicted
output
when
the
input
is
and
data
is
the
empirical
distribution.
In
the
sup
ervised
learning
case,
is
the
target
output.
Throughout
this
hapter,
we
dev
elop
the
unregularized
sup
ervised
case,
where
the
argumen
ts
to
are
and
It
is
trivial
to
extend
this
developmen
t,
for
example,
to
include
or
as
arguments,
or
to
exclude
as
argumen
ts,
to
dev
elop
arious
forms
of
regularization
or
unsup
ervised
learning.
Equation
8.1
defines
an
ob
jectiv
function
with
resp
ect
to
the
training
set.
ould
usually
prefer
to
minimize
the
corresp
onding
ob
jective
function
where
the
exp
ectation
is
taken
across
the
data-gener
ating
distribution
data
rather
than
just
er
the
finite
training
set:
) =
data
(8.2)
8.1.1
Empirical
Risk
Minimization
The
goal
of
machine
learning
algorithm
is
to
reduce
the
exp
ected
generalization
error
given
by
equation
8.2
. This
quan
tit
is
known
as
the
risk
emphasize
here
that
the
exp
ectation
is
tak
en
ov
er
the
true
underlying
distribution
data
If
knew
the
true
distribution
data
risk
minimization
ould
an
optimization
272
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
task
solv
able
by
an
optimization
algorithm.
When
we
do
not
know
data
but
only
hav
training
set
of
samples,
ho
ev
er,
we
hav
mac
hine
learning
problem.
The
simplest
wa
to
con
ert
machine
learning
problem
back
into
an
op-
timization
problem
is
to
minimize
the
exp
ected
loss
on
the
training
set.
This
means
replacing
the
true
distribution
with
the
empirical
distribution
defined
by
the
training
set.
now
minimize
the
empirical
risk
data
,y
)] =
=1
(8.3)
where
is
the
um
er
of
training
examples.
The
training
pro
cess
based
on
minimizing
this
av
erage
training
error
is
kno
wn
as
empirical
risk
minimization
In
this
setting,
machine
learning
is
still
very
similar
to
straigh
tforw
ard
optimization.
Rather
than
optimizing
the
risk
directly
optimize
the
empirical
risk
and
hop
that
the
risk
decreases
significantly
as
ell.
ariet
of
theoretical
results
establish
conditions
under
whic
the
true
risk
can
exp
ected
to
decrease
by
arious
amounts.
Nonetheless,
empirical
risk
minimization
is
prone
to
ov
erfitting.
Models
with
high
capacity
can
simply
memorize
the
training
set.
In
many
cases,
empirical
risk
minimization
is
not
really
feasible.
The
most
effective
mo
dern
optimization
algorithms
are
based
on
gradien
descent,
but
many
useful
loss
functions,
such
as
0-1
loss,
ha
no
useful
deriv
ativ
es
(the
deriv
ative
is
either
zero
or
undefined
ev
erywhere).
These
wo
problems
mean
that,
in
the
context
of
deep
learning,
we
rarely
use
empirical
risk
minimization.
Instead,
we
ust
use
sligh
tly
different
approac
h,
in
which
the
quantit
that
actually
optimize
is
even
more
different
from
the
quan
tit
that
truly
an
to
optimize.
8.1.2
Surrogate
Loss
unctions
and
Early
Stopping
Sometimes,
the
loss
function
actually
care
ab
out
(sa
classification
error)
is
not
one
that
can
optimized
efficiently
or
example,
exactly
minimizing
exp
ected
0-1
loss
is
typically
in
tractable
(exp
onential
in
the
input
dimension),
even
for
linear
classifier
Marcotte
and
Sav
ard
1992
).
In
suc
situations,
one
typically
optimizes
surrogate
loss
function
instead,
whic
acts
as
proxy
but
has
adv
antages.
or
example,
the
negative
log-lik
eliho
of
the
correct
class
is
typically
used
as
surrogate
for
the
0-1
loss.
The
negative
log-likelihoo
allows
the
mo
del
to
estimate
the
conditional
probability
of
the
classes,
given
the
input,
and
if
the
mo
del
can
do
that
ell,
then
it
can
pick
the
classes
that
yield
the
least
classification
error
in
exp
ectation.
273
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
In
some
cases,
surrogate
loss
function
actually
results
in
eing
able
to
learn
more.
or
example,
the
test
set
0-1
loss
often
con
tinues
to
decrease
for
long
time
after
the
training
set
0-1
loss
has
reached
zero,
when
training
using
the
log-lik
eliho
surrogate.
This
is
ecause
ev
en
when
the
exp
ected
0-1
loss
is
zero,
one
can
improv
the
robustness
of
the
classifier
further
pushing
the
classes
apart
from
each
other,
obtaining
more
confident
and
reliable
classifier,
thus
extracting
more
information
from
the
training
data
than
ould
hav
een
ossible
simply
minimizing
the
erage
0-1
loss
on
the
training
set.
very
imp
ortan
difference
etw
een
optimization
in
general
and
optimization
as
we
use
it
for
training
algorithms
is
that
training
algorithms
do
not
usually
halt
at
lo
cal
minim
um.
Instead,
mac
hine
learning
algorithm
usually
minimizes
surrogate
loss
function
but
halts
when
conv
ergence
criterion
based
on
early
stopping
(section
7.8
is
satisfied.
ypically
the
early
stopping
criterion
is
based
on
the
true
underlying
loss
function,
such
as
0-1
loss
measured
on
alidation
set,
and
is
designed
to
cause
the
algorithm
to
halt
whenev
er
erfitting
egins
to
ccur.
raining
often
halts
while
the
surrogate
loss
function
still
has
large
deriv
ative
s,
whic
is
ery
different
from
the
pure
optimization
setting,
where
an
optimization
algorithm
is
considered
to
ha
conv
erged
when
the
gradien
ecomes
very
small.
8.1.3
Batc
and
Minibatch
Algorithms
One
aspect
of
machine
learning
algorithms
that
separates
them
from
general
optimization
algorithms
is
that
the
ob
jectiv
function
usually
decomp
oses
as
sum
er
the
training
examples.
Optimization
algorithms
for
mac
hine
learning
ypically
compute
each
up
date
to
the
parameters
based
on
an
exp
ected
alue
of
the
cost
function
estimated
using
only
subset
of
the
of
the
full
cost
function.
or
example,
maxim
um
likelihoo
estimation
problems,
when
view
ed
in
log
space,
decomp
ose
in
to
sum
ov
er
eac
example:
ML
= arg
max
=1
log
model
(8.4)
Maximizing
this
sum
is
equiv
alent
to
maximizing
the
exp
ectation
ov
er
the
empirical
distribution
defined
by
the
training
set:
) =
data
log
model
(8.5)
Most
of
the
prop
erties
of
the
ob
jectiv
function
used
most
of
our
opti-
mization
algorithms
are
also
exp
ectations
ov
er
the
training
set.
or
example,
the
274
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
most
commonly
used
prop
erty
is
the
gradient:
) =
data
log
model
(8.6)
Computing this
exp
ectation exactly is
ery expensive
ecause it
requires
ev
aluating
the
mo
del
on
ev
ery
example
in
the
entire
dataset.
In
practice,
we
can
compute
these
exp
ectations
randomly
sampling
small
num
er
of
examples
from
the
dataset,
then
taking
the
erage
ov
er
only
those
examples.
Recall
that
the
standard
error
of
the
mean
(equation
5.46
estimated
from
samples
is
given
by
n,
where
is
the
true
standard
deviation
of
the
alue
of
the
samples.
The
denominator
of
sho
ws
that
there
are
less
than
linear
returns
to
using
more
examples
to
estimate
the
gradient.
Compare
tw
yp
othetical
estimates
of
the
gradient,
one
based
on
100
examples
and
another
based
on
10,000
examples.
The
latter
requires
100
times
more
computation
than
the
former
but
reduces
the
standard
error
of
the
mean
only
by
factor
of
10.
Most
optimization
algorithms
conv
erge
muc
faster
(in
of
total
computation,
not
in
of
um
er
of
up
dates)
if
they
are
allow
ed
to
rapidly
compute
approximate
estimates
of
the
gradien
rather
than
slowly
computing
the
exact
gradient.
Another
consideration
motiv
ating
statistical
estimation
of
the
gradient
from
small
num
er
of
samples
is
redundancy
in
the
training
set.
In
the
worst
case,
all
samples
in
the
training
set
could
identical
copies
of
eac
other.
sampling-
based
estimate
of
the
gradien
could
compute
the
correct
gradien
with
single
sample,
using
times
less
computation
than
the
naive
approac
h.
In
practice,
we
are
unlikely
to
encoun
ter
this
worst-case
situation,
but
we
may
find
large
num
ers
of
examples
that
all
mak
very
similar
contributions
to
the
gradient.
Optimization
algorithms
that
use
the
en
tire
training
set
are
called
batc
or
deterministic
gradient
metho
ds,
ecause
they
pro
cess
all
the
training
examples
sim
ultaneously
in
large
batc
h.
This
terminology
can
somewhat
confusing
ecause
the
ord
“batch”
is
also
often
used
to
describ
the
minibatc
used
minibatc
sto
chastic
gradient
descen
t.
ypically
the
term
“batc
gradien
descent”
implies
the
use
of
the
full
training
set,
while
the
use
of
the
term
“batch”
to
describ
group
of
examples
do
es
not.
or
example,
it
is
common
to
use
the
term
“batch
size”
to
describ
the
size
of
minibatch.
Optimization
algorithms
that
use
only
single
example
at
time
are
sometimes
called
sto
hastic
and
sometimes
online
metho
ds.
The
term
“online”
is
usually
reserv
ed
for
when
the
examples
are
drawn
from
stream
of
contin
ually
created
examples
rather
than
from
fixed-size
training
set
er
which
sev
eral
passes
are
made.
Most
algorithms
used
for
deep
learning
fall
somewhere
in
etw
een,
using
more
275
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
than
one
but
few
er
than
all
the
training
examples.
These
were
traditionally
called
minibatc
or
minibatc
sto
hastic
metho
ds,
and
it
is
no
common
to
call
them
simply
sto
hastic
metho
ds.
The
canonical
example
of
sto
chastic
metho
is
sto
chastic
gradien
descent,
presen
ted
in
detail
in
section
8.3.1
Minibatc
sizes
are
generally
driv
en
by
the
following
factors:
Larger
batches
provide
more
accurate
estimate
of
the
gradient,
but
with
less
than
linear
returns.
Multicore
architectures
are
usually
underutilized
by
extremely
small
batches.
This
motiv
ates
using
some
absolute
minim
um
batch
size,
elo
which
there
is
no
reduction
in
the
time
to
pro
cess
minibatch.
If
all
examples
in
the
batch
are
to
pro
cessed
in
parallel
(as
is
typically
the
case),
then
the
amoun
of
memory
scales
with
the
batch
size.
or
man
hardw
are
setups
this
is
the
limiting
factor
in
batc
size.
Some
kinds
of
hardware
ac
hiev
etter
runtime
with
sp
ecific
sizes
of
arra
ys.
Esp
ecially
when
using
GPUs,
it
is
common
for
wer
of
batch
sizes
to
offer
etter
runtime.
ypical
ow
er
of
batch
sizes
range
from
32
to
256,
with
16
sometimes
eing
attempted
for
large
mo
dels.
Small
batc
hes
can
offer
regularizing
effect
Wilson
and
Martinez
2003
),
erhaps
due
to
the
noise
they
add
to
the
learning
pro
cess.
Generalization
error
is
often
best
for
batch
size
of
1. T
raining
with
such
small
batc
size
might
require
small
learning
rate
to
maintain
stabilit
ecause
of
the
high
ariance
in
the
estimate
of
the
gradient.
The
total
runtime
can
very
high
as
result
of
the
need
to
make
more
steps,
oth
ecause
of
the
reduced
learning
rate
and
ecause
it
takes
more
steps
to
observe
the
entire
training
set.
Differen
kinds
of
algorithms
use
different
kinds
of
information
from
the
mini-
batc
in
arious
wa
ys.
Some
algorithms
are
more
sensitive
to
sampling
error
than
others,
either
ecause
they
use
information
that
is
difficult
to
estimate
accurately
with
few
samples,
or
ecause
they
use
information
in
ys
that
amplify
sampling
errors
more.
Metho
ds
that
compute
up
dates
based
only
on
the
gradien
are
usually
relatively
robust
and
can
handle
smaller
batc
sizes,
like
100.
Second-order
metho
ds,
whic
also
use
the
Hessian
matrix
and
compute
up
dates
such
as
ypically
require
muc
larger
batch
sizes,
like
10,000. These
large
batch
sizes
are
required
to
minimize
fluctuations
in
the
estimates
of
Suppose
276
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
that
is
estimated
erfectly
but
has
or
condition
um
er. Multiplication
or
its
in
erse
amplifies
pre-existing
errors,
in
this
case,
estimation
errors
in
ery
small
hanges
in
the
estimate
of
can
thus
cause
large
changes
in
the
up
date
even
if
is
estimated
erfectly
Of
course,
is
estimated
only
appro
ximately
so
the
up
date
will
contain
even
more
error
than
we
would
predict
from
applying
orly
conditioned
op
eration
to
the
estimate
of
It
is
also
crucial
that
the
minibatches
selected
randomly
Computing
an
un
biased
estimate
of
the
exp
ected
gradient
from
set
of
samples
requires
that
those
samples
indep
endent.
also
wish
for
tw
subsequent
gradient
estimates
to
indep
enden
from
each
other,
so
tw
subsequent
minibatches
of
examples
should
also
indep
enden
from
each
other.
Man
datasets
are
most
naturally
arranged
in
where
successive
examples
are
highly
correlated.
or
example,
we
migh
ha
dataset
of
medical
data
with
long
list
of
blo
sample
test
results.
This
list
might
arranged
so
that
first
we
ha
five
blo
samples
taken
at
different
times
from
the
first
patient,
then
we
ha
three
blo
samples
taken
from
the
second
patient,
then
the
blo
samples
from
the
third
patien
t,
and
so
on.
If
ere
to
draw
examples
in
order
from
this
list,
then
each
of
our
minibatches
would
extremely
biased,
ecause
it
ould
represent
primarily
one
patient
out
of
the
man
patients
in
the
dataset.
In
cases
suc
as
these,
where
the
order
of
the
dataset
holds
some
significance,
it
is
necessary
to
shuffle
the
examples
efore
selecting
minibatc
hes.
or
very
large
datasets,
for
example,
datasets
con
taining
billions
of
examples
in
data
cen
ter,
it
can
impractical
to
sample
examples
truly
uniformly
at
random
ev
ery
time
we
wan
to
construct
minibatc
h.
ortunately
in
practice
it
is
usually
sufficient
to
shuffle
the
order
of
the
dataset
once
and
then
store
it
in
sh
uffled
fashion.
This
will
imp
ose
fixed
set
of
ossible
minibatc
hes
of
consecutiv
examples
that
all
mo
dels
trained
thereafter
will
use,
and
each
individual
mo
del
will
forced
to
reuse
this
ordering
every
time
it
passes
through
the
training
data.
This
deviation
from
true
random
selection
do
es
not
seem
to
hav
significan
detrimen
tal
effect.
ailing
to
ever
shuffle
the
examples
in
an
wa
can
seriously
reduce
the
effectiv
eness
of
the
algorithm.
Man
optimization
problems
in
mac
hine
learning
decomp
ose
ov
er
examples
ell
enough
that
we
can
compute
en
tire
separate
up
dates
ov
er
different
examples
in
parallel.
In
other
words,
can
compute
the
up
date
that
minimizes
for
one
minibatch
of
examples
at
the
same
time
that
we
compute
the
up
date
for
sev
eral
other
minibatch
es.
Suc
asynchronous
parallel
distributed
approac
hes
are
discussed
further
in
section
12.1.3
An
interesting
motiv
ation
for
minibatch
sto
hastic
gradient
descen
is
that
it
follo
ws
the
gradient
of
the
true
gener
alization
err
or
(equation
8.2
as
long
as
no
277
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
examples
are
rep
eated.
Most
implementations
of
minibatch
sto
chastic
gradient
descen
shuffle
the
dataset
once
and
then
pass
through
it
multiple
times.
On
the
first
pass,
each
minibatch
is
used
to
compute
an
un
biased
estimate
of
the
true
generalization
error.
On
the
second
pass,
the
estimate
ecomes
biased
ecause
it
is
formed
resampling
alues
that
ha
already
een
used,
rather
than
obtaining
new
fair
samples
from
the
data-generating
distribution.
The
fact
that
sto
hastic
gradient
descent
minimizes
generalization
error
is
easiest
to
see
in
online
learning,
where
examples
or
minibatches
are
dra
wn
from
stream
of
data.
In
other
words,
instead
of
receiving
fixed-size
training
set,
the
learner
is
similar
to
living
eing
who
sees
new
example
at
each
instant,
with
ev
ery
example
coming
from
the
data-generating
distribution
data
In
this
scenario,
examples
are
nev
er
rep
eated;
ev
ery
exp
erience
is
fair
sample
from
data
The
equiv
alence
is
easiest
to
derive
when
oth
and
are
discrete. In
this
case,
the
generalization
error
(equation
8.2
can
written
as
sum
) =
data
(8.7)
with
the
exact
gradient
) =
data
(8.8)
hav
already
seen
the
same
fact
demonstrated
for
the
log-lik
eliho
in
equa-
tion
8.5
and
equation
8.6
we
observe
now
that
this
holds
for
other
functions
esides
the
likelihoo
d.
similar
result
can
derived
when
and
are
contin
uous,
under
mild
assumptions
regarding
data
and
Hence, w
can
obtain
an
un
biased estimator
of
the exact
gradien
t of
the
generalization
error
by
sampling
minibatc
of
examples
(1)
with
cor-
resp
onding
targets
from
the
data-generating
distribution
data
then
computing
the
gradient
of
the
loss
with
resp
ect
to
the
parameters
for
that
minibatch:
(8.9)
Up
dating
in
the
direction
of
performs
SGD
on
the
generalization
error.
Of
course, this
in
terpretation
applies
only
when
examples
are
not
reused.
Nonetheless,
it
is
usually
est
to
make
sev
eral
passes
through
the
training
set,
unless
the
training
set
is
extremely
large. When
multiple
such
ep
hs
are
used,
278
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
only
the
first
ep
ch
follows
the
unbiased
gradien
of
the
generalization
error,
but
of
course,
the
additional
ep
chs
usually
provide
enough
enefit
due
to
decreased
training
error
to
offset
the
harm
they
cause
increasing
the
gap
et
een
training
error
and
test
error.
With
some
datasets
gro
wing
rapidly
in
size,
faster
than
computing
wer,
it
is
ecoming
more
common
for
machine
learning
applications
to
use
each
training
example
only
once
or
ev
en
to
mak
an
incomplete
pass
through
the
training
set.
When
using
an
extremely
large
training
set,
ov
erfitting
is
not
an
issue,
so
underfitting
and
computational
efficiency
ecome
the
predominant
concerns.
See
also
Bottou
and
Bousquet
2008
for
discussion
of
the
effect
of
computational
ottlenec
ks
on
generalization
error,
as
the
um
er
of
training
examples
gro
ws.
8.2
Challenges
in
Neural
Net
ork
Optimization
Optimization
in
general
is
an
extremely
difficult
task.
raditionally
machine
learning
has
av
oided
the
difficulty
of
general
optimization
by
carefully
designing
the
ob
jectiv
function
and
constraints
to
ensure
that
the
optimization
problem
is
con
ex.
When
training
neural
netw
orks,
we
ust
confront
the
general
noncon
ex
case.
Even
conv
ex
optimization
is
not
without
its
complications.
In
this
section,
summarize
several
of
the
most
prominen
challenges
inv
olv
ed
in
optimization
for
training
deep
mo
dels.
8.2.1
Ill-Conditioning
Some
challenges
arise
even
when
optimizing
conv
ex
functions.
Of
these,
the
most
prominen
is
ill-conditioning
of
the
Hessian
matrix
This
is
very
general
problem
in
most
numerical
optimization,
conv
ex
or
otherwise,
and
is
describ
ed
in
more
detail
in
section
4.3.1
The
ill-conditioning
problem
is
generally
elieved
to
be presen
in
neural
net
ork
training
problems.
Ill-conditioning
can
manifest
by
causing
SGD
to
get
“stuc
k”
in
the
sense
that
even
ery
small
steps
increase
the
cost
function.
Recall
from
equation
4.9
that
second-order
aylor
series
expansion
of
the
cost
function
predicts
that
gradient
descen
step
of
will
add
(8.10)
to
the
cost.
Ill-conditioning
of
the
gradient
ecomes
problem
when
exceeds
. T
determine
whether
ill-conditioning
is
detrimental
to
neural
279
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
50
50
100
150
200
250
raining
time
(ep
hs)
10
12
14
16
Gradient norm
50
100
150
200
250
raining
time
(ep
hs)
Classification
error
rate
Figure
8.1:
Gradient
descent
often
do
es
not
arriv
at
critical
oint
of
any
kind.
In
this
example,
the
gradient
norm
increases
throughout
training
of
conv
olutional
net
ork
used
for
ob
ject
detection.
(L
eft)
scatterplot
showing
ho
the
norms
of
individual
gradien
ev
aluations
are
distributed
er
time.
improv
legibility
only
one
gradien
norm
is
plotted
er
ep
ch.
The
running
av
erage
of
all
gradient
norms
is
plotted
as
solid
curv
e.
The
gradient
norm
clearly
increases
ov
er
time,
rather
than
decreasing
as
we
would
exp
ect
if
the
training
pro
cess
con
erged
to
critical
oint.
(Right)
Despite
the
increasing
gradien
t,
the
training
pro
cess
is
reasonably
successful.
The
alidation
set
classification
error
decreases
to
lo
level.
net
ork training
task, one can
monitor
the
squared gradien
t norm
and
the
term.
In
many
cases,
the
gradient
norm
do
es
not
shrink
significantly
throughout
learning,
but
the
term
gro
ws
more
than
an
order
of
magnitude.
The
result
is
that
learning
ecomes
very
slo
despite
the
presence
of
strong
gradien
ecause
the
learning
rate
must
shrunk
to
comp
ensate
for
even
stronger
curv
ature.
Figure
8.1
sho
ws
an
example
of
the
gradien
increasing
significan
tly
during
the
successful
training
of
neural
netw
ork.
Though
ill-conditioning
is
present
in
other
settings
esides
neural
netw
ork
training,
some
of
the
tec
hniques
used
to
com
bat
it
in
other
contexts
are
less
applicable
to
neural
net
orks.
or
example,
Newton’s
metho
is
an
excellent
to
ol
for
minimizing
conv
ex
functions
with
orly
conditioned
Hessian
matrices,
but
as
argue
in
subsequent
sections,
Newton’s
metho
requires
significant
mo
dification
efore
it
can
applied
to
neural
netw
orks.
280
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
8.2.2
Lo
cal
Minima
One
of
the
most
prominent
features
of
conv
ex
optimization
problem
is
that
it
can
reduced
to
the
problem
of
finding
lo
cal
minimum.
Any
lo
cal
minimum
is
guaran
teed
to
global
minimum.
Some
con
ex
functions
ha
flat
region
at
the
ottom
rather
than
single
global
minim
um
oint,
but
an
oin
within
suc
flat
region
is
an
acceptable
solution.
When
optimizing
conv
ex
function,
we
kno
that
hav
reac
hed
go
solution
if
we
find
critical
oint
of
any
kind.
With
noncon
ex
functions,
suc
as
neural
nets,
it
is
ossible
to
hav
many
lo
cal
minima.
Indeed,
nearly
any
deep
mo
del
is
essen
tially
guaranteed
to
hav
an
extremely
large
um
er
of
lo
cal
minima.
As
will
see,
ho
ever,
this
is
not
necessarily
ma
jor
problem.
Neural
net
orks
and
an
mo
dels
with
ultiple
equiv
alen
tly
parametrized
laten
ariables
all
hav
multiple
lo
cal
minima
ecause
of
the
mo
del
iden
tifiabilit
problem.
mo
del
is
said
to
iden
tifiable
if
sufficien
tly
large
training
set
can
rule
out
all
but
one
setting
of
the
mo
del’s
parameters.
Mo
dels
with
laten
ariables
are
often
not
identifiable
ecause
we
can
obtain
equiv
alent
models
by
exc
hanging
laten
ariables
with
each
other.
or
example,
we
could
take
neural
netw
ork
and
mo
dify
la
er
by
swapping
the
incoming
eigh
vector
for
unit
with
the
incoming
eigh
vector
for
unit
then
do
the
same
for
the
outgoing
eight
vectors.
If
we
ha
la
ers
with
units
each,
then
there
are
ys
of
arranging
the
hidden
units.
This
kind
of
nonidentifiabilit
is
known
as
weigh
space
symmetry
In
addition
to
eigh
space
symmetry
many
kinds
of
neural
netw
orks
hav
additional
causes
of
nonidentifiabilit
or
example,
in
any
rectified
linear
or
maxout
netw
ork,
we
can
scale
all
the
incoming
eigh
ts
and
biases
of
unit
by
if
also
scale
all
its
outgoing
weigh
ts
by
This
means
that—if
the
cost
function
do
es
not
include
suc
as
eigh
decay
that
dep
end
directly
on
the
eigh
ts
rather
than
the
mo
dels’
outputs—every
lo
cal
minimum
of
rectified
linear
or
maxout
netw
ork
lies
on
an
-dimensional
hyperb
ola
of
equiv
alen
lo
cal
minima.
These
mo
del
identifiabilit
issues
mean
that
neural
net
work
cost
function
can
hav
an
extremely
large
or
even
uncountably
infinite
amount
of
lo
cal
minima.
Ho
ev
er,
all
these
lo
cal
minima
arising
from
noniden
tifiabilit
are
equiv
alent
to
eac
other
in
cost
function
alue.
As
result, these
lo
cal
minima
are
not
problematic
form
of
nonconv
exity
Lo
cal
minima
can
problematic
if
they
ha
high
cost
in
comparison
to
the
global
minimum.
One
can
construct
small
neural
netw
orks,
ev
en
without
hidden
units,
that
hav
lo
cal
minima
with
higher
cost
than
the
global
minim
um
Son
tag
281
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
and
Sussman
1989
Brady
et
al.
1989
Gori
and
esi
1992
).
If
lo
cal
minima
with
high
cost
are
common,
this
could
ose
serious
problem
for
gradien
t-based
optimization
algorithms.
Whether
netw
orks
of
practical
interest
ha
many
lo
cal
minima
of
high
cost
and
whether
optimization
algorithms
encoun
ter
them
remain
op
en
questions.
or
man
years,
most
practitioners
eliev
ed
that
lo
cal
minima
ere
common
problem
plaguing
neural
netw
ork
optimization.
da
that
do
es
not
app
ear
to
the
case.
The
problem
remains
an
activ
area
of
research,
but
exp
erts
now
susp
ect
that,
for
sufficiently
large
neural
net
works,
most
lo
cal
minima
hav
low
cost
function
alue,
and
that
it
is
not
imp
ortant
to
find
true
global
minim
um
rather
than
to
find
oint
in
parameter
space
that
has
lo
but
not
minimal
cost
Saxe
et
al.
2013
Dauphin
et
al.
2014
Go
dfellow
et
al.
2015
Choromansk
et
al.
2014
).
Man
practitioners
attribute
nearly
all
difficulty
with
neural
netw
ork
optimiza-
tion
to
lo
cal
minima.
encourage
practitioners
to
carefully
test
for
sp
ecific
problems. A
test
that
can
rule
out
lo
cal
minima
as
the
problem
is
plotting
the
norm
of
the
gradient
er
time.
If
the
norm
of
the
gradien
do
es
not
shrink
to
insignifican
size,
the
problem
is
neither
lo
cal
minima
nor
an
other
kind
of
critical
oin
t.
In
high-dimensional
spaces,
ositively
establishing
that
lo
cal
minima
are
the
problem
can
very
difficult.
Many
structures
other
than
lo
cal
minima
also
ha
small
gradien
ts.
8.2.3
Plateaus,
Saddle
Poin
ts
and
Other
Flat
Regions
or
man
high-dimensional
nonconv
ex
functions, lo
cal
minima
(and
maxima)
are
in
fact
rare
compared
to
another
kind
of
oint
with
zero
gradient:
saddle
oin
t.
Some
oin
ts
around
saddle
oint
hav
greater
cost
than
the
saddle
oint,
while
others
hav
lo
er
cost. At
saddle
point,
the
Hessian
matrix
has
both
ositiv
and
negative
eigen
alues.
oints
lying
along
eigen
vectors
asso
ciated
with
ositiv
eigenv
alues
hav
greater
cost
than
the
saddle
oin
t,
while
oints
lying
along
negative
eigenv
alues
ha
low
er
alue. W
can
think
of
saddle
oint
as
eing
lo
cal
minim
um
along
one
cross-section
of
the
cost
function
and
lo
cal
maxim
um
along
another
cross-section.
See
figure
4.5
for
an
illustration.
Man
classes
of random
functions
exhibit
the
follo
wing behavior:
in
low-
dimensional
spaces,
lo
cal
minima
are
common.
In
higher-dimensional
spaces,
lo
cal
minima
are
rare,
and
saddle
oints
are
more
common.
or
function
of
this
type,
the
exp
ected
ratio
of
the
um
er
of
saddle
oin
ts
to
lo
cal
minima
grows
exp
onen
tially
with
understand
the
intuition
ehind
this
ehavior,
observ
that
the
Hessian
matrix
at
lo
cal
minimum
has
only
ositive
eigenv
alues. The
282
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Hessian
matrix
at
saddle
oin
has
mixture
of
ositiv
and
negative
eigenv
alues.
Imagine
that
the
sign
of
eac
eigen
alue
is
generated
by
flipping
coin.
In
single
dimension,
it
is
easy
to
obtain
lo
cal
minimum
by
tossing
coin
and
getting
heads
once.
In
-dimensional
space,
it
is
exp
onentially
unlikely
that
all
coin
tosses
will
heads.
See
Dauphin
et
al.
2014
for
review
of
the
relev
ant
theoretical
work.
An
amazing
prop
ert
of
many
random
functions
is
that
the
eigenv
alues
of
the
Hessian
ecome
more
lik
ely
to
ositive
as
we
reach
regions
of
low
er
cost. In
our
coin
tossing
analogy
this
means
are
more
likely
to
hav
our
coin
come
up
heads
times
if
we
are
at
critical
oin
with
low
cost.
It
also
means
that
lo
cal
minima
are
muc
more
lik
ely
to
hav
low
cost
than
high
cost.
Critical
oin
ts
with
high
cost
are
far
more
likely
to
saddle
oin
ts.
Critical
oin
ts
with
extremely
high
cost
are
more
lik
ely
to
lo
cal
maxima.
This
happ
ens
for
many
classes
of
random
functions.
Does
it
happ
en
for
neural
net
orks?
Baldi
and
Hornik
1989
sho
ed
theoretically
that
shallo
auto
enco
ders
(feedforw
ard
netw
orks
trained
to
cop
their
input
to
their
output,
describ
ed
in
hapter
14
with
no
nonlinearities
hav
global
minima
and
saddle
oin
ts
but
no
lo
cal
minima
with
higher
cost
than
the
global
minimum.
They
observ
ed
without
pro
of
that
these
results
extend
to
deep
er
netw
orks
without
nonlinearities.
The
output
of
suc
net
orks
is
linear
function
of
their
input,
but
they
are
useful
to
study
as
mo
del
of
nonlinear
neural
netw
orks
ecause
their
loss
function
is
nonconv
ex
function
of
their
parameters.
Suc
net
orks
are
essen
tially
just
ultiple
matrices
comp
osed
together.
Saxe
et
al.
2013
provided
exact
solutions
to
the
complete
learning
dynamics
in
such
net
orks
and
sho
wed
that
learning
in
these
mo
dels
captures
many
of
the
qualitative
features
observed
in
the
training
of
deep
mo
dels
with
nonlinear
activ
ation
functions.
Dauphin
et
al.
2014
show
ed
exp
erimen
tally
that
real
neural
netw
orks
also
hav
loss
functions
that
contain
ery
man
high-cost
saddle
oin
ts.
Choromansk
et
al.
2014
pro
vided
additional
theoretical
arguments,
showing
that
another
class
of
high-dimensional
random
functions
related
to
neural
net
orks
do
es
so
as
ell.
What
are
the
implications
of
the
proliferation
of
saddle
oints
for
training
algorithms?
or
first-order
optimization,
algorithms
that
use
only
gradien
infor-
mation,
the
situation
is
unclear.
The
gradien
can
often
ecome
very
small
near
saddle
oint.
On
the
other
hand,
gradien
descent
empirically
seems
able
to
escap
saddle
oints
in
man
cases.
Go
dfellow
et
al.
2015
provided
visualizations
of
sev
eral
learning
tra
jectories
of
state-of-the-art
neural
netw
orks,
with
an
example
giv
en
in
gure
8.2
These
visualizations
show
flattening
of
the
cost
function
near
prominent
saddle
oint,
where
the
weigh
ts
are
all
zero,
but
they
also
show
the
gradien
descent
tra
jectory
rapidly
escaping
this
region.
Goo
dfellow
et
al.
2015
283
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Pro
jection
of
Pro
jection
of
J(
Figure
8.2:
visualization
of
the
cost
function
of
neural
net
work.
These
visualizations
app
ear
similar
for
feedforward
neural
netw
orks,
conv
olutional
net
works,
and
recurrent
net
works
applied
to
real
ob
ject
recognition
and
natural
language
pro
cessing
tasks.
Sur-
prisingly
these
visualizations
usually
do
not
show
many
conspicuous
obstacles.
Prior
to
the
success
of
sto
chastic
gradien
descent
for
training
very
large
mo
dels
eginning
in
roughly
2012,
neural
net
cost
function
surfaces
were
generally
elieved
to
ha
muc
more
noncon
vex
structure
than
is
rev
ealed
by
these
pro
jections.
The
primary
obstacle
revealed
this
pro
jection
is
saddle
oint
of
high
cost
near
where
the
parameters
are
initialized,
but,
as
indicated
the
blue
path,
the
SGD
training
tra
jectory
escap
es
this
saddle
oint
readily
Most
of
training
time
is
sp
ent
trav
ersing
the
relatively
flat
alley
of
the
cost
function,
erhaps
ecause
of
high
noise
in
the
gradient,
or
conditioning
of
the
Hessian
matrix
in
this
region,
or
simply
the
need
to
circumnavigate
the
tall
“mountain”
visible
in
the
figure
via
an
indirect
arcing
path.
Image
adapted
with
ermission
from
Go
dfellow
et
al.
2015
).
also
argue
that
con
tin
uous-time
gradient
descen
may
sho
wn
analytically
to
rep
elled
from,
rather
than
attracted
to,
nearb
saddle
oint,
but
the
situation
ma
differen
for
more
realistic
uses
of
gradien
descent.
or
Newton’s
metho
d,
saddle
oints
clearly
constitute
problem.
Gradien
descen
is
designed
to
mov
“downhill”
and
is
not
explicitly
designed
to
seek
critical
oint.
Newton’s
metho
d,
ho
ev
er,
is
designed
to
solve
for
oin
where
the
gradien
is
zero.
Without
appropriate
mo
dification,
it
can
jump
to
saddle
oin
t.
The
proliferation
of
saddle
oints
in
high-dimensional
spaces
presumably
explains
wh
second-order
metho
ds
hav
not
succeeded
in
replacing
gradient
descent
for
neural
net
ork
training.
Dauphin
et
al.
2014
introduced
saddle-free
Newton
metho
for
second-order
optimization
and
show
ed
that
it
impro
es
significantly
er
the
traditional
ersion.
Second-order
metho
ds
remain
difficult
to
scale
to
large
neural
netw
orks,
but
this
saddle-free
approach
holds
promise
if
it
can
scaled.
284
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS

Figure
8.3:
The
ob
jectiv
function
for
highly
nonlinear
deep
neural
netw
orks
or
for
recurren
neural
netw
orks
often
con
tains
sharp
nonlinearities
in
parameter
space
resulting
from
the
multiplication
of
several
parameters.
These
nonlinearities
give
rise
to
very
high
deriv
atives
in
some
places.
When
the
parameters
get
close
to
such
cliff
region,
gradien
descent
up
date
can
catapult
the
parameters
ery
far,
ossibly
losing
most
of
the
optimization
work
that
has
een
done. Figure
adapted
with
ermission
from
Pascan
et
al.
2013
).
There
are
other
kinds
of
oints
with
zero
gradien
esides
minima
and
saddle
oin
ts.
Maxima
are
muc
like
saddle
oin
ts
from
the
ersp
ective
of
optimization—
man
algorithms
are
not
attracted
to
them,
but
unmo
dified
Newton’s
metho
is.
Maxima
of
many
classes
of
random
functions
ecome
exp
onen
tially
rare
in
high-dimensional
space,
just
as
minima
do.
There
ma
also
wide,
flat
regions
of
constan
alue.
In
these
lo
cations,
the
gradient
and
the
Hessian
are
all
zero.
Such
degenerate
lo
cations
ose
ma
jor
problems
for
all
numerical
optimization
algorithms.
In
con
ex
problem,
wide,
flat
region
must
consist
en
tirely
of
global
minima,
but
in
general
optimization
problem,
such
region
could
corresp
ond
to
high
alue
of
the
ob
jectiv
function.
8.2.4
Cliffs
and
Explo
ding
Gradien
ts
Neural
netw
orks
with
many
la
ers
often
ha
extremely
steep
regions
resembling
cliffs,
as
illustrated
in
figure
8.3
These
result
from
the
multiplication
of
several
large
weigh
ts
together.
On
the
face
of
an
extremely
steep
cliff
structure, the
gradien
up
date
step
can
mo
the
parameters
extremely
far,
usually
jumping
off
the
cliff
structure
altogether.
The
cliff
can
dangerous
whether
we
approac
it
from
ab
ov
or
from
elo
w,
285
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
but
fortunately
its
most
serious
consequences
can
av
oided
using
the
gradien
clipping
heuristic
describ
ed
in
section
10.11.1
The
basic
idea
is
to
recall
that
the
gradien
sp
ecifies
not
the
optimal
step
size,
but
only
the
optimal
direction
within
an
infinitesimal
region.
When
the
traditional
gradient
descen
algorithm
prop
oses
making
very
large
step,
the
gradient
clipping
heuristic
interv
enes
to
reduce
the
step
size,
making
it
less
likely
to
go
outside
the
region
where
the
gradien
indicates
the
direction
of
appro
ximately
steep
est
descent.
Cliff
structures
are
most
common
in
the
cost
functions
for
recurrent
neural
netw
orks,
ecause
such
mo
dels
inv
olve
ultiplication
of
many
factors,
with
one
factor
for
each
time
step.
Long
temp
oral
sequences
thus
incur
an
extreme
amount
of
multiplication.
8.2.5
Long-T
erm
Dep
endencies
Another
difficult
that
neural
netw
ork
optimization
algorithms
must
ov
ercome
arises when the
computational graph becomes extremely deep.
eedforw
ard
net
orks
with
many
la
ers
hav
such
deep
computational
graphs.
So
do
recurren
net
orks,
describ
ed
in
chapter
10
whic
construct
ery
deep
computational
graphs
rep
eatedly
applying
the
same
op
eration
at
eac
time
step
of
long
temp
oral
sequence.
Rep
eated
application
of
the
same
parameters
giv
es
rise
to
esp
ecially
pronounced
difficulties.
or
example,
supp
ose
that
computational
graph
con
tains
path
that
consists
of
rep
eatedly
ultiplying
by
matrix
After
steps,
this
is
equiv
alent
to
ul-
tiplying
by
Suppose
that
has
an
eigendecomp
osition
diag
In
this
simple
case,
it
is
straigh
tforw
ard
to
see
that
diag
diag
(8.11)
An
eigenv
alues
that
are
not
near
an
absolute
alue
of
will
either
explo
de
if
they
are
greater
than
in
magnitude
or
anish
if
they
are
less
than
in
magnitude.
The
anishing
and
explo
ding
gradient
problem
refers
to
the
fact
that
gradients
through
such
graph
are
also
scaled
according
to
diag
anishing
gradients
mak
it
difficult
to
kno
which
direction
the
parameters
should
mov
to
improv
the
cost
function,
while
explo
ding
gradien
ts
can
make
learning
unstable.
The
cliff
structures
describ
ed
earlier
that
motiv
ate
gradien
clipping
are
an
example
of
the
explo
ding
gradient
phenomenon.
The
rep
eated
multiplication
by
at
each
time
step
describ
ed
here
is
ver
similar
to
the
er
metho
algorithm
used
to
find
the
largest
eigen
alue
of
matrix
and
the
corresp
onding
eigenv
ector.
rom
this
oint
of
view
it
is
not
surprising
that
will
even
tually
discard
all
comp
onents
of
that
are
orthogonal
to
the
principal
eigen
ector
of
286
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Recurren
net
orks
use
the
same
matrix
at
eac
time
step,
but
feedforw
ard
net
orks
do
not,
so
even
very
deep
feedforw
ard
netw
orks
can
largely
oid
the
anishing
and
explo
ding
gradient
problem
Sussillo
2014
).
defer
further
discussion
of
the
hallenges
of
training
recurren
netw
orks
un
til
section
10.7
after
recurren
netw
orks
ha
ve
een
describ
ed
in
more
detail.
8.2.6
Inexact
Gradien
ts
Most
optimization
algorithms
are
designed
with
the
assumption
that
we
hav
access
to
the
exact
gradient
or
Hessian
matrix.
In
practice,
we
usually
ha
ve
only
noisy
or
even
biased
estimate
of
these
quan
tities.
Nearly
ev
ery
deep
learning
algorithm
relies
on
sampling-based
estimates,
at
least
insofar
as
using
minibatch
of
training
examples
to
compute
the
gradien
t.
In
other
cases,
the
ob
jective
function
we
wan
to
minimize
is
actually
intractable.
When
the
ob
jective
function
is
intractable,
ypically
its
gradien
is
intractable
as
ell.
In
such
cases
we
can
only
approximate
the
gradient.
These
issues
mostly
arise
with
the
more
adv
anced
mo
dels
co
er
in
part
or
example,
con
trastiv
div
ergence
giv
es
tec
hnique
for
appro
ximating
the
gradien
of
the
in
tractable
log-lik
eliho
of
Boltzmann
mac
hine.
arious
neural
net
work
optimization
algorithms
are
designed
to
account
for
imp
erfections
in
the
gradient
estimate.
One
can
also
oid
the
problem
by
ho
osing
surrogate
loss
function
that
is
easier
to
appro
ximate
than
the
true
loss.
8.2.7
or
Corresp
ondence
etw
een
Lo
cal
and
Global
Structure
Man
of
the
problems
we
hav
discussed
so
far
corresp
ond
to
prop
erties
of
the
loss
function
at
single
oint—it
can
difficult
to
make
single
step
if
is
orly
conditioned
at
the
current
oint
or
if
lies
on
cliff,
or
if
is
saddle
oin
hiding
the
opp
ortunity
to
make
progress
downhill
from
the
gradient.
It
is
ossible
to
ercome
all
these
problems
at
single
oin
and
still
erform
orly
if
the
direction
that
results
in
the
most
improv
emen
lo
cally
do
es
not
oin
to
ard
distant
regions
of
uc
low
er
cost.
Go
dfello
et
al.
2015
argue
that
muc
of
the
run
time
of
training
is
due
to
the
length
of
the
tra
jectory
needed
to
arriv
at
the
solution.
Figure
8.2
shows
that
the
learning
tra
jectory
sp
ends
most
of
its
time
tracing
out
wide
arc
around
moun
tain-shap
ed
structure.
Muc
of
research
in
to
the
difficulties
of
optimization
has
fo
cused
on
whether
287
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Figure
8.4:
Optimization
based
on
lo
cal
do
wnhill
mov
es
can
fail
if
the
lo
cal
surface
do
es
not
oint
tow
ard
the
global
solution.
Here
provide
an
example
of
how
this
can
ccur,
ev
en
if
there
are
no
saddle
oints
or
lo
cal
minima.
This
example
cost
function
con
tains
only
asymptotes
tow
ard
low
alues,
not
minima.
The
main
cause
of
difficult
in
this
case
is
eing
initialized
on
the
wrong
side
of
the
“mountain”
and
not
eing
able
to
trav
erse
it.
In
higher-dimensional
space,
learning
algorithms
can
often
circumna
vigate
such
mountains,
but
the
tra
jectory
asso
ciated
with
doing
so
ma
long
and
result
in
excessiv
training
time,
as
illustrated
in
figure
8.2
training
arrives
at
global
minimum,
lo
cal
minimum,
or
saddle
oint,
but
in
practice,
neural
net
orks
do
not
arrive
at
critical
oin
of
an
kind.
Figure
8.1
sho
ws
that
neural
netw
orks
often
do
not
arrive
at
region
of
small
gradient.
Indeed,
suc
critical
oints
do
not
even
necessarily
exist.
or
example,
the
loss
function
log
can
lack
global
minim
um
point
and
instead
asymptotically
approac
some
alue
as
the
mo
del
ecomes
more
confident.
or
classifier
with
discrete
and
provided
by
softmax,
the
negative
log-lik
eliho
can
ecome
arbitrarily
close
to
zero
if
the
mo
del
is
able
to
correctly
classify
every
example
in
the
training
set,
but
it
is
imp
ossible
to
actually
reac
the
alue
of
zero.
Likewise,
mo
del
of
real
alues
) =
can
ha
negative
log-lik
eliho
that
asymptotes
to
negativ
infinity—if
is
able
to
correctly
predict
the
alue
of
all
training
set
targets,
the
learning
algorithm
will
increase
without
ound.
See
figure
8.4
for
an
example
of
failure
of
lo
cal
optimization
to
find
go
cost
function
alue
ev
en
in
the
absence
of
any
lo
cal
minima
or
saddle
oints.
uture
research
will
need
to
develop
further
understanding
of
the
factors
that
influence
the
length
of
the
learning
tra
jectory
and
etter
characterize
the
outcome
of
the
pro
cess.
288
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Man
existing
research
directions
are
aimed
at
finding
go
initial
oints
for
problems
that
hav
difficult
global
structure,
rather
than
at
developing
algorithms
that
use
nonlo
cal
mov
es.
Gradien
descent
and
essen
tially
all
learning
algorithms
that
are
effective
for
training
neural
netw
orks
are
based
on
making
small
lo
cal
mo
es.
The
previous
sections
hav
primarily
fo
cused
on
how
the
correct
direction
of
these
lo
cal
mov
es
can
difficult
to
compute.
may
able
to
compute
some
prop
erties
of
the
ob
jectiv
function,
suc
as
its
gradient,
only
approximately
with
bias
or
ariance
in
our
estimate
of
the
correct
direction.
In
these
cases,
lo
cal
descent
ma
or
ma
not
define
reasonably
short
path
to
alid
solution,
but
we
are
not
actually
able
to
follow
the
lo
cal
descent
path.
The
ob
jectiv
function
may
ha
issues,
suc
as
or
conditioning
or
discon
tinuous
gradients,
causing
the
region
where
the
gradient
pro
vides
go
mo
del
of
the
ob
jectiv
function
to
ery
small.
In
these
cases,
lo
cal
descen
with
steps
of
size
ma
define
reasonably
short
path
to
the
solution,
but
we
are
only
able
to
compute
the
lo
cal
descen
direction
with
steps
of
size
In
these
cases,
lo
cal
descent
ma
define
path
to
the
solution,
but
the
path
con
tains
many
steps,
so
following
it
incurs
high
computational
cost.
Sometimes
lo
cal
information
provides
us
no
guide,
such
as
when
the
function
has
wide
flat
region,
or
if
we
manage
to
land
exactly
on
critical
oint
(usually
this
latter
scenario
only
happ
ens
to
metho
ds
that
solv
explicitly
for
critical
oints,
suc
as
Newton’s
metho
d).
In
these
cases,
lo
cal
descen
do
es
not
define
path
to
solution
at
all.
In
other
cases,
lo
cal
mov
es
can
to
greedy
and
lead
us
along
path
that
mo
es
downhill
but
aw
ay
from
any
solution,
as
in
figure
8.4
or
along
an
unnecessarily
long
tra
jectory
to
the
solution,
as
in
figure
8.2
Curren
tly
we
do
not
understand
which
of
these
problems
are
most
relev
an
to
making
neural
netw
ork
optimization
difficult,
and
this
is
an
activ
area
of
research.
Regardless
of
whic
of
these
problems
are
most
significant,
all
of
them
might
oided
if
there
exists
region
of
space
connected
reasonably
directly
to
solution
path
that
lo
cal
descent
can
follow,
and
if
we
are
able
to
initialize
learning
within
that
ell-b
ehav
ed
region. This
last
view
suggests
research
into
choosing
go
initial
oin
ts
for
traditional
optimization
algorithms
to
use.
8.2.8
Theoretical
Limits
of
Optimization
Sev
eral
theoretical
results
show
that
there
are
limits
on
the
erformance
of
any
optimization
algorithm
might
design
for
neural
net
orks
Blum
and
Rivest
1992
Judd
1989
olpert
and
MacReady
1997
).
ypically
these
results
ha
little
earing
on
the
use
of
neural
netw
orks
in
practice.
289
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Some
theoretical
results
apply
only
when
the
units
of
neural
net
ork
output
discrete
alues.
Most
neural
netw
ork
units
output
smo
othly
increasing
alues
that
make
optimization
via
lo
cal
feasible. Some
theoretical
results
show
that
there
exist
problem
classes
that
are
intractable,
but
it
can
difficult
to
tell
whether
particular
problem
falls
into
that
class.
Other
results
sho
that
finding
solution
for
net
ork
of
given
size
is
in
tractable,
but
in
practice
can
find
solution
easily
by
using
larger
netw
ork
for
whic
many
more
parameter
settings
corresp
ond
to
an
acceptable
solution.
Moreov
er,
in
the
context
of
neural
netw
ork
training,
we
usually
do
not
care
ab
out
finding
the
exact
minimum
of
function,
but
seek
only
to
reduce
its
alue
sufficien
tly
to
obtain
go
generalization
error.
Theoretical
analysis
of
whether
an
optimization
algorithm
can
accomplish
this
goal
is
extremely
difficult.
Developing
more
realistic
ounds
on
the
erformance
of
optimization
algorithms
therefore
remains
an
imp
ortant
goal
for
mac
hine
learning
researc
h.
8.3
Basic
Algorithms
hav
previously
in
tro
duced
the
gradient
descen
(section
4.3
algorithm
that
follo
ws
the
gradient
of
an
entire
training
set
downhill.
This
may
accelerated
considerably
by
using
sto
hastic
gradient
descent
to
follow
the
gradien
of
randomly
selected
minibatches
do
wnhill,
as
discussed
in
section
5.9
and
section
8.1.3
8.3.1
Sto
hastic
Gradient
Descent
Sto
hastic
gradient
descent
(SGD)
and
its
ariants
are
probably
the
most
used
optimization
algorithms
for
machine
learning
in
general
and
for
deep
learning
in
particular. As
discussed
in
section
8.1.3
it
is
ossible
to
obtain
an
unbiased
estimate
of
the
gradien
taking
the
verage
gradien
on
minibatch
of
examples
drawn
i.i.d
from
the
data-generating
distribution.
Algorithm
8.1
sho
ws
how
to
follow
this
estimate
of
the
gradient
do
wnhill.
crucial
parameter
for
the
SGD
algorithm
is
the
learning
rate.
Previously
we
ha
describ
ed
SGD
as
using
fixed
learning
rate
In
practice,
it
is
necessary
to
gradually
decrease
the
learning
rate
ov
er
time,
so
we
no
denote
the
learning
rate
at
iteration
as
This
is
ecause
the
SGD
gradien
estimator
introduces
source
of
noise
(the
random
sampling
of
training
examples)
that
do
es
not
anish
even
when
we
arriv
at
minim
um.
By
comparison,
the
true
gradient
of
the
total
cost
function
ecomes
small
and
then
when
we
approach
and
reach
minimum
using
batch
gradient
290
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Algorithm
8.1
Sto
chastic
gradien
descent
(SGD)
up
date
Require:
Learning
rate
schedule
Require:
Initial
parameter
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
gradient
estimate:
Apply
up
date:
end
while
descen
t,
so
batch
gradient
descen
can
use
fixed
learning
rate.
sufficient
condition
to
guaran
tee
conv
ergence
of
SGD
is
that
=1
and
(8.12)
=1
(8.13)
In
practice,
it
is
common
to
deca
the
learning
rate
linearly
until
iteration
= (1
α
(8.14)
with
After
iteration
it
is
common
to
leav
constant.
The
learning
rate
may
hosen
trial
and
error,
but
it
is
usually
est
to
choose
it
by
monitoring
learning
curv
es
that
plot
the
ob
jectiv
function
as
function
of
time.
This
is
more
of
an
art
than
science,
and
most
guidance
on
this
sub
ject
should
regarded
with
some
sk
epticism.
When
using
the
linear
sc
hedule,
the
parameters
to
choose
are
and
Usually
ma
set
to
the
num
er
of
iterations
required
to
make
few
undred
passes
through
the
training
set.
Usually
should
set
to
roughly
ercent
the
alue
of
The
main
question
is
how
to
set
If
it
is
to
large,
the
learning
curve
will
show
violen
oscillations,
with
the
cost
function
often
increasing
significantly
Gentle
oscillations
are
fine,
esp
ecially
if
training
with
sto
chastic
cost
function,
such
as
the
cost
function
arising
from
the
use
of
drop
out.
If
the
learning
rate
is
to
low,
learning
pro
ceeds
slowly
and
if
the
initial
learning
rate
is
to
low,
learning
may
ecome
stuck
with
high
cost
alue.
ypically
the
optimal
initial
learning
rate,
in
of
total
training
time
and
the
291
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
final
cost
alue,
is
higher
than
the
learning
rate
that
yields
the
est
erformance
after
the
first
100
iterations
or
so.
Therefore,
it
is
usually
est
to
monitor
the
first
sev
eral
iterations
and
use
learning
rate
that
is
higher
than
the
est-p
erforming
learning
rate
at
this
time,
but
not
so
high
that
it
causes
sev
ere
instability
The
most
imp
ortant
prop
erty
of
SGD
and
related
minibatch
or
online
gradien
t-
based
optimization
is
that
computation
time
er
up
date
do
es
not
grow
with
the
um
er
of
training
examples.
This
allows
conv
ergence
even
when
the
num
er
of
training
examples
ecomes
very
large.
or
large
enough
dataset,
SGD
may
con
erge
to
within
some
fixed
tolerance
of
its
final
test
set
error
efore
it
has
pro
cessed
the
en
tire
training
set.
study
the
conv
ergence
rate
of
an
optimization
algorithm
it
is
common
to
measure
the
excess
error
min
which
is
the
amoun
by
which
the
curren
cost
function
exceeds
the
minimum
ossible
cost.
When
SGD
is
applied
to
con
ex
problem,
the
excess
error
is
after
iterations,
while
in
the
strongly
con
ex
case,
it
is
These
ounds
cannot
improv
ed
unless
extra
conditions
are
assumed.
Batc
gradient
descent
enjoys
etter
conv
ergence
rates
than
sto
hastic
gradien
descent
in
theory
Ho
wev
er,
the
Cramér-Rao
ound
Cramér
1946
Rao
1945
states
that
generalization
error
cannot
decrease
faster
than
Bottou
and
Bousquet
2008
argue
that
it
therefore
may
not
worth
while
to
pursue
an
optimization
algorithm
that
conv
erges
faster
than
for
machine
learning
tasks—faster
conv
ergence
presumably
corresp
onds
to
ov
erfitting.
Moreo
ver,
the
asymptotic
analysis
obscures
man
adv
antages
that
sto
chastic
gradient
descen
has
after
small
num
er
of
steps.
With
large
datasets,
the
ability
of
SGD
to
mak
rapid
initial
progress
while
ev
aluating
the
gradien
for
ery
few
examples
out
eighs
its
slow
asymptotic
con
ergence.
Most
of
the
algorithms
describ
ed
in
the
remainder
of
this
chapter
ac
hiev
enefits
that
matter
in
practice
but
are
lost
in
the
constant
factors
obscured
by
the
asymptotic
analysis.
One
can
also
trade
off
the
enefits
of
oth
batch
and
sto
chastic
gradient
descen
by
gradually
increasing
the
minibatc
size
during
the
course
of
learning.
or
more
information
on
SGD,
see
Bottou
1998
).
8.3.2
Momen
tum
While
sto
chastic
gradient
descen
remains
opular
optimization
strategy
learning
with
it
can
sometimes
slo
w.
The
method
of
momen
tum
olyak
1964
is
designed
to
accelerate
learning,
esp
ecially
in
the
face
of
high
curv
ature,
small
but
consisten
gradients,
or
noisy
gradients.
The
momen
tum
algorithm
accumulates
an
exp
onentially
deca
ying
moving
av
erage
of
past
gradien
ts
and
con
tin
ues
to
mo
292
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
30
20
10
10
20
30
20
10
10
20
Figure
8.5:
Momentum
aims
primarily
to
solve
tw
problems:
or
conditioning
of
the
Hessian
matrix
and
ariance
in
the
sto
chastic
gradient.
Here,
we
illustrate
how
momentum
vercomes
the
first
of
these
tw
problems.
The
contour
lines
depict
quadratic
loss
function
with
orly
conditioned
Hessian
matrix.
The
red
path
cutting
across
the
con
tours
indicates
the
path
follow
ed
by
the
momentum
learning
rule
as
it
minimizes
this
function.
At
each
step
along
the
ay
dra
an
arrow
indicating
the
step
that
gradient
descen
ould
take
at
that
point.
can
see
that
orly
conditioned
quadratic
ob
jectiv
lo
oks
like
long,
narro
alley
or
can
on
with
steep
sides.
Momentum
correctly
trav
erses
the
cany
on
lengthwise,
while
gradien
steps
waste
time
moving
back
and
forth
across
the
narro
axis
of
the
can
on.
Compare
also
figure
4.6
which
shows
the
eha
vior
of
gradient
descen
without
momentum.
in
their
direction.
The
effect
of
momentum
is
illustrated
in
figure
8.5
ormally
the
momentum
algorithm
introduces
ariable
that
plays
the
role
of
velocity—it
is
the
direction
and
sp
eed
at
which
the
parameters
mov
through
parameter
space.
The
velocity
is
set
to
an
exp
onentially
deca
ying
av
erage
of
the
negativ
gradient.
The
name
momen
tum
deriv
es
from
physical
analogy
in
whic
the
negativ
gradient
is
force
moving
particle
through
parameter
space,
according
to
Newton’s
laws
of
motion.
Momen
tum
in
physics
is
mass
times
elo
cit
In
the
momentum
learning
algorithm,
we
assume
unit
mass,
so
the
elo
cit
vector
ma
also
be
regarded
as
the
momentum
of
the
particle.
yp
erparameter
[0
1)
determines
how
quickly
the
contributions
of
previous
gradien
ts
exp
onentially
decay
The
up
date
rule
is
giv
en
by
=1
(8.15)
(8.16)
293
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Algorithm
8.2
Sto
chastic
gradien
descent
(SGD)
with
momentum
Require:
Learning
rate
momentum
parameter
Require:
Initial
parameter
initial
velocity
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
gradient
estimate:
Compute
velocity
up
date:
Apply
up
date:
end
while
The
velocity
accum
ulates
the
gradient
elemen
ts
=1
The
larger
is
relative
to
the
more
previous
gradien
ts
affect
the
curren
direction.
The
SGD
algorithm
with
momen
tum
is
giv
en
in
algorithm
8.2
Previously
the
size
of
the
step
was
simply
the
norm
of
the
gradient
multiplied
the
learning
rate.
No
w,
the
size
of
the
step
dep
ends
on
how
large
and
ho
aligned
se
quenc
of
gradien
ts
are.
The
step
size
is
largest
when
man
successive
gradien
ts
oint
in
exactly
the
same
direction.
If
the
momentum
algorithm
alwa
ys
observ
es
gradient
then
it
will
accelerate
in
the
direction
of
until
reac
hing
terminal
velocity
where
the
size
of
each
step
is
||
||
(8.17)
It
is
thus
helpful
to
think
of
the
momentum
yp
erparameter
in
of
or
example,
= 0
corresp
onds
to
multiplying
the
maxim
um
sp
eed
10
relative
to
the
gradient
descen
algorithm.
Common
alues
of
used
in
practice
include
and
99
Like
the
learning
rate,
ma
also
adapted
ov
er
time.
ypically
it
egins
with
small
alue
and
is
later
raised.
dapting
er
time
is
less
imp
ortant
than
shrinking
er
time.
can
view
the
momentum
algorithm
as
simulating
particle
sub
ject
to
con
tin
uous-time
Newtonian
dynamics.
The
ph
ysical
analogy
can
help
build
in
tuition
for
how
the
momentum
and
gradient
descent
algorithms
ehav
e.
The
osition
of
the
particle
at
an
oint
in
time
is
given
The
particle
exp
eriences
net
force
This
force
causes
the
particle
to
accelerate:
) =
(8.18)
294
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Rather
than
viewing
this
as
second-order
differential
equation
of
the
osition,
can
introduce
the
ariable
representing
the
elo
cit
of
the
particle
at
time
and
rewrite
the
Newtonian
dynamics
as
first-order
differential
equation:
) =
(8.19)
) =
(8.20)
The
momentum
algorithm
then
consists
of
solving
the
differential
equations
via
umerical
sim
ulation.
simple
numerical
metho
for
solving
differential
equations
is
Euler’s
metho
d,
which
simply
consists
of
simulating
the
dynamics
defined
by
the
equation
taking
small,
finite
steps
in
the
direction
of
each
gradien
t.
This
explains
the
basic
form
of
the
momentum
up
date,
but
what
sp
ecifically
are
the
forces?
One
force
is
prop
ortional
to
the
negativ
gradient
of
the
cost
function:
−∇
This
force
pushes
the
particle
downhill
along
the
cost
function
surface.
The
gradient
descent
algorithm
would
simply
tak
single
step
based
on
eac
gradien
t,
but
the
Newtonian
scenario
used
by
the
momentum
algorithm
instead
uses
this
force
to
alter
the
elo
cit
of
the
particle.
can
think
of
the
particle
as
eing
like
ho
ck
ey
puck
sliding
down
an
icy
surface.
Whenev
er
it
descends
steep
part
of
the
surface,
it
gathers
sp
eed
and
con
tin
ues
sliding
in
that
direction
un
til
it
egins
to
go
uphill
again.
One
other
force
is
necessary
If
the
only
force
is
the
gradien
of
the
cost
function,
then
the
particle
migh
never
come
to
rest.
Imagine
ho
key
puck
sliding
down
one
side
of
alley
and
straight
up
the
other
side,
oscillating
back
and
forth
forever,
assuming
the
ice
is
erfectly
frictionless.
resolve
this
problem,
we
add
one
other
force,
prop
ortional
to
In
physics
terminology
this
force
corresp
onds
to
viscous
drag,
as
if
the
particle
must
push
through
resistant
medium
such
as
syrup.
This
causes
the
particle
to
gradually
lose
energy
ov
er
time
and
even
tually
con
erge
to
lo
cal
minim
um.
Wh
do
use
and
viscous
drag
in
particular? P
art
of
the
reason
to
use
is
mathematical
con
enience—an
integer
ow
er
of
the
velocity
is
easy
to
work
with.
et
other
physical
systems
ha
other
kinds
of
drag
based
on
other
in
teger
ow
ers
of
the
velocity
or
example,
particle
trav
eling
through
the
air
exp
eriences
turbulent
drag,
with
force
prop
ortional
to
the
square
of
the
velocity
while
particle
mo
ving
along
the
ground
exp
eriences
dry
friction,
with
force
of
constant
magnitude.
can
reject
eac
of
these
options.
urbulent
drag,
prop
ortional
to
the
square
of
the
velocity
ecomes
ery
eak
when
the
velocity
is
small.
It
is
not
ow
erful
enough
to
force
the
particle
to
come
to
rest.
particle
with
nonzero
initial
elo
cit
that
exp
eriences
only
the
force
of
turbulent
drag
295
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
will
mo
from
its
initial
osition
forever,
with
the
distance
from
the
starting
oin
growing
like
log
must
therefore
use
low
er
er
of
the
elo
cit
If
we
use
ow
er
of
zero,
represen
ting
dry
friction,
then
the
force
is
to
strong.
When
the
force
due
to
the
gradien
of
the
cost
function
is
small
but
nonzero,
the
constan
force
due
to
friction
can
cause
the
particle
to
come
to
rest
efore
reac
hing
lo
cal
minim
um.
Viscous
drag
av
oids
oth
of
these
problems—it
is
weak
enough
that
the
gradient
can
con
tinue
to
cause
motion
until
minim
um
is
reached,
but
strong
enough
to
preven
motion
if
the
gradient
do
es
not
justify
moving.
8.3.3
Nestero
Momen
tum
Sutsk
ev
er
et
al.
2013
introduced
arian
of
the
momen
tum
algorithm
that
was
inspired
by
Nesterov’s
accelerated
gradien
metho
Nesterov
1983
2004
).
The
up
date
rules
in
this
case
are
giv
en
by
=1
(8.21)
(8.22)
where
the
parameters
and
pla
similar
role
as
in
the
standard
momen
tum
metho
d.
The
difference
etw
een
Nestero
momen
tum
and
standard
momentum
is
where
the
gradient
is
ev
aluated.
With
Nestero
momentum,
the
gradien
is
ev
aluated
after
the
current
elo
cit
is
applied.
Th
us
one
can
in
terpret
Nesterov
momen
tum
as
attempting
to
add
orr
ction
factor
to
the
standard
method
of
momen
tum.
The
complete Nestero
momen
tum
algorithm
is
presen
ted
in
algorithm
8.3
In
the
con
ex
batch
gradien
case,
Nestero
momentum
brings
the
rate
of
con
ergence
of
the
excess
error
from
(1
/k
(after
steps)
to
(1
/k
as
shown
Nestero
1983
).
Unfortunately
, in
the
sto
chastic
gradient
case, Nesterov
momen
tum
do
es
not
improv
the
rate
of
conv
ergence.
8.4
arameter
Initialization
Strategies
Some
optimization
algorithms
are
not
iterative
by
nature
and
simply
solve
for
solution
oint.
Other
optimization
algorithms
are
iterative
by
nature
but,
when
applied
to
the
right
class
of
optimization
problems,
conv
erge
to
acceptable
solutions
in
an
acceptable
amoun
of
time
regardless
of
initialization.
Deep
learning
training
algorithms
usually
do
not
hav
either
of
these
luxuries. T
raining
algorithms
for
296
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Algorithm
8.3
Sto
chastic
gradien
descent
(SGD)
with
Nesterov
momentum
Require:
Learning
rate
momentum
parameter
Require:
Initial
parameter
initial
velocity
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
lab
els
Apply
interim
up
date:
Compute
gradient
(at
interim
point):
Compute
velocity
up
date:
Apply
up
date:
end
while
deep
learning
mo
dels
are
usually
iterativ
and
thus
require
the
user
to
sp
ecify
some
initial
oint
from
whic
to
egin
the
iterations.
Moreo
ver,
training
deep
mo
dels
is
sufficiently
difficult
task
that
most
algorithms
are
strongly
affected
the
choice
of
initialization.
The
initial
oint
can
determine
whether
the
algorithm
con
erges
at
all,
with
some
initial
oints
eing
so
unstable
that
the
algorithm
encounters
umerical
difficulties
and
fails
altogether.
When
learning
do
es
conv
erge,
the
initial
oin
can
determine
ho
quickly
learning
con
erges
and
whether
it
conv
erges
to
oint
with
high
or
lo
cost.
Also,
oints
of
comparable
cost
can
hav
wildly
arying
generalization
error,
and
the
initial
oin
can
affect
the
generalization
as
well.
Mo
dern
initialization
strategies
are
simple
and
heuristic.
Designing
impro
ed
initialization
strategies
is
difficult
task
ecause
neural
net
ork
optimization
is
not
et
ell
understoo
d.
Most
initialization
strategies
are
based
on
achieving
some
nice
prop
erties
when
the
netw
ork
is
initialized.
Ho
wev
er,
do
not
ha
ve
go
understanding
of
whic
of
these
prop
erties
are
preserv
ed
under
whic
circumstances
after
learning
egins
to
pro
ceed. A
further
difficulty
is
that
some
initial
oints
ma
eneficial
from
the
viewp
oint
of
optimization
but
detrimen
tal
from
the
viewp
oin
of
generalization.
Our
understanding
of
how
the
initial
point
affects
generalization
is
esp
ecially
primitiv
e,
offering
little
to
no
guidance
for
how
to
select
the
initial
oin
t.
erhaps
the
only
prop
erty
kno
wn
with
complete
certaint
is
that
the
initial
parameters
need
to
“break
symmetry”
et
een
different
units.
If
tw
hidden
units
with
the
same
activ
ation
function
are
connected
to
the
same
inputs,
then
these
units
must
hav
different
initial
parameters. If
they
hav
the
same
initial
parameters,
then
deterministic
learning
algorithm
applied
to
deterministic
cost
and
mo
del
will
constantly
up
date
oth
of
these
units
in
the
same
wa
Ev
en
if
the
297
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
mo
del
or
training
algorithm
is
capable
of
using
sto
chasticit
to
compute
different
up
dates
for
differen
units
(for
example,
if
one
trains
with
drop
out),
it
is
usually
est
to
initialize
eac
unit
to
compute
different
function
from
all
the
other
units.
This
may
help
to
make
sure
that
no
input
patterns
are
lost
in
the
null
space
of
forward
propagation
and
that
no
gradient
patterns
are
lost
in
the
null
space
of
back-propagation.
The
goal
of
having
each
unit
compute
different
function
motiv
ates
random
initialization
of
the
parameters.
could
explicitly
searc
for
large
set
of
basis
functions
that
are
all
utually
different
from
eac
other,
but
this
often
incurs
noticeable
computational
cost.
or
example,
if
we
ha
at
most
as
many
outputs
as
inputs,
we
could
use
Gram-Schmidt
orthogonalization
on
an
initial
eigh
matrix
and
guaranteed
that
each
unit
would
compute
very
differen
function
from
eac
other
unit.
Random
initialization
from
high-entrop
distribution
er
high-dimensional
space
is
computationally
cheaper
and
unlikely
to
assign
an
units
to
compute
the
same
function
as
eac
other.
ypically
we
set
the
biases
for
each
unit
to
heuristically
chosen
constan
ts,
and
initialize
only
the
weigh
ts
randomly
Extra
parameters—for
example,
parameters
enco
ding
the
conditional
ariance
of
prediction—are
usually
set
to
heuristically
hosen
constants
uc
like
the
biases
are.
almost
alw
ys
initialize
all
the
eights
in
the model
to
alues
dra
wn
randomly
from
Gaussian
or
uniform
distribution.
The
hoice
of
Gaussian
or
uniform
distribution
do
es
not
seem
to
matter
muc
but
has
not
een
exhaustively
studied.
The
scale
of
the
initial
distribution,
how
ev
er,
do
es
ha
large
effect
on
oth
the
outcome
of
the
optimization
pro
cedure
and
the
ability
of
the
net
ork
to
generalize.
Larger
initial
weigh
ts
will
yield
stronger
symmetry-breaking
effect,
helping
to
av
oid
redundant
units.
They
also
help
to
av
oid
losing
signal
during
forward
or
bac
k-propagation
through
the
linear
comp
onent
of
eac
lay
er—larger
alues
in
the
matrix
result
in
larger
outputs
of
matrix
ultiplication.
Initial
eigh
ts
that
are
to
large
ma
how
ever,
result
in
explo
ding
alues
during
forward
propagation
or
bac
k-propagation.
In
recurrent
net
works,
large
eights
can
also
result
in
haos
(suc
extreme
sensitivity
to
small
erturbations
of
the
input
that
the
eha
vior
of
the
deterministic
forward
propagation
pro
cedure
app
ears
random).
some
exten
t,
the
explo
ding
gradient
problem
can
mitigated
by
gradient
clipping
(thresholding
the
alues
of
the
gradients
efore
erforming
gradien
descent
step).
Large
weigh
ts
may
also
result
in
extreme
alues
that
cause
the
activ
ation
function
to
saturate,
causing
complete
loss
of
gradient
through
saturated
units.
These
comp
eting
factors
determine
the
ideal
initial
scale
of
the
weigh
ts.
The
ersp
ectiv
es
of
regularization
and
optimization
can
giv
ery
differen
298
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
insigh
ts
in
to
ho
we
should
initialize
netw
ork.
The
optimization
ersp
ective
suggests
that
the
weigh
ts
should
large
enough
to
propagate
information
suc-
cessfully
but
some
regularization
concerns
encourage
making
them
smaller.
The
use
of
an
optimization
algorithm,
such
as
sto
hastic
gradient
descen
t,
that
makes
small
incremen
tal
changes
to
the
weigh
ts
and
tends
to
halt
in
areas
that
are
nearer
to
the
initial
parameters
(whether
due
to
getting
stuck
in
region
of
low
gradien
t,
or
due
to
triggering
some
early
stopping
criterion
based
on
ov
erfitting)
expresses
prior
that
the
final
parameters
should
close
to
the
initial
parameters.
Recall
from
section
7.8
that
gradient
descen
with
early
stopping
is
equiv
alent
to
weigh
deca
for
some
mo
dels.
In
the
general
case,
gradient
descen
with
early
stopping
is
not
the
same
as
eigh
decay
but
it
do
es
pro
vide
lo
ose
analogy
for
thinking
ab
out
the
effect
of
initialization.
can
think
of
initializing
the
parameters
to
as
eing
similar
to
imp
osing
Gaussian
prior
with
mean
rom
this
oin
of
view,
it
mak
es
sense
to
choose
to
near
0.
This
prior
says
that
it
is
more
likely
that
units
do
not
in
teract
with
eac
other
than
that
they
do
interact.
Units
interact
only
if
the
likelihoo
term
of
the
ob
jectiv
function
expresses
strong
preference
for
them
to
interact.
On
the
other
hand,
if
we
initialize
to
large
alues,
then
our
prior
sp
ecifies
which
units
should
interact
with
each
other,
and
how
they
should
interact.
Some
heuristics
are
av
ailable
for
ho
osing
the
initial
scale
of
the
weigh
ts.
One
heuristic
is
to
initialize
the
eigh
ts
of
fully
connected
la
yer
with
inputs
and
outputs
by
sampling
each
weigh
from
while
Glorot
and
Bengio
2010
suggest
using
the
normalized
initialization
i,j
(8.23)
This
latter
heuristic
is
designed
to
compromise
et
een
the
goal
of
initializing
all
lay
ers
to
ha
ve
the
same
activ
ation
ariance
and
the
goal
of
initializing
all
la
ers
to
hav
the
same
gradien
ariance.
The
formula
is
deriv
ed
using
the
assumption
that
the
net
work
consists
only
of
hain
of
matrix
ultiplications,
with
no
nonlinearities. Real
neural
netw
orks
ob
viously
violate
this
assumption,
but
many
strategies
designed
for
the
linear
mo
del
erform
reasonably
well
on
its
nonlinear
counterparts.
Saxe
et
al.
2013
recommend
initializing
to
random
orthogonal
matrices,
with
carefully
hosen
scaling
or
gain
factor
that
accoun
ts
for
the
nonlinearity
applied
at
each
lay
er.
They
deriv
sp
ecific
alues
of
the
scaling
factor
for
different
yp
es
of
nonlinear
activ
ation
functions.
This
initialization
sc
heme
is
also
motiv
ated
by
mo
del
of
deep
net
ork
as
sequence
of
matrix
multiplies
without
nonlinearities.
299
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Under
suc
mo
del,
this
initialization
sc
heme
guaran
tees
that
the
total
num
er
of
training
iterations
required
to
reac
conv
ergence
is
indep
endent
of
depth.
Increasing
the
scaling
factor
pushes
the
net
work
to
ward
the
regime
where
activ
ations
increase
in
norm
as
they
propagate
forw
ard
through
the
netw
ork
and
gradien
ts
increase
in
norm
as
they
propagate
backw
ard.
Sussillo
2014
show
ed
that
setting
the
gain
factor
correctly
is
sufficien
to
train
net
orks
as
deep
as
1,000
lay
ers,
without
needing
to
use
orthogonal
initializations.
ey
insight
of
this
approach
is
that
in
feedforw
ard
netw
orks,
activ
ations
and
gradien
ts
can
grow
or
shrink
on
eac
step
of
forward
or
bac
k-propagation,
following
random
alk
eha
vior.
This
is
ecause
feedforw
ard
netw
orks
use
different
eigh
matrix
at
each
la
er.
If
this
random
walk
is
tuned
to
preserv
norms,
then
feedforw
ard
netw
orks
can
mostly
oid
the
anishing
and
explo
ding
gradients
problem
that
arises
when
the
same
eigh
matrix
is
used
at
each
step,
as
describ
ed
in
section
8.2.5
Unfortunately
these
optimal
criteria
for
initial
weigh
ts
often
do
not
lead
to
optimal
erformance.
This
may
for
three
differen
reasons.
First,
we
may
using
the
wrong
criteria—it
may
not
actually
eneficial
to
preserve
the
norm
of
signal
throughout
the
entire
net
ork.
Second,
the
prop
erties
imp
osed
at
initialization
may
not
ersist
after
learning
has
egun
to
pro
ceed.
Third,
the
criteria
might
succeed
at
improving
the
sp
eed
of
optimization
but
inadverten
tly
increase
generalization
error.
In
practice,
we
usually
need
to
treat
the
scale
of
the
eigh
ts
as
hyperparameter
whose
optimal
alue
lies
somewhere
roughly
near
but
not
exactly
equal
to
the
theoretical
predictions.
One
drawbac
to
scaling
rules
that
set
all
the
initial
weigh
ts
to
ha
ve
the
same
standard
deviation,
such
as
is
that
every
individual
weigh
ecomes
extremely
small
when
the
lay
ers
ecome
large.
Martens
2010
introduced
an
alternative
initialization
sc
heme
called
sparse
initialization
in
whic
each
unit
is
initialized
to
hav
exactly
nonzero
weigh
ts.
The
idea
is
to
keep
the
total
amount
of
input
to
the
unit
indep
endent
from
the
num
er
of
inputs
without
making
the
magnitude
of
individual
weigh
elements
shrink
with
Sparse
initialization
helps
to
achiev
more
diversit
among
the
units
at
initialization
time.
Ho
wev
er,
it
also
imp
oses
very
strong
prior
on
the
weigh
ts
that
are
chosen
to
ha
ve
large
Gaussian
alues.
Because
it
takes
long
time
for
gradien
descen
to
shrink
“incorrect”
large
alues,
this
initialization
scheme
can
cause
problems
for
units,
such
as
maxout
units,
that
ha
several
filters
that
ust
carefully
co
ordinated
with
each
other.
When
computational
resources
allo
it,
it
is
usually
go
idea
to
treat
the
initial
scale
of
the
eights
for
each
lay
er
as
hyperparameter,
and
to
choose
these
scales
using
yp
erparameter
algorithm
describ
ed
in
section
11.4.2
such
as
random
search.
The
choice
of
whether
to
use
dense
or
sparse
initialization
300
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
can
also
made
yp
erparameter.
Alternately
one
can
man
ually
for
the
est
initial
scales.
go
rule
of
thum
for
choosing
the
initial
scales
is
to
lo
ok
at
the
range
or
standard
deviation
of
activ
ations
or
gradients
on
single
minibatc
of
data.
If
the
weigh
ts
are
to
small,
the
range
of
activ
ations
across
the
minibatc
will
shrink
as
the
activ
ations
propagate
forward
through
the
netw
ork.
By
rep
eatedly
identifying
the
first
lay
er
with
unacceptably
small
activ
ations
and
increasing
its
weigh
ts,
it
is
ossible
to
even
tually
obtain
net
ork
with
reasonable
initial
activ
ations
throughout.
If
learning
is
still
to
slow
at
this
oint,
it
can
useful
to
lo
ok
at
the
range
or
standard
deviation
of
the
gradien
ts
as
well
as
the
activ
ations. This
pro
cedure
can
in
principle
automated
and
is
generally
less
computationally
costly
than
hyperparameter
optimization
based
on
alidation
set
error
ecause
it
is
based
on
feedback
from
the
ehavior
of
the
initial
mo
del
on
single
batc
of
data,
rather
than
on
feedback
from
trained
mo
del
on
the
alidation
set.
While
long
used
heuristically
this
proto
col
has
recently
een
sp
ecified
more
formally
and
studied
by
Mishkin
and
Matas
2015
).
So far w
e ha
e focused on the initialization of
the w
eights.
ortunately
initialization
of
other
parameters
is
typically
easier.
The
approac
for
setting
the
biases
ust
co
ordinated
with
the
approac
for
setting
the
weigh
ts.
Setting
the
biases
to
zero
is
compatible
with
most
weigh
initialization
schemes.
There
are
few
situations
where
we
ma
set
some
biases
to
nonzero
alues:
If
bias
is
for
an
output
unit,
then
it
is
often
eneficial
to
initialize
the
bias
to
obtain
the
righ
marginal
statistics
of
the
output.
do
this,
assume
that
the
initial
eigh
ts
are
small
enough
that
the
output
of
the
unit
is
determined
only
the
bias.
This
justifies
setting
the
bias
to
the
in
erse
of
the
activ
ation
function
applied
to
the
marginal
statistics
of
the
output
in
the
training
set.
or
example,
if
the
output
is
distribution
ov
er
classes,
and
this
distribution
is
highly
sk
ew
ed
distribution
with
the
marginal
probability
of
class
giv
en
element
of
some
ector
then
can
set
the
bias
ector
solving
the
equation
softmax
) =
This
applies
not
only
to
classifiers
but
also
to
mo
dels
we
will
encounter
in
Part
such
as
auto
enco
ders
and
Boltzmann
mac
hines.
These
mo
dels
hav
la
ers
whose
output
should
resem
ble
the
input
data
and
it
can
ery
helpful
to
initialize
the
biases
of
such
la
ers
to
matc
the
marginal
distribution
er
Sometimes
e ma
an
to
ho
ose
the
bias
to
av
oid causing
to
muc
saturation
at
initialization.
or
example,
ma
set
the
bias
of
ReLU
hidden
unit
to
0.1
rather
than
to
av
oid
saturating
the
ReLU
at
initialization.
This
approach
is
not
compatible
with
weigh
initialization
schemes
that
do
301
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
not
exp
ect
strong
input
from
the
biases
though.
or
example, it
is
not
recommended
for
use
with
random
walk
initialization
Sussillo
2014
).
Sometimes
unit
controls
whether
other
units
are
able
to
participate
in
function.
In
suc
situations,
hav
unit
with
output
and
another
unit
[0
1]
and
they
are
multiplied
together
to
pro
duce
an
output
uh
. W
can
view
as
gate
that
determines
whether
uh
or
uh
. In
these
situations,
we
wan
to
set
the
bias
for
so
that
most
of
the
time
at
initialization.
Otherwise
do
es
not
hav
chance
to
learn.
or
example,
Jozefo
wicz
et
al.
2015
adv
cate
setting
the
bias
to
for
the
forget
gate
of
the
LSTM
mo
del,
describ
ed
in
section
10.10
Another
common
yp
of
parameter
is
ariance
or
precision
parameter.
or
example,
we
can
perform
linear
regression
with
conditional
ariance
estimate
using
the
mo
del
) =
b,

(8.24)
where
is
precision
parameter.
can
usually
initialize
ariance
or
precision
parameters
to
safely
Another
approach
is
to
assume
the
initial
weigh
ts
are
close
enough
to
zero
that
the
biases
may
set
while
ignoring
the
effect
of
the
weigh
ts,
then
set
the
biases
to
pro
duce
the
correct
marginal
mean
of
the
output,
and
set
the
ariance
parameters
to
the
marginal
ariance
of
the
output
in
the
training
set.
Besides
these
simple
constant
or
random
metho
ds
of
initializing
mo
del
parame-
ters,
it
is
ossible
to
initialize
mo
del
parameters
using
mach
ine
learning.
common
strategy
discussed
in
part
of
this
ok
is
to
initialize
sup
ervised
mo
del
with
the
parameters
learned
by
an
unsup
ervised
mo
del
trained
on
the
same
inputs.
One
can
also
erform
sup
ervised
training
on
related
task.
Even
erforming
sup
ervised
training
on
an
unrelated
task
can
sometimes
yield
an
initialization
that
offers
faster
conv
ergence
than
random
initialization.
Some
of
these
initialization
strategies
may
yield
faster
conv
ergence
and
etter
generalization
ecause
they
enco
de
information
ab
out
the
distribution
in
the
initial
parameters
of
the
mo
del.
Others
apparently
erform
we
ll
primarily
ecause
they
set
the
parameters
to
hav
the
righ
scale
or
set
differen
units
to
compute
different
functions
from
eac
other.
8.5
Algorithms
with
daptiv
Learning
Rates
Neural
net
ork
researc
hers
ha
long
realized
that
the
learning
rate
is
reliably
one
of
the
most
difficult
to
set
hyperparameters
ecause
it
significantly
affects
mo
del
erformance.
As
we
discuss
in
sections
4.3
and
8.2
the
cost
is
often
highly
302
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
sensitiv
to
some
directions
in
parameter
space
and
insensitiv
to
others.
The
momen
tum
algorithm
can
mitigate
these
issues
somewhat,
but
it
do
es
so
at
the
exp
ense
of
introducing
another
yp
erparameter.
In
the
face
of
this,
it
is
natural
to
ask
if
there
is
another
wa
If
we
eliev
that
the
directions
of
sensitivity
are
somewhat
axis
aligned,
it
can
make
sense
to
use
separate
learning
rate
for
eac
parameter
and
automatically
adapt
these
learning
rates
throughout
the
course
of
learning.
The
delta-bar-delta
algorithm
Jacobs
1988
is
an
early
heuristic
approach
to
adapting
individual
learning
rates
for
mo
del
parameters
during
training.
The
approac
is
based
on
simple
idea:
if
the
partial
deriv
ativ
of
the
loss,
with
resp
ect
to
given
model
parameter,
remains
the
same
sign,
then
the
learning
rate
should
increase.
If
that
partial
deriv
ative
hanges
sign,
then
the
learning
rate
should
decrease.
Of
course,
this
kind
of
rule
can
only
applied
to
full
batch
optimization.
More
recently
num
er
of
incremental
(or
mini
batch-based)
metho
ds
hav
een
in
tro
duced
that
adapt
the
learning
rates
of
mo
del
parameters.
In
this
section,
briefly
review
few
of
these
algorithms.
8.5.1
daGrad
The
daGrad
algorithm,
shown
in
algorithm
8.4
individually
adapts
the
learning
rates
of
all
mo
del
parameters
by
scaling
them
in
ersely
prop
ortional
to
the
square
ro
ot
of
the
sum
of
all
the
historical
squared
alues
of
the
gradient
Duc
hi
et
al.
2011
).
The
parameters
with the
largest
partial deriv
ative
of the
loss ha
ve
corresp
ondingly
rapid
decrease
in
their
learning
rate,
while
parameters
with
small
partial
deriv
atives
ha
relatively
small
decrease
in
their
learning
rate.
The
net
effect
is
greater
progress
in
the
more
gently
slop
ed
directions
of
parameter
space.
In
the
context
of
conv
ex
optimization,
the
daGrad
algorithm
enjoys
some
desirable
theoretical
prop
erties.
Empirically
ho
ev
er,
for
training
deep
neural
net
ork
mo
dels,
the
accumulation
of
squared
gradients
fr
om
the
ginning
of
tr
aining
can
result
in
premature
and
excessiv
decrease
in
the
effective
learning
rate.
AdaGrad
erforms
well
for
some
but
not
all
deep
learning
mo
dels.
8.5.2
RMSProp
The
RMSProp
algorithm
Hinton
2012
mo
difies
AdaGrad
to
erform
etter
in
the
nonconv
ex
setting
by
changing
the
gradient
accumulation
in
to
an
exp
onen
tially
eigh
ted
moving
erage.
AdaGrad
is
designed
to
conv
erge
rapidly
when
applied
to
conv
ex
function.
When
applied
to
noncon
ex
function
to
train
neural
netw
ork,
303
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
the
learning
tra
jectory
may
pass
through
many
different
structures
and
ev
entually
arriv
at
region
that
is
lo
cally
con
ex
owl.
AdaGrad
shrinks
the
learning
rate
according
to
the
en
tire
history
of
the
squared
gradient
and
may
hav
made
the
learning
rate
to
small
efore
arriving
at
such
conv
ex
structure.
RMSProp
uses
an
exponentially
decaying
av
erage
to
discard
history
from
the
extreme
past
so
that
it
can
conv
erge
rapidly
after
finding
con
vex
wl,
as
if
it
ere
an
instance
of
the
daGrad
algorithm
initialized
within
that
owl.
Algorithm
8.4
The
daGrad
algorithm
Require:
Global
learning
rate
Require:
Initial
parameter
Require:
Small
constant
erhaps
10
for
umerical
stability
Initialize
gradient
accum
ulation
ariable
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
gradient:
ccum
ulate
squared
gradien
t:
Compute
up
date:
(Division
and
square
ro
ot
applied
elemen
t-wise)
Apply
up
date:
end
while
Algorithm
8.5
The
RMSProp
algorithm
Require:
Global
learning
rate
decay
rate
Require:
Initial
parameter
Require:
Small
constan
, usually
10
, used
to
stabilize division
by
small
um
ers
Initialize
accumulation
ariables
= 0
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
gradient:
ccum
ulate
squared
gradien
t:
(1
Compute
parameter
up
date:
applied
element-wise)
Apply
up
date:
end
while
304
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
RMSProp
is
shown
in
its
standard
form
in
algorithm
8.5
and
combined
with
Nestero
momentum
in
algorithm
8.6
Compared
to
AdaGrad,
the
use
of
the
mo
ving
verage
introduces
new
yp
erparameter,
that
controls
the
length
scale
of
the
mo
ving
av
erage.
Empirically
RMSProp
has
een
shown
to
an
effectiv
and
practical
op-
timization
algorithm
for
deep
neural
net
wor
ks.
It
is
currently
one
of
the
go-to
optimization
metho
ds
eing
employ
ed
routinely
by
deep
learning
practitioners.
8.5.3
dam
dam
Kingma
and
Ba
2014
is
yet
another
adaptive
learning
rate
optimization
algorithm
and
is
presented
in
algorithm
8.7
The
name
“Adam” deriv
es
from
the
phrase
“adaptive
moments.”
In
the
context
of
the
earlier
algorithms,
it
is
erhaps
est
seen
as
arian
on
the
com
bination
of
RMSProp
and
momentum
with
few
important
distinctions.
First,
in
Adam,
momen
tum
is
incorp
orated
directly
as
an
estimate
of
the
first-order
moment
(with
exp
onen
tial
weigh
ting)
of
the
gradient.
The
most
straightforw
ard
wa
to
add
momentum
to
RMSProp
is
to
apply
momentum
to
the
rescaled
gradien
ts.
The
use
of
momen
tum
in
combination
with
rescaling
do
es
not
hav
clear
theoretical
motiv
ation.
Second,
Adam
includes
bias
corrections
to
the
estimates
of
oth
the
first-order
moments
(the
momentum
term)
and
the
(uncentered)
second-order
momen
ts
to
accoun
for
their
initialization
at
the
origin
(see
algorithm
8.7
).
RMSProp
also
incorp
orates
an
estimate
of
the
Algorithm
8.6
RMSProp
algorithm
with
Nestero
momentum
Require:
Global
learning
rate
decay
rate
momentum
co
efficient
Require:
Initial
parameter
initial
velocity
Initialize
accumulation
ariable
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
interim
up
date:
Compute
gradient:
ccum
ulate
gradient:
(1
Compute
velocity
up
date:
applied
element-wise)
Apply
up
date:
end
while
305
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
(uncen
tered)
second-order
moment;
ho
ev
er,
it
lacks
the
correction
factor.
Th
us,
unlik
in
Adam,
the
RMSProp
second-order
momen
estimate
may
hav
high
bias
early
in
training.
dam
is
generally
regarded
as
eing
fairly
robust
to
the
choice
of
hyperparameters,
though
the
learning
rate
sometimes
needs
to
hanged
from
the
suggested
default.
8.5.4
Cho
osing
the
Righ
Optimization
Algorithm
hav
discussed
series
of
related
algorithms
that
eac
seek
to
address
the
hallenge
of
optimizing
deep
mo
dels
by
adapting
the
learning
rate
for
eac
mo
del
parameter.
this
oint,
natural
question
is:
whic
algorithm
should
one
choose?
Unfortunately
there
is
currently
no
consensus
on
this
oint.
Schaul
et
al.
2014
presen
ted
aluable
comparison
of
large
num
er
of
optimization
algorithms
across
wide
range
of
learning
tasks.
While
the
results
suggest
that
the
family
of
algorithms
with
adaptive
learning
rates
(represen
ted
by
RMSProp
and
AdaDelta)
Algorithm
8.7
The
dam
algorithm
Require:
Step
size
(Suggested
default:
001
Require:
Exp
onen
tial
deca
rates
for
momen
estimates,
and
in
[0
1)
(Suggested
defaults:
and
999
resp
ectiv
ely)
Require:
Small
constant
used
for
numerical
stabilization
(Suggested
default:
10
Require:
Initial
parameters
Initialize
1st
and
2nd
momen
ariables
Initialize
time
step
= 0
while
stopping
criterion
not
met
do
Sample
minibatch
of
examples
from
the
training
set
(1)
with
corresp
onding
targets
Compute
gradient:
Up
date
biased
first
moment
estimate:
(1
Up
date
biased
second
moment
estimate:
(1
Correct
bias
in
first
momen
t:
Correct
bias
in
second
momen
t:
Compute
up
date:
(op
erations
applied
elemen
t-wise)
Apply
up
date:
end
while
306
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
erformed
fairly
robustly
no
single
est
algorithm
has
emerged.
Curren
tly
the
most
opular
optimization
algorithms
activ
ely
in
use
include
SGD,
SGD
with
momentum,
RMSProp,
RMSProp
with
momen
tum,
daDelta,
and
Adam.
The
choice
of
which
algorithm
to
use,
at
this
oin
t,
seems
to
dep
end
largely
on
the
user’s
familiarity
with
the
algorithm
(for
ease
of
hyperparameter
tuning).
8.6
Appro
ximate
Second-Order
Metho
ds
In
this
section
we
discuss
the
application
of
second-order
metho
ds
to
training
deep
net
orks.
See
LeCun
et
al.
1998a
for
an
earlier
treatment
of
this
sub
ject.
or
simplicit
of
exp
osition,
the
only
ob
jective
function
examine
is
the
empirical
risk:
) =
data
,y
)] =
=1
(8.25)
The
metho
ds
we
discuss
here
extend
readily
how
ever,
to
more
general
ob
jective
functions,
such
as
those
that
include
parameter
regularization
terms,
as
discussed
in
chapter
8.6.1
Newton’s
Metho
In
section
4.3
we
in
tro
duced
second-order
gradien
metho
ds.
In
contrast
to
first-
order
metho
ds,
second-order
metho
ds
make
use
of
second
deriv
atives
to
improv
optimization.
The
most
widely
used
second-order
metho
is
Newton’s
metho
d.
no
describ
Newton’s
metho
in
more
detail,
with
emphasis
on
its
application
to
neural
netw
ork
training.
Newton’s
metho
is
an
optimization
sc
heme
based
on
using
second-order
y-
lor
series
expansion
to
approximate
near
some
oint
ignoring
deriv
atives
of
higher
order:
(8.26)
where
is
the
Hessian
of
with
respect
to
ev
aluated
at
If
we
then
solv
for
the
critical
oin
of
this
function,
obtain
the
Newton
parameter
up
date
rule:
(8.27)
307
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Th
us
for
lo
cally
quadratic
function
(with
positive
definite
),
rescaling
the
gradien
by
Newton’s
metho
jumps
directly
to
the
minimum. If
the
ob
jectiv
function
is
con
ex
but
not
quadratic
(there
are
higher-order
terms),
this
up
date
can
iterated,
yielding
the
training
algorithm
asso
ciated
with
Newton’s
metho
d,
given
in
algorithm
8.8
or
surfaces
that
are
not
quadratic,
as
long
as
the
Hessian
remains
ositive
definite,
Newton’s
metho
can
applied
iteratively
This
implies
tw
o-step
iterativ
pro
cedure.
First,
up
date
or
compute
the
inv
erse
Hessian
(i.e.,
by
up
dat-
ing
the
quadratic
approximation).
Second,
up
date
the
parameters
according
to
equation
8.27
In
section
8.2.3
we
discuss
how
Newton’s
method
is
appropriate
only
when
the
Hessian
is
ositive
definite.
In
deep
learning,
the
surface
of
the
ob
jective
function
is
typically
nonconv
ex,
with
many
features,
suc
as
saddle
oints,
that
are
problematic
for
Newton’s
metho
d. If
the
eigen
alues
of
the
Hessian
are
not
all
ositive,
for
example,
near
saddle
oint,
then
Newton’s
metho
can
actually
cause
up
dates
to
mov
in
the
wrong
direction.
This
situation
can
av
oided
regularizing
the
Hessian.
Common
regularization
strategies
include
adding
constan
t,
along
the
diagonal
of
the
Hessian.
The
regularized
up
date
ecomes
))
(8.28)
This
regularization
strategy
is
used
in
approximations
to
Newton’s
metho
d,
suc
as
the
Lev
en
erg-Marquardt
algorithm
Lev
enberg
1944
Marquardt
1963
),
and
orks
fairly
well
as
long
as
the
negative
eigen
alues
of
the
Hessian
are
still
relatively
close
to
zero.
When
there
are
more
extreme
directions
of
curv
ature,
the
alue
of
ould
ha
to
sufficiently
large
to
offset
the
negativ
eigen
alues. As
increases
in
size,
ho
ever,
the
Hessian
ecomes
dominated
by
the
diagonal,
Algorithm
8.8
Newton’s
metho
with
ob
jective
) =
=1
Require:
Initial
parameter
Require:
raining
set
of
examples
while
stopping
criterion
not
met
do
Compute
gradient:
Compute
Hessian:
Compute
Hessian
in
erse:
Compute
up
date:
Apply
up
date:
end
while
308
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
and
the
direction
chosen
Newton’s
metho
conv
erges
to
the
standard
gradient
divided
by
. When
strong
negative
curv
ature
is
present,
ma
need
to
so
large
that
Newton’s
metho
ould
make
smaller
steps
than
gradien
descent
with
prop
erly
hosen
learning
rate.
Bey
ond
the
challenges
created
by
certain
features
of
the
ob
jective
function,
suc
as
saddle
oin
ts,
the
application
of
Newton’s
metho
for
training
large
neural
net
orks
is
limited
the
significan
computational
burden
it
imposes.
The
um
er
of
elemen
ts
in
the
Hessian
is
squared
in
the
num
er
of
parameters,
so
with
parameters
(and
for
ev
en
very
small
neural
netw
orks,
the
um
er
of
parameters
can
in
the
millions),
Newton’s
metho
would
require
the
inv
ersion
of
matrix—with
computational
complexit
of
Also,
since
the
parameters
will
hange
with
every
up
date,
the
in
erse
Hessian
has
to
computed
at
every
tr
aining
iter
ation
As
consequence,
only
netw
orks
with
very
small
num
er
of
parameters
can
practically
trained
via
Newton’s
metho
d.
In
the
remainder
of
this
section,
discuss
alternatives
that
attempt
to
gain
some
of
the
adv
an
tages
of
Newton’s
metho
while
side-stepping
the
computational
hurdles.
8.6.2
Conjugate
Gradien
ts
Conjugate
gradients
is
metho
to
efficien
tly
av
oid
the
calculation
of
the
inv
erse
Hessian
by
iteratively
descending
conjugate
directions
The
inspiration
for
this
approac
follows
from
careful
study
of
the
weakness
of
the
metho
of
steep
est
descen
(see
section
4.3
for
details),
where
line
searches
are
applied
iteratively
in
the
direction
asso
ciated
with
the
gradient.
Figure
8.6
illustrates
how
the
metho
of
steep
est
descen
t,
when
applied
in
quadratic
wl,
progresses
in
rather
ineffectiv
bac
k-and-forth
zig-zag
pattern.
This
happ
ens
ecause
each
line
direction,
when
given
the
gradient,
is
guaranteed
to
orthogonal
to
the
previous
line
searc
direction.
Let
the
previous
searc
direction
the
minimum,
where
the
line
searc
terminates,
the
directional
deriv
ative
is
zero
in
direction
Since
the
gradient
at
this
oin
defines
the
current
direction,
will
hav
no
contribution
in
the
direction
Thus
is
orthogonal
to
This
relationship
etw
een
and
is
illustrated
in
figure
8.6
for
ultiple
iterations
of
steep
est
descen
t.
As
demonstrated
in
the
figure,
the
hoice
of
orthogonal
directions
of
descent
do
not
preserve
the
minimum
along
the
previous
searc
directions.
This
giv
es
rise
to
the
zig-zag
pattern
of
progress,
where
by
descending
to
the
minim
um
in
the
curren
gradien
direction,
we
ust
reminimize
the
ob
jectiv
in
the
previous
gradient
direction.
Thus,
follo
wing
the
gradient
at
the
end
of
each
line
searc
we
are,
in
sense,
undoing
progress
we
ha
already
309
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
󰤓
󰤓
󰤓


󰤓
󰤓
󰤓


Figure
8.6:
The
metho
of
steep
est
descen
applied
to
quadratic
cost
surface.
The
metho
of
steep
est
descent
in
volv
es
jumping
to
the
oint
of
low
est
cost
along
the
line
defined
by
the
gradient
at
the
initial
oint
on
eac
step.
This
resolv
es
some
of
the
problems
seen
with
using
fixed
learning
rate
in
figure
4.6
but
even
with
the
optimal
step
size
the
algorithm
still
makes
back-and-forth
progress
tow
ard
the
optimum.
By
definition,
at
the
minimum
of
the
ob
jectiv
along
given
direction,
the
gradient
at
the
final
oint
is
orthogonal
to
that
direction.
made
in
the
direction
of
the
previous
line
search.
The
metho
of
conjugate
gradients
seeks
to
address
this
problem.
In
the
metho
of
conjugate
gradients,
seek
to
find
searc
direction
that
is
conjugate
to
the
previous
line
direction;
that
is,
it
will
not
undo
progress
made
in
that
direction.
training
iteration
the
next
direction
tak
es
the
form:
(8.29)
where
is
co
efficient
whose
magnitude
controls
how
uch
of
the
direction,
should
add
back
to
the
curren
direction.
directions,
and
are
defined
as
conjugate
if
= 0
where
is
the
Hessian
matrix.
The
straightforw
ard
wa
to
imp
ose
conjugacy
would
in
vo
lve
calculation
of
the
eigen
ectors
of
to
choose
whic
would
not
satisfy
our
goal
of
developing
metho
that
is
more
computationally
viable
than
Newton’s
metho
for
large
problems. Can
we
calculate
the
conjugate
directions
without
resorting
to
these
calculations?
ortunately
the
answer
to
that
is
yes.
opular
metho
ds
for
computing
the
are
310
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
1. Fletc
her-Reev
es:
(8.30)
2. P
olak-Ribière:
))
(8.31)
or
quadratic
surface,
the
conjugate
directions
ensure
that
the
gradient
along
the
previous
direction
do
es
not
increase
in
magnitude.
therefore
stay
at
the
minim
um
along
the
previous
directions.
As
consequence,
in
-dimensional
parameter
space,
the
conjugate
gradien
metho
requires
at
most
line
searches
to
ac
hiev
the
minimum.
The
conjugate
gradient
algorithm
is
given
in
algorithm
8.9
Algorithm
8.9
The
conjugate
gradient
metho
Require:
Initial
parameters
Require:
raining
set
of
examples
Initialize
Initialize
= 0
Initialize
= 1
while
stopping
criterion
not
met
do
Initialize
the
gradien
Compute
gradient:
Compute
(P
olak-Ribière)
(Nonlinear
conjugate
gradien
t:
optionally
reset
to
zero,
for
example
if
is
multiple
of
some
constant
suc
as
= 5
Compute
direction:
erform
line
searc
to
find:
= argmin
=1
(On
truly
quadratic
cost
function, analytically
solve
for
rather
than
explicitly
searching
for
it)
Apply
up
date:
+1
end
while
Nonlinear
Conjugate
Gradien
ts:
So
far
hav
discussed
the
metho
of
conjugate
gradien
ts
as
it
is
applied
to
quadratic
ob
jectiv
functions. Of
course,
our
primary
interest
in
this
chapter
is
to
explore
optimization
metho
ds
for
training
neural
netw
orks
and
other
related
deep
learning
mo
dels
where
the
corresp
onding
311
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
ob
jectiv
function
is
far
from
quadratic.
Perhaps
surprisingly
, the
metho
of
conjugate
gradien
ts
is
still
applicable
in
this
setting,
though
with
some
mo
dification.
Without
any
assurance
that
the
ob
jectiv
is
quadratic,
the
conjugate
directions
are
no
longer
assured
to
remain
at
the
minimum
of
the
ob
jectiv
for
previous
directions.
As
result,
the
nonlinear
conjugate
gradients
algorithm
includes
ccasional
resets
where
the
metho
of
conjugate
gradien
ts
is
restarted
with
line
searc
along
the
unaltered
gradien
t.
Practitioners
report
reasonable
results
in
applications
of
the
nonlinear
conjugate
gradien
ts
algorithm
to
training
neural
netw
orks,
though
it
is
often
eneficial
to
initialize
the
optimization
with
few
iterations
of
sto
chastic
gradient
descen
efore
commencing
nonlinear
conjugate
gradients.
Also,
while
the
(nonlinear)
conjugate
gradien
ts
algorithm
has
traditionally
een
cast
as
batch
metho
d,
minibatc
ersions
hav
een
used
successfully
for
training
neural
netw
orks
Le
et
al.
2011
).
daptations
of
conjugate
gradients
sp
ecifically
for
neural
netw
orks
hav
een
prop
osed
earlier,
suc
as
the
scaled
conjugate
gradients
algorithm
Moller
1993
).
8.6.3
BF
GS
The
Bro
yden–Fletc
her–Goldfarb–Shanno
(BFGS)
algorithm
attempts
to
bring
some
of
the
adv
antages
of
Newton’s
metho
without
the
computational
burden.
In
that
respect,
BFGS
is
similar
to
the
conjugate
gradient
method.
Ho
ev
er,
BFGS
tak
es
more
direct
approach
to
the
approximation
of
Newton’s
up
date.
Recall
that
Newton’s
up
date
is
given
by
(8.32)
where
is
the
Hessian
of
with
resp
ect
to
ev
aluated
at
The
primary
computational
difficult
in
applying
Newton’s
up
date
is
the
calculation
of
the
in
erse
Hessian
The
approac
adopted
quasi-Newton
metho
ds
(of
which
the
BFGS
algorithm
is
the
most
prominen
t)
is
to
approximate
the
in
erse
with
matrix
that
is
iterativ
ely
refined
lo
w-rank
up
dates
to
ecome
etter
appro
ximation
of
The
sp
ecification
and
deriv
ation
of
the
BFGS
appro
ximation
is
giv
en
in
many
textb
oks
on
optimization,
including
in
Luenberger
1984
).
Once
the
in
erse
Hessian
approximation
is
up
dated,
the
direction
of
descent
is
determined
by
line
is
erformed
in
this
direction
to
determine
the
size
of
the
step,
taken
in
this
direction.
The
final
up
date
to
the
parameters
is
giv
en
by
+1
(8.33)
312
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
Lik
the
metho
of
conjugate
gradients,
the
BFGS
algorithm
iterates
series
of
line
searc
hes
with
the
direction
incorp
orating
second-order
information.
Unlik
conjugate
gradients,
how
ev
er,
the
success
of
the
approach
is
not
heavily
dep
enden
on
the
line
finding
oint
ery
close
to
the
true
minimum
along
the
line.
Th
us,
relative
to
conjugate
gradien
ts,
BFGS
has
the
adv
an
tage
that
it
can
sp
end
less
time
refining
each
line
search.
On
the
other
hand,
the
BFGS
algorithm
must
store
the
inv
erse
Hessian
matrix,
that
requires
memory
making
BFGS
impractical
for
most
mo
dern
deep
learning
mo
dels
that
typically
ha
millions
of
parameters.
Limited Memory BF
GS (or L-BF
GS)
The memory
costs of
the BF
GS
algorithm
can
significantly
decreased
by
oiding
storing
the
complete
inv
erse
Hessian
approximation
The
L-BFGS
algorithm
computes
the
appro
ximation
using
the
same
metho
as
the
BF
GS
algorithm
but
eginning
with
the
assumption
that
1)
is
the
iden
tit
matrix,
rather
than
storing
the
approximation
from
one
step
to
the
next.
If
used
with
exact
line
searches,
the
directions
defined
by
L-BFGS
are
mutually
conjugate.
Ho
ev
er,
unlike
the
metho
of
conjugate
gradients,
this
pro
cedure
remains
well
eha
ved
when
the
minimum
of
the
line
searc
is
reac
hed
only
approximately
The
L-BFGS
strategy
with
no
storage
describ
ed
here
can
generalized
to
include
more
information
ab
out
the
Hessian
by
storing
some
of
the
ectors
used
to
up
date
at
each
time
step,
which
costs
only
er
step.
8.7
Optimization
Strategies
and
Meta-Algorithms
Man
optimization
tec
hniques
are
not
exactly
algorithms
but
rather
general
tem-
plates
that
can
sp
ecialized
to
yield
algorithms,
or
subroutines
that
can
incorp
orated
into
man
different
algorithms.
8.7.1
Batc
Normalization
Batc
normalization
Ioffe
and
Szegedy
2015
is
one
of
the
most
exciting
recent
inno
ations
in
optimizing
deep
neural
netw
orks,
and
it
is
actually
not
an
opti-
mization
algorithm
at
all.
Instead,
it
is
metho
of
adaptive
reparametrization,
motiv
ated
by
the
difficulty
of
training
very
deep
mo
dels.
ery
deep
mo
dels
inv
olve
the
comp
osition
of
sev
eral
functions,
or
lay
ers.
The
gradien
tells
how
to
up
date
each
parameter,
under
the
assumption
that
the
other
la
ers
do
not
change.
In
practice,
we
up
date
all
the
la
ers
simultaneously
When
we
mak
the
up
date,
unexp
ected
results
can
happ
en
ecause
man
functions
comp
osed
313
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
together
are
changed
simultaneously
using
up
dates
that
were
computed
under
the
assumption
that
the
other
functions
remain
constant.
As
simple
example,
supp
ose
we
ha
deep
neural
netw
ork
that
has
only
one
unit
er
lay
er
and
do
es
not
use
an
activ
ation
function
at
each
hidden
la
er:
xw
Here,
pro
vides
the
weigh
used
by
la
er
The
output
of
lay
er
is
The
output
is
linear
function
of
the
input
but
nonlinear
function
of
the
weigh
ts
Suppose
our
cost
function
has
put
gradient
of
on
so
wish
to
decrease
sligh
tly
The
back-propagation
algorithm
can
then
compute
gradien
Consider
what
happ
ens
when
we
make
an
up
date
. The
first-order
ylor
series
appro
ximation
of
predicts
that
the
alue
of
will
decrease
If
we
wan
ted
to
decrease
this
first-order
information
av
ailable
in
the
gradien
suggests
could
set
the
learning
rate
to
et,
the
actual
up
date
will
include
second-order
and
third-order
effects,
on
up
to
effects
of
order
The
new
alue
of
is
giv
en
by
g
)(
g
g
(8.34)
An
example
of
one
second-order
term
arising
from
this
up
date
is
=3
This
term
migh
negligible
if
=3
is
small,
or
might
exp
onen
tially
large
if
the
weigh
ts
on
lay
ers
through
are
greater
than
This
mak
es
it
ery
hard
to
choose
an
appropriate
learning
rate,
ecause
the
effects
of
an
up
date
to
the
parameters
for
one
lay
er
dep
end
so
strongly
on
all
the
other
lay
ers.
Second-order
optimization
algorithms
address
this
issue
computing
an
up
date
that
tak
es
these
second-order
interactions
into
accoun
t,
but
we
can
see
that
in
very
deep
net
orks,
ev
en
higher-order
interactions
can
significan
t.
Ev
en
second-order
optimization
algorithms
are
exp
ensive
and
usually
require
umerous
approximations
that
preven
them
from
truly
accounting
for
all
significant
second-order
interactions.
Building
an
-th
order
optimization
algorithm
for
thus
seems
hop
eless.
What
can
we
do
instead?
Batc
normalization
provides
an
elegant
wa
of
reparametrizing
almost
any
deep
net
ork.
The
reparametrization
significantly
reduces
the
problem
of
co
ordinating
up
dates
across
man
lay
ers.
Batch
normalization
can
applied
to
any
input
or
hidden
la
er
in
netw
ork.
Let
minibatc
of
activ
ations
of
the
lay
er
to
normalize,
arranged
as
design
matrix,
with
the
activ
ations
for
eac
example
app
earing
in
row
of
the
matrix.
normalize
we
replace
it
with
(8.35)
where
is
vector
con
taining
the
mean
of
each
unit
and
is
vector
con
taining
the
standard
deviation
of
eac
unit.
The
arithmetic
here
is
based
on
broadcasting
314
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
the
ector
and
the
ector
to
applied
to
every
row
of
the
matrix
Within
eac
row,
the
arithmetic
is
elemen
t-wise,
so
i,j
is
normalized
by
subtracting
and
dividing
by
The
rest
of
the
netw
ork
then
op
erates
on
in
exactly
the
same
wa
that
the
original
netw
ork
op
erated
on
training
time,
i,
(8.36)
and
(8.37)
where
is
small
ositive
alue
suc
as
10
imp
osed
to
void
encoun
tering
the
undefined
gradient
of
at
Crucially
we
ack-pr
op
agate
thr
ough
these
op
er
ations
for
computing
the
mean
and
the
standard
deviation,
and
for
applying
them
to
normalize
This
means
that
the
gradient
will
never
prop
ose
an
operation
that
acts
simply
to
increase
the
standard
deviation or
mean
of
the
normalization
op
erations
remov
the
effect
of
such
an
action
and
zero
out
its
comp
onent
in
the
gradient.
This
as
ma
jor
innov
ation
of
the
batch
normalization
approach. Previous
approaches
had
inv
olved
adding
enalties
to
the
cost
function
to
encourage
units
to
hav
normalized
activ
ation
statistics
or
in
olv
ed
interv
ening
to
renormalize
unit
statistics
after
each
gradient
descen
step.
The
former
approach
usually
resulted
in
imp
erfect
normalization
and
the
latter
usually
resulted
in
significant
wasted
time,
as
the
learning
algorithm
rep
eatedly
prop
osed
changing
the
mean
and
ariance,
and
the
normalization
step
rep
eatedly
undid
this
change.
Batch
normalization
reparametrizes
the
mo
del
to
make
some
units
alwa
ys
standardized
definition,
deftly
sidestepping
oth
problems.
test
time,
and
ma
be
replaced
by
running
erages
that
were
collected
during
training
time.
This
allows
the
mo
del
to
ev
aluated
on
single
example,
without
needing
to
use
definitions
of
and
that
dep
end
on
an
en
tire
minibatch.
Revisiting
the
xw
example,
we
see
that
we
can
mostly
resolve
the
difficulties
in
learning
this
mo
del
by
normalizing
Supp
ose
that
is
drawn
from
unit
Gaussian.
Then
will
also
come
from
Gaussian,
ecause
the
transformation
from
to
is
linear.
Ho
wev
er,
will
no
longer
hav
zero
mean
and
unit
ariance.
After
applying
batch
normalization,
we
obtain
the
normalized
that
restores
the
zero
mean
and
unit
ariance
prop
erties.
or
almost
an
up
date
to
the
lo
er
lay
ers,
will
remain
unit
Gaussian.
The
output
ma
then
learned
as
simple
linear
function
Learning
in
this
mo
del
is
no
very
simple
ecause
the
parameters
at
the
low
er
la
ers
do
not
hav
an
effect
in
most
cases;
their
output
is
alwa
ys
renormalized
to
unit
Gaussian. In
some
315
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
corner
cases,
the
low
er
la
ers
can
ha
an
effect.
Changing
one
of
the
low
er
la
yer
eigh
ts
to
can
mak
the
output
ecome
degenerate,
and
changing
the
sign
of
one
of
the
low
er
eights
can
flip
the
relationship
etw
een
and
These
situations
are
very
rare.
Without
normalization,
nearly
ev
ery
up
date
would
ha
ve
an
extreme
effect
on
the
statistics
of
Batc
normalization
has
thus
made
this
mo
del
significan
tly
easier
to
learn.
In
this
example,
the
ease
of
learning
of
course
came
at
the
cost
of
making
the
low
er
lay
ers
useless.
In
our
linear
example,
the
low
er
lay
ers
no
longer
ha
an
harmful
effect,
but
they
also
no
longer
ha
an
eneficial
effect.
This
is
ecause
we
hav
normalized
out
the
first-
and
second-order
statistics,
which
is
all
that
linear
netw
ork
can
influence.
In
deep
neural
netw
ork
with
nonlinear
activ
ation
functions,
the
lo
er
la
ers
can
erform
nonlinear
transformations
of
the
data,
so
they
remain
useful.
Batc
normalization
acts
to
standardize
only
the
mean
and
ariance
of
each
unit
in
order
to
stabilize
learning,
but
it
allows
the
relationships
etw
een
units
and
the
nonlinear
statistics
of
single
unit
to
change.
Because
the
final
lay
er
of
the
net
ork
is
able
to
learn
linear
transformation,
ma
actually
wish
to
remo
all
linear
relationships
et
een
units
within
la
er.
Indeed,
this
is
the
approac
taken
Desjardins
et
al.
2015
),
who
provided
the
inspiration
for
batch
normalization.
Unfortunately
, eliminating
all
linear
in
teractions
is
muc
more
exp
ensive
than
standardizing
the
mean
and
standard
deviation
of
eac
individual
unit,
and
so
far
batc
normalization
remains
the
most
practical
approach.
Normalizing
the
mean
and
standard
deviation
of
unit
can
reduce
the
expressiv
er
of
the
neural
net
ork
con
taining
that
unit.
main
tain
the
expressive
er
of
the
netw
ork,
it
is
common
to
replace
the
batch
of
hidden
unit
activ
ations
with
rather
than
simply
the
normalized
The
ariables
and
are
learned
parameters
that
allow
the
new
ariable
to
hav
an
mean
and
standard
deviation.
At
first
glance,
this
may
seem
useless—wh
did
set
the
mean
to
and
then
introduce
parameter
that
allo
ws
it
to
set
back
to
an
arbitrary
alue
The
answ
er
is
that
the
new
parametrization
can
represent
the
same
family
of
functions
of
the
input
as
the
old
parametrization,
but
the
new
parametrization
has
different
learning
dynamics.
In
the
old
parametrization,
the
mean
of
as
determined
by
complicated
interaction
et
een
the
parameters
in
the
lay
ers
elo
In
the
new
parametrization,
the
mean
of
is
determined
solely
The
new
parametrization
is
muc
easier
to
learn
with
gradient
descent.
Most
neural
netw
ork
lay
ers
tak
the
form
of
where
is
some
fixed
nonlinear
activ
ation
function
such
as
the
rectified
linear
transformation.
It
is
natural
to
wonder
whether
we
should
apply
batch
normalization
to
the
input
or
to
the
transformed
alue
Ioffe
and
Szegedy
2015
recommend
316
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
the
latter.
More
sp
ecifically
should
replaced
by
normalized
version
of
The
bias
term
should
omitted
because
it
ecomes
redundant
with
the
parameter
applied
by
the
batch
normalization
reparametrization.
The
input
to
lay
er
is
usually
the
output
of
nonlinear
activ
ation
function
such
as
the
rectified
linear
function
in
previous
lay
er.
The
statistics
of
the
input
are
th
us
more
non-Gaussian
and
less
amenable
to
standardization
by
linear
op
erations.
In
conv
olutional
netw
orks,
describ
ed
in
chapter
it
is
imp
ortan
to
apply
the
same
normalizing
and
at
every
spatial
lo
cation
within
feature
map,
so
that
the
statistics
of
the
feature
map
remain
the
same
regardless
of
spatial
lo
cation.
8.7.2
Co
ordinate
Descent
In
some
cases,
it
may
ossible
to
solve
an
optimization
problem
quickly
breaking
it
into
separate
pieces.
If
we
minimize
with
resp
ect
to
single
ariable
, then
minimize
it
with
resp
ect
to
another v
ariable
, and
so
on,
rep
eatedly
cycling
through
all
ariables,
we
are
guaran
teed
to
arrive
at
(lo
cal)
minim
um.
This
practice
is
known
as
co
ordinate
descent
ecause
we
optimize
one
co
ordinate
at
time. More
generally
blo
co
ordinate
descen
refers
to
minimizing
with
resp
ect
to
subset
of
the
ariables
simultaneously
The
term
“co
ordinate
descent”
is
often
used
to
refer
to
blo
ck
co
ordinate
descent
as
ell
as
the
strictly
individual
co
ordinate
descen
t.
Co
ordinate
descent
makes
the
most
sense
when
the
different
ariables
in
the
optimization
problem
can
clearly
separated
into
groups
that
pla
relatively
isolated
roles,
or
when
optimization
with
resp
ect
to
one
group
of
ariables
is
significan
tly
more
efficient
than
optimization
with
resp
ect
to
all
of
the
ariables.
or
example,
consider
the
cost
function
) =
i,j
i,j
i,j
i,j
(8.38)
This
function
describ
es
learning
problem
called
sparse
co
ding,
where
the
goal
is
to
find
weigh
matrix
that
can
linearly
deco
de
matrix
of
activ
ation
alues
to
reconstruct
the
training
set
Most
applications
of
sparse
co
ding
also
in
olv
eigh
deca
or
constraint
on
the
norms
of
the
columns
of
to
prev
en
the
pathological
solution
with
extremely
small
and
large
The
function
is
not
conv
ex.
Ho
ev
er, w
can
divide
the
inputs
to
the
training
algorithm
in
to
wo
sets:
the
dictionary
parameters
and
the
code
represen
tations
Minimizing
the
ob
jectiv
function
with
resp
ect
to
either
one
of
these
sets
of
ariables
is
conv
ex
problem.
Block
co
ordinate
descent
th
us
gives
317
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
us
an
optimization
strategy
that
allows
us
to
use
efficient
conv
ex
optimization
algorithms,
by
alternating
etw
een
optimizing
with
fixed,
then
optimizing
with
fixed.
Co
ordinate
descent
is
not
very
go
strategy
when
the
alue
of
one
ariable
strongly
influences
the
optimal
alue
of
another
ariable,
as
in
the
function
) =
where
is
ositive
constan
t.
The
first
term
encourages
the
tw
ariables
to
ha
ve
similar
alue,
while
the
second
term
encourages
them
to
near
zero.
The
solution
is
to
set
oth
to
zero.
Newton’s
metho
can
solve
the
problem
in
single
step
ecause
it
is
ositiv
definite
quadratic
problem.
or
small
how
ever,
co
ordinate
descen
will
mak
very
slow
progress
ecause
the
first
term
do
es
not
allo
single
ariable
to
hanged
to
alue
that
differs
significan
tly
from
the
current
alue
of
the
other
ariable.
8.7.3
oly
ak
eraging
oly
ak
av
eraging
Poly
ak
and
Juditsky
1992
consists
of
av
eraging
several
oin
ts
in
the
tra
jectory
through
parameter
space
visited
by
an
optimization
algorithm.
If
iterations
of
gradient
descent
visit
oints
(1)
then
the
output
of
the
oly
ak
av
eraging
algorithm
is
On
some
problem
classes,
suc
as
gradien
descen
applied
to
con
ex
problems,
this
approac
has
strong
conv
ergence
guaran
tees.
When
applied
to
neural
net
orks,
its
justification
is
more
heuristic,
but
it
erforms
ell
in
practice.
The
basic
idea
is
that
the
optimization
algorithm
ma
leap
back
and
forth
across
alley
several
times
without
ever
visiting
oin
near
the
ottom
of
the
alley
The
av
erage
of
all
the
lo
cations
on
either
side
should
close
to
the
ottom
of
the
alley
though.
In
nonconv
ex
problems,
the
path
taken
the
optimization
tra
jectory
can
ery
complicated
and
visit
man
different
regions.
Including
oints
in
parameter
space
from
the
distant
past
that
may
separated
from
the
current
oin
large
barriers
in
the
cost
function
do
es
not
seem
like
useful
ehavi
or.
As
result,
when
applying
oly
ak
av
eraging
to
noncon
ex
problems,
it
is
typical
to
use
an
exp
onen
tially
decaying
running
av
erage:
1)
(1
(8.39)
The
running
erage
approach
is
used
in
numerous
applications.
See
Szegedy
et
al.
2015
for
recent
example.
318
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
8.7.4
Sup
ervised
Pretraining
Sometimes,
directly
training
mo
del
to
solve
sp
ecific
task
can
to
ambitious
if
the
mo
del
is
complex
and
hard
to
optimize
or
if
the
task
is
very
difficult.
It
is
sometimes
more
effectiv
to
train
simpler
mo
del
to
solve
the
task,
then
make
the
mo
del
more
complex.
It
can
also
more
effective
to
train
the
mo
del
to
solve
simpler
task,
then
mov
on
to
confront
the
final
task.
These
strategies
that
in
olv
training
simple
mo
dels
on
simple
tasks
efore
confronting
the
hallenge
of
training
the
desired
mo
del
to
erform
the
desired
task
are
collectively
known
as
pretraining
Greedy
algorithms
break
problem
in
to
many
comp
onents,
then
solve
for
the
optimal
ersion
of
each
comp
onen
in
isolation.
Unfortunately
combining
the
individually
optimal
comp
onents
is
not
guaran
teed
to
yield
an
optimal
complete
solution.
Nonetheless,
greedy
algorithms
can
computationally
muc
cheaper
than
algorithms
that
solve
for
the
est
joint
solution,
and
the
quality
of
greedy
solution
is
often
acceptable
if
not
optimal.
Greedy
algorithms
may
also
follow
ed
fine-tuning
stage
in
which
joint
optimization
algorithm
searc
hes
for
an
optimal
solution
to
the
full
problem.
Initializing
the
join
optimization
algorithm
with
greedy
solution
can
greatly
sp
eed
it
up
and
impro
ve
the
qualit
of
the
solution
it
finds.
Pretraining,
and
especially
greedy
pretraining,
algorithms
are
ubiquitous
in
deep
learning.
In
this
section,
describ
sp
ecifically
those
pretraining
algorithms
that
break
sup
ervised
learning
problems
in
to
other
simpler
sup
ervised
learning
problems.
This
approach
is
known
as
greedy
sup
ervised
pretraining
In
the
original
Bengio
et
al.
2007
version
of
greedy
sup
ervised
pretraining,
eac
stage
consists
of
sup
ervised
learning
training
task
inv
olving
only
subset
of
the
lay
ers
in
the
final
neural
net
ork.
An
example
of
greedy
sup
ervised
pretraining
is
illustrated
in
figure
8.7
in
which
eac
added
hidden
lay
er
is
pretrained
as
part
of
shallow
sup
ervised
MLP
taking
as
input
the
output
of
the
previously
trained
hidden
lay
er.
Instead
of
pretraining
one
lay
er
at
time,
Simony
an
and
Zisserman
2015
pretrain
deep
conv
olutional
netw
ork
(eleven
weigh
lay
ers)
and
then
use
the
first
four
and
last
three
la
yers
from
this
netw
ork
to
initialize
even
deep
er
net
orks
(with
up
to
nineteen
la
yers
of
weigh
ts). The
middle
lay
ers
of
the
new,
ery
deep
netw
ork
are
initialized
randomly
The
new
net
ork
is
then
jointly
trained.
Another
option,
explored
by
et
al.
2010
),
is
to
use
the
outputs
of
the
previously
trained
MLPs,
as
well
as
the
raw
input,
as
inputs
for
each
added
stage.
Wh
y w
ould greedy sup
ervised pretraining
help?
The hypothesis initially
discussed
by
Bengio
et
al.
2007
is
that
it
helps
to
pro
vide
etter
guidance
to
the
319
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
(1)
(1)
(a)
(1)
(1)
(1)
(1)
(1)
(1)
(b)
(1)
(1)
(1)
(1)
(1)
(1)
(c)
(1)
(1)
(1)
(1)
(2)
(2)
(2)
(2)
(2)
(2)
(1)
(1)
(d)
(1)
(1)
(1)
(1)
(2)
(2)
(2)
(2)
(2)
(2)
Figure
8.7:
Illustration
of
one
form
of
greedy
sup
ervised
pretraining
Bengio
et
al.
2007
).
(a)
start
training
sufficiently
shallow
architecture.
(b)
Another
dra
wing
of
the
same
architecture.
(c)
eep
only
the
input-to-hidden
lay
er
of
the
original
netw
ork
and
discard
the
hidden-to-output
la
er.
send
the
output
of
the
first
hidden
la
yer
as
input
to
another
supervised
single
hidden
la
yer
MLP
that
is
trained
with
the
same
ob
jectiv
as
the
first
netw
ork
was,
thus
adding
second
hidden
lay
er.
This
can
rep
eated
for
as
man
lay
ers
as
desired.
(d)
Another
drawing
of
the
result,
viewed
as
feedforward
netw
ork.
further
improv
the
optimization,
we
can
jointly
fine-tune
all
the
la
yers,
either
only
at
the
end
or
at
eac
stage
of
this
pro
cess.
320
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
in
termediate
levels
of
deep
hierarc
In
general,
pretraining
may
help
oth
in
of
optimization
and
generalization.
An
approach
related
to
sup
ervised
pretraining
extends
the
idea
to
the
context
of
transfer
learning:
osinski
et
al.
2014
pretrain
deep
con
olutional
net
with
eigh
lay
ers
of
eigh
ts
on
set
of
tasks
(a
subset
of
the
1,000
ImageNet
ob
ject
categories)
and
then
initialize
same-size
netw
ork
with
the
first
la
ers
of
the
first
net. All
the
la
yers
of
the
second
netw
ork
(with
the
upp
er
lay
ers
initialized
randomly)
are
then
join
tly
trained
to
erform
differen
set
of
tasks
(another
subset
of
the
1,000
ImageNet
ob
ject
categories),
with
few
er
training
examples
than
for
the
first
set
of
tasks.
Other
approac
hes
to
transfer
learning
with
neural
net
orks
are
discussed
in
section
15.2
Another
related
line
of
work
is
the
FitNets
Romero
et
al.
2015
approach.
This
approach
egins
by
training
netw
ork
that
has
low
enough
depth
and
great
enough
width
(n
um
er
of
units
er
la
er)
to
easy
to
train.
This
net
ork
then
ecomes
teac
her
for
second
netw
ork,
designated
the
studen
The
studen
net
ork
is
muc
deep
er
and
thinner
(elev
en
to
nineteen
lay
ers)
and
ould
difficult
to
train
with
SGD
under
normal
circumstances.
The
training
of
the
studen
net
ork
is
made
easier
by
training
the
student
netw
ork
not
only
to
predict
the
output
for
the
original
task,
but
also
to
predict
the
alue
of
the
middle
la
er
of
the
teacher
netw
ork.
This
extra
task
provides
set
of
hints
ab
out
ho
the
hidden
la
ers
should
used
and
can
simplify
the
optimization
problem.
dditional
parameters
are
introduced
to
regress
the
middle
la
er
of
the
five
la
er
teac
her
net
ork
from
the
middle
la
er
of
the
deep
er
student
netw
ork.
Instead
of
predicting
the
final
classification
target,
ho
ev
er,
the
ob
jectiv
is
to
predict
the
middle
hidden
la
er
of
the
teacher
net
ork.
The
lo
er
lay
ers
of
the
studen
netw
orks
th
us
hav
ob
jectiv
es:
to
help
the
outputs
of
the
student
net
ork
accomplish
their
task,
as
well
as
to
predict
the
in
termediate
la
er
of
the
teac
her
netw
ork.
Although
thin
and
deep
netw
ork
app
ears
to
more
difficult
to
train
than
wide
and
shallow
net
ork,
the
thin
and
deep
netw
ork
may
generalize
etter
and
certainly
has
low
er
computational
cost
if
it
is
thin
enough
to
ha
far
fewer
parameters.
Without
the
hints
on
the
hidden
lay
er,
the
student
netw
ork
erforms
ery
orly
in
the
exp
erimen
ts,
on
oth
the
training
and
the
test
set. Hin
ts
on
middle
lay
ers
may
th
us
one
of
the
to
ols
to
help
train
neural
net
works
that
otherwise
seem
difficult
to
train,
but
other
optimization
techniques
or
changes
in
the
architecture
may
also
solv
the
problem.
321
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
8.7.5
Designing
Mo
dels
to
Aid
Optimization
improv
optimization,
the
est
strategy
is
not
alwa
ys
to
improv
the
optimization
algorithm.
Instead,
many
improv
ements
in
the
optimization
of
deep
mo
dels
hav
come
from
designing
the
mo
dels
to
easier
to
optimize.
In
principle,
could
use
activ
ation
functions
that
increase
and
decrease
in
jagged
nonmonotonic
patterns,
but
this
ould
make
optimization
extremely
difficult.
In
practice,
it
is
mor
imp
ortant
to
cho
ose
mo
del
family
that
is
asy
to
optimize
than
to
use
owerful
optimization
algorithm
Most
of
the
adv
ances
in
neural
netw
ork
learning
ov
er
the
past
thirty
ears
hav
een
obtained
by
changing
the
mo
del
family
rather
than
changing
the
optimization
pro
cedure.
Sto
chastic
gradien
descent
with
momentum,
which
as
used
to
train
neural
netw
orks
in
the
1980s,
remains
in
use
in
mo
dern
state-of-the-art
neural
net
ork
applications.
Sp
ecifically
mo
dern
neural
net
orks
reflect
design
choic
to
use
linear
trans-
formations
etw
een
la
ers
and
activ
ation
functions
that
are
differentiable
almost
ev
erywhere,
with
significant
slop
in
large
ortions
of
their
domain.
In
particular,
mo
del
innov
ations
lik
the
LSTM,
rectified
linear
units
and
maxout
units
hav
all
mov
ed
to
ward
using
more
linear
functions
than
previous
mo
dels
like
deep
net
orks
based
on
sigmoidal
units.
These
mo
dels
hav
nice
prop
erties
that
make
optimization
easier.
The
gradient
flows
through
many
la
ers
provided
that
the
Jacobian
of
the
linear
transformation
has
reasonable
singular
alues. Moreo
ver,
linear
functions
consistently
increase
in
single
direction,
so
even
if
the
mo
del’s
output
is
ery
far
from
correct,
it
is
clear
simply
from
computing
the
gradien
whic
direction
its
output
should
mov
to
reduce
the
loss
function.
In
other
ords,
mo
dern
neural
nets
hav
been
designed
so
that
their
lo
al
gradient
information
corresp
onds
reasonably
ell
to
mo
ving
tow
ard
distant
solution.
Other
model design
strategies can
help to
make
optimization
easier.
or
example,
linear
paths
or
skip
connections
et
een
lay
ers
reduce
the
length
of
the
shortest
path
from
the
lo
wer
la
yer’s
parameters
to
the
output, and
th
us
mitigate
the
anishing
gradient
problem
Sriv
astav
et
al.
2015
).
related
idea
to
skip
connections
is
adding
extra
copies
of
the
output
that
are
attached
to
the
in
termediate
hidden
la
ers
of
the
net
ork,
as
in
Go
ogLeNet
Szegedy
et
al.
2014a
and
deeply
sup
ervised
nets
Lee
et
al.
2014
).
These
“auxiliary
heads”
are
trained
to
erform
the
same
task
as
the
primary
output
at
the
top
of
the
net
ork
to
ensure
that
the
low
er
la
ers
receive
large
gradient.
When
training
is
complete,
the
auxiliary
heads
may
discarded.
This
is
an
alternative
to
the
pretraining
strategies,
whic
were
introduced
in
the
previous
section.
In
this
wa
one
can
train
jointly
all
the
lay
ers
in
single
phase
but
change
the
arc
hitecture,
so
that
in
termediate
lay
ers
(esp
ecially
the
low
er
ones)
can
get
some
hints
ab
out
what
they
322
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
should
do,
via
shorter
path.
These
hin
ts
provide
an
error
signal
to
lo
er
lay
ers.
8.7.6
Con
tin
uation
Metho
ds
and
Curriculum
Learnin
As
argued
in
section
8.2.7
man
of
the
hallenges
in
optimization
arise
from
the
global
structure
of
the
cost
function
and
cannot
resolved
merely
by
making
etter
estimates
of
lo
cal
up
date
directions.
The
predominan
strategy
for
ov
ercoming
this
problem
is
to
attempt
to
initialize
the
parameters
in
region
connected
to
the
solution
by
short
path
through
parameter
space
that
lo
cal
descent
can
discov
er.
Con
tin
uation
metho
ds
are
family
of
strategies
that
can
mak
optimization
easier
by
ho
osing
initial
oints
to
ensure
that
lo
cal
optimization
sp
ends
most
of
its
time
in
ell-b
eha
ed
regions
of
space.
The
idea
ehind
contin
uation
metho
ds
is
to
construct
series
of
ob
jectiv
functions
ov
er
the
same
parameters.
minimize
cost
function
we
construct
new
cost
functions
(0)
These
cost
functions
are
designed
to
increasingly
difficult,
with
(0)
eing
fairly
easy
to
minimize,
and
the
most
difficult,
eing
the
true
cost
function
motiv
ating
the
entire
process.
When
we
sa
that
is
easier
than
+1)
we
mean
that
it
is
well
eha
ed
ov
er
more
of
space.
random
initialization
is
more
lik
ely
to
land
in
the
region
where
lo
cal
descent
can
minimize
the
cost
function
successfully
ecause
this
region
is
larger.
The
series
of
cost
functions
are
designed
so
that
solution
to
one
is
go
initial
oint
of
the
next. W
thus
egin
solving
an
easy
problem,
then
refine
the
solution
to
solve
incrementally
harder
problems
until
arrive
at
solution
to
the
true
underlying
problem.
raditional
contin
uation
metho
ds
(predating
the
use
of
contin
uation
metho
ds
for
neural
net
ork
training)
are
usually
based
on
smo
othing
the
ob
jective
function.
See
1997
for
an
example
of
suc
metho
and
review
of
some
related
metho
ds.
Contin
uation
metho
ds
are
also
closely
related
to
sim
ulated
annealing,
whic
adds
noise
to
the parameters
Kirkpatrick
et
al.
1983
).
Con
tinuation
metho
ds
hav
een
extremely
successful
in
recen
years.
See
Mobahi
and
Fisher
2015
for
an
ov
erview
of
recent
literature,
esp
ecially
for
AI
applications.
Con
tin
uation
metho
ds
traditionally
were
mostly
designed
with
the
goal
of
ercoming
the
challenge
of
lo
cal
minima.
Sp
ecifically
they
were
designed
to
reac
global
minim
um
despite
the
presence
of
many
lo
cal
minima.
do
so,
these
contin
uation
metho
ds
ould
construct
easier
cost
functions
by
“blurring”
the
original
cost
function.
This
blurring
op
eration
can
done
by
appro
ximating
) =
∼N

)2
(8.40)
via
sampling. The
intuition
for
this
approach
is
that
some
noncon
ex
functions
ecome
approximately
conv
ex
when
blurred.
In
many
cases,
this
blurring
preserves
323
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
enough
information
ab
out
the
lo
cation
of
global
minimum
that
we
can
find
the
global
minim
um
by
solving
progressively
less-blurred
ersions
of
the
problem.
This
approac
can
break
down
in
three
different
wa
ys.
First,
it
might
successfully
define
series
of
cost
functions
where
the
first
is
con
ex
and
the
optimum
trac
ks
from
one
function
to
the
next,
arriving
at
the
global
minim
um,
but
it
might
require
so
man
incremental
cost
functions
that
the
cost
of
the
entire
pro
cedure
remains
high.
NP-hard
optimization
problems
remain
NP-hard,
even
when
contin
uation
metho
ds
are
applicable.
The
other
tw
wa
ys
con
tin
uation
metho
ds
fail
oth
corresp
ond
to
the
metho
not
eing
applicable.
First,
the
function
might
not
ecome
con
ex,
no
matter
how
muc
it
is
blurred.
Consider,
for
example,
the
function
) =
Second,
the
function
may
ecome
con
ex
as
result
of
blurring,
but
the
minim
um
of
this
blurred
function
may
trac
to
lo
cal
rather
than
global
minimum
of
the
original
cost
function.
Though
contin
uation
metho
ds
were
mostly
originally
designed
to
deal
with
the
problem
of
lo
cal
minima,
lo
cal
minima
are
no
longer
elieved
to
the
primary
problem
for
neural
netw
ork
optimization.
ortunately
contin
uation
metho
ds
can
still
help.
The
easier
ob
jectiv
functions
introduced
the
contin
uation
metho
can
eliminate
flat
regions,
decrease
ariance
in
gradien
estimates,
impro
conditioning
of
the
Hessian
matrix,
or
do
anything
else
that
will
either
make
local
up
dates
easier
to
compute
or
improv
the
corresp
ondence
et
een
lo
cal
up
date
directions
and
progress
to
ard
global
solution.
Bengio
et
al.
2009
observed
that
an
approac
called
curriculum
learning
or
shaping
can
interpreted
as
contin
uation
metho
d.
Curriculum
learning
is
based
on
the
idea
of
planning
learning
process
to
egin
learning
simple
concepts
and
progress
to
learning
more
complex
concepts
that
dep
end
on
these
simpler
concepts.
This
basic
strategy
was
previously
known
to
accelerate
progress
in
animal
training
Skinner
1958
eterson
2004
Krueger
and
Day
an
2009
and
in
machine
learning
Solomonoff
1989
Elman
1993
Sanger
1994
).
Bengio
et
al.
2009
justified
this
strategy
as
contin
uation
metho
d,
where
earlier
are
made
easier
increasing
the
influence
of
simpler
examples
(either
by
assigning
their
contributions
to
the
cost
function
larger
co
efficients,
or
by
sampling
them
more
frequently),
and
exp
erimen
tally
demonstrated
that
etter
results
could
obtained
by
following
curriculum
on
large-scale
neural
language
mo
deling
task.
Curriculum
learning
has
een
successful
on
wide
range
of
natural
language
Spitk
vsky
et
al.
2010
Collob
ert
et
al.
2011a
Mikolo
et
al.
2011b
and
Honav
ar
2011
and
computer
vision
Kumar
et
al.
2010
Lee
and
Grauman
2011
Supancic
and
Ramanan
2013
tasks.
Curriculum
learning
as
also
verified
as
eing
consistent
with
the
ay
in
whic
humans
te
ach
Khan
et
al.
2011
):
teac
hers
start
showing
easier
and
more
prototypical
examples
and
then
help
the
learner
refine
the
decision
surface
324
CHAPTER
8.
OPTIMIZA
TION
FOR
TRAINING
DEEP
MODELS
with
the
less
obvious
cases.
Curriculum-based
strategies
are
mor
effe
ctive
for
teac
hing
humans
than
strategies
based
on
uniform
sampling
of
examples
and
can
also
increase
the
effectiveness
of
other
teac
hing
strategies
Basu
and
Christensen
2013
).
Another
imp
ortant
contribution
to
research
on
curriculum
learning
arose
in
the
con
text
of
training
recurren
neural
netw
orks
to
capture
long-term
dep
endencies:
Zarem
ba
and
Sutsk
ev
er
2014
found
that
muc
etter
results
were
obtained
with
sto
chastic
curriculum
in
which
random
mix
of
easy
and
difficult
examples
is
alwa
ys
presen
ted
to
the
learner,
but
where
the
av
erage
prop
ortion
of
the
more
difficult
examples
(here,
those
with
longer-term
dep
endencies)
is
gradually
increased.
With
deterministic
curriculum,
no
impro
ement
ov
er
the
baseline
(ordinary
training
from
the
full
training
set)
was
observ
ed.
hav
no
describ
ed
the
basic
family
of
neural
net
ork
mo
dels
and
ho
to
regularize
and
optimize
them.
In
the
hapters
ahead,
we
turn
to
sp
ecializations
of
the
neural
netw
ork
family
that
allow
neural
net
orks
to
scale
to
ery
large
sizes
and
pro
cess
input
data
that
has
sp
ecial
structure.
The
optimization
metho
ds
discussed
in
this
chapter
are
often
directly
applicable
to
these
sp
ecialized
arc
hitectures
with
little
or
no
mo
dification.
325