Chapter
Deep
eedforw
ard
Net
orks
Deep
feedforw
ard
net
orks
also
called
feedforw
ard
neural
net
orks
or
ultila
er
erceptrons
(MLPs),
are
the
quin
tessen
tial
deep
learning
mo
dels.
The
goal
of
feedforw
ard
net
work
is
to
appro
ximate
some
function
or
example,
for
classifier,
maps
an
input
to
category
feedforward
net
work
defines
mapping
and
learns
the
alue
of
the
parameters
that
result
in
the
est
function
approximation.
These
mo
dels
are
called
feedforw
ard
ecause
information
flows
through
the
function
eing
ev
aluated
from
through
the
intermediate
computations
used
to
define
and
finally
to
the
output
There
are
no
feedbac
connections
in
which
outputs
of
the
mo
del
are
fed
back
in
to
itself.
When
feedforward
neural
netw
orks
are
extended
to
include
feedbac
connections,
they
are
called
recurren
neural
net
orks
as
presen
ted
in
hapter
10
eedforward
netw
orks
are
of
extreme
imp
ortance
to
mac
hine
learning
practi-
tioners.
They
form
the
basis
of
many
imp
ortant
commercial
applications.
or
example,
the
conv
olutional
netw
orks
used
for
ob
ject
recognition
from
photos
are
sp
ecialized
kind
of
feedforw
ard
netw
ork.
eedforw
ard
netw
orks
are
conceptual
stepping
stone
on
the
path
to
recurren
netw
orks,
which
er
man
natural
language
applications.
eedforward
neural
net
orks
are
called
net
orks
ecause
they
are
ypically
represen
ted
by
comp
osing
together
many
differen
functions.
The
mo
del
is
asso-
ciated
with
directed
acyclic
graph
describing
how
the
functions
are
comp
osed
together.
or
example,
we
migh
ha
three
functions
(1)
(2)
and
(3)
connected
in
chain,
to
form
(3)
(2)
(1)
)))
These
hain
structures
are
the
most
commonly
used
structures
of
neural
netw
orks.
In
this
case,
(1)
is
called
the
first
la
er
of
the
netw
ork,
(2)
is
called
the
second
la
er
and
so
on. The
164
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
erall
length
of
the
chain
giv
es
the
depth
of
the
mo
del.
The
name
“deep
learning”
arose
from
this
terminology
The
final
lay
er
of
feedforward
net
work
is
called
the
output
la
er
During
neural
netw
ork
training,
we
drive
to
match
The
training
data
provides
us
with
noisy
appro
ximate
examples
of
ev
aluated
at
different
training
oin
ts.
Each
example
is
accompanied
by
lab
el
The
training
examples
sp
ecify
directly
what
the
output
lay
er
must
do
at
each
oin
it
must
pro
duce
alue
that
is
close
to
The
ehavior
of
the
other
la
ers
is
not
directly
sp
ecified
the
training
data.
The
learning
algorithm
ust
decide
ho
to
use
those
lay
ers
to
pro
duce
the
desired
output,
but
the
training
data
do
not
sa
what
eac
individual
la
er
should
do.
Instead,
the
learning
algorithm
must
decide
how
to
use
these
la
ers
to
best
implemen
an
approximation
of
Because
the
training
data
do
es
not
show
the
desired
output
for
each
of
these
la
ers,
they
are
called
hidden
la
ers
Finally
these
net
orks
are
called
neur
al
ecause
they
are
loosely
inspired
by
neuroscience.
Eac
hidden
lay
er
of
the
netw
ork
is
typically
vector
alued.
The
dimensionalit
of
these
hidden
lay
ers
determines
the
width
of
the
model.
Eac
elemen
of
the
vector
ma
in
terpreted
as
pla
ying
role
analogous
to
neuron.
Rather
than
thinking
of
the
lay
er
as
representing
single
vector-to-v
ector
function,
can
also
think
of
the
lay
er
as
consisting
of
man
units
that
act
in
parallel,
eac
representing
vector-to-scalar
function.
Eac
unit
resembles
neuron
in
the
sense
that
it
receives
input
from
many
other
units
and
computes
its
own
activ
ation
alue.
The
idea
of
using
many
lay
ers
of
vector-v
alued
represen
tations
is
drawn
from
neuroscience.
The
choice
of
the
functions
used
to
compute
these
representations
is
also
loosely
guided
by
neuroscien
tific
observ
ations
ab
out
the
functions
that
biological
neurons
compute.
Mo
dern
neural
net
ork
research,
ho
wev
er,
is
guided
man
mathematical
and
engineering
disciplines,
and
the
goal
of
neural
net
works
is
not
to
perfectly
model
the
brain.
It
is
est
to
think
of
feedforw
ard
netw
orks
as
function
approximation
mac
hines
that
are
designed
to
ac
hiev
statistical
generalization,
ccasionally
dra
wing
some
insights
from
what
we
kno
ab
out
the
brain,
rather
than
as
mo
dels
of
brain
function.
One
wa
to
understand
feedforw
ard
net
orks
is
to
egin
with
linear
mo
dels
and
consider
how
to
ov
ercome
their
limitations. Linear
mo
dels,
suc
as
logistic
regression
and
linear
regression,
are
appealing
because
they
can
fit
efficien
tly
and
reliably
either
in
closed
form
or
with
con
ex
optimization.
Linear
mo
dels
also
ha
the
ob
vious
defect
that
the
mo
del
capacit
is
limited
to
linear
functions,
so
the
mo
del
cannot
understand
the
in
teraction
etw
een
any
tw
input
ariables.
extend
linear
models
to
represent
nonlinear
functions
of
we
can
apply
the
linear
mo
del
not
to
itself
but
to
transformed
input
where
is
165
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
nonlinear
transformation.
Equiv
alently
we
can
apply
the
kernel
tric
described
in
section
5.7.2
to
obtain
nonlinear
learning
algorithm
based
on
implicitly
applying
the
mapping.
can
think
of
as
providing
set
of
features
describing
or
as
providing
new
represen
tation
for
The
question
is
then
ho
to
ho
ose
the
mapping
1.
One
option
is
to
use
ery
generic
such
as
the
infinite-dimensional
that
is
implicitly
used
by
kernel
machines
based
on
the
RBF
kernel. If
is
of
high
enough
dimension,
can
alwa
ys
hav
enough
capacity
to
fit
the
training
set,
but
generalization
to
the
test
set
often
remains
or.
ery
generic
feature
mappings
are
usually
based
only
on
the
principle
of
lo
cal
smo
othness
and
do
not
enco
de
enough
prior
information
to
solve
adv
anced
problems.
2.
Another
option
is
to
man
ually
engineer
Un
til
the
adven
of
deep
learning,
this
was
the
dominant
approach.
It
requires
decades
of
human
effort
for
eac
separate
task,
with
practitioners
sp
ecializing
in
differen
domains,
such
as
sp
eech
recognition
or
computer
vision,
and
with
little
transfer
et
ween
domains.
3.
The
strategy
of
deep
learning
is
to
learn
In
this
approach,
we
hav
mo
del
) =
no
hav
parameters
that
we
use
to
learn
from
broad
class
of
functions,
and
parameters
that
map
from
to
the
desired
output.
This
is
an
example
of
deep
feedforward
net
ork,
with
defining
hidden
la
er.
This
approac
is
the
only
one
of
the
three
that
giv
es
up
on
the
conv
exit
of
the
training
problem,
but
the
enefits
outw
eigh
the
harms.
In
this
approac
h,
we
parametrize
the
represen
tation
as
and
use
the
optimization
algorithm
to
find
the
that
corresp
onds
to
go
represen
tation.
If
we
wish,
this
approac
can
capture
the
enefit
of
the
first
approac
eing
highly
generic—we
do
so
by
using
ery
broad
family
Deep
learning
can
also
capture
the
enefit
of
the
second
approach.
Human
practitioners
can
encode
their
knowledge
to
help
generalization
by
designing
families
that
they
exp
ect
will
perform
ell.
The
adv
antage
is
that
the
human
designer
only
needs
to
find
the
right
general
function
family
rather
than
finding
precisely
the
righ
function.
This
general
principle
of
improving
mo
dels
by
learning
features
extends
ey
ond
the
feedforward
netw
orks
describ
ed
in
this
hapter.
It
is
recurring
theme
of
deep
learning
that
applies
to
all
the
kinds
of
mo
dels
describ
ed
throughout
this
ook.
eedforward
netw
orks are
the
application
of
this
principle
to
learning
166
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
deterministic
mappings
from
to
that
lack
feedback
connections.
Other
models,
presen
ted
later,
apply
these
principles
to
learning
sto
hastic
mappings,
functions
with
feedback,
and
probability
distributions
ov
er
single
vector.
begin
this
chapter
with
simple
example
of
feedforw
ard
net
ork.
Next,
address
eac
of
the
design
decisions
needed
to
deplo
feedforward
net
ork.
First,
training
feedforward
net
work
requires
making
many
of
the
same
design
decisions
as
are
necessary
for
linear
mo
del:
ho
osing
the
optimizer,
the
cost
function,
and
the
form
of
the
output
units.
review
these
basics
of
gradien
t-based
learning,
then
proceed
to
confron
some
of
the
design
decisions
that
are
unique
to
feedforward
net
works.
eedforw
ard
netw
orks
hav
introduced
the
concept
of
hidden
lay
er,
and
this
requires
us
to
choose
the
activ
ation
functions
that
will
used
to
compute
the
hidden
lay
er
alues.
ust
also
design
the
architecture
of
the
netw
ork,
including
how
many
la
yers
the
netw
ork
should
con
tain,
ho
these
la
yers
should
connected to
each
other, and ho
w man
units
should
in
eac
lay
er.
Learning
in
deep
neural
netw
orks
requires
computing
the
gradients
of
complicated
functions.
presen
the
bac
k-propagation
algorithm
and
its
mo
dern
generalizations,
whic
can
be
used
to
efficien
tly
compute
these
gradients.
Finally
we
close
with
some
historical
ersp
ectiv
e.
6.1
Example:
Learning
OR
make
the
idea
of
feedforward
net
ork
more
concrete,
we
egin
with
an
example
of
fully
functioning
feedforw
ard
net
work
on
very
simple
task:
learning
the
XOR
function.
The
XOR
function
(“exclusiv
or”)
is
an
op
eration
on
wo
binary
alues,
and
When
exactly
one
of
these
binary
alues
is
equal
to
the
OR
function
returns
Otherwise,
it
returns
0.
The
XOR
function
pro
vides
the
target
function
that
we
an
to
learn.
Our
model
provides
function
and
our
learning
algorithm
will
adapt
the
parameters
to
make
as
similar
as
ossible
to
In
this
simple
example,
will
not
be
concerned
with
statistical
generalization.
an
our
net
work
to
erform
correctly
on
the
four
oin
ts
[0
0]
[0
1]
[1
0]
and
[1
1]
. W
will
train
the
netw
ork
on
all
four
of
these
oints. The
only
challenge
is
to
fit
the
training
set.
can
treat
this
problem
as
regression
problem
and
use
mean
squared
error
loss
function.
hav
chosen
this
loss
function
to
simplify
the
math
for
this
example
as
uc
as
ossible.
In
practical
applications,
MSE
is
usually
not
an
167
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
appropriate
cost
function
for
mo
deling
binary
data.
More
appropriate
approaches
are
describ
ed
in
section
6.2.2.2
Ev
aluated
on
our
whole
training
set,
the
MSE
loss
function
is
) =
))
(6.1)
No
we
must
choose
the
form
of
our
model,
Supp
ose
that
choose
linear
mo
del,
with
consisting
of
and
Our
model
is
defined
to
) =
b.
(6.2)
can
minimize
in
closed
form
with
respect
to
and
using
the
normal
equations.
After
solving
the
normal
equations,
we
obtain
and
The
linear
mo
del
simply
outputs
everywhere.
Why
do
es
this
happ
en?
Figure
6.1
sho
ws
ho
linear
mo
del
is
not
able
to
represen
the
OR
function.
One
to
solv
this
problem
is
to
use
mo
del
that
learns
different
feature
space
in
whic
linear
mo
del
is
able
to
represen
the
solution.
Sp
ecifically
we
will
introduce
simple
feedforward
netw
ork
with
one
hidden
la
er
con
taining
tw
hidden
units.
See
figure
6.2
for
an
illustration
of
this
mo
del.
This
feedforward
net
work
has
vector
of
hidden
units
that
are
computed
by
function
(1)
The
alues
of
these
hidden
units
are
then
used
as
the
input
for
second
lay
er.
The
second
lay
er
is
the
output
lay
er
of
the
netw
ork.
The
output
la
er
is
still
just
linear
regression
mo
del,
but
no
it
is
applied
to
rather
than
to
The
netw
ork
now
con
tains
tw
functions
hained
together,
(1)
and
(2)
with
the
complete
mo
del
being
) =
(2)
(1)
))
What
function
should
(1)
compute?
Linear
mo
dels
hav
served
us
ell
so
far,
and
it
ma
tempting
to
make
(1)
linear
as
ell. Unfortunately
if
(1)
ere
linear,
then
the
feedforward
netw
ork
as
whole
ould
remain
linear
function
of
its
input.
Ignoring
the
in
tercept
for
the
moment,
supp
ose
(1)
) =
and
(2)
) =
Then
) =
could
represen
this
function
as
) =
where
Clearly
we
must
use
nonlinear
function
to
describ
the
features.
Most
neural
net
works
do
so
using
an
affine
transformation
con
trolled
by
learned
parameters,
follo
ed
by
fixed
nonlinear
function
called
an
activ
ation
function.
use
that
strategy
here,
by
defining
where
pro
vides
the
eigh
ts
of
linear
transformation
and
the
biases.
Previously
to
describ
linear
regression
mo
del,
used
ector
of
eigh
ts
and
scalar
bias
parameter
to
describ
an
168
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Original
space
Learned
space
Figure
6.1:
Solving
the
XOR
problem
learning
representation.
The
old
num
ers
prin
ted
on
the
plot
indicate
the
alue
that
the
learned
function
ust
output
at
each
oint.
(L
eft)
linear
mo
del
applied
directly
to
the
original
input
cannot
implemen
the
XOR
function.
When
= 0
the
mo
del’s
output
must
increase
as
increases.
When
= 1
the
model’s
output
ust
decrease
as
increases.
linear
mo
del
ust
apply
fixed
co
efficien
to
The
linear
mo
del
therefore
cannot
use
the
alue
of
to
hange
the
co
efficient
on
and
cannot
solv
this
problem.
(Right)
In
the
transformed
space
represen
ted
the
features
extracted
by
neural
netw
ork,
linear
mo
del
can
now
solve
the
problem. In
our
example
solution,
the
tw
oints
that
ust
ha
output
hav
een
collapsed
into
single
point
in
feature
space.
In
other
words,
the
nonlinear
features
hav
mapp
ed
oth
= [1
0]
and
= [0
1]
to
single
oin
in
feature
space,
= [1
0]
The
linear
mo
del
can
now
describ
the
function
as
increasing
in
and
decreasing
in
In
this
example,
the
motiv
ation
for
learning
the
feature
space
is
only
to
make
the
model
capacit
greater
so
that
it
can
fit
the
training
set.
In
more
realistic
applications
learned
represen
tations
can
also
help
the
mo
del
to
generalize.
169
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Figure
6.2:
An
example
of
feedforward
net
work,
drawn
in
wo
differen
styles.
Sp
ecifically
this
is
the
feedforw
ard
net
ork
use
to
solve
the
XOR
example.
It
has
single
hidden
la
yer
con
taining
units.
(L
eft)
In
this
style,
we
draw
ev
ery
unit
as
node
in
the
graph.
This
style
is
explicit
and
unambiguous,
but
for
netw
orks
larger
than
this
example,
it
can
consume
to
muc
space.
(R
ight)
In
this
style,
draw
no
de
in
the
graph
for
each
en
tire
ector
represen
ting
lay
er’s
activ
ations.
This
style
is
uch
more
compact.
Sometimes
annotate
the
edges
in
this
graph
with
the
name
of
the
parameters
that
describ
the
relationship
et
een
tw
la
ers.
Here,
we
indicate
that
matrix
describ
es
the
mapping
from
to
and
vector
describ
es
the
mapping
from
to
ypically
omit
the
in
tercept
parameters
asso
ciated
with
each
la
yer
when
lab
eling
this
kind
of
drawing.
max
,z
Figure
6.3:
The
rectified
linear
activ
ation
function.
This
activ
ation
function
is
the
default
activ
ation
function
recommended
for
use
with
most
feedforward
neural
net
works.
Applying
this
function
to
the
output
of
linear
transformation
yields
nonlinear
transformation.
The
function
remains
ery
close
to
linear,
how
ever,
in
the
sense
that
it
is
piecewise
linear
function
with
linear
pieces.
Because
rectified
linear
units
are
nearly
linear,
they
preserve
many
of
the
prop
erties
that
make
linear
mo
dels
easy
to
optimize
with
gradien
t-based
metho
ds.
They
also
preserve
many
of
the
prop
erties
that
make
linear
mo
dels
generalize
ell.
common
principle
throughout
computer
science
is
that
we
can
build
complicated
systems
from
minimal
comp
onents.
Much
as
uring
mac
hine’s
memory
needs
only
to
able
to
store
or
states,
we
can
build
universal
function
appro
ximator
from
rectified
linear
functions.
170
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
affine
transformation
from
an
input
vector
to
an
output
scalar.
No
w,
we
describ
an
affine
transformation
from
vector
to
ector
so
an
entire
vector
of
bias
parameters
is
needed.
The
activ
ation
function
is
ypically
chosen
to
function
that
is
applied
element-wise,
with
,i
In
modern
neural
netw
orks,
the
default
recommendation
is
to
use
the
rectified
linear
unit
or
ReLU
Jarrett
et
al.
2009
Nair
and
Hinton
2010
Glorot
et
al.
2011a
),
defined
by
the
activ
ation
function
) = max
depicted
in
figure
6.3
can
no
specify
our
complete
net
ork
as
) =
max
b.
(6.3)
can
then
sp
ecify
solution
to
the
XOR
problem.
Let
(6.4)
(6.5)
(6.6)
and
= 0
can
no
alk
through
ho
the
model
pro
cesses
batc
of
inputs.
Let
the
design
matrix
containing
all
four
oin
ts
in
the
binary
input
space,
with
one
example
er
ro
w:
(6.7)
The
first
step
in
the
neural
net
work
is
to
multiply
the
input
matrix
by
the
first
la
er’s
eigh
matrix:
(6.8)
Next,
we
add
the
bias
vector
to
obtain
(6.9)
171
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
In
this
space,
all
the
examples
lie
along
line
with
slop
As
we
mov
along
this
line,
the
output
needs
to
egin
at
then
rise
to
then
drop
bac
down
to
linear
mo
del
cannot
implement
suc
function.
finish
computing
the
alue
of
for
eac
example,
apply
the
rectified
linear
transformation:
(6.10)
This
transformation
has
hanged
the
relationship
etw
een
the
examples.
They
no
longer
lie
on
single
line.
As
sho
wn
in
figure
6.1
they
now
lie
in
space
where
linear
mo
del
can
solv
the
problem.
finish
with
multiplying
by
the
weigh
ector
(6.11)
The
neural
netw
ork
has
obtained
the
correct
answer
for
every
example
in
the
batc
h.
In
this
example,
we
simply
sp
ecified
the
solution,
then
show
ed
that
it
obtained
zero
error. In
real
situation,
there
migh
be
billions
of
mo
del
parameters
and
billions
of
training
examples,
so
one
cannot
simply
guess
the
solution
as
we
did
here.
Instead,
gradien
t-based
optimization
algorithm
can
find
parameters
that
pro
duce
ery
little
error.
The
solution
describ
ed
to
the
OR
problem
is
at
global
minim
um
of
the
loss
function,
so
gradien
descen
could
conv
erge
to
this
oin
t.
There
are
other
equiv
alent
solutions
to
the
XOR
problem
that
gradien
descen
could
also
find.
The
conv
ergence
oin
of
gradient
descen
depends
on
the
initial
alues
of
the
parameters.
In
practice,
gradient
descen
would
usually
not
find
clean,
easily
understo
d,
integer-v
alued
solutions
like
the
one
presented
here.
6.2
Gradien
t-Based
Learning
Designing
and
training
neural
netw
ork
is
not
uc
different
from
training
any
other
machine
learning
mo
del
with
gradien
descen
t.
In
section
5.10
we
describ
ed
ho
to
build
machine
learning
algorithm
sp
ecifying
an
optimization
pro
cedure,
cost
function,
and
mo
del
family
172
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
The
largest
difference
et
ween
the
linear
mo
dels
ha
seen
so
far
and
neural
net
orks
is
that
the
nonlinearity
of
neural
netw
ork
causes
most
interesting
loss
functions
to
ecome
nonconv
ex.
This
means
that
neural
netw
orks
are
usually
trained
using
iterativ
e,
gradient-based
optimizers
that
merely
drive
the
cost
function
to
very
lo
alue,
rather
than
the
linear
equation
solv
ers
used
to
train
linear
regression
mo
dels
or
the
con
ex
optimization
algorithms
with
global
con
ver-
gence
guarantees
used
to
train
logistic
regression
or
SVMs.
Conv
ex
optimization
con
erges
starting
from
an
initial
parameters
(in
theory—in
practice
it
is
robust
but
can
encoun
ter
numerical
problems).
Sto
hastic
gradien
descen
applied
to
noncon
ex
loss
functions
has
no
suc
conv
ergence
guaran
tee
and
is
sensitiv
to
the
alues
of
the
initial
parameters.
or
feedforward
neural
netw
orks,
it
is
imp
ortan
to
initialize
all
weigh
ts
to
small
random
alues.
The
biases
may
initialized
to
zero
or
to
small
ositiv
alues.
The
iterative
gradien
t-based
optimization
algorithms
used
to
train
feedforw
ard
netw
orks
and
almost
all
other
deep
mo
dels
are
described
in
detail
in
chapter
with
parameter
initialization
in
particular
discussed
in
section
8.4
or
the
moment,
it
suffices
to
understand
that
the
training
algorithm
is
almost
alw
ys
based
on
using
the
gradient
to
descend
the
cost
function
in
one
ay
or
another.
The
sp
ecific
algorithms
are
improv
ements
and
refinements
on
the
ideas
of
gradient
descent,
in
tro
duced
in
section
4.3
and,
more
sp
ecifically
are
most
often
impro
vemen
ts
of
the
sto
hastic
gradient
descent
algorithm,
introduced
in
section
5.9
can
of
course
train
mo
dels
such
as
linear
regression
and
supp
ort
ector
mac
hines
with
gradient
descen
too,
and
in
fact
this
is
common
when
the
training
set
is
extremely
large.
rom
this
oin
of
view,
training
neural
netw
ork
is
not
uc
differen
from
training
any
other
mo
del.
Computing
the
gradient
is
slightly
more
complicated
for
neural
net
ork
but
can
still
be
done
efficiently
and
exactly
In
Section
6.5
we
describ
ho
to
obtain
the
gradien
using
the
bac
k-propagation
algorithm
and
mo
dern
generalizations
of
the
back-propagation
algorithm.
As
with
other
mac
hine
learning
mo
dels,
to
apply
gradien
t-based
learning
ust
choose
cost
function,
and
we
ust
choose
ho
to
represen
the
output
of
the
mo
del.
no
revisit
these
design
considerations
with
special
emphasis
on
the
neural
net
orks
scenario.
6.2.1
Cost
unctions
An
imp
ortant
asp
ect
of
the
design
of
deep
neural
netw
ork
is
the
choice
of
the
cost
function.
ortunately
the
cost
functions
for
neural
netw
orks
are
more
or
less
173
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
the
same
as
those
for
other
parametric
mo
dels,
such
as
linear
mo
dels.
In
most
cases,
our
parametric
mo
del
defines
distribution
and
simply
use the
principle of
maximum
likelihoo
d.
This means
we
use the
cross-en
trop
etw
een
the
training
data
and
the
mo
del’s
predictions
as
the
cost
function.
Sometimes,
tak
simpler
approach,
where
rather
than
predicting
complete
probabilit
distribution
er
we
merely
predict
some
statistic
of
conditioned
on
Sp
ecialized
loss
functions
enable
us
to
train
predictor
of
these
estimates.
The
total
cost
function
used
to
train
neural
netw
ork
will
often
com
bine
one
of
the
primary
cost
functions
describ
ed
here
with
regularization
term.
hav
already
seen
some
simple
examples
of
regularization
applied
to
linear
mo
dels
in
section
5.2.2
. The
weigh
decay
approac
used
for
linear
mo
dels
is
also
directly
applicable
to
deep
neural
netw
orks
and
is
among
the
most
opular
regulariza-
tion
strategies.
More
adv
anced
regularization
strategies
for
neural
net
orks
are
describ
ed
in
hapter
6.2.1.1
Learning
Conditional
Distributions
with
Maximum
Likelihoo
Most
mo
dern
neural
netw
orks
are
trained
using
maximum
likelihoo
d.
This
means
that
the
cost
function
is
simply
the
negativ
log-lik
elihoo
d,
equiv
alently
describ
ed
as
the
cross-en
trop
etw
een
the
training
data
and
the
model
distribution.
This
cost
function
is
given
) =
data
log
model
(6.12)
The
sp
ecific
form
of
the
cost
function
changes
from
model
to
model,
depending
on
the
specific
form
of
log
model
The
expansion
of
the
ab
ov
equation
typically
yields
some
that
do
not
dep
end
on
the
mo
del
parameters
and
ma
dis-
carded.
or
example,
as
we
sa
in
section
5.5.1
if
model
) =
then
we
reco
ver
the
mean
squared
error
cost,
) =
data
||
||
const
(6.13)
up
to
scaling
factor
of
and
term
that
does
not
dep
end
on
The
discarded
constan
is
based
on
the
ariance
of
the
Gaussian
distribution,
which
in
this
case
hose
not
to
parametrize.
Previously
sa
that
the
equiv
alence
et
ween
maxim
um
likelihoo
estimation
with
an
output
distribution
and
minimization
of
174
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
mean
squared
error
holds
for
linear
mo
del,
but
in
fact,
the
equiv
alence
holds
regardless
of
the
used
to
predict
the
mean
of
the
Gaussian.
An
adv
an
tage
of
this
approac
of
deriving
the
cost
function
from
maximum
lik
eliho
is
that
it
remov
es
the
burden
of
designing
cost
functions
for
each
mo
del.
Sp
ecifying
model
automatically
determines
cost
function
log
One
recurring
theme
throughout
neural
net
ork
design
is
that
the
gradien
of
the
cost
function
ust
large
and
predictable
enough
to
serve
as
go
guide
for
the
learning
algorithm.
unctions
that
saturate
(b
ecome
ery
flat)
undermine
this
ob
jectiv
ecause
they
make
the
gradient
ecome
very
small.
In
man
cases
this
happ
ens
ecause
the
activ
ation
functions
used
to
pro
duce
the
output
of
the
hidden
units
or
the
output
units
saturate.
The
negative
log-lik
eliho
od
helps
to
oid
this
problem
for
man
mo
dels.
Sev
eral
output
units
in
volv
an
exp
function
that
can
saturate
when
its
argument
is
very
negative.
The
log
function
in
the
negativ
log-likelihoo
cost
function
undo
es
the
exp
of
some
output
units.
will
discuss
the
in
teraction
etw
een
the
cost
function
and
the
hoice
of
output
unit
in
section
6.2.2
One
un
usual
prop
ert
of
the
cross-entrop
cost
used
to
erform
maxim
um
lik
eliho
estimation
is
that
it
usually
do
es
not
hav
minim
um
alue
when
applied
to
the
mo
dels
commonly
used
in
practice.
or
discrete
output
ariables,
most
mo
dels
are
parametrized
in
suc
wa
that
they
cannot
represen
probabilit
of
zero
or
one,
but
can
come
arbitrarily
close
to
doing
so.
Logistic
regression
is
an
example
of
suc
mo
del.
or
real-v
alued
output
ariables,
if
the
mo
del
can
control
the
densit
of
the
output
distribution
(for
example,
by
learning
the
ariance
parameter
of
Gaussian
output
distribution)
then
it
ecomes
possible
to
assign
extremely
high
density
to
the
correct
training
set
outputs,
resulting
in
cross-en
trop
approaching
negativ
infinity
Regularization
tec
hniques
describ
ed
in
chapter
provide
several
differen
wa
ys
of
mo
difying
the
learning
problem
so
that
the
mo
del
cannot
reap
unlimited
reward
in
this
ay
6.2.1.2
Learning
Conditional
Statistics
Instead
of
learning
full
probabilit
distribution
we
often
ant
to
learn
just
one
conditional
statistic
of
given
or
example,
we
may
ha
predictor
that
we
wish
to
employ
to
predict
the
mean
of
If
we
use
sufficiently
pow
erful
neural
net
work,
we
can
think
of
the
neural
net
ork
as
eing
able
to
represen
any
function
from
wide
class
of
functions,
with
this
class
being
limited
only
features
such
as
con
tin
uit
and
oundedness
175
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
rather
than
by
having
sp
ecific
parametric
form.
rom
this
oint
of
view,
we
can
view
the
cost
function
as
eing
functional
rather
than
just
function.
functional
is
mapping
from
functions
to
real
umbers.
can
thus
think
of
learning
as
choosing
function
rather
than
merely
choosing
set
of
parameters.
can
design
our
cost
functional
to
hav
its
minimum
ccur
at
some
sp
ecific
function
desire.
or
example,
we
can
design
the
cost
functional
to
hav
its
minim
um
lie
on
the
function
that
maps
to
the
exp
ected
alue
of
giv
en
Solving
an
optimization
problem
with
resp
ect
to
function
requires
mathematical
to
ol
called
calculus
of
ariations
describ
ed
in
section
19.4.2
It
is
not
necessary
to
understand
calculus
of
ariations
to
understand
the
conten
of
this
chapter.
At
the
moment,
it
is
only
necessary
to
understand
that
calculus
of
ariations
may
used
to
deriv
the
follo
wing
tw
results.
176
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Our
first
result
derived
using
calculus
of
ariations
is
that
solving
the
optimiza-
tion
problem
= arg
min
data
||
||
(6.14)
yields
) =
data
(6.15)
so
long
as
this
function
lies
within
the
class
we
optimize
er.
In
other
ords,
if
could
train
on
infinitely
man
samples
from
the
true
data
generating
distribution,
minimizing
the
mean
squared
error
cost
function
ould
give
function
that
predicts
the
mean
of
for
each
alue
of
Differen
cost
functions
give
differen
statistics.
second
result
derived
using
calculus
of
ariations
is
that
= arg
min
data
||
||
(6.16)
yields
function
that
predicts
the
me
dian
alue
of
for
each
as
long
as
such
function
may
describ
ed
the
family
of
functions
we
optimize
ov
er.
This
cost
function
is
commonly
called
mean
absolute
error
Unfortunately
mean
squared
error
and
mean
absolute
error
often
lead
to
or
results
when
used
with
gradient-based
optimization.
Some
output
units
that
saturate
pro
duce
very
small
gradien
ts
when
com
bined
with
these
cost
functions.
This
is
one
reason
that
the
cross-en
trop
cost
function
is
more
opular
than
mean
squared
error
or
mean
absolute
error,
even
when
it
is
not
necessary
to
estimate
an
en
tire
distribution
6.2.2
Output
Units
The
choice
of
cost
function
is
tigh
tly
coupled
with
the
hoice
of
output
unit.
Most
of
the
time,
simply
use
the
cross-entrop
et
een
the
data
distribution
and
the
mo
del
distribution. The
hoice
of
how
to
represent
the
output
then
determines
the
form
of
the
cross-en
trop
function.
An
kind
of
neural
net
ork
unit
that
may
used
as
an
output
can
also
used
as
hidden
unit.
Here,
we
fo
cus
on
the
use
of
these
units
as
outputs
of
the
mo
del,
but
in
principle
they
can
used
in
ternally
as
ell.
revisit
these
units
with
additional
detail
ab
out
their
use
as
hidden
units
in
section
6.3
Throughout
this
section,
we
supp
ose
that
the
feedforw
ard
netw
ork
provides
set
of
hidden
features
defined
The
role
of
the
output
lay
er
is
then
to
provide
some
additional
transformation
from
the
features
to
complete
the
task
that
the
net
ork
must
erform.
177
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
6.2.2.1
Linear
Units
for
Gaussian
Output
Distributions
One
simple
kind
of
output
unit
is
based
on
an
affine
transformation
with
no
nonlinearit
These
are
often
just
called
linear
units.
Giv
en
features
la
er
of
linear
output
units
pro
duces
ector
Linear
output
lay
ers
are
often
used
to
pro
duce
the
mean
of
conditional
Gaussian
distribution:
) =
(6.17)
Maximizing
the
log-lik
eliho
od
is
then
equiv
alent
to
minimizing
the
mean
squared
error.
The
maxim
um
lik
eliho
framework
makes
it
straightforw
ard
to
learn
the
co
ariance
of
the
Gaussian
to
o,
or
to
mak
the
cov
ariance
of
the
Gaussian
function
of
the
input.
How
ev
er,
the
cov
ariance
must
constrained
to
ositiv
definite
matrix
for
all
inputs.
It
is
difficult
to
satisfy
such
constraints
with
linear
output
la
yer,
so
typically
other
output
units
are
used
to
parametrize
the
co
ariance.
Approac
hes
to
mo
deling
the
co
ariance
are
describ
ed
shortly
in
section
6.2.2.4
Because
linear
units
do
not
saturate,
they
ose
little
difficulty
for
gradien
t-
based
optimization
algorithms
and
ma
used
with
wide
ariet
of
optimization
algorithms.
6.2.2.2
Sigmoid
Units
for
Bernoulli
Output
Distributions
Man
tasks
require
predicting
the
alue
of
binary
ariable
Classification
problems
with
classes
can
cast
in
this
form.
The
maximum
lik
eliho
approach
is
to
define
Bernoulli
distribution
er
conditioned
on
Bernoulli
distribution
is
defined
just
single
um
er.
The
neural
net
needs
to
predict
only
= 1
or
this
umber
to
alid
probabilit
it
ust
lie
in
the
in
terv
al
[0,
1].
Satisfying
this
constraint
requires
some
careful
design
effort.
Supp
ose
we
were
to
use
linear
unit
and
threshold
its
alue
to
obtain
alid
probability:
= 1
) = max
min

(6.18)
This
would
indeed
define
alid
conditional
distribution,
but
ould
not
able
to
train
it
very
effectiv
ely
with
gradient
descen
t.
Any
time
that
stra
ed
178
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
outside
the
unit
interv
al,
the
gradient
of
the
output
of
the
mo
del
with
resp
ect
to
its
parameters
would
gradient
of
is
typically
problematic
ecause
the
learning
algorithm
no
longer
has
guide
for
how
to
improv
the
corresponding
parameters.
Instead,
it
is
etter
to
use
differen
approach
that
ensures
there
is
alwa
ys
strong
gradien
whenever
the
mo
del
has
the
wrong
answer.
This
approac
is
based
on
using
sigmoid
output
units
combined
with
maximum
lik
eliho
d.
sigmoid
output
unit
is
defined
(6.19)
where
is
the
logistic
sigmoid
function
describ
ed
in
section
3.10
can
think
of
the
sigmoid
output
unit
as
having
wo
comp
onen
ts.
First,
it
uses
linear
lay
er
to
compute
Next,
it
uses
the
sigmoid
activ
ation
function
to
con
ert
into
probabilit
omit
the
dep
endence
on
for
the
moment
to
discuss
how
to
define
probabilit
distribution
er
using
the
alue
The
sigmoid
can
motiv
ated
constructing
an
unnormalized
probability
distribution
which
does
not
sum
to
1.
can
then
divide
by
an
appropriate
constant
to
obtain
alid
probabilit
distribution.
If
we
egin
with
the
assumption
that
the
unnormalized
log
probabilities
are
linear
in
and
can
exp
onentiate
to
obtain
the
unnormalized
probabilities.
then
normalize
to
see
that
this
yields
Bernoulli
distribution
con
trolled
by
sigmoidal
transformation
of
log
) =
(6.20)
) = exp(
(6.21)
) =
exp(
=0
exp(
(6.22)
) =
((2
1)
(6.23)
Probabilit
distributions
based
on
exp
onen
tiation
and
normalization
are
common
throughout
the
statistical
mo
deling
literature.
The
ariable
defining
suc
distribution
ov
er
binary
ariables
is
called
logit
This
approach
to
predicting
the
probabilities
in
log
space
is
natural
to
use
with
maximum
likelihoo
learning.
Because
the
cost
function
used
with
maxim
um
lik
eliho
is
log
the
log
in
the
cost
function
undo
es
the
exp
of
the
sigmoid.
Without
this
effect,
the
saturation
of
the
sigmoid
could
prev
ent
gradien
t-
based learning
from making
go
od progress.
The loss
function for
maxim
um
179
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
lik
eliho
learning
of
Bernoulli
parametrized
by
sigmoid
is
) =
log
(6.24)
log
((2
1)
(6.25)
((1
(6.26)
This
deriv
ation
mak
es
use
of
some
properties
from
section
3.10
By
rewriting
the
loss
in
of
the
softplus
function,
we
can
see
that
it
saturates
only
when
(1
is
very
negative.
Saturation
thus
ccurs
only
when
the
mo
del
already
has
the
righ
answer—when
= 1
and
is
very
ositiv
e,
or
= 0
and
is
very
negativ
e.
When
has
the
wrong
sign,
the
argument
to
the
softplus
function,
(1
may
simplified
to
As
ecomes
large
while
has
the
wrong
sign,
the
softplus
function
asymptotes
tow
ard
simply
returning
its
argumen
The
deriv
ative
with
resp
ect
to
asymptotes
to
sign
so,
in
the
limit
of
extremely
incorrect
the
softplus
function
do
es
not
shrink
the
gradien
at
all.
This
prop
ert
is
useful
ecause
it
means
that
gradient-based
learning
can
act
to
quic
kly
correct
mistaken
When
use
other
loss
functions,
such
as
mean
squared
error,
the
loss
can
saturate
anytime
saturates.
The
sigmoid
activ
ation
function
saturates
to
when
ecomes
ery
negative
and
saturates
to
when
ecomes
ery
ositive.
The
gradient
can
shrink
to
small
to
useful
for
learning
when
this
happ
ens,
whether
the
mo
del
has
the
correct
answ
er
or
the
incorrect
answer.
or
this
reason,
maxim
um
likelihoo
is
almost
alwa
ys
the
preferred
approach
to
training
sigmoid
output
units.
Analytically
the
logarithm
of
the
sigmoid
is
alwa
ys
defined
and
finite,
ecause
the
sigmoid
returns
alues
restricted
to
the
op
en
interv
al
(0
1)
rather
than
using
the
entire
closed
in
terv
al
of
alid
probabilities
[0
1]
In
soft
ware
implemen
tations,
to
void
numerical
problems,
it
is
est
to
write
the
negativ
log-lik
eliho
od
as
function
of
rather
than
as
function
of
If
the
sigmoid
function
underflo
ws
to
zero,
then
taking
the
logarithm
of
yields
negative
infinit
6.2.2.3
Softmax
Units
for
Multinoulli
Output
Distributions
An
time
wish
to
represen
probabilit
distribution
er
discrete
ariable
with
ossible
alues,
may
use
the
softmax
function.
This
can
seen
as
generalization
of
the
sigmoid
function,
whic
was
used
to
represen
probability
distribution
ov
er
binary
ariable.
Softmax
functions
are
most
often
used
as
the
output
of
classifier,
to
represent
the
probabilit
distribution
ov
er
differen
classes.
More
rarely
softmax
functions
180
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
can
used
inside
the
model
itself,
if
we
wish
the
mo
del
to
choose
et
een
one
of
different
options
for
some
internal
ariable.
In
the
case
of
binary
ariables,
wished
to
pro
duce
single
num
er
= 1
(6.27)
Because
this
num
er
needed
to
lie
et
een
and
and
because
wan
ted
the
logarithm
of
the
num
er
to
well
behav
ed
for
gradient-based
optimization
of
the
log-likelihoo
d,
we
chose
to
instead
predict
num
er
log
Exp
onen
tiating
and
normalizing
ga
us
Bernoulli
distribution
controlled
by
the
sigmoid
function.
generalize
to
the
case
of
discrete
ariable
with
alues,
now
need
to
pro
duce
vector
with
require
not
only
that
eac
elemen
of
et
een
and
but
also
that
the
en
tire
vector
sums
to
so
that
it
represents
alid
probability
distribution.
The
same
approach
that
ork
ed
for
the
Bernoulli
distribution
generalizes
to
the
ultinoulli
distribution.
First,
linear
la
er
predicts
unnormalized
log
probabilities:
(6.28)
where
log
The
softmax
function
can
then
exp
onentiate
and
normalize
to
obtain
the
desired
ormally
the
softmax
function
is
given
softmax(
exp(
exp(
(6.29)
As
with
the
logistic
sigmoid,
the
use
of
the
exp
function
works
well
when
training
the
softmax
to
output
target
alue
using
maximum
log-likelihoo
d.
In
this
case,
we
wish
to
maximize
log
log
softmax
Defining
the
softmax
in
of
exp
is
natural
because
the
log
in
the
log-lik
eliho
od
can
undo
the
exp
of
the
softmax:
log
softmax(
log
exp(
(6.30)
The
first
term
of
equation
6.30
shows
that
the
input
alw
ays
has
direct
con
tribution
to
the
cost
function.
Because
this
term
cannot
saturate,
know
that
learning
can
pro
ceed,
ev
en
if
the
con
tribution
of
to
the
second
term
of
equation
6.30
becomes
ery
small.
When
maximizing
the
log-likelihoo
d,
the
first
term
encourages
to
pushed
up,
while
the
second
term
encourages
all
of
to
pushed
down.
gain
some
intuition
for
the
second
term,
log
exp
observe
181
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
that
this
term
can
roughly
approximated
by
max
This
appro
ximation
is
based
on
the
idea
that
exp
is
insignificant
for
any
that
is
noticeably
less
than
max
The
intuition
can
gain
from
this
appro
ximation
is
that
the
negative
log-lik
eliho
cost
function
alwa
ys
strongly
enalizes
the
most
active
incorrect
prediction.
If
the
correct
answer
already
has
the
largest
input
to
the
softmax,
then
the
term
and
the
log
exp
max
will
roughly
cancel.
This
example
will
then
con
tribute
little
to
the
verall
training
cost,
whic
will
dominated
by
other
examples
that
are
not
yet
correctly
classified.
So
far
hav
discussed
only
single
example.
Ov
erall,
unregularized
maximum
lik
eliho
will
drive
the
mo
del
to
learn
parameters
that
driv
the
softmax
to
predict
the
fraction
of
counts
of
each
outcome
observed
in
the
training
set:
softmax(
))
=1
i,
=1
(6.31)
Because
maximum
lik
eliho
is
consistent
estimator,
this
is
guaran
teed
to
happ
en
as
long
as
the
mo
del
family
is
capable
of
represen
ting
the
training
distribution.
In
practice,
limited
mo
del
capacity
and
imp
erfect
optimization
will
mean
that
the
mo
del
is
only
able
to
appro
ximate
these
fractions.
Man
ob
jectiv
functions
other
than
the
log-likelihoo
do
not
ork
as
ell
with
the
softmax
function.
Sp
ecifically
ob
jective
functions
that
do
not
use
log
to
undo
the
exp
of
the
softmax
fail
to
learn
when
the
argument
to
the
exp
ecomes
ery
negativ
e,
causing
the
gradien
to
anish.
In
particular,
squared
error
is
or
loss
function
for
softmax
units
and
can
fail
to
train
the
mo
del
to
change
its
output,
ev
en
when
the
mo
del
mak
es
highly
confiden
incorrect
predictions
Bridle
1990
).
understand
why
these
other
loss
functions
can
fail,
we
need
to
examine
the
softmax
function
itself.
Lik
the
sigmoid,
the
softmax
activ
ation
can
saturate.
The
sigmoid
function
has
single
output
that
saturates
when
its
input
is
extremely
negative
or
extremely
ositiv
e.
The
softmax
has
ultiple
output
alues.
These
output
alues
can
saturate
when
the
differences
et
een
input
alues
ecome
extreme. When
the
softmax
saturates,
many
cost
functions
based
on
the
softmax
also
saturate,
unless
they
are
able
to
in
ert
the
saturating
activ
ating
function.
see
that
the
softmax
function
responds
to
the
difference
et
een
its
inputs,
observ
that
the
softmax
output
is
in
ariant
to
adding
the
same
scalar
to
all
its
inputs:
softmax(
) = softmax(
(6.32)
Using
this
prop
ert
we
can
derive
numerically
stable
arian
of
the
softmax:
softmax(
) = softmax(
max
(6.33)
182
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
The
reform
ulated
ersion
enables
us
to
ev
aluate
softmax
with
only
small
umerical
errors, ev
en
when
con
tains
extremely
large or
extremely negativ
e n
um
bers.
Examining
the
umerically
stable
arian
t,
see
that
the
softmax
function
is
driv
en
by
the
amoun
that
its
arguments
deviate
from
max
An
output
softmax
saturates
to
when
the
corresp
onding
input
is
maximal
max
and
is
muc
greater
than
all
the
other
inputs.
The
output
softmax
can
also
saturate
to
when
is
not
maximal
and
the
maxim
um
is
uc
greater.
This
is
generalization
of
the
wa
that
sigmoid
units
saturate
and
can
cause
similar
difficulties
for
learning
if
the
loss
function
is
not
designed
to
comp
ensate
for
it.
The
argumen
to
the
softmax
function
can
pro
duced
in
wo
differen
ys.
The
most
common
is
simply
to
hav
an
earlier
lay
er
of
the
neural
net
work
output
ev
ery
elemen
of
as
describ
ed
ab
using
the
linear
lay
er
While
straigh
tforw
ard,
this
approach
actually
verparametrizes
the
distribution.
The
constrain
that
the
outputs
ust
sum
to
means
that
only
parameters
are
necessary;
the
probability
of
the
-th
alue
ma
obtained
by
subtracting
the
first
probabilities
from
can
thus
imp
ose
requirement
that
one
elemen
of
fixed.
or
example,
we
can
require
that
Indeed,
this
is
exactly
what
the
sigmoid
unit
do
es.
Defining
= 1
) =
is
equiv
alent
to
defining
= 1
) =
softmax
with
tw
o-dimensional
and
= 0
Both
the
argumen
and
the
argumen
approaches
to
the
softmax
can
describ
the
same
set
of
probabilit
distributions
but
ha
different
learning
dynamics.
In
practice,
there
is
rarely
muc
difference
et
een
using
the
ov
erparametrized
version
or
the
restricted
version,
and
it
is
simpler
to
implement
the
ov
erparametrized
version.
rom
neuroscien
tific
oin
of
view,
it
is
interesting
to
think
of
the
softmax
as
wa
to
create
form
of
comp
etition
et
een
the
units
that
participate
in
it:
the
softmax
outputs
alw
ys
sum
to
so
an
increase
in
the
alue
of
one
unit
necessarily
corresp
onds
to
decrease
in
the
alue
of
others.
This
is
analogous
to
the
lateral
inhibition
that
is
eliev
ed
to
exist
et
ween
nearby
neurons
in
the
cortex.
the
extreme
(when
the
difference
etw
een
the
maximal
and
the
others
is
large
in
magnitude)
it
ecomes
form
of
winner-tak
e-all
(one
of
the
outputs
is
nearly
1,
and
the
others
are
nearly
0).
The
name
“softmax”
can
somewhat
confusing.
The
function
is
more
closely
related
to
the
arg
max
function
than
the
max
function. The
term
“soft”
deriv
es
from
the
fact
that
the
softmax
function
is
contin
uous
and
differentiable.
The
arg
max
function,
with
its
result
represented
as
one-hot
vector,
is
not
contin
uous
or
differentiable.
The
softmax
function
thus
provides
“softened”
version
of
the
arg
max
The
corresp
onding
soft
ersion
of
the
maximum
function
is
softmax
183
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
It
would
erhaps
etter
to
call
the
softmax
function
“softargmax,” but
the
curren
name
is
an
en
trenc
hed
conv
en
tion.
6.2.2.4
Other
Output
yp
es
The
linear, sigmoid, and
softmax output
units
describ
ed
ab
are
the
most
common.
Neural
net
orks
can
generalize
to
almost
any
kind
of
output
lay
er
that
wish.
The
principle
of
maxim
um
likelihoo
pro
vides
guide
for
ho
to
design
go
cost
function
for
nearly
any
kind
of
output
lay
er.
In
general,
if
define
conditional
distribution
the
principle
of
maxim
um
likelihoo
suggests
we
use
log
as
our
cost
function.
In
general,
we
can
think
of
the
neural
net
ork
as
representing
function
The
outputs
of
this
function
are
not
direct
predictions
of
the
alue
Instead,
) =
pro
vides
the
parameters
for
distribution
ov
er
Our
loss
function
can
then
in
terpreted
as
log
))
or
example,
we
may
wish
to
learn
the
ariance
of
conditional
Gaussian
for
giv
en
In
the
simple
case,
where
the
ariance
is
constan
t,
there
is
closed
form
expression
ecause
the
maxim
um
likelihoo
estimator
of
ariance
is
simply
the
empirical
mean
of
the
squared
difference
et
een
observ
ations
and
their
exp
ected
alue.
computationally
more
exp
ensiv
approac
that
does
not
require
writing
sp
ecial-case
co
de
is
to
simply
include
the
ariance
as
one
of
the
prop
erties
of
the
distribution
that
is
controlled
by
The
negative
log-likelihoo
log
))
will
then
provide
cost
function
with
the
appropriate
necessary
to
mak
our
optimization
pro
cedure
incrementally
learn
the
ariance.
In
the
simple
case
where
the
standard
deviation
do
es
not
depend
on
the
input,
can
make
new
parameter
in
the
netw
ork
that
is
copied
directly
into
This
new
parameter
might
itself
or
could
parameter
represen
ting
or
it
could
parameter
represen
ting
dep
ending
on
ho
ho
ose
to
parametrize
the
distribution.
may
wish
our
mo
del
to
predict
differen
amount
of
ariance
in
for
different
alues
of
This
is
called
heteroscedastic
mo
del.
In
the
heteroscedastic
case,
simply
make
the
sp
ecification
of
the
ariance
one
of
the
alues
output
ypical
to
do
this
is
to
form
ulate
the
Gaussian
distribution
using
precision,
rather
than
ariance,
as
describ
ed
in
equation
3.22
In
the
ultiv
ariate
case,
it
is
most
common
to
use
diagonal
precision
matrix
diag
(6.34)
This
form
ulation
works
well
with
gradient
descent
ecause
the
form
ula
for
the
log-lik
eliho
of
the
Gaussian
distribution
parametrized
by
in
olves
only
ul-
tiplication
by
and
addition
of
log
The
gradient
of
multiplication,
addition,
184
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
and
logarithm
op
erations
is
ell
eha
ved.
By
comparison,
if
parametrized
the
output
in
of
ariance,
would
need
to
use
division.
The
division
function
ecomes
arbitrarily
steep
near
zero.
While
large
gradients
can
help
learning,
arbi-
trarily
large
gradients
usually
result
in
instabilit
If
we
parametrized
the
output
in
of
standard
deviation,
the
log-likelihoo
would
still
inv
olv
division
as
well
as
squaring. The
gradien
through
the
squaring
op
eration
can
anish
near
zero,
making
it
difficult
to
learn
parameters
that
are
squared.
Regardless
of
whether
we
use
standard
deviation,
ariance,
or
precision,
must
ensure
that
the
co
ariance
matrix
of
the
Gaussian
is
ositiv
definite.
Because
the
eigenv
alues
of
the
precision
matrix
are
the
recipro
cals
of
the
eigenv
alues
of
the
co
ariance
matrix, this
is
equiv
alent
to
ensuring
that
the
precision
matrix
is
ositive
definite. If
use
diagonal
matrix,
or
scalar
times
the
diagonal
matrix,
then
the
only
condition
need
to
enforce
on
the
output
of
the
mo
del
is
ositivit
If
suppose
that
is
the
raw
activ
ation
of
the
mo
del
used
to
determine
the
diagonal
precision,
we
can
use
the
softplus
function
to
obtain
ositiv
precision
ector:
This
same
strategy
applies
equally
if
using
ariance
or
standard
deviation
rather
than
precision
or
if
using
scalar
times
identit
rather
than
diagonal
matrix.
It
is
rare
to
learn
cov
ariance
or
precision
matrix
with
richer
structure
than
diagonal. If
the
co
ariance
is
full
and
conditional,
then
parametrization
must
chosen
that
guaran
tees
ositiv
definiteness
of
the
predicted
co
ariance
matrix.
This
can
achiev
ed
by
writing
) =
where
is
an
unconstrained
square
matrix.
One
practical
issue
if
the
matrix
is
full
rank
is
that
computing
the
lik
eliho
is
exp
ensive,
with
matrix
requiring
computation
for
the
determinan
and
in
erse
of
(or
equiv
alen
tly
and
more
commonly
done,
its
eigendecomp
osition
or
that
of
).
e often
wan
to perform m
ultimodal regression, that
is, to predict
real
alues
from
conditional
distribution
that
can
ha
ve
several
differen
eaks
in
space
for
the
same
alue
of
In
this
case,
Gaussian
mixture
is
natural
represen
tation
for
the
output
Jacobs
et
al.
1991
Bishop
1994
).
Neural
netw
orks
with
Gaussian
mixtures
as
their
output
are
often
called
mixture
densit
net
orks
Gaussian
mixture
output
with
comp
onen
ts
is
defined
by
the
conditional
probabilit
distribution:
) =
=1
))
(6.35)
The
neural
net
ork
must
hav
three
outputs:
ector
defining
matrix
providing
for
all
and
tensor
providing
for
all
These
outputs
must
satisfy
different
constraints:
185
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
1.
Mixture
comp
onents
these
form
ultinoulli
distribution
er
the
differen
comp
onents
asso
ciated
with
latent
ariable
and
can
ypically
obtained
by
softmax
ov
er
an
-dimensional
vector,
to
guarantee
that
these
outputs
are
positive
and
sum
to
1.
2.
Means
these
indicate
the
cen
ter
or
mean
asso
ciated
with
the
-th
Gaussian
comp
onen
and
are
unconstrained
(typically
with
no
nonlinearit
at
all
for
these
output
units).
If
is
-v
ector,
then
the
net
ork
must
output
an
matrix
con
taining
all
of
these
-dimensional
ectors.
Learning
these
means
with
maximum
lik
eliho
is
slightly
more
complicated
than
learning
the
means
of
distribution
with
only
one
output
mode. W
only
ant
to
up
date
the
mean
for
the
comp
onent
that
actually
pro
duced
the
observ
ation.
In
practice,
we
do
not
know
whic
comp
onent
pro
duced
eac
observ
ation.
The
expression
for
the
negativ
log-likelihoo
naturally
eigh
ts
eac
example’s
con
tribution
to
the
loss
for
each
comp
onent
the
probabilit
that
the
comp
onen
pro
duced
the
example.
3.
Co
ariances
these
sp
ecify
the
cov
ariance
matrix
for
each
comp
onen
As
when
learning
single
Gaussian
comp
onen
t,
we
typically
use
diagonal
matrix
to
void
needing
to
compute
determinants.
As
with
learning
the
means
of
the
mixture,
maximum
likelihoo
is
complicated
by
needing
to
assign
partial
resp
onsibilit
for
eac
oin
to
eac
mixture
comp
onent.
Gradient
descen
will
automatically
follow
the
correct
pro
cess
if
given
the
correct
sp
ecification
of
the
negativ
log-lik
elihoo
under
the
mixture
mo
del.
It
has
een
rep
orted
that
gradien
t-based
optimization
of
conditional
Gaussian
mixtures
(on
the
output
of
neural
netw
orks)
can
unreliable,
in
part
ecause
one
gets
divisions
(b
the
ariance)
whic
can
be
numerically
unstable
(when
some
ariance
gets
to
small
for
particular
example,
yielding
ery
large
gradients).
One
solution
is
to
clip
gradien
ts
(see
section
10.11.1
),
while
another
is
to
scale
the
gradients
heuristically
Uria
et
al.
2014
).
Gaussian
mixture
outputs
are
particularly
effective
in
generative
mo
dels
of
sp
eec
Sch
uster
1999
and
mov
emen
ts
of
physical
ob
jects
Gra
es
2013
).
The
mixture
density
strategy
giv
es
for
the
net
ork
to
represen
ultiple
output
mo
des
and
to
control
the
ariance
of
its
output,
which
is
crucial
for
obtaining
consider
to
latent
ecause
we
do
not
observe
it
in
the
data:
given
input
and
target
it
is
not
ossible
to
know
with
certaint
which
Gaussian
comp
onen
was
resp
onsible
for
but
can
imagine
that
as
generated
by
picking
one
of
them,
and
can
make
that
unobserv
ed
hoice
random
ariable.
186
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Figure
6.4:
Samples
dra
wn
from
neural
netw
ork
with
mixture
densit
output
la
yer.
The
input
is
sampled
from
uniform
distribution,
and
the
output
is
sampled
from
model
The
neural
netw
ork
is
able
to
learn
nonlinear
mappings
from
the
input
to
the
parameters
of
the
output
distribution.
These
parameters
include
the
probabilities
go
verning
which
of
three
mixture
comp
onents
will
generate
the
output
as
well
as
the
parameters
for
each
mixture
comp
onen
t.
Each
mixture
comp
onent
is
Gaussian
with
predicted
mean
and
ariance.
All
these
aspects
of
the
output
distribution
are
able
to
ary
with
resp
ect
to
the
input
and
to
do
so
in
nonlinear
ys.
high
degree
of
quality
in
these
real-v
alued
domains.
An
example
of
mixture
densit
netw
ork
is
sho
wn
in
figure
6.4
In
general,
ma
wish
to
contin
ue
to
mo
del
larger
vectors
con
taining
more
ariables,
and
to
imp
ose
ric
her
and
richer
structures
on
these
output
ariables.
or
example,
if
we
ant
our
neural
netw
ork
to
output
sequence
of
characters
that
forms
sentence,
might
con
tin
ue
to
use
the
principle
of
maxim
um
likelihoo
applied
to
our
mo
del
))
In
this
case,
the
mo
del
use
to
describ
ould
ecome
complex
enough
to
eyond
the
scope
of
this
chapter.
In
Chapter
10
describ
how
to
use
recurren
neural
netw
orks
to
define
suc
mo
dels
er
sequences,
and
in
part
we
describ
adv
anced
tec
hniques
for
mo
deling
arbitrary
probability
distributions.
6.3
Hidden
Units
So
far
hav
fo
cused
our
discussion
on
design
choices
for
neural
net
orks
that
are
common
to
most
parametric
mac
hine
learning
mo
dels
trained
with
gradient-
based
optimization.
Now
turn
to
an
issue
that
is
unique
to
feedforward
neural
187
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
net
orks:
ho
to
choose
the
ype
of
hidden
unit
to
use
in
the
hidden
lay
ers
of
the
mo
del.
The
design
of
hidden
units
is
an
extremely
activ
area
of
research
and
does
not
et
hav
many
definitiv
guiding
theoretical
principles.
Rectified
linear
units
are
an
excellen
default
choice
of
hidden
unit.
Many
other
yp
es
of
hidden
units
are
av
ailable.
It
can
difficult
to
determine
when
to
use
whic
kind
(though
rectified
linear
units
are
usually
an
acceptable
choice). W
describ
here
some
of
the
basic
in
tuitions
motiv
ating
each
type
of
hidden
unit.
These
intuitions
can
help
decide
when
to
try
out
which
unit.
Predicting
in
adv
ance
whic
will
work
est
is
usually
imp
ossible.
The
design
pro
cess
consists
of
trial
and
error,
intuiting
that
kind
of
hidden
unit
may
work
ell,
and
then
training
netw
ork
with
that
kind
of
hidden
unit
and
ev
aluating
its
erformance
on
alidation
set.
Some
of
the
hidden
units
included
in
this
list
are
not
actually
differen
tiable
at
all
input
oin
ts.
or
example,
the
rectified
linear
function
max
is
not
differentiable
at
This
may
seem
like
it
in
alidates
for
use
with
gradient-based
learning
algorithm.
In
practice,
gradien
descent
still
erforms
ell
enough
for
these
mo
dels
to
used
for
mac
hine
learning
tasks.
This
is
in
part
ecause
neural
net
ork
training
algorithms
do
not
usually
arriv
at
lo
cal
minim
um
of
the
cost
function,
but
instead
merely
reduce
its
alue
significan
tly
as
sho
wn
in
figure
4.3
(These
ideas
are
describ
ed
further
in
chapt
er
.)
Because
do
not
exp
ect
training
to
actually
reac
oint
where
the
gradient
is
it
is
acceptable
for
the
minima
of
the
cost
function
to
corresp
ond
to
oin
ts
with
undefined
gradien
t.
Hidden
units
that
are
not
differentiable
are
usually
nondifferen
tiable
at
only
small
num
ber
of
points.
In
general,
function
has
left
deriv
ative
defined
the
slop
of
the
function
immediately
to
the
left
of
and
right
deriv
ativ
defined
the
slop
of
the
function
immediately
to
the
right
of
function
is
differen
tiable
at
only
if
oth
the
left
deriv
ativ
and
the
right
deriv
ative
are
defined
and
equal
to
eac
other.
The
functions
used
in
the
context
of
neural
net
orks
usually
hav
defined
left
deriv
atives
and
defined
righ
deriv
ativ
es.
In
the
case
of
) =
max
the
left
deriv
ativ
at
= 0
is
and
the
right
deriv
ative
is
Softw
are
implemen
tations
of
neural
netw
ork
training
usually
return
one
of
the
one-sided
deriv
atives
rather
than
rep
orting
that
the
deriv
ativ
is
undefined
or
raising
an
error. This
ma
heuristically
justified
observing
that
gradien
t-
based
optimization
on
digital
computer
is
sub
ject
to
numerical
error
anyw
ay
When
function
is
ask
ed
to
ev
aluate
(0)
it
is
very
unlik
ely
that
the
underlying
alue
truly
as
Instead,
it
as
lik
ely
to
some
small
alue
that
was
rounded
to
In
some
contexts,
more
theoretically
pleasing
justifications
are
av
ailable,
but
188
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
these
usually
do
not
apply
to
neural
net
ork
training.
The
imp
ortan
oin
is
that
in
practice
one
can
safely
disregard
the
nondifferentiabilit
of
the
hidden
unit
activ
ation
functions
describ
ed
elow.
Unless
indicated
otherwise,
most
hidden
units
can
describ
ed
as
accepting
vector
of
inputs
computing
an
affine
transformation
and
then
applying
an
element-wise
nonlinear
function
Most
hidden
units
are
distinguished
from
eac
other
only
the
choice
of
the
form
of
the
activ
ation
function
6.3.1
Rectified
Linear
Units
and
Their
Generalizations
Rectified
linear
units
use
the
activ
ation
function
) = max
These
units
are
easy
to
optimize
ecause
they
are
so
similar
to
linear
units.
The
only
difference
betw
een
linear
unit
and
rectified
linear
unit
is
that
rectified
linear
unit
outputs
zero
across
half
its
domain.
This
makes
the
deriv
atives
through
rectified
linear
unit
remain
large
whenever
the
unit
is
active.
The
gradients
are
not
only
large
but
also
consisten
t.
The
second
deriv
ative
of
the
rectifying
op
eration
is
almost
ev
erywhere,
and
the
deriv
ative
of
the
rectifying
op
eration
is
everywhere
that
the
unit
is
active.
This
means
that
the
gradien
direction
is
far
more
useful
for
learning
than
it
would
with
activ
ation
functions
that
introduce
second-order
effects.
Rectified
linear
units
are
ypically
used
on
top
of
an
affine
transformation:
(6.36)
When
initializing
the
parameters
of
the
affine
transformation,
it
can
go
od
practice
to
set
all
elemen
ts
of
to
small
ositiv
alue,
suc
as
. Doing
so
mak
es
it
ery
likely
that
the
rectified
linear
units
will
initially
activ
for
most
inputs
in
the
training
set
and
allo
the
deriv
ativ
es
to
pass
through.
Sev
eral
generalizations
of
rectified
linear
units
exist.
Most
of
these
general-
izations
erform
comparably
to
rectified
linear
units
and
ccasionally
erform
etter.
One
drawbac
to
rectified
linear
units
is
that
they
cannot
learn
via
gradient-
based
metho
ds
on
examples
for
which
their
activ
ation
is
zero.
arious
generaliza-
tions
of
rectified
linear
units
guarantee
that
they
receiv
gradient
ev
erywhere.
Three
generalizations
of
rectified
linear
units
are
based
on
using
nonzero
slop
when
max
(0
min
(0
Absolute
alue
189
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
rectification
fixes
to
obtain
) =
It
is
used
for
ob
ject
recognition
from
images
Jarrett
et
al.
2009
),
where
it
mak
es
sense
to
seek
features
that
are
in
ariant
under
olarit
reversal
of
the
input
illumination.
Other
generalizations
of
rectified
linear
units
are
more
broadly
applicable.
leaky
ReLU
Maas
et
al.
2013
fixes
to
small
alue
like
0.01,
while
parametric
ReLU
or
PReLU
treats
as
learnable
parameter
He
et
al.
2015
).
Maxout
units
Go
odfellow
et al.
2013a
generalize
rectified
linear
units
further.
Instead
of
applying
an
elemen
t-wise
function
maxout
units
divide
in
to
groups
of
alues.
Eac
maxout
unit
then
outputs
the
maximum
elemen
of
one
of
these
groups:
max
(6.37)
where
is
the
set
of
indices
into
the
inputs
for
group
1)
ik
This
provides
of
learning
piecewise
linear
function
that
resp
onds
to
multiple
directions
in
the
input
space.
maxout
unit
can
learn
piecewise
linear,
conv
ex
function
with
up
to
pieces.
Maxout
units
can
th
us
be
seen
as
le
arning
the
activation
function
itself
rather
than
just
the
relationship
etw
een
units.
With
large
enough
maxout
unit
can
learn
to
approximate
any
con
vex
function
with
arbitrary
fidelity
In
particular,
maxout
la
yer
with
tw
pieces
can
learn
to
implement
the
same
function
of
the
input
as
traditional
la
er
using
the
rectified
linear
activ
ation
function,
the
absolute
alue
rectification
function,
or
the
leaky
or
parametric
ReLU,
or
it
can
learn
to
implemen
totally
differen
function
altogether.
The
maxout
la
er
will
of
course
parametrized
differen
tly
from
an
of
these
other
la
er
types,
so
the
learning
dynamics
will
different
ev
en
in
the
cases
where
maxout
learns
to
implement
the
same
function
of
as
one
of
the
other
la
yer
yp
es.
Eac
maxout
unit
is
now
parametrized
eigh
ectors
instead
of
just
one,
so
maxout
units
typically
need
more
regularization
than
rectified
linear
units.
They
can
work
well
without
regularization
if
the
training
set
is
large
and
the
num
er
of
pieces
er
unit
is
kept
low
Cai
et
al.
2013
).
Maxout
units
ha
ve
few
other
enefits.
In
some
cases,
one
can
gain
some
sta-
tistical
and
computational
adv
an
tages
by
requiring
fewer
parameters.
Sp
ecifically
if
the
features
captured
by
differen
linear
filters
can
summarized
without
losing
information
taking
the
max
ov
er
each
group
of
features,
then
the
next
la
er
can
get
by
with
times
fewer
weigh
ts.
Because
each
unit
is
driven
by
ultiple
filters,
maxout
units
hav
some
redun-
dancy
that
helps
them
resist
phenomenon
called
catastrophic
forgetting
in
whic
neural
net
orks
forget
how
to
erform
tasks
that
they
were
trained
on
in
190
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
the
past
Go
odfellow
et
al.
2014a
).
Rectified
linear
units
and
all
these
generalizations
of
them
are
based
on
the
principle
that
mo
dels
are
easier
to
optimize
if
their
eha
vior
is
closer
to
linear.
This
same
general
principle
of
using
linear
eha
vior
to
obtain
easier
optimization
also
applies
in
other
con
texts
esides
deep
linear
netw
orks.
Recurren
netw
orks
can
learn
from
sequences
and
pro
duce
sequence
of
states
and
outputs.
When
training
them,
one
needs
to
propagate
information
through
sev
eral
time
steps,
which
is
muc
easier
when
some
linear
computations
(with
some
directional
deriv
ativ
es
being
of
magnitude
near
1)
are
in
volv
ed.
One
of
the
est-p
erforming
recurrent
net
ork
arc
hitectures,
the
LSTM,
propagates
information
through
time
via
summation—a
particular
straightforw
ard
kind
of
linear
activ
ation. This
is
discussed
further
in
section
10.10
6.3.2
Logistic
Sigmoid
and
Hyp
erb
olic
angent
Prior
to
the
introduction
of
rectified
linear
units,
most
neural
net
orks
used
the
logistic
sigmoid
activ
ation
function
) =
(6.38)
or
the
yp
erbolic
tangen
activ
ation
function
) = tanh(
(6.39)
These
activ
ation
functions
are
closely
related
ecause
tanh(
) = 2
(2
e ha
ve
already seen
sigmoid
units as
output
units,
used
to predict
the
probabilit
that
binary
ariable
is
Unlike
piecewise
linear
units,
sigmoidal
units
saturate
across
most
of
their
domain—they
saturate
to
high
alue
when
is
very
ositive,
saturate
to
lo
alue
when
is
very
negative,
and
are
only
strongly
sensitive
to
their
input
when
is
near
0.
The
widespread
saturation
of
sigmoidal
units
can
mak
gradien
t-based
learning
ery
difficult.
or
this
reason,
their
use
as
hidden
units
in
feedforward
net
works
is
no
discouraged.
Their
use
as
output
units
is
compatible
with
the
use
of
gradient-based
learning
when
an
appropriate
cost
function
can
undo
the
saturation
of
the
sigmoid
in
the
output
la
er.
When
sigmoidal
activ
ation
function
ust
used,
the
yp
erbolic
tangent
activ
ation
function
typically
erforms
etter
than
the
logistic
sigmoid.
It
resembles
the
identit
function
more
closely
in
the
sense
that
tanh
(0) = 0
while
(0) =
Because
tanh
is
similar
to
the
iden
tit
function
near
training
deep
neural
net
work
tanh
tanh
))
resem
bles
training
linear
model
191
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
as
long
as
the
activ
ations
of
the
netw
ork
can
ept
small.
This
mak
es
training
the
tanh
net
ork
easier.
Sigmoidal
activ
ation
functions
are
more
common
in
settings
other
than
feed-
forw
ard
netw
orks.
Recurren
netw
orks,
man
probabilistic
models, and
some
auto
encoders
hav
additional
requiremen
ts
that
rule
out
the
use
of
piecewise
linear
activ
ation
functions
and
mak
sigmoidal
units
more
app
ealing
despite
the
dra
wbac
ks
of
saturation.
6.3.3
Other
Hidden
Units
Man
other
yp
es
of
hidden
units
are
ossible
but
are
used
less
frequently
In
general,
wide
ariety
of
differentiable
functions
erform
erfectly
well.
Man
unpublished
activ
ation
functions
erform
just
as
well
as
the
opular
ones.
pro
vide
concrete
example,
tested
feedforward
netw
ork
using
cos
on
the
MNIST
dataset
and
obtained
an
error
rate
of
less
than
ercen
t,
which
is
comp
etitiv
with
results
obtained
using
more
con
ven
tional
activ
ation
functions.
During
research
and
dev
elopmen
of
new
tec
hniques,
it
is
common
to
test
many
differen
activ
ation
functions
and
find
that
several
ariations
on
standard
practice
erform
comparably
This
means
that
usually
new
hidden
unit
types
are
published
only
if
they
are
clearly
demonstrated
to
pro
vide
significan
improv
ement.
New
hidden
unit
types
that
erform
roughly
comparably
to
known
types
are
so
common
as
to
unin
teresting.
It
would
impractical
to
list
all
the
hidden
unit
ypes
that
ha
app
eared
in
the
literature.
highlight
few
esp
ecially
useful
and
distinctive
ones.
One
ossibility
is
to
not
ha
an
activ
ation
at
all.
One
can
also
think
of
this
as
using
the
identit
function
as
the
activ
ation
function.
hav
already
seen
that
linear
unit
can
useful
as
the
output
of
neural
netw
ork. It
ma
also
used
as
hidden
unit.
If
ev
ery
la
er
of
the
neural
netw
ork
consists
of
only
linear
transformations,
then
the
netw
ork
as
whole
will
linear.
How
ever,
it
is
acceptable
for
some
lay
ers
of
the
neural
net
ork
to
be
purely
linear.
Consider
neural
netw
ork
lay
er
with
inputs
and
outputs,
may
replace
this
with
wo
la
ers,
with
one
lay
er
using
weigh
matrix
and
the
other
using
weigh
matrix
If
the
first
la
er
has
no
activ
ation
function,
then
we
hav
essen
tially
factored
the
weigh
matrix
of
the
original
lay
er
based
on
The
factored
approach
is
to
compute
If
pro
duces
outputs,
then
and
together
contain
only
parameters,
while
con
tains
np
parameters.
or
small
this
can
considerable
saving
in
parameters.
It
comes
at
the
cost
of
constraining
the
linear
transformation
to
lo
rank,
but
192
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
these
low-rank
relationships
are
often
sufficien
t.
Linear
hidden
units
thus
offer
an
effectiv
wa
of
reducing
the
um
er
of
parameters
in
net
ork.
Softmax
units
are
another
kind
of
unit
that
is
usually
used
as
an
output
(as
describ
ed
in
section
6.2.2.3
but
ma
sometimes
used
as
hidden
unit.
Softmax
units
naturally
represen
probability
distribution
er
discrete
ariable
with
ossible
alues,
so
they
ma
used
as
kind
of
switch.
These
kinds
of
hidden
units
are
usually
only
used
in
more
adv
anced
architectures
that
explicitly
learn
to
manipulate
memory
as
describ
ed
in
section
10.12
few
other
reasonably
common
hidden
unit
types
include
Radial
basis
function
(RBF),
unit
exp
||
,i
||
This
function
ecomes
more
activ
as
approac
hes
template
,i
Because
it
saturates
to
for
most
it
can
difficult
to
optimize.
Softplus
) =
) =
log
(1
This
is
smo
oth
version
of
the
rectifier,
in
tro
duced
by
Dugas
et
al.
2001
for
function
appro
ximation
and
by
Nair
and
Hinton
2010
for
the
conditional
distributions
of
undirected
probabilistic
mo
dels.
Glorot
et
al.
2011a
compared
the
softplus
and
rectifier
and
found
etter
results
with
the
latter.
The
use
of
the
softplus
is
generally
discouraged.
The
softplus
demonstrates
that
the
erformance
of
hidden
unit
yp
es
can
ery
counterin
tuitiv
e—one
might
exp
ect
it
to
hav
an
adv
an
tage
ov
er
the
rectifier
due
to
eing
differen
tiable
ev
erywhere
or
due
to
saturating
less
completely
but
empirically
it
do
es
not.
Hard
tanh
This
is
shap
ed
similarly
to
the
tanh
and
the
rectifier,
but
unlike
the
latter,
it
is
ounded,
max
min
(1
))
It
was
in
tro
duced
Collob
ert
2004
).
Hidden
unit
design
remains
an
activ
area
of
research,
and
man
useful
hidden
unit
types
remain
to
be
disco
vered.
6.4
Arc
hitecture
Design
Another
key
design
consideration
for
neural
netw
orks
is
determining
the
architecture.
The
word
arc
hitecture
refers
to
the
ov
erall
structure
of
the
netw
ork:
how
man
units
it
should
hav
and
how
these
units
should
connected
to
eac
other.
Most
neural
netw
orks
are
organized
into
groups
of
units
called
lay
ers. Most
neural
netw
ork
arc
hitectures
arrange
these
lay
ers
in
chain
structure,
with
eac
193
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
la
er
eing
function
of
the
lay
er
that
preceded
it.
In
this
structure,
the
first
lay
er
is
given
(1)
(1)
(1)
(1)
(6.40)
the
second
la
er
is
giv
en
by
(2)
(2)
(2)
(1)
(2)
(6.41)
and
so
on.
In
these
chain-based
arc
hitectures,
the
main
arc
hitectural
considerations
are
ho
osing
the depth
of
the net
work
and the
width
of eac
h la
er.
As
e will
see,
net
work
with
ev
en
one
hidden
lay
er
is
sufficien
to
fit
the
training
set.
Deep
er
netw
orks
are
often
able
to
use
far
fewer
units
er
lay
er
and
far
fewer
parameters,
as
ell
as
frequen
tly
generalizing
to
the
test
set,
but
they
also
tend
to
harder
to
optimize.
The
ideal
net
ork
architecture
for
task
ust
found
via
exp
erimentation
guided
by
monitoring
the
alidation
set
error.
6.4.1
Univ
ersal
Appro
ximation
Prop
erties
and
Depth
linear
model,
mapping
from
features
to
outputs
via
matrix
ultiplication,
can
definition
represent
only
linear
functions.
It
has
the
adv
antage
of
being
easy
to
train
ecause
man
loss
functions
result
in
con
ex
optimization
problems
when
applied
to
linear
mo
dels.
Unfortunately
, w
often
wan
our
systems
to
learn
nonlinear
functions.
first
glance,
we
migh
presume
that
learning
nonlinear
function
requires
designing
sp
ecialized
mo
del
family
for
the
kind
of
nonlinearit
we
wan
to
learn.
ortunately
feedforw
ard
netw
orks
with
hidden
lay
ers
provide
universal
approxi-
mation
framework.
Sp
ecifically
the
univ
ersal
appro
ximation
theorem
Hornik
et
al.
1989
Cyb
enko
1989
states
that
feedforw
ard
net
ork
with
linear
output
la
er
and
at
least
one
hidden
lay
er
with
an
“squashing”
activ
ation
function
(such
as
the
logistic
sigmoid
activ
ation
function)
can
approximate
an
Borel
measurable
function
from
one
finite-dimensional
space
to
another
with
an
desired
nonzero
amoun
of
error,
provided
that
the
netw
ork
is
giv
en
enough
hidden
units.
The
deriv
atives
of
the
feedforw
ard
net
ork
can
also
appro
ximate
the
deriv
atives
of
the
function
arbitrarily
ell
Hornik
et
al.
1990
).
The
concept
of
Borel
measurability
is
ey
ond
the
scop
of
this
ok; for
our
purp
oses
it
suffices
to
say
that
any
con
tinuous
function
on
closed
and
ounded
subset
of
is
Borel
measurable
and
therefore
ma
approximated
neural
netw
ork.
neural
net
ork
may
also
approximate
any
function
mapping
from
any
finite
dimensional
discrete
space
194
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
to
another.
While
the
original
theorems
were
first
stated
in
of
units
with
activ
ation
functions
that
saturate
for
oth
ery
negative
and
very
ositiv
argu-
men
ts,
universal
approximation
theorems
hav
also
been
pro
ed
for
wider
class
of
activ
ation
functions,
which
includes
the
now
commonly
used
rectified
linear
unit
Leshno
et
al.
1993
).
The
universal
appro
ximation
theorem
means
that
regardless
of
what
function
are
trying
to
learn,
we
know
that
large
MLP
will
be
able
to
epr
esent
this
function.
are
not
guaranteed,
ho
ever,
that
the
training
algorithm
will
able
to
le
arn
that
function.
Even
if
the
MLP
is
able
to
represen
the
function,
learning
can
fail
for
different
reasons.
First,
the
optimization
algorithm
used
for
training
ma
not
able
to
find
the
alue
of
the
parameters
that
corresp
onds
to
the
desired
function.
Second,
the
training
algorithm
might
ho
ose
the
wrong
function
as
result
of
erfitting.
Recall
from
section
5.2.1
that
the
no
free
lunc
theorem
shows
that
there
is
no
univ
ersally
sup
erior
machine
learning
algorithm.
eedforw
ard
net
orks
pro
vide
univ
ersal
system
for
represen
ting
functions
in
the
sense
that,
given
function,
there
exists
feedforw
ard
net
ork
that
approximates
the
function.
There
is
no
univ
ersal
pro
cedure
for
examining
training
set
of
sp
ecific
examples
and
choosing
function
that
will
generalize
to
oints
not
in
the
training
set.
ccording
to
the
universal
appro
ximation
theorem,
there
exists
netw
ork
large
enough
to
ac
hieve
any
degree
of
accuracy
we
desire,
but
the
theorem
do
es
not
sa
how
large
this
net
work
will
be.
Barron
1993
pro
vides
some
bounds
on
the
size
of
single-la
er
netw
ork
needed
to
approximate
broad
class
of
functions.
Unfortunately
in
the
orst
case,
an
exponential
num
er
of
hidden
units
(possibly
with
one
hidden
unit
corresp
onding
to
eac
input
configuration
that
needs
to
distinguished)
ma
required.
This
is
easiest
to
see
in
the
binary
case:
the
umber
of
ossible
binary
functions
on
ectors
is
and
selecting
one
such
function
requires
bits,
which
will
in
general
require
(2
degrees
of
freedom.
In
summary
feedforward
netw
ork
with
single
lay
er
is
sufficien
to
represen
an
function,
but
the
la
yer
may
infeasibly
large
and
may
fail
to
learn
and
generalize
correctly
In
many
circumstances,
using
deep
er
mo
dels
can
reduce
the
umber
of
units
required
to
represent
the
desired
function
and
can
reduce
the
amoun
of
generalization
error.
arious
families
of
functions
can
approximated
efficiently
by
an
architecture
with
depth
greater
than
some
alue
but
they
require
uch
larger
mo
del
if
depth
is
restricted
to
less
than
or
equal
to
In
many
cases,
the
num
er
of
hidden
units
required
the
shallo
model
is
exp
onen
tial
in
. Suc
results
195
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Figure
6.5:
An
intuitiv
e,
geometric
explanation
of
the
exp
onen
tial
adv
an
tage
of
deeper
rectifier
netw
orks
formally
by
Montufar
et
al.
2014
).
(L
eft)
An
absolute
alue
rectification
unit
has
the
same
output
for
ev
ery
pair
of
mirror
oints
in
its
input.
The
mirror
axis
of
symmetry
is
given
by
the
hyperplane
defi
ned
by
the
weigh
ts
and
bias
of
the
unit.
function
computed
on
top
of
that
unit
(the
green
decision
surface)
will
mirror
image
of
simpler
pattern
across
that
axis
of
symmetry
(Center)
The
function
can
obtained
folding
the
space
around
the
axis
of
symmetry
(Right)
Another
rep
eating
pattern
can
folded
on
top
of
the
first
(b
another
downstream
unit)
to
obtain
another
symmetry
(whic
is
now
rep
eated
four
times,
with
tw
hidden
la
yers).
Figure
repro
duced
with
ermission
from
Mon
tufar
et
al.
2014
).
ere
first
pro
ed
for
mo
dels
that
do
not
resemble
the
contin
uous,
differen
tiable
neural
netw
orks
used
for
machine
learning
but
ha
since
een
extended
to
these
mo
dels.
The
first
results
ere
for
circuits
of
logic
gates
Håstad
1986
).
Later
ork
extended
these
results
to
linear
threshold
units
with
nonnegativ
weigh
ts
Håstad
and
Goldmann
1991
Ha
jnal
et
al.
1993
),
and
then
to
netw
orks
with
con
tinuous-
alued
activ
ations
Maass
1992
Maass
et
al.
1994
). Many
mo
dern
neural
net
orks
use
rectified
linear
units.
Leshno
et
al.
1993
demonstrated
that
shallow
netw
orks
with
broad
family
of
non-p
olynomial
activ
ation
functions,
including
rectified
linear
units,
hav
universal
appro
ximation
properties,
but
these
results
do
not
address
the
questions
of
depth
or
efficiency—they
sp
ecify
only
that
sufficiently
wide
rectifier
net
ork
could
represen
an
function.
Montufar
et
al.
2014
show
ed
that
functions
representable
with
deep
rectifier
net
can
require
an
exp
onential
umber
of
hidden
units
with
shallo
(one
hidden
la
er)
netw
ork.
More
precisely
they
sho
ed
that
piecewise
linear
netw
orks
(whic
can
obtained
from
rectifier
nonlinearities
or
maxout
units)
can
represent
functions
with
num
er
of
regions
that
is
exp
onen
tial
in
the
depth
of
the
netw
ork.
Figure
6.5
illustrates
ho
netw
ork
with
absolute
alue
rectification
creates
mirror
images
of
the
function
computed
on
top
of
some
hidden
unit,
with
resp
ect
to
the
input
of
that
hidden
unit.
Each
hidden
unit
sp
ecifies
where
to
fold
the
input
space
in
order
to
create
mirror
resp
onses
(on
oth
sides
of
the
absolute
alue
nonlinearit
y).
By
comp
osing
these
folding
op
erations,
obtain
an
exp
onen
tially
large
num
er
of
piecewise
linear
regions
that
can
capture
all
kinds
of
regular
(e.g.,
rep
eating)
patterns.
196
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
The
main
theorem
in
Montufar
et
al.
2014
states
that
the
num
er
of
linear
regions
carved
out
deep
rectifier
netw
ork
with
inputs,
depth
and
units
er
hidden
la
er
is
1)
(6.42)
that
is,
exponential
in
depth
In
the
case
of
maxout
netw
orks
with
filters
er
unit,
the
um
er
of
linear
regions
is
1)+
(6.43)
Of
course,
there
is
no
guarantee
that
the
kinds
of
functions
we
wan
to
learn
in
applications
of
mac
hine
learning
(and
in
particular
for
AI)
share
such
property
may
also
wan
to
choose
deep
mo
del
for
statistical
reasons. An
time
ho
ose
sp
ecific
mac
hine
learning
algorithm,
we
are
implicitly
stating
some
set
of
prior
eliefs
we
hav
ab
out
what
kind
of
function
the
algorithm
should
learn.
Cho
osing
deep
mo
del
enco
des
very
general
elief
that
the
function
we
an
to
learn
should
inv
olv
composition
of
sev
eral
simpler
functions.
This
can
in
terpreted
from
represen
tation
learning
oin
of
view
as
sa
ying
that
believe
the
learning
problem
consists
of
disco
vering
set
of
underlying
factors
of
ariation
that
can
in
turn
describ
ed
in
of
other,
simpler
underlying
factors
of
ariation.
Alternately
can
in
terpret
the
use
of
deep
architecture
as
expressing
elief
that
the
function
an
to
learn
is
computer
program
consisting
of
ultiple
steps,
where
eac
step
makes
use
of
the
previous
step’s
output. These
in
termediate
outputs
are
not
necessarily
factors
of
ariation
but
can
instead
analogous
to
counters
or
oin
ters
that
the
netw
ork
uses
to
organize
its
internal
pro
cessing.
Empirically
greater
depth
do
es
seem
to
result
in
better
generalization
for
wide
ariety
of
tasks
Bengio
et
al.
2007
Erhan
et
al.
2009
Bengio
2009
Mesnil
et
al.
2011
Ciresan
et
al.
2012
Krizhevsky
et
al.
2012
Sermanet
et
al.
2013
arab
et
et
al.
2013
Couprie
et
al.
2013
Kahou
et
al.
2013
Go
dfello
et
al.
2014d
Szegedy
et
al.
2014a
).
See
figure
6.6
and
figure
6.7
for
examples
of
some
of
these
empirical
results.
These
results
suggest
that
using
deep
architectures
do
es
indeed
express
useful
prior
ov
er
the
space
of
functions
the
mo
del
learns.
6.4.2
Other
Arc
hitectural
Considerations
So
far
we
hav
describ
ed
neural
netw
orks
as
eing
simple
hains
of
lay
ers,
with
the
main
considerations
eing
the
depth
of
the
net
ork
and
the
width
of
each
la
yer.
In
practice,
neural
netw
orks
sho
considerably
more
diversit
197
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
10
11
Num
er
of
hidden
lay
ers
92
92
93
93
94
94
95
95
96
96
est
accuracy
(p
ercent)
Figure
6.6:
Effect
of
depth.
Empirical
results
showing
that
deeper
netw
orks
generalize
etter
when
used
to
transcrib
multidigit
num
bers
from
photographs
of
addresses.
Data
from
Go
dfello
et
al.
2014d
).
The
test
set
accuracy
consisten
tly
increases
with
increasing
depth.
See
figure
6.7
for
control
exp
eriment
demonstrating
that
other
increases
to
the
mo
del
size
do
not
yield
the
same
effect.
Man
neural
netw
ork
arc
hitectures
ha
een
dev
elop
ed
for
specific
tasks.
Sp
ecialized
architectures
for
computer
vision
called
con
olutional
netw
orks
are
describ
ed
in
hapter
eedforward
net
orks
ma
also
generalized
to
the
recurren
neural
net
works
for
sequence
pro
cessing,
describ
ed
in
hapter
10
whic
ha
their
own
architectural
considerations.
In
general,
the
lay
ers
need
not
connected
in
chain,
even
though
this
is
the
most
common
practice.
Many
arc
hitectures
build
main
chain
but
then
add
extra
arc
hitectural
features
to
it,
such
as
skip
connections
going
from
lay
er
to
la
yer
or
higher.
These
skip
connections
mak
it
easier
for
the
gradien
to
flow
from
output
lay
ers
to
la
ers
nearer
the
input.
Another
key
consideration
of
architecture
design
is
exactly
ho
to
connect
pair
of
la
yers
to
each
other.
In
the
default
neural
net
ork
lay
er
describ
ed
by
linear
transformation
via
matrix
every
input
unit
is
connected
to
every
output
unit.
Many
sp
ecialized
netw
orks
in
the
chapters
ahead
ha
few
er
connections,
so
that
each
unit
in
the
input
lay
er
is
connected
to
only
small
subset
of
units
in
the
output
la
er. These
strategies
for
decreasing
the
num
er
of
connections
reduce
the
num
er
of
parameters
and
the
amoun
of
computation
required
to
ev
aluate
the
netw
ork
but
are
often
highly
problem
dep
enden
t.
or
example,
con
olutional
198
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Num
er
of
parameters
10
91
92
93
94
95
96
97
est
accuracy
(p
ercent)
3,
conv
olutional
3,
fully
connected
11,
conv
olutional
Figure
6.7:
Effect
of
num
ber
of
parameters.
Deeper
mo
dels
tend
to
erform
etter.
This
is
not
merely
because
the
mo
del
is
larger.
This
exp
eriment
from
Go
odfellow
et
al.
2014d
sho
ws
that
increasing
the
num
er
of
parameters
in
lay
ers
of
con
olutional
net
orks
without
increasing
their
depth
is
not
nearly
as
effectiv
at
increasing
test
set
erformance,
as
illustrated
in
this
figure.
The
legend
indicates
the
depth
of
netw
ork
used
to
make
eac
curve
and
whether
the
curv
represents
ariation
in
the
size
of
the
conv
olutional
or
the
fully
connected
lay
ers.
observ
that
shallo
mo
dels
in
this
context
ov
erfit
at
around
20
million
parameters
while
deep
ones
can
enefit
from
having
ov
er
60
million.
This
suggests
that
using
deep
mo
del
expresses
useful
preference
ov
er
the
space
of
functions
the
mo
del
can
learn.
Sp
ecifically
it
expresses
elief
that
the
function
should
consist
of
many
simpler
functions
comp
osed
together.
This
could
result
either
in
learning
representation
that
is
comp
osed
in
turn
of
simpler
representations
(e.g.,
corners
defined
in
of
edge
s)
or
in
learning
program
with
sequentially
dep
endent
steps
(e.g.,
first
lo
cate
set
of
ob
jects,
then
segmen
them
from
eac
other,
then
recognize
them).
199
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
net
works,
describ
ed
in
chapter
use
sp
ecialized
patterns
of
sparse
connections
that
are
very
effectiv
for
computer
vision
problems.
In
this
chapter,
it
is
difficult
to
give
more
sp
ecific
advice
concerning
the
architecture
of
generic
neural
netw
ork.
In
subsequent
chapters
develop
the
particular
architectural
strategies
that
hav
een
found
to
ork
ell
for
differen
application
domains.
6.5
Bac
k-Propagation
and
Other
Differen
tiation
Algorithms
When
we
use
feedforw
ard
neural
net
work
to
accept
an
input
and
pro
duce
an
output
information
flo
ws
forward
through
the
netw
ork.
The
input
pro
vides
the
initial
information
that
then
propagates
up
to
the
hidden
units
at
eac
lay
er
and
finally
pro
duces
This
is
called
forw
ard
propagation
During
training,
forw
ard
propagation
can
contin
ue
onw
ard
un
til
it
pro
duces
scalar
cost
The
bac
k-propagation
algorithm
Rumelhart
et
al.
1986a
),
often
simply
called
bac
kprop
allows
the
information
from
the
cost
to
then
flo
backw
ard
through
the
netw
ork
in
order
to
compute
the
gradient.
Computing
an
analytical
expression
for
the
gradient
is
straightforw
ard,
but
umerically
ev
aluating
such
an
expression
can
computationally
exp
ensive.
The
bac
k-propagation
algorithm
do
es
so
using
simple
and
inexp
ensive
pro
cedure.
The
term
bac
k-propagation
is
often
misundersto
as
meaning
the
whole
learning
algorithm
for
multi
la
yer
neural
netw
orks.
Actually
back-propagation
refers
only
to
the
metho
for
computing
the
gradient,
while
another
algorithm,
suc
as
sto
hastic
gradient
descent,
is
used
to
erform
learning
using
this
gradien
t.
urthermore,
back-propagation
is
often
misundersto
od
as
eing
sp
ecific
to
ulti-
la
er
neural
netw
orks,
but
in
principle
it
can
compute
deriv
ativ
es
of
any
function
(for
some
functions,
the
correct
resp
onse
is
to
rep
ort
that
the
deriv
ativ
of
the
function
is
undefined).
Sp
ecifically
will
describe
ho
to
compute
the
gradien
for
an
arbitrary
function
where
is
set
of
ariables
whose
deriv
atives
are
desired,
and
is
an
additional
set
of
ariables
that
are
inputs
to
the
function
but
whose
deriv
atives
are
not
required.
In
learning
algorithms,
the
gradien
most
often
require
is
the
gradient
of
the
cost
function
with
resp
ect
to
the
parameters,
Man
machine
learning
tasks
in
olv
computing
other
deriv
atives,
either
as
part
of the
learning process, or to
analyze the
learned model.
The bac
k-
propagation
algorithm
can
applied
to
these
tasks
as
ell
and
is
not
restricted
to
computing
the
gradien
of
the
cost
function
with
resp
ect
to
the
parameters.
The
idea
of
computing
deriv
ativ
es
propagating
information
through
net
ork
is
ery
general
and
can
used
to
compute
alues
suc
as
the
Jacobian
of
function
200
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
with
multiple
outputs.
restrict
our
description
here
to
the
most
commonly
used
case,
where
has
single
output.
6.5.1
Computational
Graphs
So
far
we
ha
discussed
neural
net
orks
with
relativ
ely
informal
graph
language.
describe
the
back-propagation
algorithm
more
precisely
it
is
helpful
to
ha
more
precise
computational
graph
language.
Man
wa
ys
of
formalizing
computation
as
graphs
are
ossible.
Here,
we
use
each
no
de
in
the
graph
to
indicate
ariable.
The
ariable
may
scalar,
ector,
matrix,
tensor,
or
ev
en
ariable
of
another
yp
e.
formalize
our
graphs,
also
need
to
introduce
the
idea
of
an
op
eration
An
op
eration
is
simple
function
of
one
or
more
ariables.
Our
graph
language
is
accompanied
by
set
of
allow
able
op
erations.
unctions
more
complicated
than
the
operations
in
this
set
ma
described
comp
osing
man
op
erations
together.
Without
loss
of
generality
, w
define
an
op
eration
to
return
only
single
output
ariable.
This
do
es
not
lose
generalit
ecause
the
output
ariable
can
hav
ultiple
entries,
such
as
vector.
Soft
ware
implementations
of
bac
k-propagation
usually
supp
ort
operations
with
multiple
outputs,
but
av
oid
this
case
in
our
description
ecause
it
introduces
many
extra
details
that
are
not
imp
ortan
to
conceptual
understanding.
If
ariable
is
computed
by
applying
an
op
eration
to
ariable
then
draw
directed
edge
from
to
. W
sometimes
annotate
the
output
no
de
with
the
name
of
the
op
eration
applied,
and
other
times
omit
this
label
when
the
op
eration
is
clear
from
context.
Examples
of
computational
graphs
are
shown
in
figure
6.8
6.5.2
Chain
Rule
of
Calculus
The
hain
rule
of
calculus
(not
to
confused
with
the
chain
rule
of
probability)
is
used
to
compute
the
deriv
atives
of
functions
formed
by
comp
osing
other
functions
whose
deriv
ativ
es
are
known.
Bac
k-propagation
is
an
algorithm
that
computes
the
hain
rule,
with
sp
ecific
order
of
op
erations
that
is
highly
efficien
t.
Let
real
num
ber,
and
let
and
oth
functions
mapping
from
real
um
er
to
real
num
ber.
Supp
ose
that
and
)) =
Then
201
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
(a)
(b)
(1)
(1)
dot
(2)
(2)
(c)
(1)
(1)
matmul
(2)
(2)
relu
(d)
dot
(1)
(1)
sqr
(2)
(2)
sum
(3)
(3)
Figure
6.8:
Examples
of
computational
graphs.
(a)
The
graph
using
the
op
eration
to
compute
xy
(b)
The
graph
for
the
logistic
regression
prediction
Some
of
the
in
termediate
expressions
do
not
hav
names
in
the
algebraic
expression
but
need
names
in
the
graph.
simply
name
the
-th
such
ariable
(c)
The
computational
graph
for
the
expression
max
which
computes
design
matrix
of
rectified
linear
unit
activ
ations
giv
en
design
matrix
containing
minibatc
of
inputs
(d)
Examples
a–c
applied
at
most
one
operation
to
each
ariable,
but
it
is
ossible
to
apply
more
than
one
op
eration.
Here
show
computation
graph
that
applies
ore
than
one
op
eration
to
the
weigh
ts
of
linear
regression
mo
del.
The
eights
are
used
to
make
oth
the
prediction
and
the
eight
deca
enalty
202
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
the
chain
rule
states
that
dz
dx
dz
dy
dy
dx
(6.44)
can
generalize
this
ey
ond
the
scalar
case.
Supp
ose
that
maps
from
to
and
maps
from
to
If
and
then
(6.45)
In
vector
notation,
this
ma
equiv
alen
tly
written
as
(6.46)
where
is
the
Jacobian
matrix
of
rom
this
we
see
that
the
gradien
of
ariable
can
obtained
by
multiplying
Jacobian
matrix
gradient
The
bac
k-propagation
algorithm
consists
of
erforming
suc
Jacobian-gradien
pro
duct
for
eac
op
eration
in
the
graph.
Usually
apply
the
back-propagation
algorithm
to
tensors
of
arbitrary
di-
mensionalit
not
merely
to
ectors.
Conceptually
this
is
exactly
the
same
as
bac
k-propagation
with
vectors.
The
only
difference
is
ho
the
num
ers
are
ar-
ranged
in
grid
to
form
tensor.
could
imagine
flattening
eac
tensor
into
vector
efore
we
run
back-propagation,
computing
vector-v
alued
gradient,
and
then
reshaping
the
gradien
back
into
tensor.
In
this
rearranged
view,
bac
k-propagation
is
still
just
ultiplying
Jacobians
gradients.
denote
the
gradient
of
alue
with
resp
ect
to
tensor
we
write
just
as
if
ere
vector.
The
indices
in
to
no
hav
multiple
co
ordinates—for
example,
3-D
tensor
is
indexed
by
three
co
ordinates.
can
abstract
this
ay
using
single
ariable
to
represent
the
complete
tuple
of
indices.
or
all
ossible
index
tuples
giv
es
This
is
exactly
the
same
as
how
for
all
ossible
in
teger
indices
in
to
ector,
giv
es
Using
this
notation,
we
can
write
the
chain
rule
as
it
applies
to
tensors.
If
and
then
(6.47)
203
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
6.5.3
Recursiv
ely
Applying
the
Chain
Rule
to
Obtain
Bac
kprop
Using
the
chain
rule,
it
is
straigh
tforw
ard
to
write
do
wn
an
algebraic
expression
for
the
gradient
of
scalar
with
resp
ect
to
any
no
de
in
the
computational
graph
that
pro
duced
that
scalar.
Actually
ev
aluating
that
expression
in
computer,
ho
wev
er,
in
tro
duces
some
extra
considerations.
Sp
ecifically
many
sub
expressions
ma
rep
eated
several
times
within
the
verall
expression
for
the
gradient.
Any
pro
cedure
that
computes
the
gradient
will
need
to
ho
ose
whether
to
store
these
sub
expressions
or
to
recompute
them
sev
eral
times.
An
example
of
how
these
rep
eated
subexpressions
arise
is
giv
en
in
figure
6.9
In
some
cases,
computing
the
same
sub
expression
twice
ould
simply
asteful. F
or
complicated
graphs,
there
can
exp
onen
tially
man
of
these
asted
computations,
making
naiv
implementation
of
the
hain
rule
infeasible.
In
other
cases,
computing
the
same
subexpression
wice
could
alid
wa
to
reduce
memory
consumption
at
the
cost
of
higher
runtime.
egin
with
version
of
the
bac
k-propagation
algorithm
that
sp
ecifies
the
actual
gradien
computation
directly
(algorithm
6.2
along
with
algorithm
6.1
for
the
asso
ciated
forw
ard
computation),
in
the
order
it
will
actually
done
and
according
to
the
recursiv
application
of
chain
rule.
One
could
either
directly
erform
these
computations
or
view
the
description
of
the
algorithm
as
symbolic
sp
ecification
of
the
computational
graph
for
computing
the
bac
k-propagation.
Ho
ever,
this
form
ulation
do
es
not
mak
explicit
the
manipulation
and
the
construction
of
the
sym
olic
graph
that
erforms
the
gradien
computation. Such
form
ulation
is
presen
ted
in
section
6.5.6
with
algorithm
6.5
where
we
also
generalize
to
nodes
that
contain
arbitrary
tensors.
First
consider
computational
graph
describing
ho
to
compute
single
scalar
(sa
, the
loss
on
training
example).
This
scalar
is
the
quantit
whose
gradien
an
to
obtain,
with
resp
ect
to
the
input
nodes
(1)
to
. In
other
words,
we
wish
to
compute
for
all
In
the
application
of
back-propagation
to
computing
gradients
for
gradient
descen
er
parameters,
will
the
cost
asso
ciated
with
an
example
or
minibatch,
while
(1)
to
corresp
ond
to
the
parameters
of
the
mo
del.
will
assume
that
the
no
des
of
the
graph
hav
een
ordered
in
suc
that
can
compute
their
output
one
after
the
other,
starting
at
+1)
and
going
up
to
As
defined
in
algorithm
6.1
each
no
de
is
asso
ciated
with
an
op
eration
and
is
computed
by
ev
aluating
the
function
(6.48)
where
is
the
set
of
all
no
des
that
are
parents
of
204
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Figure
6.9:
computational
graph
that
results
in
repeated
sub
expressions
when
computing
the
gradient.
Let
the
input
to
the
graph.
use
the
same
function
as
the
op
eration
that
we
apply
at
ev
ery
step
of
chain:
compute
we
apply
equation
6.44
and
obtain:
(6.49)
(6.50)
(6.51)
)))
))
(6.52)
Equation
6.51
suggests
an
implemen
tation
in
whic
we
compute
the
alue
of
only
once
and
store
it
in
the
ariable
This
is
the
approach
taken
the
back-propagation
algorithm.
An
alternative
approac
is
suggested
by
equation
6.52
where
the
subexpression
app
ears
more
than
once.
In
the
alternative
approach,
is
recomputed
eac
time
it
is
needed.
When
the
memory
required
to
store
the
alue
of
these
expressions
is
lo
w,
the
bac
k-propagation
approach
of
equation
6.51
is
clearly
preferable
ecause
of
its
reduced
run
time.
How
ever,
equation
6.52
is
also
alid
implementation
of
the
hain
rule
and
is
useful
when
memory
is
limited.
205
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
That
algorithm
sp
ecifies
the
forward
propagation
computation,
which
we
could
put
in
graph
perform
back-propagation,
can
construct
computational
graph
that
dep
ends
on
and
adds
to
it
an
extra
set
of
no
des.
These
form
subgraph
with
one
node
er
no
de
of
Computation
in
pro
ceeds
in
exactly
the
rev
erse
of
the
order
of
computation
in
and
eac
no
de
of
computes
the
deriv
ative
asso
ciated
with
the
forward
graph
no
de
This
is
done
using
the
chain
rule
with
resp
ect
to
scalar
output
(6.53)
as
sp
ecified
algorithm
6.2
The
subgraph
con
tains
exactly
one
edge
for
eac
edge
from
no
de
to
no
de
of
The
edge
from
to
is
asso
ciated
with
the
computation
of
In
addition,
dot
pro
duct
is
performed
for
eac
no
de,
et
een
the
gradien
already
computed
with
resp
ect
to
no
des
that
are
hildren
of
and
the
ector
containing
the
partial
deriv
atives
for
the
same
children
no
des
summarize,
the
amount
of
computation
required
for
erforming
the
back-
propagation
scales
linearly
with
the
num
er
of
edges
in
where
the
computation
for
eac
edge
corresponds
to
computing
partial
deriv
ativ
(of
one
no
de
with
resp
ect
to
one
of
its
paren
ts)
as
ell
as
erforming
one
ultiplication
and
one
addition.
Belo
w,
we
generalize
this
analysis
to
tensor-v
alued
no
des,
which
is
just
to
group
ultiple
scalar
alues
in
the
same
no
de
and
enable
more
Algorithm
6.1
pro
cedure
that
erforms
the
computations
mapping
inputs
(1)
to
to
an
output
This
defines
computational
graph
where
eac
no
de
computes
numerical
alue
applying
function
to
the
set
of
argumen
ts
that
comprises
the
alues
of
previous
no
des
with
The
input
to
the
computational
graph
is
the
vector
and
is
set
in
to
the
first
no
des
(1)
to
The
output
of
the
computational
graph
is
read
off
the
last
(output)
no
de
for
= 1
do
end
for
for
do
end
for
return
206
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
efficien
implementations.
The
back-propagation
algorithm
is
designed
to
reduce
the
num
ber
of
common
sub
expressions
without
regard
to
memory
Sp
ecifically
it
erforms
on
the
order
of
one
Jacobian
pro
duct
er
no
de
in
the
graph. This
can
seen
from
the
fact
that
bac
kprop
(algorithm
6.2
visits
eac
edge
from
no
de
to
no
de
of
the
graph
exactly
once
in
order
to
obtain
the
asso
ciated
partial
deriv
ativ
Bac
k-propagation
th
us
oids
the
exp
onential
explosion
in
rep
eated
sub
expres-
sions.
Other
algorithms
may
able
to
av
oid
more
sub
expressions
performing
simplifications
on
the
computational
graph,
or
ma
able
to
conserv
memory
recomputing
rather
than
storing
some
sub
expressions.
revisit
these
ideas
after
describing
the
back-propagation
algorithm
itself.
Algorithm
6.2
Simplified
version
of
the
back-propagation
algorithm
for
computing
the
deriv
ativ
es
of
with
resp
ect
to
the
ariables
in
the
graph.
This
example
is
in
tended
to
further
understanding
showing
simplified
case
where
all
ariables
are
scalars,
and
we
wish
to
compute
the
deriv
ativ
es
with
resp
ect
to
(1)
This
simplified
ersion
computes
the
deriv
atives
of
all
no
des
in
the
graph.
The
computational
cost
of
this
algorithm
is
prop
ortional
to
the
um
er
of
edges
in
the
graph,
assuming
that
the
partial
deriv
ativ
asso
ciated
with
each
edge
requires
constant
time.
This
is
of
the
same
order
as
the
num
er
of
computations
for
the
forward
propagation.
Each
is
function
of
the
parents
of
thus
linking
the
nodes
of
the
forward
graph
to
those
added
for
the
bac
k-propagation
graph.
Run
forward
propagation
(algorithm
6.1
for
this
example)
to
obtain
the
activ
a-
tions
of
the
netw
ork.
Initialize
grad_table
data
structure
that
will
store
the
deriv
atives
that
ha
een
computed.
The
entry
grad
table
will
store
the
computed
alue
of
grad
table
for
down
to
do
The
next
line
computes
using
stored
alues:
grad
table
grad
table
end
for
return
grad
table
= 1
207
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
6.5.4
Bac
k-Propagation
Computation
in
ully
Connected
MLP
clarify
the
ab
ve
definition
of
the
back-propagation
computation,
let
us
consider
the
sp
ecific
graph
asso
ciated
with
fully-connected
ulti
lay
er
MLP
Algorithm
6.3
first
sho
ws
the
forw
ard
propagation,
which
maps
parameters
to
the
sup
ervised
loss
asso
ciated
with
single
(input,target)
training
example
with
the
output
of
the
neural
net
ork
when
is
pro
vided
in
input.
Algorithm
6.4
then sho
ws the
corresp
onding computation
to b
done for
applying
the
bac
k-propagation
algorithm
to
this
graph.
Algorithms
6.3
and
6.4
are
demonstrations
hosen
to
be
simple
and
straigh
tfor-
ard
to
understand.
How
ev
er,
they
are
sp
ecialized
to
one
sp
ecific
problem.
Mo
dern
soft
ware
implemen
tations
are
based
on
the
generalized
form
of
bac
k-
propagation
describ
ed
in
section
6.5.6
elo
w,
which
can
accommodate
any
compu-
tational
graph
explicitly
manipulating
data
structure
for
represen
ting
symbolic
computation.
Algorithm
6.3
orward
propagation
through
typical
deep
neural
net
work
and
the
computation
of
the
cost
function.
The
loss
dep
ends
on
the
output
and
on
the
target
(see
section
6.2.1.1
for
examples
of
loss
functions).
obtain
the
total
cost
the
loss
ma
added
to
regularizer
Ω(
where
con
tains
all
the
parameters
(weigh
ts
and
biases).
Algorithm
6.4
shows
how
to
compute
gradients
of
with
resp
ect
to
parameters
and
or
simplicit
this
demonstration
uses
only
single
input
example
Practical
applications
should
use
minibatc
h.
See
section
6.5.7
for
more
realistic
demonstration.
Require:
Netw
ork
depth,
Require:
the
weigh
matrices
of
the
mo
del
Require:
the
bias
parameters
of
the
mo
del
Require:
the
input
to
pro
cess
Require:
the
target
output
(0)
for
= 1
do
1)
end
for
Ω(
208
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Algorithm 6.4
Bac
kward
computation for
the deep
neural net
work of
algo-
rithm
6.3
which
uses,
in
addition
to
the
input
target
This
computation
yields
the
gradien
ts
on
the
activ
ations
for
eac
la
yer
starting
from
the
output
lay
er
and
going
backw
ards
to
the
first
hidden
la
yer.
rom
these
gradien
ts,
whic
can
interpreted
as
an
indication
of
how
each
lay
er’s
output
should
change
to
reduce
error,
one
can
obtain
the
gradient
on
the
parameters
of
each
la
yer.
The
gradien
ts
on
eights
and
biases
can
immediately
used
as
part
of
sto
has-
tic
gradient
update
(p
erforming
the
up
date
right
after
the
gradients
ha
een
computed)
or
used
with
other
gradient-based
optimization
metho
ds.
After
the
forw
ard
computation,
compute
the
gradien
on
the
output
la
er:
for
do
Con
vert the
gradient
on
the
lay
er’s
output
into a
gradient
on
the
pre-
nonlinearit
y activ
ation (elemen
t-wise
multiplication
if
is
element-wise):
Compute
gradients
on
weigh
ts
and
biases
(including
the
regularization
term,
where
needed):
Ω(
1)
Ω(
Propagate
the
gradien
ts
w.r.t.
the
next
low
er-lev
el
hidden
la
er’s
activ
ations:
1)
end
for
6.5.5
Sym
ol-to-Sym
ol
Deriv
ativ
es
Algebraic
expressions
and
computational
graphs
oth
op
erate
on
sym
ols
or
ariables that
do not ha
ve
sp
ecific v
alues.
These algebraic
and graph-based
represen
tations
are
called
sym
olic
representations
When
we
actually
use
or
train
neural
netw
ork,
we
must
assign
sp
ecific
alues
to
these
symbols.
replace
sym
olic
input
to
the
net
work
with
sp
ecific
umeric
alue,
suc
as
[1
765
8]
Some
approaches
to
bac
k-propagation
tak
computational
graph
and
set
of
umerical
alues
for
the
inputs
to
the
graph,
then
return
set
of
umerical
alues
describing
the
gradien
at
those
input
alues.
call
this
approach
sym
ol-to-
um
er
differentiation
. This
is
the
approac
used
by
libraries
such
as
orc
Collob
ert
et
al.
2011b
and
Caffe
Jia
2013
).
Another
approach
is
to
tak
computational
graph
and
add
additional
no
des
209
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
dz
dy
dz
dy
dy
dx
dy
dx
dz
dx
dz
dx
dx
dw
dx
dw
dz
dw
dz
dw
Figure
6.10:
An
example
of
the
sym
ol-to-sym
ol
approach
to
computing
deriv
atives.
In
this
approach,
the
back-propagation
algorithm
do
es
not
need
to
ever
access
any
actual
sp
ecific
numeric
alues.
Instead,
it
adds
no
des
to
computational
graph
describing
how
to
compute
these
deriv
ativ
es.
generic
graph
ev
aluation
engine
can
later
compute
the
deriv
atives
for
an
sp
ecific
numeric
alues.
(L
eft)
In
this
example,
we
egin
with
graph
represen
ting
)))
(Right)
run
the
back-propagation
algorithm,
instructing
it
to
construct
the
graph
for
the
expression
corresp
onding
to
dz
dw
In
this
example,
we
do
not
explain
how
the
bac
k-propagation
algorithm
works.
The
purp
ose
is
only
to
illustrate
what
the
desired
result
is:
computational
graph
with
symbolic
description
of
the
deriv
ative.
to
the
graph
that
provide
symbolic
description
of
the
desired
deriv
atives.
This
is
the
approac
taken
by
Theano
Bergstra
et
al.
2010
Bastien
et
al.
2012
and
ensorFlow
Abadi
et
al.
2015
).
An
example
of
how
it
works
is
illustrated
in
figure
6.10
. The
primary
adv
an
tage
of
this
approac
is
that
the
deriv
atives
are
describ
ed
in
the
same
language
as
the
original
expression.
Because
the
deriv
ativ
es
are
just
another
computational
graph, it
is
ossible
to
run
back-propagation
again,
differentiating
the
deriv
ativ
es
to
obtain
higher
deriv
atives.
(Computation
of
higher-order
deriv
ativ
es
is
described
in
section
6.5.10
.)
will
use
the
latter
approach
and
describ
the
back-propagation
algorithm
in
of
constructing
computational
graph
for
the
deriv
atives.
An
subset
of
the
graph
may
then
ev
aluated
using
specific
numerical
alues
at
later
time.
This
allo
ws
us
to
av
oid
sp
ecifying
exactly
when
each
op
eration
should
computed.
Instead,
generic
graph
ev
aluation
engine
can
ev
aluate
ev
ery
no
de
as
so
on
as
its
paren
ts’
alues
are
ailable.
The
description
of
the
sym
ol-to-sym
ol
based
approach
subsumes
the
symbol-
210
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
to-n
umber
approac
h.
The
symbol-to-num
er
approach
can
understo
as
erforming
exactly
the
same
computations
as
are
done
in
the
graph
built
by
the
sym
ol-to-sym
ol
approac
h.
The
key
difference
is
that
the
sym
ol-to-n
umber
approac
do
es
not
exp
ose
the
graph.
6.5.6
General
Bac
k-Propagation
The
back-propagation
algorithm
is
very
simple.
compute
the
gradien
of
some
scalar
with
respect
to
one
of
its
ancestors
in
the
graph,
egin
by
observing
that
the
gradient
with
resp
ect
to
is
given
by
dz
dz
can
then
compute
the
gradien
with
resp
ect
to
each
parent
of
in
the
graph
ultiplying
the
curren
gradient
the
Jacobian
of
the
op
eration
that
pro
duced
con
tin
ue
ultiplying
by
Jacobians,
tra
eling
bac
kw
ard
through
the
graph
in
this
wa
un
til
reac
or
any
no
de
that
may
reached
by
going
bac
kw
ard
from
through
or
more
paths,
we
simply
sum
the
gradien
ts
arriving
from
differen
paths
at
that
no
de.
More
formally
each
no
de
in
the
graph
corresp
onds
to
ariable.
ac
hiev
maxim
um
generality
describ
this
ariable
as
eing
tensor
ensors
in
general
can
ha
any
umber
of
dimensions.
They
subsume
scalars,
ectors,
and
matrices.
assume
that
each
ariable
is
asso
ciated
with
the
following
subroutines:
get
operation
This
returns
the
op
eration
that
computes
repre-
sen
ted
by
the
edges
coming
into
in
the
computational
graph.
or
example,
there
may
Python
or
C++
class
represen
ting
the
matrix
ultiplication
op
eration,
and
the
get_operation
function.
Supp
ose
we
hav
ariable
that
is
created
matrix
multiplication,
AB
Then
get
operation
returns
oin
ter
to
an
instance
of
the
corresp
onding
C++
class.
get
consumers
This
returns
the
list
of
ariables
that
are
hildren
of
in
the
computational
graph
get
inputs
This
returns
the
list
of
ariables
that
are
paren
ts
of
in
the
computational
graph
Eac
op
eration
op
is
also
asso
ciated
with
bprop
op
eration.
This
bprop
op
eration
can
compute
Jacobian-vector
pro
duct
as
describ
ed
by
equation
6.47
This
is
how
the
back-propagation
algorithm
is
able
to
ac
hiev
great
generalit
Eac
op
eration
is
resp
onsible
for
kno
wing
how
to
bac
k-propagate
through
the
edges
in
the
graph
that
it
participates
in.
or
example,
we
migh
use
matrix
211
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
ultiplication
operation
to
create
ariable
AB
Supp
ose
that
the
gradien
of
scalar
with
resp
ect
to
is
giv
en
by
The
matrix
multiplication
op
eration
is
resp
onsible
for
defining
tw
bac
k-propagation
rules,
one
for
each
of
its
input
argumen
ts.
If
we
call
the
bprop
metho
to
request
the
gradient
with
respect
to
giv
en
that
the
gradient
on
the
output
is
then
the
bprop
metho
of
the
matrix
ultiplication
op
eration
ust
state
that
the
gradien
with
resp
ect
to
is
given
GB
Likewise,
if
call
the
bprop
metho
to
request
the
gradien
with
resp
ect
to
then
the
matrix
op
eration
is
responsible
for
implementing
the
bprop
metho
and
sp
ecifying
that
the
desired
gradient
is
giv
en
The
bac
k-propagation
algorithm
itself
do
es
not
need
to
kno
an
differen
tiation
rules.
It
only
needs
to
call
eac
op
eration’s
bprop
rules
with
the
righ
argumen
ts.
ormally
op
bprop
inputs
must
return
op
inputs
(6.54)
whic
is
just
an
implemen
tation
of
the
hain
rule
as
expressed
in
equation
6.47
Here,
inputs
is
list
of
inputs
that
are
supplied
to
the
op
eration,
op.f
is
the
mathematical
function
that
the
op
eration
implements,
is
the
input
whose
gradient
wish
to
compute,
and
is
the
gradien
on
the
output
of
the
op
eration.
The
op.bprop
metho
should
alwa
ys
pretend
that
all
its
inputs
are
distinct
from
each
other,
even
if
they
are
not.
or
example,
if
the
mul
op
erator
is
passed
copies
of
to
compute
the
op.bprop
metho
should
still
return
as
the
deriv
ative
with
resp
ect
to
oth
inputs.
The
bac
k-propagation
algorithm
will
later
add
oth
of
these
arguments
together
to
obtain
which
is
the
correct
total
deriv
ative
on
Soft
are
implemen
tations
of
bac
k-propagation
usually
provide
oth
the
op
era-
tions
and
their
bprop
metho
ds,
so
that
users
of
deep
learning
softw
are
libraries
are
able
to
back-prop
agate
through
graphs
built
using
common
operations
like
matrix
ultiplication,
exp
onents,
logarithms,
and
so
on.
Soft
are
engineers
who
build
new
implementation
of
bac
k-propagation
or
adv
anced
users
who
need
to
add
their
wn
operation
to
an
existing
library
ust
usually
derive
the
op.bprop
metho
for
an
new
op
erations
manually
The
back-propagation
algorithm
is
formally
describ
ed
in
algorithm
6.5
In
section
6.5.2
explained
that
back-propagation
as
dev
elop
ed
in
order
to
oid
computing
the
same
sub
expression
in
the
chain
rule
ultiple
times.
The
naive
algorithm
could
hav
exp
onen
tial
runtime
due
to
these
rep
eated
sub
expressions.
No
that
we
hav
specified
the
bac
k-propagation
algorithm,
we
can
understand
its
computational
cost.
If
we
assume
that
each
op
eration
ev
aluation
has
roughly
the
212
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
Algorithm
6.5
The
outermost
sk
eleton
of
the
bac
k-propagation
algorithm.
This
ortion
do
es
simple
setup
and
clean
up
ork.
Most
of
the
imp
ortan
work
happ
ens
in
the
build_grad
subroutine
of
algorithm
6.6
Require:
the
target
set
of
ariables
whose
gradien
ts
must
computed.
Require:
the
computational
graph
Require:
the
ariable
to
be
differen
tiated
Let
pruned
to
con
tain
only
no
des
that
are
ancestors
of
and
descenden
ts
of
no
des
in
Initialize
grad_table
data
structure
asso
ciating
tensors
to
their
gradien
ts
grad
table
for
in
do
build
grad
grad
table
end
for
Return
grad_table
restricted
to
Algorithm
6.6
The
inner
loop
subroutine
build
grad
grad
table
of
the
back-propagation
algorithm,
called
by
the
bac
k-propagation
algorithm
defined
in
algorithm
6.5
Require:
the
ariable
whose
gradien
should
added
to
and
grad_table
Require:
the
graph
to
mo
dify
Require:
the
restriction
of
to
nodes
that
participate
in
the
gradien
Require:
grad_table
data
structure
mapping
nodes
to
their
gradien
ts
if
is
in
grad_table
then
Return
grad
table
end
if
for
in
get
consumers
do
op
get
operation
build
grad
grad
table
op
bprop
get
inputs
end
for
grad
table
] =
Insert
and
the
operations
creating
it
in
to
Return
213
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
same
cost,
then
ma
analyze
the
computational
cost
in
of
the
num
ber
of
op
erations
executed. Keep
in
mind
here
that
we
refer
to
an
op
eration
as
the
fundamen
tal
unit
of
our
computational
graph,
which
migh
actually
consist
of
sev
eral
arithmetic
operations
(for
example,
we
might
ha
ve
graph
that
treats
matrix
ultiplication
as
single
op
eration).
Computing
gradien
in
graph
with
no
des
will
never
execute
more
than
op
erations
or
store
the
output
of
more
than
op
erations.
Here
we
are
counting
op
erations
in
the
computational
graph,
not
individual
op
erations
executed
the
underlying
hardware,
so
it
is
imp
ortant
to
remem
er
that
the
runtime
of
each
op
eration
ma
highly
ariable.
or
example,
ultiplying
matrices
that
eac
con
tain
millions
of
en
tries
migh
corresp
ond
to
single
op
eration
in
the
graph.
can
see
that
computing
the
gradient
requires
at
most
op
erations
ecause
the
forw
ard
propagation
stage
will
at
worst
execute
all
no
des
in
the
original
graph
(dep
ending
on
whic
alues
we
ant
to
compute,
may
not
need
to
execute
the
en
tire
graph).
The
bac
k-propagation
algorithm
adds
one
Jacobian-v
ector
pro
duct,
which
should
expressed
with
(1)
no
des,
er
edge
in
the
original
graph.
Because
the
computational
graph
is
directed
acyclic
graph
it
has
at
most
edges.
or
the
kinds
of
graphs
that
are
commonly
used
in
practice,
the
situation
is
even
etter.
Most
neural
net
ork
cost
functions
are
roughly
chain-structured,
causing
bac
k-propagation
to
hav
cost.
This
is
far
etter
than
the
naive
approac
h,
which
might
need
to
execute
exp
onentially
man
no
des.
This
otentially
exp
onen
tial
cost
can
seen
expanding
and
rewriting
the
recursive
hain
rule
(equation
6.53
nonrecursiv
ely:
path
,u
,...,u
from
to
=2
(6.55)
Since
the
num
er
of
paths
from
no
de
to
no
de
can
grow
exponentially
in
the
length
of
these
paths,
the
umber
of
in
the
ab
ov
sum,
whic
is
the
um
er
of
such
paths,
can
gro
exp
onentially
with
the
depth
of
the
forward
propagation
graph.
This
large
cost
ould
incurred
ecause
the
same
computation
for
ould
redone
man
times. T
av
oid
such
recomputation,
we
can
think
of
bac
k-propagation
as
table-filling
algorithm
that
tak
es
adv
an
tage
of
storing
in
termediate
results
Each
node
in
the
graph
has
corresponding
slot
in
table
to
store
the
gradient
for
that
no
de.
By
filling
in
these
table
entries
in
order,
bac
k-propagation
av
oids
rep
eating
many
common
sub
expressions.
This
table-filling
strategy
is
sometimes
called
dynamic
programming
214
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
6.5.7
Example:
Bac
k-Propagation
for
MLP
raining
As
an
example,
we
alk
through
the
back-propagation
algorithm
as
it
is
used
to
train
ultila
yer
erceptron.
Here
dev
elop
very
simple
multila
er
erceptron
with
single
hidden
la
yer.
train
this
mo
del,
we
will
use
minibatch
sto
hastic
gradien
descent.
The
back-propagation
algorithm
is
used
to
compute
the
gradien
of
the
cost
on
single
minibatch.
Sp
ecifically
use
minibatc
of
examples
from
the
training
set
formatted
as
design
matrix
and
vector
of
asso
ciated
class
lab
els
The
net
ork
computes
lay
er
of
hidden
features
max
(1)
simplify
the
presen
tation
do
not
use
biases
in
this
mo
del.
assume
that
our
graph
language
includes
relu
op
eration
that
can
compute
max
elemen
t-
wise.
The
predictions
of
the
unnormalized
log
probabilities
er
classes
are
then
giv
en
by
(2)
assume
that
our
graph
language
includes
cross_entropy
op
eration
that
computes
the
cross-entrop
et
een
the
targets
and
the
probabilit
distribution
defined
these
unnormalized
log
probabilities.
The
resulting
cross-
en
trop
defines
the
cost
MLE
Minimizing
this
cross-entrop
erforms
maximum
lik
eliho
estimation
of
the
classifier.
How
ev
er,
to
make
this
example
more
realistic,
also
include
regularization
term.
The
total
cost
MLE
i,j
(1)
i,j
i,j
(2)
i,j
(6.56)
consists
of
the
cross-en
tropy
and
weigh
decay
term
with
co
efficient
The
computational
graph
is
illustrated
in
figure
6.11
The
computational
graph
for
the
gradien
of
this
example
is
large
enough
that
it
ould
tedious
to
draw
or
to
read.
This
demonstrates
one
of
the
enefits
of
the
bac
k-propagation
algorithm,
which
is
that
it
can
automatically
generate
gradien
ts
that
would
straightforw
ard
but
tedious
for
softw
are
engineer
to
deriv
manually
can
roughly
trace
out
the
eha
vior
of
the
back-propagation
algorithm
lo
oking
at
the
forward
propagation
graph
in
figure
6.11
train,
we
wish
to
compute
oth
(1)
and
(2)
There
are
tw
different
paths
leading
bac
kward
from
to
the
weigh
ts:
one
through
the
cross-entrop
cost,
and
one
through
the
eigh
decay
cost.
The
eigh
decay
cost
is
relativ
ely
simple;
it
will
alw
ys
con
tribute
to
the
gradien
on
The
other
path
through
the
cross-en
trop
cost
is
sligh
tly
more
complicated.
Let
the
gradient
on
the
unnormalized
log
probabilities
(2)
pro
vided
by
the
cross_entropy
op
eration.
The
back-propagation
algorithm
now
needs
to
215
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
(1)
(1)
(1)
(1)
matmul
relu
(3)
(3)
sqr
(4)
(4)
sum
(7)
(7)
(2)
(2)
(2)
(2)
matmul
MLE
MLE
cross_entropy
(5)
(5)
sqr
(6)
(6)
sum
(8)
(8)
Figure
6.11:
The
computational
graph
used
to
compute
the
cost
to
train
our
example
of
single-la
yer
MLP
using
the
cross-entrop
loss
and
eight
deca
explore
tw
different
branches.
On
the
shorter
branch,
it
adds
to
the
gradien
on
(2)
using
the
back-propagation
rule
for
the
second
argument
to
the
matrix
ultiplication
op
eration.
The
other
branc
corresp
onds
to
the
longer
hain
descending
further
along
the
net
ork.
First,
the
back-propagation
algorithm
computes
GW
(2)
using
the
back-propagation
rule
for
the
first
argumen
to
the
matrix
multiplication
op
eration.
Next,
the
relu
op
eration
uses
its
bac
k-
propagation
rule
to
zero
out
comp
onents
of
the
gradien
corresp
onding
to
entries
of
(1)
that
are
less
than
Let
the
result
called
The
last
step
of
the
bac
k-propagation
algorithm
is
to
use
the
back-propagation
rule
for
the
second
argumen
of
the
matmul
op
eration
to
add
to
the
gradien
on
(1)
After
these
gradien
ts
hav
een
computed,
the
gradient
descen
algorithm,
or
another
optimization
algorithm,
uses
these
gradients
to
up
date
the
parameters.
or
the
MLP
the
computational
cost
is
dominated
the
cost
of
matrix
ultiplication.
During
the
forward
propagation
stage,
we
ultiply
each
weigh
matrix,
resulting
in
multiply-adds,
where
is
the
num
er
of
weigh
ts.
During
the
backw
ard
propagation
stage,
we
multiply
the
transp
ose
of
eac
weigh
matrix,
which
has
the
same
computational
cost.
The
main
memory
cost
of
the
algorithm
is
that
need
to
store
the
input
to
the
nonlinearit
of
the
hidden
lay
er.
216
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
This
alue
is
stored
from
the
time
it
is
computed
until
the
bac
kward
pass
has
returned
to
the
same
oin
t.
The
memory
cost
is
th
us
mn
where
is
the
um
er
of
examples
in
the
minibatch
and
is
the
um
er
of
hidden
units.
6.5.8
Complications
Our
description
of
the
bac
k-propagation
algorithm
here
is
simpler
than
the
imple-
men
tations
actually
used
in
practice.
As
noted
ab
ve,
we
hav
restricted
the
definition
of
an
operation
to
function
that
returns
single
tensor.
Most
softw
are
implemen
tations
need
to
supp
ort
op
erations
that
can
return
more
than
one
tensor.
or
example,
if
we
wish
to
compute
oth
the
maxim
um
alue
in
tensor
and
the
index
of
that
alue,
it
is
est
to
compute
oth
in
single
pass
through
memory
so
it
is
most
efficient
to
implemen
this
pro
cedure
as
single
op
eration
with
tw
outputs.
e ha
ve
not
describ
ed ho
w to
con
trol the
memory consumption
of bac
k-
propagation.
Back-propagation
often
in
volv
es
summation
of
man
tensors
together.
In
the
naiv
approach,
eac
of
these
tensors
ould
computed
separately
then
all
of
them
would
be
added
in
second
step.
The
naive
approach
has
an
erly
high
memory
ottleneck
that
can
av
oided
maintaining
single
buffer
and
adding
each
alue
to
that
buffer
as
it
is
computed.
Real-w
orld
implementations
of
back-propagation
also
need
to
handle
arious
data
types,
suc
as
32-bit
floating
point,
64-bit
floating
oin
t,
and
in
teger
alues.
The
olicy
for
handling
each
of
these
yp
es
takes
sp
ecial
care
to
design.
Some
op
erations
hav
undefined
gradien
ts,
and
it
is
imp
ortant
to
track
these
cases
and
determine
whether
the
gradient
requested
by
the
user
is
undefined.
arious
other
tec
hnicalities
make
real-w
orld
differentiation
more
complicated.
These
technicalities
are
not
insurmoun
table,
and
this
hapter
has
describ
ed
the
ey
in
tellectual
to
ols
needed
to
compute
deriv
ativ
es,
but
it
is
imp
ortan
to
be
are
that
many
more
subtleties
exist.
6.5.9
Differen
tiation
outside
the
Deep
Learning
Communit
The deep
learning comm
unity
has been somewhat isolated
from the
broader
computer
science
comm
unity
and
has
largely
developed
its
wn
cultural
attitudes
concerning
ho
to
erform
differentiation.
More
generally
the
field
of
automatic
differen
tiation
is
concerned
with
how
to
compute
deriv
ativ
es
algorithmically
The
back-propagation
algorithm
describ
ed
here
is
only
one
approac
to
automatic
differen
tiation.
It
is
sp
ecial
case
of
broader
class
of
tec
hniques
called
rev
erse
217
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
mo
de
accum
ulation
Other
approaches
ev
aluate
the
sub
expressions
of
the
hain
rule
in
different
orders.
In
general, determining
the
order
of
ev
aluation
that
results
in
the
low
est
computational
cost
is
difficult
problem.
Finding
the
optimal
sequence
of
op
erations
to
compute
the
gradien
is
NP-complete
Naumann
2008
),
in
the
sense
that
it
ma
require
simplifying
algebraic
expressions
into
their
least
exp
ensiv
form.
or
example,
suppose
hav
ariables
represen
ting
probabilities,
and
ariables
represen
ting
unnormalized
log
probabilities.
Supp
ose
define
exp(
exp(
(6.57)
where
we
build
the
softmax
function
out
of
exp
onen
tiation,
summation
and
division
op
erations, and construct a
cross-entrop
y loss
log
A h
uman
mathematician
can
observ
that
the
deriv
ative
of
with
resp
ect
to
tak
es
very
simple
form:
The
bac
k-propagation
algorithm
is
not
capable
of
simplifying
the
gradient
this
and
will
instead
explicitly
propagate
gradien
ts
through
all
the
logarithm
and
exp
onentiation
op
erations
in
the
original
graph.
Some
softw
are
libraries
suc
as
Theano
Bergstra
et
al.
2010
Bastien
et
al.
2012
are
able
to
erform
some
kinds
of
algebraic
substitution
to
impro
ve
er
the
graph
prop
osed
the
pure
back-propagation
algorithm.
When
the
forward
graph
has
single
output
no
de
and
each
partial
deriv
ative
can
computed
with
constant
amount
of
computation,
bac
k-propagation
guaran
tees
that
the
um
er
of
computations
for
the
gradien
computation
is
of
the
same
order
as
the
num
er
of
computations
for
the
forward
computation:
this
can
seen
in
algorithm
6.2
because
eac
lo
cal
partial
deriv
ative
needs
to
computed
only
once
along
with
an
asso
ciated
multiplication
and
addition
for
the
recursive
chain-rule
form
ulation
(equation
6.53
).
The
ov
erall
computation
is
therefore
(#
edges).
It
can
potentially
reduced,
ho
ever,
simplifying
the
computational
graph
constructed
by
back-propagation,
and
this
is
an
NP-complete
task. Implemen
tations
such
as
Theano
and
ensorFlow
use
heuristics
based
on
matc
hing
known
simplification
patterns
to
iteratively
attempt
to
simplify
the
graph.
defined
back-propagation
only
for
computing
scalar
output
gradien
t,
but
bac
k-propagation
can
extended
to
compute
Jacobian
(either
of
differen
scalar
no
des
in
the
graph,
or
of
tensor-v
alued
no
de
containing
alues).
naiv
implementation
ma
then
need
times
more
computation:
for
each
scalar
in
ternal
node
in
the
original
forw
ard
graph,
the
naiv
implemen
tation
computes
gradien
ts
instead
of
single
gradien
t.
When
the
um
er
of
outputs
of
the
graph
is
larger
than
the
num
er
of
inputs,
it
is
sometimes
preferable
to
use
another
form
of
automatic
differentiation
called
forw
ard
mo
de
accumulation
orw
ard
mo
de
218
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
accum
ulation
has
een
prop
osed
for
obtaining
real-time
computation
of
gradien
ts
in
recurren
netw
orks,
for
example
Williams
and
Zipser
1989
).
This
approach
also
oids
the
need
to
store
the
alues
and
gradients
for
the
whole
graph,
trading
off
computational
efficiency
for
memory
The
relationship
et
ween
forw
ard
mode
and
bac
kward
mo
de
is
analogous
to
the
relationship
betw
een
left-m
ultiplying
versus
righ
t-m
ultiplying
sequence
of
matrices,
such
as
AB
(6.58)
where
the
matrices
can
thought
of
as
Jacobian.
or
example,
if
is
column
ector
while
has
many
ro
ws,
the
graph
will
hav
single
output
and
many
inputs,
and
starting
the
ultiplications
from
the
end
and
going
bac
kw
ard
requires
only
matrix-v
ector
pro
ducts.
This
order
corresp
onds
to
the
backw
ard
mo
de.
Instead,
starting
to
ultiply
from
the
left
ould
inv
olv
series
of
matrix-matrix
pro
ducts,
whic
makes
the
whole
computation
uc
more
exp
ensive.
If
has
fewer
rows
than
has
columns,
ho
ev
er,
it
is
cheaper
to
run
the
ultiplications
left-to-right,
corresp
onding
to
the
forw
ard
mode.
In
many
communities
outside
machine
learning,
it
is
more
common
to
implemen
differen
tiation
softw
are
that
acts
directly
on
traditional
programming
language
co
de, suc
h as
Python
or
code, and
automatically
generates
programs
that
differen
tiate
functions
written
in
these
languages.
In
the
deep
learning
comm
unit
computational
graphs
are
usually
represented
by
explicit
data
structures
created
sp
ecialized
libraries.
The
sp
ecialized
approach
has
the
drawbac
of
requiring
the
library
developer
to
define
the
bprop
metho
ds
for
every
op
eration
and
limiting
the
user
of
the
library
to
only
those
op
erations
that
hav
een
defined.
et
the
sp
ecialized
approach
also
has
the
enefit
of
allowing
customized
bac
k-propagation
rules
to
developed
for
each
op
eration,
enabling
the
dev
elop
er
to
impro
sp
eed
or
stability
in
nonobvious
wa
ys
that
an
automatic
pro
cedure
would
presumably
unable
to
replicate.
Bac
k-propagation
is
therefore
not
the
only
wa
or
the
optimal
wa
of
computing
the
gradient,
but
it
is
practical
metho
that
con
tin
ues
to
serv
the
deep
learning
comm
unit
well.
In
the
future,
differen
tiation
technology
for
deep
netw
orks
ma
impro
ve
as
deep
learning
practitioners
ecome
more
aw
are
of
adv
ances
in
the
broader
field
of
automatic
differen
tiation.
6.5.10
Higher-Order
Deriv
ativ
es
Some
soft
are
frameworks
supp
ort
the
use
of
higher-order
deriv
ativ
es.
Among
the
deep
learning
soft
are
framew
orks,
this
includes
at
least
Theano
and
ensorFlow.
219
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
These
libraries
use
the
same
kind
of
data
structure
to
describ
the
expressions
for
deriv
atives
as
they
use
to
describ
the
original
function
being
differen
tiated.
This
means
that
the
symbolic
differen
tiation
machinery
can
applied
to
deriv
ativ
es.
In
the
con
text
of
deep
learning,
it
is
rare
to
compute
single
second
deriv
ative
of
scalar
function.
Instead,
are
usually
interested
in
properties
of
the
Hessian
matrix.
If
we
ha
function
then
the
Hessian
matrix
is
of
size
In
typical
deep
learning
applications,
will
the
um
er
of
parameters
in
the
mo
del,
which
could
easily
num
er
in
the
billions.
The
entire
Hessian
matrix
is
th
us
infeasible
to
even
represen
t.
Instead
of
explicitly
computing
the
Hessian,
the
typical
deep
learning
approach
is
to
use
Krylo
metho
ds
Krylov
metho
ds
are
set
of
iterativ
techniques
for
erforming
arious
op
erations,
such
as
approximatel
inv
erting
matrix
or
finding
appro
ximations
to
its
eigenv
ectors
or
eigen
alues,
without
using
any
op
eration
other
than
matrix-v
ector
pro
ducts.
use
Krylo
metho
ds
on the
Hessian, w
only
need
to be
able
to
com-
pute
the
pro
duct
et
ween
the
Hessian
matrix
and
an
arbitrary
vector
straigh
tforw
ard
tec
hnique
Christianson
1992
for
doing
so
is
to
compute
))
(6.59)
Both
gradient
computations
in
this
expression
ma
be
computed
automatically
by
the
appropriate
soft
are
library
Note
that
the
outer
gradient
expression
takes
the
gradien
of
function
of
the
inner
gradient
expression.
If
is
itself
ector
produced
by
computational
graph,
it
is
imp
ortant
to
sp
ecify
that
the
automatic
differentiation
soft
are
should
not
differen
tiate
through
the
graph
that
pro
duced
While
computing
the
Hessian
is
usually
not
advisable,
it
is
ossible
to
do
with
Hessian
vector
pro
ducts.
One
simply
computes
for
all
= 1
n,
where
is
the
one-hot
vector
with
= 1
and
all
other
en
tries
are
equal
to
0.
6.6
Historical
Notes
eedforward
net
orks
can
seen
as
efficien
nonlinear
function
approximators
based
on
using
gradien
descent
to
minimize
the
error
in
function
approximation.
rom
this
oin
of
view,
the
mo
dern
feedforw
ard
net
ork
is
the
culmination
of
cen
turies
of
progress
on
the
general
function
approximation
task.
The
chain
rule
that
underlies
the
bac
k-propagation
algorithm
as
inv
en
ted
in
the
sev
en
teenth
century
Leibniz
1676
L’Hôpital
1696
). Calculus
and
algebra
220
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
ha
long
been
used
to
solve
optimization
problems
in
closed
form,
but
gradien
descen
was
not
in
tro
duced
as
technique
for
iterativ
ely
appro
ximating
the
solution
to
optimization
problems
until
the
nineteenth
cen
tury
Cauch
1847
).
Beginning
in
the
1940s,
these
function
approximation
tec
hniques
ere
used
to
motiv
ate
machine
learning
mo
dels
such
as
the
perceptron.
Ho
ever,
the
earliest
mo
dels
were
based
on
linear
mo
dels.
Critics
including
Marvin
Minsky
ointed
out
sev
eral
of
the
flaws
of
the
linear
mo
del
family
such
as
its
inabilit
to
learn
the
OR
function,
whic
led
to
backlash
against
the
entire
neural
netw
ork
approach
Learning
nonlinear
functions
required
the
developmen
of
multila
yer
er-
ceptron
and
means
of
computing
the
gradient
through
suc
model.
Efficien
applications
of
the
chain
rule
based
on
dynamic
programming
egan
to
app
ear
in
the
1960s
and
1970s,
mostly
for
con
trol
applications
Kelley
1960
Bryson
and
Denham
1961
Dreyfus
1962
Bryson
and
Ho
1969
Dreyfus
1973
but
also
for
sensitivit
analysis
Linnainmaa
1976
).
erb
os
1981
prop
osed
applying
these
tec
hniques
to
training
artificial
neural
netw
orks.
The
idea
was
finally
developed
in
practice
after
being
indep
enden
tly
rediscov
ered
in
differen
wa
ys
LeCun
1985
arker
1985
Rumelhart
et
al.
1986a
).
The
ook
arallel
Distributed
Pro-
cessing
presen
ted
the
results
of
some
of
the
first
successful
exp
erimen
ts
with
bac
k-propagation
in
chapter
Rumelhart
et
al.
1986b
that
contributed
greatly
to
the
opularization
of
bac
k-propagation
and
initiated
very
activ
perio
of
re-
searc
in
multila
yer
neural
net
orks.
The
ideas
put
forward
by
the
authors
of
that
ook,
particularly
by
Rumelhart
and
Hin
ton,
go
uc
eyond
bac
k-propagation.
They
include
crucial
ideas
ab
out
the
ossible
computational
implementation
of
sev
eral
central
asp
ects
of
cognition
and
learning,
which
came
under
the
name
“connectionism”
ecause
of
the
imp
ortance
this
sc
ho
ol
of
though
places
on
the
connections
etw
een
neurons
as
the
lo
cus
of
learning
and
memory
In
particular,
these
ideas
include
the
notion
of
distributed
representa
tion
Hin
ton
et
al.
1986
).
ollo
wing
the
success
of
back-propagation,
neural
netw
ork
researc
gained
op-
ularit
and
reached
peak
in
the
early
1990s.
Afterwards,
other
mac
hine
learning
tec
hniques
ecame
more
opular
un
til
the
modern
deep
learning
renaissance
that
egan
in
2006.
The
core
ideas
ehind
mo
dern
feedforward
netw
orks
hav
not
changed
sub-
stan
tially
since
the
1980s. The
same
bac
k-propagation
algorithm
and
the
same
approac
hes
to
gradient
descent
are
still
in
use.
Most
of
the
impro
emen
in
neural
net
work
erformance
from
1986
to
2015
can
attributed
to
factors.
First,
larger
datasets
hav
reduced
the
degree
to
which
statistical
generalization
is
hallenge
for
neural
net
orks.
Second,
neural
netw
orks
hav
ecome
muc
larger,
ecause
of
more
erful
computers
and
etter
softw
are
infrastructure.
small
221
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
umber
of
algorithmic
changes
hav
also
impro
ed
the
erformance
of
neural
net
orks
noticeably
One
of
these
algorithmic
changes
as
the
replacemen
of
mean
squared
error
with
the
cross-en
trop
family
of
loss
functions.
Mean
squared
error
as
opular
in
the
1980s
and
1990s
but
was
gradually
replaced
cross-entrop
losses
and
the
principle
of
maximum
lik
eliho
as
ideas
spread
et
een
the
statistics
comm
unit
and
the
machine
learning
communit
The
use
of
cross-entrop
losses
greatly
impro
ved
the
erformance
of
mo
dels
with
sigmoid
and
softmax
outputs,
which
had
previously
suffered
from
saturation
and
slo
learning
when
using
the
mean
squared
error
loss.
The
other
ma
jor
algorithmic
change
that
has
greatly
improv
ed
the
erformance
of
feedforward
net
orks
as
the
replacemen
of
sigmoid
hidden
units
with
piecewise
linear
hidden
units,
suc
as
rectified
linear
units.
Rectification
using
the
max
function
was
introduced
in
early
neural
netw
ork
mo
dels
and
dates
bac
at
least
as
far
as
the
cognitron
and
neocognitron
ukushima
1975
1980
).
These
early
mo
dels
did
not
use
rectified
linear
units
but
instead
applied
rectification
to
nonlinear
functions.
Despite
the
early
popularity
of
rectification,
it
as
largely
replaced
sigmoids
in
the
1980s,
erhaps
ecause
sigmoids
erform
etter
when
neural
net
orks
are
ery
small. As
of
the
early
2000s,
rectified
linear
units
ere
oided
ecause
of
somewhat
sup
erstitious
elief
that
activ
ation
functions
with
nondifferen
tiable
oin
ts
must
oided.
This
egan
to
hange
in
ab
out
2009.
Jarrett
et
al.
2009
observ
ed
that
“using
rectifying
nonlinearit
is
the
single
most
imp
ortan
factor
in
impro
ving
the
erformance
of
recognition
system,”
among
several
differen
factors
of
neural
netw
ork
arc
hitecture
design.
or
small
datasets,
Jarrett
et
al.
2009
observed
that
using
rectifying
non-
linearities
is
ev
en
more
important
than
learning
the
weigh
ts
of
the
hidden
la
yers.
Random
weigh
ts
are
sufficien
to
propagate
useful
information
through
rectified
linear
netw
ork,
enabling
the
classifier
la
er
at
the
top
to
learn
how
to
map
differen
feature
vectors
to
class
iden
tities.
When
more
data
is
ailable,
learning
egins
to
extract
enough
useful
knowledge
to
exceed
the
erformance
of
randomly
hosen
parameters.
Glorot
et
al.
2011a
sho
wed
that
learning
is
far
easier
in
deep
rectified
linear
netw
orks
than
in
deep
net
orks
that
hav
curv
ature
or
o-sided
saturation
in
their
activ
ation
functions.
Rectified
linear
units
are
also
of
historical
interest
ecause
they
show
that
neuroscience
has
contin
ued
to
hav
e an
influence on
the
developmen
t of
deep
learning
algorithms.
Glorot
et
al.
2011a
motiv
ated
rectified
linear
units
from
biological
considerations.
The
half-rectifying
nonlinearity
as
in
tended
to
capture
these
prop
erties
of
biological
neurons:
(1)
or
some
inputs,
biological
neurons
222
CHAPTER
6.
DEEP
FEEDFOR
ARD
NETWORKS
are
completely
inactive.
(2)
or
some
inputs, a
biological
neuron’s
output
is
prop
ortional
to
its
input.
(3)
Most
of
the
time,
biological
neurons
op
erate
in
the
regime
where
they
are
inactiv
(i.e.,
they
should
ha
sparse
activ
ations
).
When
the
mo
dern
resurgence
of
deep
learning
egan
in
2006,
feedforward
net
works
con
tinued
to
ha
bad
reputation. F
rom
ab
out
2006
to
2012,
it
was
widely
elieved
that
feedforw
ard
net
orks
ould
not
erform
ell
unless
they
ere
assisted
other
mo
dels,
suc
as
probabilistic
mo
dels. T
da
it
is
now
kno
wn
that
with
the
righ
resources
and
engineering
practices,
feedforward
netw
orks
erform
very
well.
da
gradien
t-based
learning
in
feedforw
ard
net
orks
is
used
as
to
ol
to
develop
probabilistic
mo
dels,
such
as
the
ariational
auto
enco
der
and
generative
adv
ersarial
netw
orks,
describ
ed
in
chapter
20
Rather
than
eing
view
ed
as
an
unreliable
tec
hnology
that
ust
supp
orted
by
other
tec
hniques,
gradien
t-based
learning
in
feedforward
net
orks
has
een
view
ed
since
2012
as
erful
tec
hnology
that
can
applied
to
man
other
machine
learning
tasks.
In
2006,
the
communit
used
unsup
ervised
learning
to
supp
ort
sup
ervised
learning,
and
now,
ironically
it
is
more
common
to
use
sup
ervised
learning
to
supp
ort
unsup
ervised
learning.
eedforward
net
orks
contin
ue
to
ha
unfulfilled
oten
tial.
In
the
future,
exp
ect
they
will
applied
to
man
more
tasks,
and
that
adv
ances
in
optimization
algorithms
and
mo
del
design
will
improv
their
erformance
ev
en
further. This
hapter
has
primarily
describ
ed
the
neural
net
work
family
of
mo
dels.
In
the
subsequen
chapters,
we
turn
to
ho
to
use
these
mo
dels—ho
to
regularize
and
train
them.
223