www.deeplearningbook.org

Chapter
Con
olutional
Net
orks
Con
olutional
net
orks
LeCun
1989
),
also
kno
wn
as
con
olutional
neural
net
orks
or
CNNs,
are
sp
ecialized
kind
of
neural
net
ork
for
pro
cessing
data
that
has
known
grid-like
top
ology
Examples
include
time-series
data,
which
can
though
of
as
1-D
grid
taking
samples
at
regular
time
in
terv
als,
and
image
data,
whic
can
though
of
as
2-D
grid
of
pixels.
Con
volutional
netw
orks
hav
een
tremendously
successful
in
practical
applications.
The
name
“con
olutional
neural
net
work”
indicates
that
the
net
work
employs
mathematical
op
eration
called
con
olution
Con
olution
is
sp
ecialized
kind
of
linear
op
eration.
Convolutional
networks
ar
simply
neur
al
networks
that
use
onvolution
in
plac
of
gener
al
matrix
multiplic
ation
in
at
le
ast
one
of
their
layers.
In
this
hapter,
ﬁrst
describ
what
con
volution
is.
Next,
explain
the
motiv
ation
behind
using
con
olution
in
neural
net
work. W
then
describ
an
op
eration
called
ooling
which
almost
all
con
volutional
netw
orks
employ
Usually
the
op
eration
used
in
conv
olutional
neural
net
ork
do
es
not
corresp
ond
precisely
to
the
deﬁnition
of
con
volution
as
used
in
other
ﬁelds,
such
as
engineering
or
pure
mathematics.
describ
several
ariants
on
the
conv
olution
function
that
are
widely
used
in
practice
for
neural
netw
orks. W
also
sho
how
con
volution
ma
applied
to
many
kinds
of
data,
with
diﬀeren
umbers
of
dimensions.
then
discuss
means
of
making
conv
olution
more
eﬃcien
t.
Con
olutional
net
orks
stand
out
as
an
example
of
neuroscien
tiﬁc
principles
inﬂuencing
deep
learning.
discuss
these
neuroscientiﬁc
principles,
then
conclude
with
comments
ab
out
the
role
conv
olutional
netw
orks
hav
play
ed
in
the
history
of
deep
learning.
One
topic
this
chapter
do
es
not
address
is
ho
to
choose
the
architecture
of
our
con
voluti
onal
netw
ork.
The
goal
of
this
hapter
is
to
describ
the
kinds
of
to
ols
that
con
olutional
netw
orks
provide,
while
chapter
11
describ
es
general
guidelines
326
CHAPTER
9.
CONV
OLUTIONAL
NETW
ORKS
for
ho
osing
which
tools
to
use
in
which
circumstances.
Researc
into
conv
olutional
net
ork
arc
hitectures
pro
ceeds
so
rapidly
that
new
est
architecture
for
given
enc
hmark
is
announced
every
few
weeks
to
months,
rendering
it
impractical
to
describ
in
prin
the
est
arc
hitecture.
Nonetheless,
the
est
arc
hitectures
hav
consisten
tly
een
comp
osed
of
the
building
blo
ks
describ
ed
here.
9.1
The
Con
olution
Op
eration
In
its
most
general
form,
conv
olution
is
an
op
eration
on
tw
functions
of
real-
alued
argument.
motiv
ate
the
deﬁnition
of
con
volution,
start
with
examples
of
tw
functions
migh
use.
Supp
ose
are
trac
king
the
lo
cation
of
spaceship
with
laser
sensor.
Our
laser
sensor
provides
single
output
the
osition
of
the
spaceship
at
time
Both
and
are
real
alued,
that
is,
we
can
get
diﬀeren
reading
from
the
laser
sensor
at
any
instan
in
time.
No
supp
ose
that
our
laser
sensor
is
somewhat
noisy
obtain
less
noisy
estimate
of
the
spaceship’s
osition,
would
lik
to
verage
sev
eral
measuremen
ts.
Of
course,
more
recent
measuremen
ts
are
more
relev
an
t,
so
will
wan
this
to
eighted
av
erage
that
gives
more
eigh
to
recent
measurements. W
can
do
this
with
weigh
ting
function
where
is
the
age
of
measurement.
If
apply
such
weigh
ted
erage
op
eration
at
ev
ery
momen
t,
obtain
new
function
providing
smo
othed
estimate
of
the
osition
of
the
spaceship:
) =
da.
(9.1)
This
op
eration
is
called
con
olution
The
conv
olution
op
eration
is
typically
denoted
with
an
asterisk:
) = (
)(
(9.2)
In
our
example,
needs
to
alid
probability
densit
function,
or
the
output
will
not
weigh
ted
av
erage.
Also,
needs
to
for
all
negativ
argumen
ts,
or
it
will
lo
ok
in
to
the
future,
whic
is
presumably
ey
ond
our
capabilities.
These
limitations
are
particular
to
our
example,
though.
In
general,
con
olution
is
deﬁned
for
any
functions
for
which
the
ab
ve
in
tegral
is
deﬁned
and
may
be
used
for
other
purp
oses
esides
taking
weigh
ted
av
erages.
In
conv
olutional
netw
ork
terminology
the
ﬁrst
argument
(in
this
example,
the
function
to
the
conv
olution
is
often
referred
to
as
the
input
and
the
second
327
CHAPTER
9.
CONV
OLUTIONAL
NETW
ORKS
argumen
(in
this
example,
the
function
as
the
ernel
The
output
is
sometimes
referred
to
as
the
feature
map
In
our
example,
the
idea
of
laser
sensor
that
can
provide
measuremen
ts
at
ev
ery
instan
is
not
realistic.
Usually
when
we
work
with
data
on
computer,
time
will
be
discretized,
and
our
sensor
will
pro
vide
data
at
regular
in
terv
als.
In
our
example,
it
migh
be
more
realistic
to
assume
that
our
laser
pro
vides
measuremen
once
er
second.
The
time
index
can
then
take
on
only
integer
alues.
If
we
no
assume
that
and
are
deﬁned
only
on
in
teger
we
can
deﬁne
the
discrete
conv
olution:
) = (
)(
) =
−∞
(9.3)
In
machine
learning
applications,
the
input
is
usually
multidimensional
arra
of
data,
and
the
kernel
is
usually
multidimensional
arra
of
parameters
that
are
adapted
the
learning
algorithm.
will
refer
to
these
multidimensional
arrays
as
tensors.
Because
eac
elemen
of
the
input
and
ernel
must
explicitly
stored
separately
we
usually
assume
that
these
functions
are
zero
everywhere
but
in
the
ﬁnite
set
of
oin
ts
for
which
store
the
alues.
This
means
that
in
practice,
we
can
implemen
the
inﬁnite
summation
as
summation
ov
er
ﬁnite
num
ber
of
arra
elements.
Finally
often
use
conv
olutions
ov
er
more
than
one
axis
at
time.
or
example,
if
use
wo-dimensional
image
as
our
input,
probably
also
ant
to
use
wo-dimensional
ernel
i,
) = (
)(
i,
) =
m,
m,
(9.4)
Con
olution
is
commutativ
e,
meaning
we
can
equiv
alen
tly
write
i,
) = (
)(
i,
) =
m,
m,
(9.5)
Usually
the
latter
formula
is
more
straightforw
ard
to
implement
in
machine
learning
library
ecause
there
is
less
ariation
in
the
range
of
alid
alues
of
and
The
commutativ
prop
ert
of
conv
olution
arises
ecause
we
hav
ﬂipp
ed
the
ernel
relativ
to
the
input,
in
the
sense
that
as
increases,
the
index
into
the
input
increases,
but
the
index
in
to
the
kernel
decreases.
The
only
reason
to
ﬂip
the
ernel
is
to
obtain
the
comm
utativ
prop
ert
While
the
commutativ
prop
ert
328
CHAPTER
9.
CONV
OLUTIONAL
NETW
ORKS
is
useful
for
writing
proofs,
it
is
not
usually
an
important
prop
ert
of
neural
net
work
implemen
tation.
Instead,
many
neural
netw
ork
libraries
implement
related
function
called
the
cross-correlation
which
is
the
same
as
conv
olution
but
without
ﬂipping
the
kernel:
i,
) = (
)(
i,
) =
m,
m,
(9.6)
Man
machine
learning
libraries
implement
cross-correlation
but
call
it
conv
olution.
In
this
text
we
follo
this
conv
ention
of
calling
oth
op
erations
con
volution
and
sp
ecify
whether
mean
to
ﬂip
the
kernel
or
not
in
contexts
where
ernel
ﬂipping
is
relev
an
t.
In
the
context
of
machine
learning,
the
learning
algorithm
will
learn
the
appropriate
alues
of
the
kernel
in
the
appropriate
place,
so
an
algorithm
based
on
con
olution
with
kernel
ﬂipping
will
learn
kernel
that
is
ﬂipped
relativ
to
the
ernel
learned
an
algorithm
without
the
ﬂipping.
It
is
also
rare
for
conv
olution
to
used
alone
in
machine
learning;
instead
conv
olution
is
used
simultaneously
with
other
functions,
and
the
com
bination
of
these
functions
do
es
not
commute
regardless
of
whether
the
conv
olution
op
eration
ﬂips
its
ernel
or
not.
See
ﬁgure
9.1
for
an
example
of
conv
olution
(without
ernel
ﬂipping)
applied
to
2-D
tensor.
Discrete
con
volution
can
viewed
as
multiplication
by
matrix,
but
the
matrix
has
sev
eral
entries
constrained
to
equal
to
other
entries.
or
example,
for
univ
ariate
discrete
con
olution,
eac
row
of
the
matrix
is
constrained
to
equal
to
the
row
ab
ve
shifted
by
one
element.
This
is
known
as
eplitz
matrix
In
tw
dimensions,
doubly
blo
circulan
matrix
corresp
onds
to
con
volutio
n.
In
addition
to
these
constraints
that
sev
eral
elemen
ts
equal
to
eac
other,
conv
olution
usually
corresp
onds
to
very
sparse
matrix
(a
matrix
whose
entries
are
mostly
equal
to
zero).
This
is
ecause
the
kernel
is
usually
muc
smaller
than
the
input
image.
An
neural
net
ork
algorithm
that
works
with
matrix
multiplication
and
does
not
dep
end
on
sp
eciﬁc
prop
erties
of
the
matrix
structure
should
work
with
con
volution,
without
requiring
an
further
changes
to
the
neural
net
work.
Typical
conv
olutional
neural
netw
orks
do
mak
use
of
further
sp
ecializations
in
order
to
deal
with
large
inputs
eﬃciently
but
these
are
not
strictly
necessary
from
theoretical
erspective.
9.2
Motiv
ation
Con
volutio
leverages
three
imp
ortan
ideas
that
can
help
impro
machine
learning
system:
sparse
in
teractions
parameter
sharing
and
equiv
arian
329
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
aw
bx
ey
fz
aw
bx
ey
fz
bw
cx
fy
gz
bw
cx
fy
gz
cw
dx
gy
hz
cw
dx
gy
hz
ew
fx
iy
jz
ew
fx
iy
jz
fw
gx
jy
kz
fw
gx
jy
kz
gw
hx
ky
lz
gw
hx
ky
lz
Input
Kernel
Output
Figure
9.1:
An
example
of
2-D
con
volution
without
ernel
ﬂipping.
restrict
the
output
to
only
ositions
where
the
ernel
lies
en
tirely
within
the
image,
called
“v
alid”
con
volution
in
some
contexts.
draw
xes
with
arrows
to
indicate
how
the
upp
er-left
element
of
the
output
tensor
is
formed
by
applying
the
kernel
to
the
corresp
onding
upp
er-left
region
of
the
input
tensor.
represen
tations
Moreov
er, con
olution
provides
a means
for
working
with
inputs
of
ariable
size.
now
describ
eac
of
these
ideas
in
turn.
raditional
neural
net
work
la
ers
use
matrix
ultiplication
matrix
of
parameters
with
separate
parameter
describing
the
interaction
betw
een
eac
input
unit
and
eac
output
unit.
This
means
that
ev
ery
output
unit
in
teracts
with
every
input
unit.
Con
volutiona
net
works,
how
ev
er,
typically
ha
ve
sparse
in
teractions
(also
referred
to
as
sparse
connectivit
or
sparse
eigh
ts
).
This
is accomplished
by
making the k
ernel
smaller than the
input.
or example,
when
pro
cessing
an
image,
the
input
image
might
ha
ve
thousands
or
millions
of
pixels,
but
we
can
detect
small,
meaningful
features
suc
as
edges
with
ernels
that
ccup
only
tens
or
undreds
of
pixels.
This
means
that
need
to
store
few
er
parameters,
whic
oth
reduces
the
memory
requiremen
ts
of
the
mo
del
and
impro
ves
its
statistical
eﬃciency
It
also
means
that
computing
the
output
330
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
requires
fewer
op
erations.
These
impro
emen
ts
in
eﬃciency
are
usually
quite
large.
If
there
are
inputs
and
outputs,
then
matrix
multiplication
requires
parameters,
and
the
algorithms
used
in
practice
ha
runtime
(p
er
example).
If
we
limit
the
umber
of
connections
each
output
may
ha
ve
to
then
the
sparsely
connected
approac
requires
only
parameters
and
run
time.
or
many
practical
applications,
it
is
ossi
ble
to
obtain
go
performance
on
the
machine
learning
task
while
keeping
sev
eral
orders
of
magnitude
smaller
than
or
graphical
demonstrations
of
sparse
connectivity
see
ﬁgure
9.2
and
ﬁgure
9.3
In
deep
conv
olutional
netw
ork,
units
in
the
deep
er
la
ers
ma
indir
ctly
in
teract
with
larger
ortion
of
the
input,
as
shown
in
ﬁgure
9.4
This
allows
the
net
ork
to
eﬃcien
tly
describ
complicated
interactions
et
een
many
ariables
by
constructing
suc
in
teractions
from
simple
building
blo
ks
that
each
describ
only
sparse
interactions.
arameter
sharing
refers
to
using
the
same
parameter
for
more
than
one
function
in
mo
del.
In
traditional
neural
net,
each
elemen
of
the
weigh
matrix
Figure
9.2:
Sparse
connectivit
viewed
from
elo
w.
highlight
one
input
unit,
and
highligh
the
output
units
in
that
are
aﬀected
by
this
unit.
(T
op)
When
is
formed
by
con
volution
with
kernel
of
width
only
three
outputs
are
aﬀected
by
(Bottom)
When
is
formed
by
matrix
ultiplication,
connectivit
is
no
longer
sparse,
so
all
the
outputs
are
aﬀected
by
331
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Figure
9.3:
Sparse
connectivit
viewed
from
abov
e.
highlight
one
output
unit,
and
highligh
the
input
units
in
that
aﬀect
this
unit.
These
units
are
known
as
the
receptiv
ﬁeld
of
(T
op)
When
is
formed
by
con
volution
with
kernel
of
width
only
three
inputs
aﬀect
(Bottom)
When
is
formed
by
matrix
multiplication,
connectivity
is
no
longer
sparse,
so
all
the
inputs
aﬀect
Figure
9.4:
The
receptive
ﬁeld
of
the
units
in
the
deeper
lay
ers
of
conv
olutional
netw
ork
is
larger
than
the
receptive
ﬁeld
of
the
units
in
the
shallo
la
yers.
This
eﬀect
increases
if
the
net
work
includes
architectural
features
lik
strided
conv
olution
(ﬁgure
9.12
or
oling
(section
9.3
).
This
means
that
even
though
dir
ct
connections
in
conv
olutional
net
are
ery
sparse,
units
in
the
deep
er
lay
ers
can
indir
ctly
connected
to
all
or
most
of
the
input
image.
332
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
is
used
exactly
once
when
computing
the
output
of
la
er.
It
is
multiplied
by
one
element
of
the
input
and
then
nev
er
revisited.
As
synonym
for
parameter
sharing,
one
can
say
that
netw
ork
has
tied
eigh
ts
ecause
the
alue
of
the
eigh
applied
to
one
input
is
tied
to
the
alue
of
weigh
applied
elsewhere.
In
conv
olutional
neural
net,
eac
member
of
the
ernel
is
used
at
every
osition
of
the
input
(except
perhaps
some
of
the
boundary
pixels, depending
on
the
design
decisions
regarding
the
oundary).
The
parameter
sharing
used
the
con
olution
op
eration
means
that
rather
than
learning
separate
set
of
parameters
for
ev
ery
lo
cation,
we
learn
only
one
set.
This
do
es
not
aﬀect
the
run
time
of
forw
ard
propagation—it
is
still
—but
it
do
es
further
reduce
the
storage
requiremen
ts
of
the
mo
del
to
parameters.
Recall
that
is
usually
several
orders
of
magnitude
smaller
than
Since
and
are
usually
roughly
the
same
size,
is
practically
insigniﬁcant
compared
to
Con
olution
is
th
us
dramatically
more
eﬃcien
than
dense
matrix
multiplication
in
of
the
memory
requirements
and
statistical
eﬃciency
or
graphical
depiction
of
ho
parameter
sharing
orks,
see
ﬁgure
9.5
As
an
example
of
oth
of
these
ﬁrst
wo
principles
in
action,
ﬁgure
9.6
shows
ho
sparse
connectivity
and
parameter
sharing
can
dramatically
impro
ve
the
Figure
9.5:
arameter
sharing.
Black
arrows
indicate
the
connections
that
use
particular
parameter
in
tw
diﬀerent
mo
dels.
(T
op)
The
blac
arrows
indicate
uses
of
the
central
elemen
of
3-element
kernel
in
conv
olutional
mo
del.
Because
of
parameter
sharing,
this
single
parameter
is
used
at
all
input
lo
cations.
(Bottom)
The
single
blac
arro
indicates
the
use
of
the
central
elemen
of
the
weigh
matrix
in
fully
connected
mo
del.
This
mo
del
has
no
parameter
sharing,
so
the
parameter
is
used
only
once.
333
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Figure
9.6:
Eﬃciency
of
edge
detection.
The
image
on
the
righ
as
formed
taking
eac
pixel
in
the
original
image
and
subtracting
the
alue
of
its
neighboring
pixel
on
the
left.
This
sho
ws
the
strength
of
all
the
vertically
oriented
edges
in
the
input
image,
whic
can
be
seful
op
eration
for
ob
ject
detection.
Both
images
are
280
pixels
tall.
The
input
image
is
320
pixels
wide,
while
the
output
image
is
319
pixels
wide.
This
transformation
can
describ
ed
conv
olution
ernel
containing
tw
elemen
ts,
and
requires
319
280
267
960
ﬂoating-p
oin
operations
(tw
multiplications
and
one
addition
er
output
pixel)
to
compute
using
conv
olution.
describ
the
same
transformation
with
matrix
ultiplication
ould
take
320
280
319
280
or
ver
eigh
billion,
entries
in
the
matrix,
making
conv
olution
four
billion
times
more
eﬃcient
for
represen
ting
this
transformation.
The
straightforw
ard
matrix
multiplication
algorithm
erforms
ver
sixteen
billion
ﬂoating
oin
op
erations,
making
conv
olution
roughly
60,000
times
more
eﬃcient
computationally
Of
course,
most
of
the
entries
of
the
matrix
ould
be
zero.
If
stored
only
the
nonzero
entries
of
the
matrix,
then
both
matrix
multiplication
and
conv
olution
ould
require
the
same
um
er
of
ﬂoating-p
oin
op
erations
to
compute.
The
matrix
would
still
need
to
contain
319
280
178
640
entries.
Con
volution
is
an
extremely
eﬃcient
wa
of
describing
transformations
that
apply
the
same
linear
transformation
of
a small
lo
cal
region
across
the en
tire
input.
Photo
credit:
Paula
Go
odfellow.
eﬃciency
of
linear
function
for
detecting
edges
in
an
image.
In
the
case
of
conv
olution,
the
particular
form
of
parameter
sharing
causes
the
la
er
to
ha
prop
ert
called
equiv
ariance
to
translation.
say
function
is
equiv
ariant
means
that
if
the
input
changes,
the
output
hanges
in
the
same
wa
Sp
eciﬁcally
function
is
equiv
arian
to
function
if
)) =
))
In
the
case
of
conv
olution,
if
we
let
an
function
that
translates
the
input,
that
is,
shifts
it,
then
the
conv
olution
function
is
equiv
ariant
to
or
example,
let
function
giving
image
brightness
at
in
teger
co
ordinates.
Let
function
mapping
one
image
function
to
another
image
function,
such
that
is
the
image
function
with
x,
) =
This
shifts
every
pixel
of
one
unit
to
the
right.
If
we
apply
this
transformation
to
then
apply
conv
olution,
the
result
will
be
the
same
as
if
we
applied
con
olution
to
then
applied
the
transformation
334
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
to
the
output.
When
pro
cessing
time-series
data,
this
means
that
conv
olution
pro
duces
sort
of
timeline
that
sho
ws
when
diﬀerent
features
app
ear
in
the
input.
If
we
mov
an
even
later
in
time
in
the
input,
the
exact
same
represen
tation
of
it
will
app
ear
in
the
output,
just
later.
Similarly
with
images,
conv
olution
creates
2-D
map
of
where
certain
features
appear
in
the
input.
If
mo
the
ob
ject
in
the
input,
its
representation
will
mo
ve
the
same
amount
in
the
output.
This
is
useful
for
when
know
that
some
function
of
small
num
er
of
neighboring
pixels
is
useful
when
applied
to
multiple
input
locations.
or
example,
when
processing
images,
it
is
useful
to
detect
edges
in
the
ﬁrst
lay
er
of
con
olutional
netw
ork.
The
same
edges
app
ear
more
or
less
everywhere
in
the
image,
so
it
is
practical
to
share
parameters
across
the
en
tire
image.
In
some
cases,
we
may
not
wish
to
share
parameters
across
the
entire
image.
or
example,
if
we
are
pro
cessing
images
that
are
cropp
ed
to
centered
on
an
individual’s
face,
we
probably
wan
to
extract
diﬀeren
features
at
diﬀeren
lo
cations—the
part
of
the
net
work
pro
cessing
the
top
of
the
face
needs
to
lo
ok
for
eyebro
ws,
while
the
part
of
the
netw
ork
pro
cessing
the
ottom
of
the
face
needs
to
lo
ok
for
chin.
Con
olution
is
not
naturally
equiv
arian
to
some
other
transformations,
suc
as
changes
in
the
scale
or
rotation
of
an
image.
Other
mechanisms
are
necessary
for
handling
these
kinds
of
transformations.
Finally
some
kinds
of
data
cannot
pro
cessed
neural
net
works
deﬁned
matrix
multiplication
with
ﬁxed-shap
matrix.
Con
volution
enables
pro
cessing
of
some
of
these
kinds
of
data.
discuss
this
further
in
section
9.7
9.3
oling
typical
lay
er
of
conv
olutional
netw
ork
consists
of
three
stages
(see
ﬁgure
9.7
).
In
the
ﬁrst
stage,
the
lay
er
erforms
several
con
olutions
in
parallel
to
pro
duce
set
of
linear
activ
ations.
In
the
second
stage,
eac
linear
activ
ation
is
run
through
nonlinear
activ
ation
function,
such
as
the
rectiﬁed
linear
activ
ation
function.
This
stage
is
sometimes
called
the
detector
stage
In
the
third
stage,
use
ooling
function
to
mo
dify
the
output
of
the
la
yer
further.
oling
function
replaces
the
output
of
the
net
at
certain
lo
cation
with
summary
statistic
of
the
nearby
outputs.
or
example,
the
max
ooling
Zhou
and
Chellappa
1988
op
eration
rep
orts
the
maximum
output
within
rectangular
neigh
orhoo
d.
Other
opular
ooling
functions
include
the
erage
of
rectangular
neigh
orhoo
d,
the
norm
of
rectangular
neighborho
od,
or
eighted
erage
based
on
the
distance
from
the
central
pixel.
335
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Conv
olutional La
yer
Input to lay
er
Conv
olution stage:
ne transform
Detector stage:
Nonlinearity
e.g., rectiﬁed linear
Pooling stage
Next lay
er
Input to lay
ers
Conv
olution la
yer:
ne transform
Detector lay
er: Nonlinearit
e.g., rectiﬁed linear
Pooling lay
er
Next lay
er
Complex lay
er terminology
Simple lay
er terminology
Figure
9.7:
The
comp
onents
of
ypical
con
olutional
neural
net
work
la
yer.
There
are
wo
commonly
used
sets
of
terminology
for
describing
these
lay
ers.
(L
eft)
In
this
terminology
the
conv
olutional
net
is
viewed
as
small
num
er
of
relatively
complex
la
ers,
with
eac
la
yer
having
man
“stages.”
In
this
terminology
there
is
one-to-one
mapping
et
ween
kernel
tensors
and
netw
ork
lay
ers.
In
this
bo
ok
we
generally
use
this
terminology
(Right)
In
this
terminology
the
conv
olutional
net
is
viewed
as
larger
num
er
of
simple
la
yers;
ev
ery
step
of
processing
is
regarded
as
lay
er
in
its
own
righ
t.
This
means
that
not
every
“lay
er”
has
parameters.
In
all
cases,
oling
helps
to
mak
the
representation
approximately
in
arian
to
small
translations
of
the
input.
Inv
ariance
to
translation
means
that
if
we
translate
the
input
by
small
amount,
the
alues
of
most
of
the
ooled
outputs
do
not
change.
See
ﬁgure
9.8
for
an
example
of
how
this
orks.
Invarianc
to
lo
al
tr
anslation
an
useful
pr
op
erty
if
we
ar
mor
ab
out
whether
some
fe
atur
is
pr
esent
than
exactly
wher
it
is.
or
example,
when
determining
whether
an
image
contains
face,
we
need
not
know
the
lo
cation
of
the
eyes
with
pixel-p
erfect
accuracy
we
just
need
to
know
that
there
is
an
eye
on
the
left
side
of
the
face
and
an
ey
on
the
right
side
of
the
face.
In
other
con
texts,
it
is
more
imp
ortan
to
preserv
the
lo
cation
of
feature.
or
example,
if
we
wan
to
ﬁnd
corner
deﬁned
336
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
0.1
1.
0.2
1.
1.
1.
0.1
0.2
...
...
...
...
0.3
0.1
1.
1.
0.3
1.
0.2
1.
...
...
...
...
DETECTOR ST
AGE
POOLING ST
AGE
POOLING ST
AGE
DETECTOR ST
AGE
Figure
9.8:
Max
ooling
introduces
in
ariance.
(T
op)
view
of
the
middle
of
the
output
of
con
volutional
lay
er.
The
ottom
row
shows
outputs
of
the
nonlinearity
The
top
ro
sho
ws
the
outputs
of
max
ooling,
with
stride
of
one
pixel
et
een
ooling
regions
and
ooling
region
width
of
three
pixels.
(Bottom)
view
of
the
same
net
ork,
after
the
input
has
een
shifted
to
the
righ
one
pixel.
Every
alue
in
the
ottom
ro
has
hanged,
but
only
half
of
the
alues
in
the
top
row
ha
hanged,
ecause
the
max
ooling
units
are
sensitive
only
to
the
maxim
um
alue
in
the
neighborho
d,
not
its
exact
location.
wo
edges
meeting
at
sp
eciﬁc
orientation,
we
need
to
preserv
the
lo
cation
of
the
edges
well
enough
to
test
whether
they
meet.
The
use
of
oling
can
viewed
as
adding
an
inﬁnitely
strong
prior
that
the
function
the
lay
er
learns
must
in
ariant
to
small
translations.
When
this
assumption
is
correct,
it
can
greatly
impro
ve
the
statistical
eﬃciency
of
the
netw
ork.
oling
ver
spatial
regions
pro
duces
inv
ariance
to
translation,
but
if
we
po
ol
er
the
outputs
of
separately
parametrized
conv
olutions,
the
features
can
learn
whic
transformations
to
ecome
inv
arian
to
(see
ﬁgure
9.9
).
Because
oling
summarizes
the
resp
onses
ov
er
whole
neigh
borho
d,
it
is
ossible
to
use
few
er
ooling
units
than
detector
units,
by
reporting
summary
statistics
for
ooling
regions
spaced
pixels
apart
rather
than
pixel
apart.
See
ﬁgure
9.10
for
an
example.
This
impro
es
the
computational
eﬃciency
of
the
net
ork
ecause
the
next
lay
er
has
roughly
times
fewer
inputs
to
pro
cess.
When
337
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Large response
in po
oling unit
Large response
in po
oling unit
Large
response
in detector
unit 1
Large
response
in detector
unit 3
Figure
9.9:
Example
of
learned
inv
ariances.
ooling
unit
that
po
ols
ov
er
ultiple
features
that
are
learned
with
separate
parameters
can
learn
to
inv
arian
to
transformations
of
the
input.
Here
show
ho
set
of
three
learned
ﬁlters
and
max
oling
unit
can
learn
to
become
inv
arian
to
rotation.
All
three
ﬁlters
are
in
tended
to
detect
hand
written
5.
Eac
ﬁlter
attempts
to
matc
sligh
tly
diﬀerent
orientation
of
the
5.
When
appears
in
the
input,
the
corresponding
ﬁlter
will
match
it
and
cause
large
activ
ation
in
detector
unit.
The
max
ooling
unit
then
has
large
activ
ation
regardless
of
which
detector
unit
as
activ
ated.
sho
here
how
the
net
work
pro
cesses
wo
diﬀerent
inputs,
resulting
in
tw
diﬀeren
detector
units
eing
activ
ated.
The
eﬀect
on
the
ooling
unit
is
roughly
the
same
either
This
principle
is
leveraged
by
maxout
net
orks
Goo
dfello
et
al.
2013a
and
other
conv
olutional
netw
orks.
Max
po
oling
er
spatial
positions
is
naturally
in
arian
to
translation;
this
multic
hannel
approach
is
only
necessary
for
learning
other
transformations.
0.1
1.
0.2
1.
0.2
0.1
0.1
0.0
0.1
Figure
9.10:
Pooling
with
do
wnsampling.
Here
use
max
ooling
with
ool
width
of
three
and
stride
et
een
ols
of
wo.
This
reduces
the
representation
size
by
factor
of
tw
o,
which
reduces
the
computational
and
statistical
burden
on
the
next
la
yer.
Note
that
the
rightmost
ooling
region
has
smaller
size
but
ust
be
included
if
do
not
ant
to
ignore
some
of
the
detector
units.
the
num
er
of
parameters
in
the
next
la
er
is
function
of
its
input
size
(such
as
when
the
next
lay
er
is
fully
connected
and
based
on
matrix
multiplication),
this
338
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
reduction
in
the
input
size
can
also
result
in
improv
ed
statistical
eﬃciency
and
reduced
memory
requirements
for
storing
the
parameters.
or
man
tasks,
oling
is
essen
tial
for
handling
inputs
of
arying
size. F
or
example,
if
ant
to
classify
images
of
ariable
size,
the
input
to
the
classiﬁcation
la
er
must
ha
ﬁxed
size.
This
is
usually
accomplished
by
arying
the
size
of
an
oﬀset
et
een
oling
regions
so
that
the
classiﬁcation
lay
er
alwa
ys
receives
the
same
num
ber
of
summary
statistics
regardless
of
the
input
size.
or
example,
the
ﬁnal
oling
la
er
of
the
netw
ork
may
deﬁned
to
output
four
sets
of
summary
statistics,
one
for
eac
quadran
of
an
image,
regardless
of
the
image
size.
Some
theoretical
work
giv
es
guidance
as
to
which
kinds
of
oling
one
should
use
in
arious
situations
Boureau
et
al.
2010
).
It
is
also
ossible
to
dynamically
ool
features
together,
for
example,
by
running
clustering
algorithm
on
the
lo
cations
of
in
teresting
features
Boureau
et
al.
2011
).
This
approach
yields
diﬀeren
set
of
po
oling
regions
for
each
image.
Another
approac
is
to
le
arn
single
oling
structure
that
is
then
applied
to
all
images
Jia
et
al.
2012
).
oling
can
complicate
some
kinds
of
neural
netw
ork
arc
hitectures
that
use
top-do
wn
information,
suc
as
Boltzmann
machines
and
auto
enco
ders.
These
issues
are
discussed
further
when
present
these
types
of
netw
orks
in
part
II
oling
in
con
volutional
Boltzmann
machines
is
presen
ted
in
section
20.6
. The
in
erse-like
op
erations
on
ooling
units
needed
in
some
diﬀerentiable
net
works
are
co
ered
in
section
20.10.6
Some
examples
of
complete
conv
olutional
netw
ork
arc
hitectures
for
classiﬁcation
using
conv
olution
and
ooling
are
sho
wn
in
ﬁgure
9.11
9.4
Con
olution
and Pooling
as an
Inﬁnitely Strong
Prior
Recall
the
concept
of
prior
probabilit
distribution
from
section
5.6
This
is
probability
distribution
ov
er
the
parameters
of
mo
del
that
enco
des
our
eliefs
ab
out
what
mo
dels
are
reasonable,
efore
ha
ve
seen
any
data.
Priors
can
considered
weak
or
strong
dep
ending
on
how
concen
trated
the
probabilit
density
in
the
prior
is.
weak
prior
is
prior
distribution
with
high
en
trop
such
as
Gaussian
distribution
with
high
ariance.
Suc
prior
allows
the
data
to
mov
the
parameters
more
or
less
freely
strong
prior
has
ery
low
en
trop
such
as
Gaussian
distribution
with
low
ariance.
Suc
prior
plays
more
active
role
in
determining
where
the
parameters
end
up.
An
inﬁnitely
strong
prior
places
zero
probability
on
some
parameters
and
says
339
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Input image:
256x256x3
Output of
conv
olution +
ReLU: 256x256x64
Output of po
oling
with stride 4:
64x64x64
Output of
conv
olution +
ReLU: 64x64x64
Output of po
oling
with stride 4:
16x16x64
Output of reshape to
vector:
16,384 units
Output of matrix
multiply: 1,000 units
Output of softmax:
1,000 class
probabilities
Input image:
256x256x3
Output of
conv
olution +
ReLU: 256x256x64
Output of po
oling
with stride 4:
64x64x64
Output of
conv
olution +
ReLU: 64x64x64
Output of po
oling to
3x3 grid: 3x3x64
Output of reshape to
vector:
576 units
Output of matrix
multiply: 1,000 units
Output of softmax:
1,000 class
probabilities
Input image:
256x256x3
Output of
conv
olution +
ReLU: 256x256x64
Output of po
oling
with stride 4:
64x64x64
Output of
conv
olution +
ReLU: 64x64x64
Output of
conv
olution:
16x16x1,000
Output of av
erage
po
oling: 1x1x1,000
Output of softmax:
1,000 class
probabilities
Output of po
oling
with stride 4:
16x16x64
Figure
9.11: E
xamples
of
architectures
for
classiﬁcation
with
conv
olutional
net
orks.
The
sp
eciﬁc
strides
and
depths
used
in
this
ﬁgure
are
not
advisable
for
real
use;
they
are
designed
to
very
shallow
to
ﬁt
onto
the
page.
Real
conv
olutional
netw
orks
also
often
in
volv
signiﬁcan
amoun
ts
of
branc
hing,
unlik
the
chain
structures
used
here
for
simplicity
(L
eft)
con
olutional
netw
ork
that
pro
cesses
ﬁxed
image
size.
After
alternating
et
ween
con
volution
and
po
oling
for
few
lay
ers,
the
tensor
for
the
conv
olutional
feature
map
is
reshap
ed
to
ﬂatten
out
the
spatial
dimensions.
The
rest
of
the
netw
ork
is
an
ordinary
feedforw
ard
netw
ork
classiﬁer,
as
describ
ed
in
chapter
(Center)
conv
olutional
netw
ork
that
pro
cesses
ariably
sized
image
but
still
main
tains
fully
connected
section.
This
net
work
uses
ooling
op
eration
with
ariably
sized
ools
but
ﬁxed
um
er
of
po
ols,
in
order
to
provide
ﬁxed-size
vector
of
576
units
to
the
fully
connected
ortion
of
the
net
work.
(Right)
con
volutional
netw
ork
that
does
not
ha
ve
an
fully
connected
weigh
la
yer.
Instead,
the
last
con
volutional
lay
er
outputs
one
feature
map
er
class.
The
mo
del
presumably
learns
map
of
how
likely
eac
class
is
to
ccur
at
eac
spatial
lo
cation.
eraging
feature
map
down
to
single
alue
pro
vides
the
argument
to
the
softmax
classiﬁer
at
the
top.
340
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
that
these
parameter
alues
are
completely
forbidden,
regardless
of
ho
muc
supp
ort
the
data
give
to
those
alues.
can
imagine
conv
olutional
net
as
eing
similar
to
fully
connected
net,
but
with
an
inﬁnitely
strong
prior
ov
er
its
weigh
ts.
This
inﬁnitely
strong
prior
sa
ys
that
the
weigh
ts
for
one
hidden
unit
ust
identical
to
the
eigh
ts
of
its
neigh
or
but
shifted
in
space.
The
prior
also
says
that
the
eights
ust
zero,
except
for
in
the
small,
spatially
contiguous
receptive
ﬁeld
assigned
to
that
hidden
unit.
Ov
erall,
can
think
of
the
use
of
conv
olution
as
in
tro
ducing
an
inﬁnitely
strong
prior
probability
distribution
ver
the
parameters
of
la
er.
This
prior
sa
ys
that
the
function
the
lay
er
should
learn
con
tains
only
lo
cal
in
teractions
and
is
equiv
ariant
to
translation.
Lik
ewise,
the
use
of
ooling
is
an
inﬁnitely
strong
prior
that
each
unit
should
inv
arian
to
small
translations.
Of
course,
implementing
conv
olutional
net
as
fully
connected
net
with
an
inﬁnitely
strong
prior
would
extremely
wasteful
computationally
But
thinking
of
conv
olutional
net
as
fully
connected
net
with
an
inﬁnitely
strong
prior
can
giv
us
some
insights
in
to
how
conv
olutional
nets
work.
One
key
insigh
is
that
conv
olution
and
oling
can
cause
underﬁtting.
Lik
an
prior,
con
olution
and
oling
are
only
useful
when
the
assumptions
made
the
prior
are
reasonably
accurate.
If
task
relies
on
preserving
precise
spatial
information,
then
using
oling
on
all
features
can
increase
the
training
error.
Some
conv
olutional
netw
ork
architectures
Szegedy
et
al.
2014a
are
designed
to
use
ooling
on
some
channels
but
not
on
other
hannels,
in
order
to
get
oth
highly
inv
arian
features
and
features
that
will
not
underﬁt
when
the
translation
in
ariance
prior
is
incorrect.
When
task
inv
olv
es
incorp
orating
information
from
ery
distan
lo
cations
in
the
input,
then
the
prior
imp
osed
conv
olution
may
inappropriate.
Another
ey
insight
from
this
view
is
that
we
should
only
compare
conv
olu-
tional
mo
dels
to
other
con
volutional
mo
dels
in
enc
hmarks
of
statistical
learning
erformance.
Mo
dels
that
do
not
use
conv
olution
ould
be
able
to
learn
even
if
we
erm
uted
all
the
pixels
in
the
image.
or
man
image
datasets,
there
are
separate
enc
hmarks
for
mo
dels
that
are
erm
utation
inv
arian
and
must
disco
ver
the
concept
of
top
ology
via
learning
and
for
mo
dels
that
hav
the
knowledge
of
spatial
relationships
hard
co
ded
into
them
by
their
designer.
341
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
9.5
arian
ts
of
the
Basic
Con
olution
unction
When
discussing
conv
olution
in
the
context
of
neural
netw
orks,
we
usually
do
not
refer
exactly
to
the
standard
discrete
con
olution
operation
as
it
is
usually
understo
od
in
the
mathematical
literature.
The
functions
used
in
practice
diﬀer
sligh
tly
Here
describ
these
diﬀerences
in
detail
and
highligh
some
useful
prop
erties
of
the
functions
used
in
neural
netw
orks.
First,
when
refer
to
con
volution
in
the
context
of
neural
net
orks,
we
usually
actually
mean
an
op
eration
that
consists
of
many
applications
of
con
volution
in
parallel.
This
is
ecause
conv
olution
with
single
kernel
can
extract
only
one
kind
of
feature,
albeit
at
man
spatial
locations.
Usually
we
ant
eac
la
yer
of
our
net
ork
to
extract
many
kinds
of
features,
at
man
lo
cations.
dditionally
the
input
is
usually
not
just
grid
of
real
alues.
Rather,
it
is
grid
of
vector-v
alued
observ
ations. F
or
example,
color
image
has
red,
green
and
blue
intensit
at
each
pixel.
In
multila
yer
conv
olutional
netw
ork,
the
input
to
the
second
lay
er
is
the
output
of
the
ﬁrst
lay
er,
which
usually
has
the
output
of
many
diﬀeren
conv
olutions
at
each
osition.
When
working
with
images,
we
usually
think
of
the
input
and
output
of
the
con
olution
as
eing
3-D
tensors,
with
one
index
into
the
diﬀerent
hannels
and
tw
indices
into
the
spatial
co
ordinates
of
each
hannel.
Soft
are
implementations
usually
work
in
batch
mode,
so
they
will
actually
use
4-D
tensors,
with
the
fourth
axis
indexing
diﬀerent
examples
in
the
batch,
but
will
omit
the
batc
axis
in
our
description
here
for
simplicity
Because
con
olutional
net
orks
usually
use
ultic
hannel
con
olution,
the
linear
op
erations
they
are
based
on
are
not
guaran
teed
to
commutativ
e,
even
if
kernel
ﬂipping
is
used.
These
multic
hannel
op
erations
are
only
comm
utativ
if
eac
op
eration
has
the
same
umber
of
output
channels
as
input
channels.
Assume
hav
4-D
ernel
tensor
with
elemen
i,j,k,l
giving
the
connection
strength
et
een
unit
in
channel
of
the
output
and
unit
in
hannel
of
the
input,
with
an
oﬀset
of
ro
ws
and
columns
et
een
the
output
unit
and
the
input
unit.
Assume
our
input
consists
of
observ
ed
data
with
elemen
i,j,k
giving
the
alue
of
the
input
unit
within
channel
at
row
and
column
Assume
our
output
consists
of
with
the
same
format
as
If
is
pro
duced
by
con
olving
across
without
ﬂipping
then
i,j,k
l,m,n
l,j
,k
i,l,m,n
(9.7)
where
the
summation
ov
er
and
is
ov
er
all
alues
for
which
the
tensor
indexing
op
erations
inside
the
summation
are
alid.
In
linear
algebra
notation,
342
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
index
into
arra
ys
using
for
the
ﬁrst
entry
This
necessitates
the
in
the
ab
ve
form
ula.
Programming
languages
suc
as
and
Python
index
starting
from
rendering
the
ab
expression
even
simpler.
may
wan
to
skip
ov
er
some
ositions
of
the
kernel
to
reduce
the
computa-
tional
cost
(at
the
exp
ense
of
not
extracting
our
features
as
ﬁnely).
can
think
of
this
as
downsampling
the
output
of
the
full
conv
olution
function.
If
we
wan
to
sample
only
every
pixels
in
each
direction
in
the
output,
then
we
can
deﬁne
do
wnsampled
conv
olution
function
suc
that
i,j,k
i,j,k
l,m,n
l,
1)
m,
1)
i,l,m,n
(9.8)
refer
to
as
the
stride
of
this
downsampled
conv
olution.
It
is
also
ossible
to
deﬁne
separate
stride
for
eac
direction
of
motion.
See
ﬁgure
9.12
for
an
illustration.
One
essen
tial
feature
of
any
con
olutional
net
work
implementation
is
the
abilit
to
implicitly
zero
pad
the
input
to
mak
it
wider.
Without
this
feature,
the
width
of
the
representation
shrinks
by
one
pixel
less
than
the
kernel
width
at
eac
lay
er.
Zero
padding
the
input
allo
ws
us
to
control
the
ernel
width
and
the
size
of
the
output
indep
enden
tly
Without
zero
padding,
we
are
forced
to
ho
ose
et
ween
shrinking
the
spatial
exten
of
the
netw
ork
rapidly
and
using
small
ernels—b
oth
scenarios
that
signiﬁcantly
limit
the
expressive
wer
of
the
netw
ork.
See
ﬁgure
9.13
for
an
example.
Three
sp
ecial
cases
of
the
zero-padding
setting
are
orth
men
tioning.
One
is
the
extreme
case
in
which
no
zero
padding
is
used
whatso
ev
er,
and
the
conv
olution
ernel
is
allo
ed
to
visit
only
ositions
where
the
entire
kernel
is
con
tained
en
tirely
within
the
image.
In
MA
TLAB
terminology
this
is
called
alid
con
voluti
on.
In
this
case,
all
pixels
in
the
output
are
function
of
the
same
num
er
of
pixels
in
the
input,
so
the
eha
vior
of
an
output
pixel
is
somewhat
more
regular.
Ho
ever,
the
size
of
the
output
shrinks
at
eac
lay
er.
If
the
input
image
has
width
and
the
ernel
has
width
the
output
will
of
width
. The
rate
of
this
shrink
age
can
dramatic
if
the
kernels
used
are
large.
Since
the
shrink
age
is
greater
than
0,
it
limits
the
num
ber
of
conv
olutional
lay
ers
that
can
included
in
the
netw
ork.
As
la
ers
are
added,
the
spatial
dimension
of
the
net
ork
will
ev
en
tually
drop
to
at
which
oin
additional
lay
ers
cannot
meaningfully
considered
con
olutional.
Another
sp
ecial
case
of
the
zero-padding
setting
is
when
just
enough
zero
padding
is
added
to
keep
the
size
of
the
output
equal
to
the
size
of
the
input.
MA
TLAB
calls
this
same
con
volution
In
this
case,
the
netw
ork
can
contain
as
many
con
olutional
lay
ers
as
the
av
ailable
hardware
can
supp
ort,
since
the
op
eration
of
conv
olution
do
es
not
mo
dify
the
arc
hitectural
ossibilities
343
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Strided
conv
olution
Downsampling
Conv
olution
Figure 9.12:
Conv
olution
with
stride.
In
this
example,
use
stride
of
tw
o.
(T
op)
Con
volution
with
stride
length
of
tw
implemen
ted
in
single
op
eration.
(Bot-
tom)
Con
volution
with
stride
greater
than
one
pixel
is
mathematically
equiv
alent
to
con
volution
with
unit
stride
follow
ed
by
downsampling.
Obviously
the
wo-step
approach
in
volving
do
wnsampling
is
computationally
asteful,
ecause
it
computes
many
alues
that
are
then
discarded.
ailable
to
the
next
lay
er.
The
input
pixels
near
the
order,
how
ev
er,
inﬂuence
few
er
output
pixels
than
the
input
pixels
near
the
center.
This
can
make
the
order
pixels
somewhat
underrepresented
in
the
mo
del.
This
motiv
ates
the
other
extreme
case,
whic
MA
TLAB
refers
to
as
full
con
olution,
in
which
enough
zeros
are
added
for
every
pixel
to
visited
times
in
each
direction,
resulting
in
an
output
image
of
width
In
this
case,
the
output
pixels
near
the
order
are
function
of
fewer
pixels
than
the
output
pixels
near
the
center.
This
can
mak
it
diﬃcult
to
learn
single
kernel
that
erforms
well
at
all
ositions
in
the
conv
olutional
feature
map.
Usually
the
optimal
amoun
of
zero
padding
(in
344
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
...
...
...
...
...
...
...
...
...
Figure
9.13:
The
eﬀect
of
zero
padding
on
netw
ork
size.
Consider
conv
olutional
net
work
with
kernel
of
width
six
at
ev
ery
la
er.
In
this
example,
we
do
not
use
an
po
oling,
so
only
the
conv
olution
operation
itself
shrinks
the
net
work
size.
(T
op)
In
this
con
olutional
net
work,
do
not
use
an
implicit
zero
padding.
This
causes
the
representation
to
shrink
by
ﬁve
pixels
at
each
la
yer.
Starting
from
an
input
of
sixtee
pixels,
we
are
only
able
to
hav
three
conv
olutional
lay
ers
and
the
last
la
yer
do
es
not
ev
er
mov
the
kernel,
so
arguably
only
tw
of
the
lay
ers
are
truly
con
olutional.
The
rate
of
shrinking
can
mitigated
by
using
smaller
kernels,
but
smaller
kernels
are
less
expressive,
and
some
shrinking
is
inevitable
in
this
kind
of
arc
hitecture.
(Bottom)
By
adding
ﬁv
implicit
zeros
to
eac
la
er,
we
preven
the
representation
from
shrinking
with
depth.
This
allo
ws
us
to
mak
an
arbitrarily
deep
con
volutional
netw
ork.
of
test
set
classiﬁcation
accuracy)
lies
somewhere
et
ween
“v
alid”
and
“same”
con
olution.
In
some
cases,
do
not
actually
wan
to
use
con
olution,
but
an
to
use
lo
cally
connected
lay
ers
instead
LeCun
1986
1989
).
In
this
case,
the
adjacency
matrix
in
the
graph
of
our
MLP
is
the
same,
but
ev
ery
connection
has
its
wn
eight
sp
eciﬁed
by
6-D
tensor
The
indices
into
are
resp
ectiv
ely:
the
345
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
output
channel;
the
output
row;
the
output
column;
the
input
channel;
the
row
oﬀset
within
the
input;
and
the
column
oﬀset
within
the
input.
The
linear
part
of
lo
cally
connected
lay
er
is
then
given
by
i,j,k
l,m,n
l,j
,k
i,j,k,l
,m,n
(9.9)
This
is
sometimes
also
called
unshared
conv
olution
ecause
it
is
similar
op
er-
ation
to
discrete
conv
olution
with
small
kernel,
but
without
sharing
parameters
across
locations.
Figure
9.14
compares
local
connections,
con
volution,
and
full
connections.
Lo
cally
connected
la
yers
are
useful
when
we
know
that
each
feature
should
function
of
small
part
of
space,
but
there
is
no
reason
to
think
that
the
same
feature
should
ccur
across
all
of
space.
or
example,
if
we
an
to
tell
if
an
image
is
picture
of
face,
we
only
need
to
lo
ok
for
the
mouth
in
the
ottom
half
of
the
image.
It
can
also
useful
to
make
versions
of
conv
olution
or
locally
connected
lay
ers
in
which
the
connectivit
is
further
restricted,
for
example
to
constrain
each
output
hannel
to
function
of
only
subset
of
the
input
hannels
common
ay
to
do
this
is
to
make
the
ﬁrst
output
hannels
connect
to
only
the
ﬁrst
input
channels,
the
second
output
channels
connect
to
only
the
second
input
channels,
and
so
on.
See
ﬁgure
9.15
for
an
example.
Modeling
interactions
et
ween
few
hannels
allo
ws
the
netw
ork
to
hav
few
er
parameters, reducing
memory
consumption,
increasing
statistical
eﬃciency
and
reducing
the
amount
of
computation
needed
to
erform
forward
and
back-propagation.
It
accomplishes
these
goals
without
reducing
the
num
ber
of
hidden
units.
Tiled
con
olution
Gregor
and
LeCun
2010a
Le
et
al.
2010
oﬀers
com-
promise
et
ween
conv
olutional
lay
er
and
lo
cally
connected
la
yer.
Rather
than
learning
separate
set
of
eigh
ts
at
every
spatial
lo
cation,
we
learn
set
of
kernels
that
we
rotate
through
as
mo
through
space.
This
means
that
immediately
neigh
oring
lo
cations
will
hav
diﬀerent
ﬁlters,
as
in
lo
cally
connected
la
yer,
but
the
memory
requirements
for
storing
the
parameters
will
increase
only
by
factor
of
the
size
of
this
set
of
ernels,
rather
than
the
size
of
the
en
tire
output
feature
map.
See
ﬁgure
9.16
for
comparison
of
lo
cally
connected
lay
ers,
tiled
conv
olution,
and
standard
conv
olution.
deﬁne
tiled
conv
olution
algebraically
let
6-D
tensor,
where
wo
of
the
dimensions
corresp
ond
to
diﬀerent
lo
cations
in
the
output
map.
Rather
than
ha
ving
separate
index
for
eac
lo
cation
in
the
output
map,
output
lo
cations
cycle
through
set
of
diﬀeren
choices
of
ernel
stack
in
eac
direction.
If
is
equal
to
346
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
a b
a b
a b
a b
a b
c d
e f
g h
Figure
9.14:
Comparison
of
lo
cal
connections,
con
olution,
and
full
connections.
(T
op)
lo
cally
connected
lay
er
with
patch
size
of
tw
pixels.
Each
edge
is
lab
eled
with
unique
letter
to
sho
that
each
edge
is
asso
ciated
with
its
own
weigh
parameter.
(Center)
conv
olutional
la
er
with
kernel
width
of
tw
pixels.
This
mo
del
has
exactly
the
same
connectivit
as
the
lo
cally
connected
la
er.
The
diﬀerence
lies
not
in
which
units
in
teract
with
eac
other,
but
in
how
the
parameters
are
shared.
The
lo
cally
connected
la
yer
has
no
parameter
sharing.
The
con
olutional
la
yer
uses
the
same
tw
weigh
ts
rep
eatedly
across
the
entire
input,
as
indicated
by
the
rep
etition
of
the
letters
lab
eling
each
edge.
(Bottom)
fully
connected
lay
er
resembles
locally
connected
la
er
in
the
sense
that
eac
edge
has
its
own
parameter
(there
are
to
man
to
lab
el
explicitly
with
letters
in
this
diagram).
It
does
not,
ho
wev
er,
hav
the
restricted
connectivit
of
the
lo
cally
connected
la
yer.
the
output
width,
this
is
the
same
as
lo
cally
connected
lay
er.
i,j,k
l,m,n
l,j
,k
i,l,m,n,j
+1
,k
+1
(9.10)
where
ercen
is
the
mo
dulo
op
eration,
with
= 0
1)%
= 1
and
so
on.
It
is
straightforw
ard
to
generalize
this
equation
to
use
diﬀerent
tiling
range
for
eac
dimension.
347
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Input T
ensor
Output T
ensor
Spatial coordinates
Channel coordinates
Figure
9.15: A
conv
olutional
netw
ork
with
the
ﬁrst
tw
output
channels
connected
to
only
the
ﬁrst
tw
input
channels,
and
the
second
tw
output
channels
connected
to
only
the
second
tw
input
channels.
348
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
a b
a b
a b
a b
a b
c d
e f
g h
a b
c d
a b
c d
Figure
9.16:
comparison
of
lo
cally
connected
lay
ers,
tiled
conv
olution,
and
standard
con
volution.
All
three
hav
the
same
sets
of
connections
et
een
units,
when
the
same
size
of
kernel
is
used.
This
diagram
illustrates
the
use
of
ernel
that
is
tw
pixels
wide.
The
diﬀerences
et
een
the
metho
ds
lies
in
ho
they
share
parameters.
(T
op)
lo
cally
connected
la
er
has
no
sharing
at
all.
indicate
that
eac
connection
has
its
own
eight
by
lab
eling
each
connection
with
unique
letter.
(Center)
Tiled
conv
olution
has
set
of
diﬀeren
kernels.
Here
we
illustrate
the
case
of
= 2
. One
of
these
ernels
has
edges
labeled
“a”
and
“b,”
while
the
other
has
edges
lab
eled
“c”
and
“d.”
Each
time
we
mo
ve
one
pixel
to
the
righ
in
the
output,
we
mo
ve
on
to
using
diﬀeren
kernel.
This
means
that,
like
the
lo
cally
connected
la
yer,
neigh
oring
units
in
the
output
hav
diﬀeren
parameters.
Unlike
the
lo
cally
connected
lay
er,
after
we
hav
gone
through
all
ailable
ernels,
cycle
bac
to
the
ﬁrst
kernel.
If
wo
output
units
are
separated
by
multiple
of
steps,
then
they
share
parameters.
(Bottom)
raditional
conv
olution
is
equiv
alent
to
tiled
con
olution
with
= 1
There
is
only
one
kernel,
and
it
is
applied
everywhere,
as
indicated
in
the
diagram
using
the
kernel
with
weigh
ts
lab
eled
“a”
and
“b”
everywhere.
Lo
cally
connected
la
yers
and
tiled
con
olutional
la
ers
oth
ha
ve
an
in
teresting
in
teraction
with
max
oling:
the
detector
units
of
these
lay
ers
are
driven
by
diﬀeren
ﬁlters.
If
these
ﬁlters
learn
to
detect
diﬀeren
transformed
ersions
of
349
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
the
same
underlying
features,
then
the
max-p
ooled
units
ecome
inv
arian
to
the
learned
transformation
(see
ﬁgure
9.9
).
Conv
olutional
lay
ers
are
hard
co
ded
to
in
ariant
sp
eciﬁcally
to
translation.
Other
op
erations
besides
con
volution
are
usually
necessary
to
implement
con
volutio
nal
net
work.
perform
learning,
one
must
able
to
compute
the
gradien
with
resp
ect
to
the
ernel,
given
the
gradien
with
resp
ect
to
the
outputs.
In
some
simple
cases, this
op
eration
can
performed
using
the
conv
olution
op
eration,
but
many
cases
of
interest,
including
the
case
of
stride
greater
than
1,
do
not
hav
this
prop
ert
Recall
that
con
volution
is
linear
op
eration
and
can
thus
describ
ed
as
matrix
multiplication
(if
we
ﬁrst
reshap
the
input
tensor
into
ﬂat
ector).
The
matrix
inv
olv
ed
is
function
of
the
con
volu
tion
kernel.
The
matrix
is
sparse,
and
eac
element
of
the
kernel
is
copied
to
sev
eral
elemen
ts
of
the
matrix.
This
view
helps
us
to
deriv
some
of
the
other
op
erations
needed
to
implemen
conv
olutional
net
ork.
Multiplication
by
the
transpose
of
the
matrix
deﬁned
conv
olution
is
one
suc
op
eration.
This
is
the
op
eration
needed
to
back-propagate
error
deriv
atives
through
con
olutional
lay
er,
so
it
is
needed
to
train
conv
olutional
netw
orks
that
hav
more
than
one
hidden
la
yer.
This
same
op
eration
is
also
needed
if
we
wish
to
reconstruct
the
visible
units
from
the
hidden
units
Simard
et
al.
1992
).
Reconstructing
the
visible
units
is
an
op
eration
commonly
used
in
the
mo
dels
describ
ed
in
part
of
this
ook,
such
as
auto
enco
ders,
RBMs,
and
sparse
co
ding.
ransp
ose
conv
olution
is
necessary
to
construct
conv
olutional
versions
of
those
mo
dels.
Lik
the
kernel
gradient
op
eration,
this
input
gradient
op
eration
can
sometimes
implemen
ted
using
conv
olution
but
in
general
requires
third
op
eration
to
implemented. Care
must
taken
to
co
ordinate
this
transp
ose
op
eration
with
the
forward
propagation.
The
size
of
the
output
that
the
transp
ose
op
eration
should
return
dep
ends
on
the
zero-padding
policy
and
stride
of
the
forw
ard
propagation
operation,
as
well
as
the
size
of
the
forw
ard
propagation’s
output
map.
In
some
cases,
ultiple
sizes
of
input
to
forward
propagation
can
result
in
the
same
size
of
output
map,
so
the
transp
ose
op
eration
must
be
explicitly
told
what
the
size
of
the
original
input
was.
These
three
op
erations—con
volution,
bac
kprop
from
output
to
weigh
ts,
and
bac
kprop
from
output
to
inputs—are
suﬃcient
to
compute
all
the
gradients
needed
to
train
an
y depth
of feedforw
ard con
olutional net
ork, as w
ell
as to
train
con
volutio
nal
netw
orks
with
reconstruction
functions
based
on
the
transp
ose
of
con
volutio
n. See
Go
odfellow
2010
for
full
deriv
ation
of
the
equations
in
the
fully
general
multidimensional,
multiexamp
le
case.
giv
sense
of
ho
these
350
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
equations
work,
we
presen
the
tw
o-dimensional,
single
example
ersion
here.
Supp
ose
we
ant
to
train
con
olutional
netw
ork
that
incorp
orates
strided
con
volutio
of
kernel
stack
applied
to
ultic
hannel
image
with
stride
as
deﬁned
by
as
in
equation
9.8
Supp
ose
wan
to
minimize
some
loss
function
During
forw
ard
propagation,
will
need
to
use
itself
to
output
which
is
then
propagated
through
the
rest
of
the
net
ork
and
used
to
compute
the
cost
function
During
bac
k-propagation,
will
receiv
tensor
suc
that
i,j,k
i,j,k
train
the
netw
ork,
we
need
to
compute
the
deriv
atives
with
resp
ect
to
the
eigh
ts
in
the
kernel.
do
so,
we
can
use
function
i,j,k,l
i,j,k,l
) =
m,n
i,m,n
j,
1)
k,
1)
(9.11)
If
this
la
er
is
not
the
ottom
la
er
of
the
netw
ork,
we
will
need
to
compute
the
gradient
with
resp
ect
to
to
back-propagate
the
error
further
down.
do
so,
we
can
use
function
i,j,k
i,j,k
(9.12)
l,m
s.t.
1)
n,p
s.t.
1)
,i,m,p
,l,n
(9.13)
Auto
encoder
netw
orks, described in
hapter
14
, are
feedforw
ard
netw
orks
trained
to
copy
their
input
to
their
output.
simple
example
is
the
PCA
algorithm,
whic
copies
its
input
to
an
approximate
reconstruction
using
the
function
It
is common
for
more general
auto
encoders
to
use m
ultiplication
the
transp
ose
of
the
eight
matrix
just
as
PCA
do
es.
make
such
mo
dels
con
olutional,
can
use
the
function
to
erform
the
transp
ose
of
the
conv
olution
op
eration.
Supp
ose
we
hav
hidden
units
in
the
same
format
as
and
deﬁne
reconstruction
(9.14)
train
the
autoenco
der,
will
receive
the
gradien
with
respect
to
as
tensor
train
the
deco
der,
we
need
to
obtain
the
gradient
with
respect
to
This
is
given
train
the
encoder,
we
need
to
obtain
the
gradien
with
resp
ect
to
This
is
given
by
It
is
also
possible
to
diﬀeren
tiate
through
using
and
but
these
op
erations
are
not
needed
for
the
bac
k-propagation
algorithm
on
any
standard
netw
ork
architectures.
351
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Generally
we
do
not
use
only
linear
op
eration
to
transform
from
the
inputs
to
the
outputs
in
con
volutional
la
er.
generally
also
add
some
bias
term
to
eac
output
efore
applying
the
nonlinearity
This
raises
the
question
of
how
to
share
parameters
among
the
biases.
or
lo
cally
connected
lay
ers,
it
is
natural
to
giv
eac
unit
its
wn
bias,
and
for
tiled
con
olution,
it
is
natural
to
share
the
biases
with
the
same
tiling
pattern
as
the
ernels.
or
con
volutional
la
ers,
it
is
ypical
to
ha
ve
one
bias
er
channel
of
the
output
and
share
it
across
all
lo
cations
within
each
con
olution
map.
If
the
input
is
of
known,
ﬁxed
size,
how
ever,
it
is
also
ossible
to
learn
separate
bias
at
each
lo
cation
of
the
output
map.
Separating
the
biases
may
slightly
reduce
the
statistical
eﬃciency
of
the
mo
del,
but
it
allo
ws
the
mo
del
to
correct
for
diﬀerences
in
the
image
statistics
at
diﬀerent
lo
cations.
or
example,
when
using
implicit
zero
padding,
detector
units
at
the
edge
of
the
image
receive
less
total
input
and
ma
need
larger
biases.
9.6
Structured
Outputs
Con
olutional
net
orks
can
used
to
output
high-dimensional
structured
ob
ject,
rather
than
just
predicting
class
lab
el
for
classiﬁcation
task
or
real
alue
for
regression
task.
Typically
this
ob
ject
is
just
tensor,
emitted
standard
con
volutio
nal
la
yer.
or
example,
the
mo
del
might
emit
tensor
where
i,j,k
is
the
probabilit
that
pixel
j,
of
the
input
to
the
netw
ork
elongs
to
class
This
allows
the
model
to
lab
el
every
pixel
in
an
image
and
draw
precise
masks
that
follow
the
outlines
of
individual
ob
jects.
One
issue
that
often
comes
up
is
that
the
output
plane
can
smaller
than
the
input
plane,
as
shown
in
ﬁgure
9.13
. In
the
kinds
of
arc
hitectures
typically
used
for
classiﬁcation
of
single
ob
ject
in
an
image,
the
greatest
reduction
in
the
spatial
dimensions
of
the
net
ork
comes
from
using
ooling
la
yers
with
large
stride.
pro
duce
an
output
map
of
similar
size
as
the
input,
one
can
void
po
oling
altogether
Jain
et
al.
2007
).
Another
strategy
is
to
simply
emit
lo
er-resolution
grid
of
lab
els
Pinheiro
and
Collob
ert
2014
2015
).
Finally
in
principle,
one
could
use
ooling
op
erator
with
unit
stride.
One
strategy
for
pixel-wise
lab
eling
of
images
is
to
pro
duce
an
initial
guess
of
the
image
lab
els,
then
reﬁne
this
initial
guess
using
the
interactions
et
een
neigh
oring
pixels.
Rep
eating
this
reﬁnement
step
several
times
corresp
onds
to
using
the
same
conv
olutions
at
each
stage,
sharing
weigh
ts
et
ween
the
last
la
yers
of
the
deep
net
Jain
et
al.
2007
).
This
makes
the
sequence
of
computations
erformed
the
successiv
conv
olutional
lay
ers
with
weigh
ts
shared
across
la
ers
particular
352
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
(1)
(1)
(2)
(2)
(3)
(3)
(1)
(1)
(2)
(2)
(3)
(3)
Figure
9.17:
An
example
of
recurrent
conv
olutional
netw
ork
for
pixel
lab
eling.
The
input
is
an
image
tensor
with
axes
corresponding
to
image
ro
ws,
image
columns,
and
hannels
(red,
green,
blue).
The
goal
is
to
output
tensor
of
labels
with
probability
distribution
er
labels
for
each
pixel.
This
tensor
has
axes
corresp
onding
to
image
ro
ws,
image
columns,
and
the
diﬀeren
classes.
Rather
than
outputting
in
single
shot,
the
recurren
netw
ork
iteratively
reﬁnes
its
estimate
using
previous
estimate
of
as
input
for
creating
new
estimate. The
same
parameters
are
used
for
each
updated
estimate,
and
the
estimate
can
reﬁned
as
many
times
as
wish.
The
tensor
of
con
volution
kernels
is
used
on
each
step
to
compute
the
hidden
representation
given
the
input
image.
The
kernel
tensor
is
used
to
pro
duce
an
estimate
of
the
labels
given
the
hidden
alues.
On
all
but
the
ﬁrst
step,
the
kernels
are
conv
olv
ed
er
to
provide
input
to
the
hidden
lay
er.
On
the
ﬁrst
time
step,
this
term
is
replaced
zero.
Because
the
same
parameters
are
used
on
eac
step,
this
is
an
example
of
recurren
net
ork,
as
describ
ed
in
hapter
10
kind
of
recurrent
netw
ork
Pinheiro
and
Collob
ert
2014
2015
).
Figure
9.17
shows
the
architecture
of
suc
recurrent
conv
olutional
netw
ork.
Once
prediction
for
each
pixel
is
made,
arious
metho
ds
can
be
used
to
further
pro
cess
these
predictions
to
obtain
segmen
tation
of
the
image
into
regions
Briggman
et
al.
2009
uraga
et
al.
2010
arab
et
et
al.
2013
).
The
general
idea
is
to
assume
that
large
groups
of
contiguous
pixels
tend
to
asso
ciated
with
the
same
lab
el.
Graphical
mo
dels
can
describ
the
probabilistic
relationships
et
een
neigh
oring
pixels.
Alternatively
the
conv
olutional
net
ork
can
trained
to
maximize
an
approximation
of
the
graphical
mo
del
training
ob
jectiv
Ning
et
al.
2005
Thompson
et
al.
2014
).
353
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
9.7
Data
yp
es
The
data
used
with
con
volutional
netw
ork
usually
consist
of
several
channels,
eac
channel
eing
the
observ
ation
of
diﬀerent
quan
tity
at
some
oin
in
space
or
time.
See
table
9.1
for
examples
of
data
yp
es
with
diﬀerent
dimensionalities
and
num
ber
of
channels.
or
an
example
of
conv
olutional
netw
orks
applied
to
video,
see
Chen
et
al.
2010
).
So
far
ha
discussed
only
the
case
where
ev
ery
example
in
the
train
and
test
data
has
the
same
spatial
dimensions.
One
adv
antage
to
conv
olutional
netw
orks
is
that
they
can
also
pro
cess
inputs
with
arying
spatial
extents.
These
kinds
of
input
simply
cannot
represented
by
traditional,
matrix
multiplication-based
neural
netw
orks.
This
provides
comp
elling
reason
to
use
conv
olutional
netw
orks
ev
en
when
computational
cost
and
ov
erﬁtting
are
not
signiﬁcan
issues.
or
example,
consider
collection
of
images
in
whic
each
image
has
diﬀeren
width
and
height.
It
is
unclear
ho
to
mo
del
suc
inputs
with
weigh
matrix
of
ﬁxed
size.
Conv
olution
is
straightforw
ard
to
apply;
the
kernel
is
simply
applied
diﬀeren
um
er
of
times
dep
ending
on
the
size
of
the
input,
and
the
output
of
the
con
volutio
op
eration
scales
accordingly
Conv
olution
ma
viewed
as
matrix
ultiplication;
the
same
con
volution
ernel
induces
diﬀerent
size
of
doubly
blo
circulan
matrix
for
each
size
of
input.
Sometimes
the
output
of
the
netw
ork
as
ell
as
the
input
is
allo
wed
to
ha
ve
ariable
size,
for
example,
if
we
wan
to
assign
class
label
to
eac
pixel
of
the
input.
In
this
case,
no
further
design
ork
is
necessary
In
other
cases,
the
netw
ork
must
pro
duce
some
ﬁxed-size
output,
for
example,
if
we
wan
to
assign
single
class
lab
el
to
the
en
tire
image.
In
this
case,
must
mak
some
additional
design
steps,
like
inserting
ooling
la
yer
whose
ooling
regions
scale
in
size
prop
ortional
to
the
size
of
the
input,
to
main
tain
ﬁxed
num
ber
of
ooled
outputs.
Some
examples
of
this
kind
of
strategy
are
shown
in
ﬁgure
9.11
Note
that
the
use
of
con
volution
for
processing
ariably
sized
inputs
mak
es
sense
only
for
inputs
that
ha
ariable
size
ecause
they
con
tain
arying
amounts
of
observ
ation
of
the
same
kind
of
thing—diﬀeren
lengths
of
recordings
ver
time,
diﬀeren
widths
of
observ
ations
er
space,
and
so
forth.
Con
olution
do
es
not
mak
sense
if
the
input
has
ariable
size
ecause
it
can
optionally
include
diﬀeren
kinds
of
observ
ations.
or
example,
if
we
are
pro
cessing
college
applications,
and
our
features
consist
of
both
grades
and
standardized
test
scores,
but
not
every
applican
to
ok
the
standardized
test,
then
it
do
es
not
make
sense
to
conv
olv
the
same
weigh
ts
ov
er
features
corresp
onding
to
the
grades
as
well
as
the
features
354
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Single
channel
Multic
hannel
1-D
Audio w
av
eform:
The axis w
con
volv
ov
er
corresp
onds
to
time.
discretize
time
and
measure
the
amplitude
of
the
veform
once
er
time
step.
Sk
eleton
animation
data:
Anima-
tions
of
3-D
computer-rendered
haracters
are
generated
by
alter-
ing
the
ose
of
“sk
eleton”
ov
er
time.
each
point
in
time,
the
ose
of
the
haracter
is
describ
ed
sp
eciﬁcation
of
the
angles
of
eac
of
the
join
ts
in
the
harac-
ter’s
sk
eleton.
Eac
channel
in
the
data
we
feed
to
the
conv
olu-
tional
mo
del
represen
ts
the
angle
ab
out
one
axis
of
one
joint.
2-D
Audio
data
that
has
een
prepro-
cessed
with
ourier
transform:
can
transform
the
audio
e-
form
into
2-D
tensor
with
dif-
feren
rows
corresp
onding
to
dif-
feren
frequencies
and diﬀeren
columns
corresp
onding
to
diﬀer-
en
oin
ts
in
time.
Using
con
olu-
tion
in
the
time
makes
the
mo
del
equiv
ariant
to
shifts
in
time.
Us-
ing
conv
olution
across
the
fre-
quency
axis
mak
es
the
model
equiv
ariant
to
frequency
so
that
the
same
melo
dy
play
ed
in
dif-
feren
cta
pro
duces
the
same
represen
tation
but
at
diﬀerent
heigh
in
the
netw
ork’s
output.
Color
image
data:
One
hannel
con
tains
the
red
pixels,
one
the
green pixels, and one the
blue
pixels.
The
conv
olution
kernel
mo
ves
ov
er
both
the
horizon
tal
and
the
vertical
axes
of
the
im-
age,
conferring
translation
equiv-
ariance
in
oth
directions.
3-D
olumetric
data:
common
source
of
this
kind
of
data
is
med-
ical
imaging
tec
hnology
suc
as
CT
scans.
Color
video
data:
One
axis
corre-
sp
onds
to
time,
one
to
the
height
of
the
video
frame,
and
one
to
the
width
of
the
video
frame.
able
9.1:
Examples
of
diﬀeren
formats
of
data
that
can
be
used
with
conv
olutional
net
works.
355
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
corresp
onding
to
the
test
scores.
9.8
Eﬃcien
Con
olution
Algorithms
Mo
dern
conv
olutional
netw
ork
applications
often
inv
olve
netw
orks
containing
more
than
one
million
units.
ow
erful
implementations
exploiting
parallel
computation
resources,
as
discussed
in
section
12.1
are
essen
tial. In
man
cases,
how
ev
er,
it
is
also
ossible
to
sp
eed
up
conv
olution
by
selecting
an
appropriate
conv
olution
algorithm.
Con
vo
lution
is
equiv
alent
to
conv
erting
oth
the
input
and
the
kernel
to
the
frequency
domain
using
ourier
transform,
erforming
oin
t-wise
ultiplication
of
the
wo
signals, and
conv
erting
back
to
the
time
domain
using
an
in
erse
ourier
transform.
or
some
problem
sizes,
this
can
be
faster
than
the
naiv
implemen
tation
of
discrete
conv
olution.
When
-dimensional
kernel
can
expressed
as
the
outer product
of
ectors,
one
ector
per
dimension,
the
ernel
is
called
separable
When
the
ernel
is
separable,
naive
conv
olution
is
ineﬃcient.
It
is
equiv
alent
to
comp
ose
one-dimensional
conv
olutions
with
each
of
these
ectors.
The
comp
osed
approach
is
signiﬁcantly
faster
than
erforming
one
-dimensional
conv
olution
with
their
outer
pro
duct.
The
kernel
also
tak
es
fewer
parameters
to
represen
as
vectors.
If
the
ernel
is
elemen
ts
wide
in
each
dimension,
then
naive
multidimensional
con
olution
requires
runtime
and
parameter
storage
space,
while
separable
con
volutio
requires
runtime
and
parameter
storage
space.
Of
course,
not
every
conv
olution
can
represen
ted
in
this
wa
Devising
faster
wa
ys
of
erforming
con
olution
or
appro
ximate
conv
olution
without
harming
the
accuracy
of
the
mo
del
is
an
activ
area
of
research.
Even
tec
h-
niques
that
improv
the
eﬃciency
of
only
forward
propagation
are
useful
ecause
in
the
commercial
setting,
it
is
typical
to
devote
more
resources
to
deploymen
of
netw
ork
than
to
its
training.
9.9
Random
or
Unsup
ervised
eatures
ypically
the
most
exp
ensiv
part
of
conv
olutional
netw
ork
training
is
learning
the
features.
The
output
lay
er
is
usually
relatively
inexp
ensiv
ecause
of
the
small
num
er
of
features
provided
as
input
to
this
lay
er
after
passing
through
sev
eral
lay
ers
of
po
oling.
When
erforming
sup
ervised
training
with
gradien
descen
t,
every
gradien
step
requires
complete
run
of
forw
ard
propagation
and
356
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
bac
kw
ard
propagation
through
the
entire
net
ork.
One
wa
to
reduce
the
cost
of
con
olutional
net
ork
training
is
to
use
features
that
are
not
trained
in
supervised
fashion.
There
are
three
basic
strategies
for obtaining
con
volution
ernels
without
sup
ervised
training.
One
is
to
simply
initialize
them
randomly
Another
is
to
design
them
hand,
for
example,
by
setting
eac
ernel
to
detect
edges
at
certain
orientation
or
scale.
Finally
one
can
learn
the
ernels
with
an
unsup
ervised
criterion.
or
example,
Coates
et
al.
2011
apply
-means
clustering
to
small
image
patches,
then
use
each
learned
cen
troid
as
conv
olution
kernel.
In
Part
II
describ
many
more
unsup
ervised
learning
approac
hes.
Learning
the
features
with
an
unsup
ervised
criterion
allo
ws
them
to
determined
separately
from
the
classiﬁer
lay
er
at
the
top
of
the
arc
hitecture.
One
can
then
extract
the
features
for
the
en
tire
training
set
just
once,
essentially
constructing
new
training
set
for
the
last
la
er.
Learning
the
last
la
yer
is
then
typically
conv
ex
optimization
problem,
assuming
the
last
la
yer
is
something
like
logistic
regression
or
an
SVM.
Random
ﬁlters
often
work
surprisingly
well
in
con
volutional
net
orks
Jarrett
et
al.
2009
Saxe
et
al.
2011
Pin
to
et
al.
2011
Co
and
Pinto
2011
).
Saxe
et
al.
2011
show
ed
that
lay
ers
consisting
of
conv
olution
follow
ed
by
ooling
naturally
ecome
frequency
selective
and
translation
in
ariant
when
assigned
random
weigh
ts.
They
argue
that
this
provides
an
inexp
ensiv
wa
to
choose
the
architecture
of
conv
olutional
net
work:
ﬁrst,
ev
aluate
the
erformance
of
several
con
olutional
net
ork
architectures
by
training
only
the
last
lay
er;
then
take
the
est
of
these
arc
hitectures
and
train
the
en
tire
arc
hitecture
using
more
exp
ensiv
approach.
An
intermediate
approach
is
to
learn
the
features,
but
using
metho
ds
that
do
not
require
full
forw
ard
and
bac
k-propagation
at
ev
ery
gradien
step.
As
with
ultila
yer
erceptrons,
use
greedy
la
er-wise
pretraining,
to
train
the
ﬁrst
la
er
in
isolation,
then
extract
all
features
from
the
ﬁrst
lay
er
only
once,
then
train
the
second
lay
er
in
isolation
giv
en
those
features,
and
so
on.
In
chapter
we
describ
ed
ho
to
perform
sup
ervised
greedy
lay
er-wise
pretraining,
and
in
part
II
extend
this
to
greedy
la
yer-wise
pretraining
using
an
unsup
ervised
criterion
at
each
la
er.
The
canonical
example
of
greedy
lay
er-wise
pretraining
of
con
volutional
mo
del
is
the
con
olutional
deep
elief
net
ork
Lee
et
al.
2009
).
Conv
olutional
netw
orks
oﬀer
us
the
opp
ortunit
to
tak
the
pretraining
strategy
one
step
further
than
is
ossible
with
multila
er
erceptrons.
Instead
of
training
an
entire
con
olutional
la
yer
at
time,
can
train
mo
del
of
small
patc
h,
as
Coates
et
al.
2011
do
with
-means.
can
then
use
the
parameters
from
this
patch-based
mo
del
to
deﬁne
the
kernels
of
conv
olutional
la
yer.
This
means
that
it
is
ossible
to
use
unsup
ervised
learning
to
train
conv
olutional
net
ork
without
ever
using
onvolution
during
the
tr
aining
357
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
pr
ess
Using
this
approac
h,
can
train
ery
large
mo
dels
and
incur
high
computational
cost
only
at
inference
time
Ranzato
et
al.
2007b
Jarrett
et
al.
2009
Kavuk
cuoglu
et
al.
2010
Coates
et
al.
2013
).
This
approach
as
opular
from
roughly
2007
to
2013,
when
lab
eled
datasets
were
small
and
computational
wer
as
more
limited.
da
most
con
volutional
net
orks
are
trained
in
purely
sup
ervised
fashion,
using
full
forward
and
bac
k-propagation
through
the
en
tire
netw
ork
on
each
training
iteration.
As
with
other
approaches
to
unsup
ervised
pretraining,
it
remains
diﬃcult
to
tease
apart
the
cause
of
some
of
the
eneﬁts
seen
with
this
approach.
Unsup
ervised
pretraining
may
oﬀer
some
regularization
relative
to
sup
ervised
training,
or
it
ma
simply
allo
us
to
train
muc
larger
architectures
ecause
of
the
reduced
computational
cost
of
the
learning
rule.
9.10
The
Neuroscien
tiﬁc
Basis
for
Con
volutional
Net
orks
Con
volutio
nal net
works
are perhaps the
greatest
success story
of biologically
inspired
artiﬁcial
intelligence.
Though
conv
olutional
netw
orks
hav
een
guided
man
other
ﬁelds,
some
of
the
key
design
principles
of
neural
netw
orks
were
dra
wn
from
neuroscience.
The
history
of
conv
olutional
netw
orks
egins
with
neuroscientiﬁc
exp
erimen
ts
long
efore
the
relev
an
computational
mo
dels
were
developed.
Neurophysiologists
Da
vid
Hub
el
and
orsten
Wiesel
collab
orated
for
several
years
to
determine
man
of
the
most
basic
facts
ab
out
ho
the
mammalian
vision
system
orks
Hub
el
and
Wiesel
1959
1962
1968
).
Their
accomplishmen
ts
ere
even
tually
recognized
with
Nob
el
prize.
Their
ﬁndings
that
hav
had
the
greatest
inﬂuence
on
con
temp
orary
deep
learning
mo
dels
were
based
on
recording
the
activit
of
individual
neurons
in
cats.
They
observed
how
neurons
in
the
cat’s
brain
resp
onded
to
images
pro
jected
in
precise
lo
cations
on
screen
in
front
of
the
cat.
Their
great
disco
very
as
that
neurons
in
the
early
visual
system
resp
onded
most
strongly
to
ery
sp
eciﬁc
patterns
of
ligh
t,
suc
as
precisely
orien
ted
bars,
but
resp
onded
hardly
at
all
to
other
patterns.
Their
work
help
ed
to
characterize
man
asp
ects
of
brain
function
that
are
ey
ond
the
scop
of
this
ok.
rom
the
oin
of
view
of
deep
learning,
can
fo
cus
on
simpliﬁed,
carto
on
view
of
brain
function.
In
this
simpliﬁed
view,
we
fo
cus
on
part
of
the
brain
called
V1,
also
known
as
the
primary
visual
cortex
. V1
is
the
ﬁrst
area
of
the
brain
that
egins
to
358
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
erform
signiﬁcantly
adv
anced
pro
cessing
of
visual
input. In
this
carto
on
view,
images
are
formed
light
arriving
in
the
eye
and
stimulating
the
retina,
the
ligh
t-sensitiv
tissue
in
the
back
of
the
ey
e.
The
neurons
in
the
retina
perform
some
simple
prepro
cessing
of
the
image
but
do
not
substan
tially
alter
the
wa
it
is
represen
ted.
The
image
then
passes
through
the
optic
nerve
and
brain
region
called
the
later
al
geniculate
nucleus
. The
main
role,
as
far
as
we
are
concerned
here,
of
oth
anatomical
regions
is
primarily
just
to
carry
the
signal
from
the
ey
to
V1,
which
is
lo
cated
at
the
bac
of
the
head.
conv
olutional
netw
ork
lay
er
is
designed
to
capture
three
prop
erties
of
V1:
1.
V1
is
arranged
in
spatial
map.
It
actually
has
tw
o-dimensional
structure,
mirroring the
structure of
the image in
the retina.
or example, light
arriving
at
the
low
er
half
of
the
retina
aﬀects
only
the
corresp
onding
half
of
V1.
Con
olutional
netw
orks
capture
this
prop
ert
by
ha
ving
their
features
deﬁned
in
of
tw
o-dimensional
maps.
2.
V1
contains
many
simple
cells
simple
cell’s
activity
can
to
some
extent
haracterized
by
linear
function
of
the
image
in
a small,
spatially
lo
calized
receptiv
ﬁeld.
The
detector
units
of
conv
olutional
netw
ork
are
designed
to
emulate
these
prop
erties
of
simple
cells.
3.
V1
also
contains
many
complex
cells
These
cells
resp
ond
to
features
that
are
similar
to
those
detected
simple
cells,
but
complex
cells
are
inv
arian
to
small
shifts
in
the
osition
of
the
feature.
This
inspires
the
ooling
units
of
conv
olutional
netw
orks.
Complex
cells
are
also
in
ariant
to
some
hanges
in
lighting
that
cannot
captured
simply
by
ooling
ov
er
spatial
lo
cations.
These
inv
ariances
hav
inspired
some
of
the
cross-c
hannel
ooling
strategies
in
conv
olutional
netw
orks,
such
as
maxout
units
Go
odfellow
et
al.
2013a
).
Though
we
know
the
most
about
V1,
it
is
generally
eliev
ed
that
the
same
basic
principles
apply
to
other
areas
of
the
visual
system.
In
our
carto
on
view
of
the
visual
system,
the
basic
strategy
of
detection
follo
ed
by
ooling
is
rep
eatedly
applied
as
we
mov
deep
er
into
the
brain.
As
pass
through
multiple
anatomical
la
ers
of
the
brain,
ev
en
tually
ﬁnd
cells
that
resp
ond
to
some
sp
eciﬁc
concept
and
are
inv
arian
to
many
transformations
of
the
input.
These
cells
hav
been
nic
knamed
“grandmother
cells”—the
idea
is
that
erson
could
hav
neuron
that
activ
ates
when
seeing
an
image
of
their
grandmother,
regardless
of
whether
she
app
ears
in
the
left
or
right
side
of
the
image,
whether
the
image
is
close-up
of
her
face
or
zo
omed-out
shot
of
her
entire
ody
whether
she
is
brightly
lit
or
in
shado
w,
and
so
on.
359
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
These
grandmother
cells
hav
een
shown
to
actually
exist
in
the
uman
brain,
in
region
called
the
me
dial
temp
or
al
lob
Quiroga
et
al.
2005
).
Researc
hers
tested
whether
individual
neurons
would
resp
ond
to
photos
of
famous
individuals.
They
found
what
has
come
to
called
the
“Halle
Berry
neuron,”
an
individual
neuron
that
is
activ
ated
the
concept
of
Halle
Berry
This
neuron
ﬁres
when
erson
sees
photo
of
Halle
Berry
dra
wing
of
Halle
Berry
or
even
text
containing
the
ords
“Halle
Berry
.”
Of
course,
this
has
nothing
to
do
with
Halle
Berry
herself;
other
neurons
resp
onded
to
the
presence
of
Bill
Clinton,
Jennifer
Aniston,
and
so
forth.
These
medial
temp
oral
lob
neurons
are
somewhat
more
general
than
mo
dern
con
olutional
netw
orks,
which
ould
not
automatically
generalize
to
identifying
erson
or
ob
ject
when
reading
its
name.
The
closest
analog
to
conv
olutional
net
ork’s
last
la
yer
of
features
is
brain
area
called
the
infer
otemp
or
al
ortex
(IT).
When
viewing
an
ob
ject,
information
ﬂo
ws
from
the
retina,
through
the
LGN,
to
V1,
then
onw
ard
to
V2,
then
V4,
then
IT.
This
happ
ens
within
the
ﬁrst
100ms
of
glimpsing
an
ob
ject.
If
person
is
allo
ed
to
contin
ue
lo
oking
at
the
ob
ject
for
more
time,
then
information
will
egin
to
ﬂo
backw
ard
as
the
brain
uses
top-do
wn
feedback
to
up
date
the
activ
ations
in
the
low
er
level
brain
areas.
If
we
in
terrupt
the
erson’s
gaze,
how
ev
er,
and
observe
only
the
ﬁring
rates
that
result
from
the
ﬁrst
100ms
of
mostly
feedforw
ard
activ
ation,
then
IT
prov
es
to
similar
to
con
volutional
netw
ork.
Con
volutional
net
orks
can
predict
IT
ﬁring
rates
and
erform
similarly
to
(time-limited)
humans
on
ob
ject
recognition
tasks
DiCarlo
2013
).
That
eing
said,
there
are
many
diﬀerences
et
een
conv
olutional
netw
orks
and
the
mammalian
vision
system.
Some
of
these
diﬀerences
are
ell
known
to
computational
neuroscientists
but
outside
the
scop
of
this
ook.
Some
of
these
diﬀerences
are
not
yet
kno
wn,
ecause
many
basic
questions
ab
out
how
the
mammalian
vision
system
orks
remain
unanswered.
As
brief
list:
The
uman
eye
is
mostly
ery
lo
resolution,
except
for
tin
patc
called
the
fo
ea
The
fov
ea
only
observes
an
area
ab
out
the
size
of
th
um
bnail
held
at
arms
length.
Though
we
feel
as
if
we
can
see
an
en
tire
scene
in
high
resolution,
this
is
an
illusion
created
by
the
sub
conscious
part
of
our
brain,
as
it
stitc
hes
together
several
glimpses
of
small
areas.
Most
con
olutional
netw
orks
actually
receiv
large
full-resolution
photographs
as
input.
The
human
brain
makes
sev
eral
ey
mo
ements
called
saccades
to
glimpse
the
most
visually
salien
or
task-relev
ant
parts
of
scene.
Incorp
orating
similar
attention
mechanisms
in
to
deep
learning
mo
dels
is
an
activ
research
direction.
In
the
con
text
of
deep
learning,
attention
mec
hanisms
hav
een
most
successful
for
natural
360
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
language
pro
cessing,
as
describ
ed
in
section
12.4.5.1
Sev
eral
visual
mo
dels
with
fov
eation
mechanisms
hav
een
developed
but
so
far
hav
not
ecome
the
dominant
approach
Laro
helle
and
Hin
ton
2010
Denil
et
al.
2012
).
The
human
visual
system
is
integrated
with
man
other
senses,
such
as
hearing,
and
factors
like
our
mo
ods
and
though
ts.
Con
olutional
netw
orks
so
far
are
purely
visual.
The
human
visual
system
do
es
muc
more
than
just
recognize
ob
jects.
It
is
able
to
understand
entire
scenes,
including
many
ob
jects
and
relationships
et
ween
ob
jects,
and
it
pro
cesses
ric
3-D
geometric
information
needed
for
our
dies
to
interface
with
the
world.
Con
olutional
net
works
ha
een
applied
to
some
of
these
problems,
but
these
applications
are
in
their
infancy
Ev
en
simple
brain
areas
lik
V1
are
hea
vily
aﬀected
feedbac
from
higher
lev
els.
eedback
has
een
explored
extensively
in
neural
netw
ork
mo
dels
but
has
not
yet
een
sho
wn
to
oﬀer
comp
elling
impro
vemen
t.
While
feedforward
IT
ﬁring
rates
capture
uch
of
the
same
information
as
con
olutional
net
wor
features,
it
is
not
clear
how
similar
the
in
termediate
computations
are.
The
brain
probably
uses
ery
diﬀerent
activ
ation
and
ooling
functions.
An
individual
neuron’s
activ
ation
probably
is
not
ell
haracterized
single
linear
ﬁlter
resp
onse.
recent
mo
del
of
V1
inv
olves
ultiple
quadratic
ﬁlters
for
each
neuron
Rust
et
al.
2005
).
Indeed
our
carto
on
picture
of
“simple
cells”
and
“complex
cells”
might
create
nonexistent
distinction;
simple
cells
and
complex
cells
might
oth
the
same
kind
of
cell
but
with
their
“parameters”
enabling
contin
uum
of
eha
viors
ranging
from
what
we
call
“simple”
to
what
we
call
“complex.”
It
is
also
orth men
tioning that
neuroscience has
told us
relatively
little
ab
out
how
to
tr
ain
conv
olutional
netw
orks.
Mo
del
structures
with
parameter
sharing
across
multiple
spatial
lo
cations
date
back
to
early
connectionist
mo
dels
of
vision
Marr
and
Poggio
1976
),
but
these
mo
dels
did
not
use
the
modern
bac
k-propagation
algorithm
and
gradient
descen
t.
or
example,
the
neo
cognitron
ukushima
1980
incorp
orated
most
of
the
mo
del
architecture
design
elements
of
the
mo
dern
conv
olutional
net
ork
but
relied
on
la
er-wise
unsup
ervised
clustering
algorithm.
Lang and
Hinton
1988
) in
tro
duced
the use of
back-propagation
to train
time-dela
neural
netw
orks
(TDNNs).
use
contemporary
terminology
TDNNs
are
one-dimensional
con
volutional
net
orks
applied
to
time
series.
Bac
k-
propagation
applied
to
these
mo
dels
as
not
inspired
by
any
neuroscien
tiﬁc
observ
a-
361
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
tion
and
is
considered
by
some
to
be
biologically
implausible.
ollowing
the
success
of
bac
k-propagation-based
training
of
TDNNs,
LeCun
et
al.
1989
developed
the
mo
dern
conv
olutional
net
ork
applying
the
same
training
algorithm
to
2-D
con
olution
applied
to
images.
So
far
hav
described
ho
simple
cells
are
roughly
linear
and
selectiv
for
certain
features,
complex
cells
are
more
nonlinear
and
ecome
in
arian
to
some transformations
of these
simple cell
features,
and stacks
of lay
ers that
alternate
et
een
selectivit
and
in
ariance
can
yield
grandmother
cells
for
sp
eciﬁc
phenomena.
hav
not
et
describ
ed
precisely
what
these
individual
cells
detect.
In
deep
nonlinear
net
ork,
it
can
diﬃcult
to
understand
the
function
of
individual
cells.
Simple
cells
in
the
ﬁrst
lay
er
are
easier
to
analyze,
ecause
their
resp
onses
are
driven
linear
function.
In
an
artiﬁcial
neural
net
work,
can
just
display
an
image
of
the
conv
olution
kernel
to
see
what
the
corresp
onding
hannel
of
conv
olutional
lay
er
resp
onds
to.
In
biological
neural
net
work,
do
not
ha
ve
access
to
the
eigh
ts
themselv
es.
Instead,
we
put
an
electrode
in
the
neuron,
display
sev
eral
samples
of
white
noise
images
in
fron
of
the
animal’s
retina,
and
record
how
eac
of
these
samples
causes
the
neuron
to
activ
ate.
can
then
ﬁt
linear
mo
del
to
these
resp
onses
to
obtain
an
approximation
of
the
neuron’s
weigh
ts.
This
approach
is
known
as
rev
erse
correlation
Ringac
and
Shapley
2004
).
Rev
erse
correlation
sho
ws
us
that
most
V1
cells
hav
weigh
ts
that
are
describ
ed
Gab
or
functions
. The
Gab
or
function
describes
the
weigh
at
2-D
point
in
the
image. W
can
think
of
an
image
as
eing
function
of
2-D
co
ordinates,
x,
Lik
ewise,
we
can
think
of
simple
cell
as
sampling
the
image
at
set
of
lo
cations,
deﬁned
by
set
of
co
ordinates
and
set
of
co
ordinates
then
applying
weigh
ts
that
are
also
function
of
the
lo
cation,
x,
rom
this
oin
of
view,
the
resp
onse
of
simple
cell
to
an
image
is
giv
en
by
) =
x,
x,
(9.15)
Sp
eciﬁcally
x,
takes
the
form
of
Gab
or
function:
x,
α,
φ,
) =
exp
cos(
(9.16)
where
= (
cos(
sin(
(9.17)
and
sin(
cos(
(9.18)
362
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Figure
9.18:
Gabor
functions
with
ariet
of
parameter
settings.
White
indicates
large
ositive
eight,
black
indicates
large
negative
weigh
t,
and
the
bac
kground
gra
corresp
onds
to
zero
weigh
t.
(L
eft)
Gabor
functions
with
diﬀerent
alues
of
the
parameters
that
control
the
co
ordinate
system:
and
. Eac
Gab
or
function
in
this
grid
is
assigned
alue
of
and
prop
ortional
to
its
osition
in
its
grid,
and
is
chosen
so
that
eac
Gab
or
ﬁlter
is
sensitive
to
the
direction
radiating
out
from
the
cen
ter
of
the
grid.
or
the
other
tw
plots,
and
are
ﬁxed
to
zero.
(Center)
Gab
or
functions
with
diﬀerent
Gaussian
scale
parameters
and
Gab
or
functions
are
arranged
in
increasing
width
(decreasing
as
we
mov
left
to
right
through
the
grid,
and
increasing
heigh
(decreasing
as
we
mo
ve
top
to
ottom.
or
the
other
tw
plots,
the
alues
are
ﬁxed
to
1.5
times
the
image
width.
(R
ight)
Gab
or
functions
with
diﬀerent
sin
usoid
parameters
and
As
we
mo
ve
top
to
ottom,
increases,
and
as
we
mo
left
to
right,
increases.
or
the
other
tw
plots,
is
ﬁxed
to
and
is
ﬁxed
to
times
the
image
width.
Here,
and
are
parameters
that
control
the
prop
erties
of
the
Gab
or
function.
Figure
9.18
shows
some
examples
of
Gab
or
functions
with
diﬀeren
settings
of
these
parameters.
The
parameters
and
deﬁne
co
ordinate
system. W
translate
and
rotate
and
to
form
and
Sp
eciﬁcally
the
simple
cell
will
resp
ond
to
image
features
cen
tered
at
the
oin
),
and
it
will
resp
ond
to
hanges
in
brightness
as
we
mov
along
line
rotated
radians
from
the
horizontal.
View
ed
as
function
of
and
the
function
then
resp
onds
to
changes
in
brigh
tness
as
we
mo
ve
along
the
axis. It
has
tw
imp
ortan
factors:
one
is
Gaussian
function,
and
the
other
is
cosine
function.
The
Gaussian
factor
exp
can
seen
as
gating
term
that
ensures
that
the
simple
cell
will
resp
ond
only
to
alues
near
where
and
are
oth
zero,
in
other
words,
near
the
cen
ter
of
the
cell’s
receptive
ﬁeld.
The
scaling
factor
adjusts
the
total
magnitude
of
the
simple
cell’s
resp
onse,
while
and
con
trol
how
quickly
its
receptive
ﬁeld
falls
oﬀ.
363
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
The
cosine
factor
cos
con
trols
how
the
simple
cell
resp
onds
to
hanging
brigh
tness
along
the
axis.
The
parameter
con
trols
the
frequency
of
the
cosine,
and
controls
its
phase
oﬀset.
Altogether,
this
carto
on
view
of
simple
cells
means
that
simple
cell
resp
onds
to
sp
eciﬁc
spatial
frequency
of
brigh
tness
in
sp
eciﬁc
direction
at
sp
eciﬁc
lo
cation.
Simple
cells
are
most
excited
when
the
wa
of
brightness
in
the
image
has
the
same
phase
as
the
weigh
ts.
This
ccurs
when
the
image
is
brigh
where
the
eigh
ts
are
ositiv
and
dark
where
the
weigh
ts
are
negativ
e.
Simple
cells
are
most
inhibited
when
the
ve
of
brightness
is
fully
out
of
phase
with
the
weigh
ts—when
the
image
is
dark
where
the
weigh
ts
are
ositiv
and
bright
where
the
weigh
ts
are
negativ
e.
The
cartoon
view
of
complex
cell
is
that
it
computes
the
norm
of
the
2-D
vector
containing
tw
simple
cells’
resp
onses:
. An
imp
ortan
sp
ecial
case
ccurs
when
has
all
the
same
parameters
as
except
for
and
is
set
suc
that
is
one
quarter
cycle
out
of
phase
with
In
this
case,
and
form
quadrature
pair
complex
cell
deﬁned
in
this
wa
resp
onds
when
the
Gaussian
reweigh
ted
image
x,
exp
contains
high-amplitude
sin
usoidal
with
frequency
in
direction
near
gar
less
of
the
phase
oﬀset
of
this
wave
In
other
words,
the
complex
cell
is
in
ariant
to
small
translations
of
the
image
in
direction
or
to
negating
the
image
(replacing
black
with
white
and
vice
ersa).
Some
of
the
most
striking
corresp
ondences
et
een
neuroscience
and
machine
learning
come
from
visually
comparing
the
features
learned
by
mac
hine
learning
mo
dels
with
those
employ
ed
by
V1.
Olshausen
and
Field
1996
sho
wed
that
simple
unsupervised
learning
algorithm,
sparse
co
ding,
learns
features
with
receptiv
ﬁelds
similar
to
those
of
simple
cells.
Since
then,
we
ha
ve
found
that
an
extremely
wide
ariet
of
statistical
learning
algorithms
learn
features
with
Gab
or-lik
functions
when
applied
to
natural
images.
This
includes
most
deep
learning
algorithms,
which
learn
these
features
in
their
ﬁrst
lay
er.
Figure
9.19
sho
ws
some
examples.
Because
so
many
diﬀerent
learning
algorithms
learn
edge
detectors,
it
is
diﬃcult
to
conclude
that
any
speciﬁc
learning
algorithm
is
the
“righ
t”
mo
del
of
the
brain
just
based
on
the
features
it
learns
(though
it
can
certainly
bad
sign
if
an
algorithm
do
es
not
learn
some
sort
of
edge
detector
when
applied
to
natural
images).
These
features
are
an
imp
ortan
part
of
the
statistical
structure
of
natural
images
and
can
recov
ered
by
man
diﬀeren
approac
hes
to
statistical
mo
deling.
See
Hyv
ärinen
et
al.
2009
for
review
of
the
ﬁeld
of
natural
image
statistics.
364
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
Figure
9.19:
Many
mac
hine
learning
algorithms
learn
features
that
detect
edges
or
sp
eciﬁc
colors
of
edges
when
applied
to
natural
images.
These
feature
detectors
are
reminiscent
of
the
Gab
or
functions
known
to
presen
in
the
primary
visual
cortex.
(L
eft)
eights
learned
by
an
unsup
ervised
learning
algorithm
(spike
and
slab
sparse
coding)
applied
to
small
image
patches.
(R
ight)
Conv
olution
kernels
learned
by
the
ﬁrst
lay
er
of
fully
sup
ervised
conv
olutional
maxout
netw
ork.
Neigh
oring
pairs
of
ﬁlters
drive
the
same
maxout
unit.
9.11
Con
olutional
Net
orks
and
the
History
of
Deep
Learning
Con
volutio
nal
net
works
hav
pla
ed
an
important
role
in
the
history
of
deep
learning.
They
are
key
example
of
successful
application
of
insigh
ts
obtained
studying
the
brain
to
machine
learning
applications.
They
were
also
some
of
the
ﬁrst
deep
mo
dels
to
erform
well,
long
efore
arbitrary
deep
mo
dels
were
considered
viable.
Con
olutional
netw
orks w
ere
also
some
of
the
ﬁrst
neural
net
orks
to
solve
imp
ortan
commercial
applications
and
remain
at
the
forefront
of
commercial
applications
of
deep
learning
to
da
or
example,
in
the
1990s,
the
neural
netw
ork
researc
group
at
T&T
developed
conv
olutional
netw
ork
for
reading
hec
ks
LeCun
et
al.
1998b
).
By
the
end
of
the
1990s,
this
system
deplo
ed
NCR
as
reading
ov
er
10
ercen
of
all
the
chec
ks
in
the
United
States.
Later,
sev
eral
OCR
and
handwriting
recognition
systems
based
on
conv
olutional
nets
ere
deplo
ed
Microsoft
Simard
et
al.
2003
).
See
chapter
12
for
more
details
on
such
applications
and
more
mo
dern
applications
of
conv
olutional
netw
orks.
See
LeCun
et
al.
2010
for
more
in-depth
history
of
conv
olutional
net
works
up
to
2010.
Con
volution
al
net
orks
ere
also
used
to
win
man
contests.
The
current
in
tensity
of
commercial
interest
in
deep
learning
egan
when
Krizhevsky
et
al.
365
CHAPTER
9.
CONVOLUTIONAL
NETW
ORKS
2012
on
the
ImageNet
ob
ject
recognition
challenge,
but
con
olutional
netw
orks
had
een
used
to
win
other
machine
learning
and
computer
vision
contests
with
less
impact
for
ears
earlier.
Con
olutional
nets
were
some
of
the
ﬁrst
working
deep
net
orks
trained
with
bac
k-propagation.
It
is
not
entirely
clear
why
con
olutional
netw
orks
succeeded
when
general
bac
k-propagation
netw
orks
ere
considered
to
hav
failed.
It
may
simply
that
con
olutional
netw
orks
were
more
computationally
eﬃcien
than
fully
connected
netw
orks,
so
it
was
easier
to
run
multiple
exp
erimen
ts
with
them
and
tune
their
implementation
and
hyperparameters.
Larger
netw
orks
also
seem
to
easier
to
train.
With
modern
hardw
are,
large
fully
connected
net
orks
app
ear
to
erform
reasonably
on
many
tasks,
even
when
using
datasets
that
were
ailable
and
activ
ation
functions
that
were
opular
during
the
times
when
fully
connected
netw
orks
were
believed
not
to
ork
well.
It
ma
that
the
primary
barriers
to
the
success
of
neural
net
orks
ere
psyc
hological
(practitioners
did
not
exp
ect
neural
net
orks
to
work,
so
they
did
not
mak
serious
eﬀort
to
use
neural
netw
orks).
Whatev
er
the
case,
it
is
fortunate
that
conv
olutional
netw
orks
erformed
ell
decades
ago.
In
many
ys,
they
carried
the
torch
for
the
rest
of
deep
learning
and
pa
ved
the
wa
to
the
acceptance
of
neural
net
orks
in
general.
Con
volution
al
net
wo
rks
pro
vide
wa
to
sp
ecialize
neural
netw
orks
to
ork
with
data
that
has
clear
grid-structured
top
ology
and
to
scale
such
mo
dels
to
ery
large
size.
This
approach
has
een
the
most
successful
on
o-dimensional
image
top
ology
process
one-dimensional
sequential
data,
turn
next
to
another
pow
erful
sp
ecialization
of
the
neural
net
works
framew
ork:
recurren
neural
net
orks.
366