www.deeplearningbook.org

Chapter
Probabilit
and
Information
Theory
In
this
hapter,
describ
probabilit
theory
and
information
theory
Probabilit
theory
is
mathematical
framew
ork
for
represen
ting
uncertain
statemen
ts.
It
pro
vides
means
of
quan
tifying
uncertaint
as
ell
as
axioms
for
deriving
new
uncertain
statemen
ts.
In
artiﬁcial
in
telligence
applications,
we
use
probabilit
theory
in
tw
ma
jor
ys.
First,
the
laws
of
probabilit
tell
us
how
AI
systems
should
reason,
so
we
design
our
algorithms
to
compute
or
approximate
arious
expressions
deriv
ed
using
probabilit
theory
Second,
can
use
probabilit
and
statistics
to
theoretically
analyze
the
eha
vior
of
prop
osed
AI
systems.
Probabilit
theory
is
fundamen
tal
to
ol
of
man
disciplines
of
science
and
engineering.
pro
vide
this
hapter
to
ensure
that
readers
whose
bac
kground
is
primarily
in
softw
are
engineering,
with
limited
exp
osure
to
probabilit
theory
can
understand
the
material
in
this
ok.
While
probabilit
theory
allows
us
to
make
uncertain
statements
and
to
reason
in
the
presence
of
uncertain
information
theory
enables
us
to
quantify
the
amoun
of
uncertain
in
probabilit
distribution.
If
ou
are
already
familiar
with
probability
theory
and
information
theory
ou
may
wish
to
skip
this
chapter
except
for
section
3.14
whic
describ
es
the
graphs
use
to
describ
structured
probabilistic
mo
dels
for
machine
learning.
If
ou
hav
absolutely
no
prior
exp
erience
with
these
sub
jects,
this
chapter
should
suﬃcient
to
successfully
carry
out
deep
learning
researc
pro
jects,
but
we
do
suggest
that
ou
consult
an
additional
resource,
suc
as
Ja
ynes
2003
).
51
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
3.1
Wh
Probabilit
y?
Man
branches
of
computer
science
deal
mostly
with
entities
that
are
en
tirely
deterministic
and
certain.
programmer
can
usually
safely
assume
that
CPU
will
execute
eac
mac
hine
instruction
ﬂawlessly
Errors
in
hardware
do
ccur
but
are
rare
enough
that
most
soft
are
applications
do
not
need
to
designed
to
accoun
for
them.
Giv
en
that
many
computer
scien
tists
and
soft
are
engineers
work
in
relativ
ely
clean
and
certain
environmen
t,
it
can
surprising
that
mac
hine
learning
mak
es
hea
vy
use
of
probabilit
theory
Mac
hine
learning
must
alwa
ys
deal
with
uncertain
quan
tities
and
sometimes
sto
hastic
(nondeterministic)
quantities.
Uncertaint
and
sto
chasticit
can
arise
from
man
sources.
Researchers
hav
made
compelling
arguments
for
quantifying
uncertain
using
probability
since
at
least
the
1980s.
Man
of
the
argumen
ts
presen
ted
here
are
summarized
from
or
inspired
earl
1988
).
Nearly
all
activities
require
some
abilit
to
reason
in
the
presence
of
uncertaint
In
fact,
eyond
mathematical
statements
that
are
true
by
deﬁnition,
it
is
diﬃcult
to
think
of
any
prop
osition
that
is
absolutely
true
or
any
even
that
is
absolutely
guaran
teed
to
ccur.
There
are
three
ossible
sources
of
uncertain
y:
1.
Inheren
sto
hasticit
in
the
system
eing
mo
deled.
or
example, most
in
terpretations
of
quantum
mechanics
describ
the
dynamics
of
subatomic
particles
as
eing
probabilistic.
can
also
create
theoretical
scenarios
that
ostulate
to
ha
random
dynamics,
such
as
hypothetical
card
game
where
assume
that
the
cards
are
truly
sh
uﬄed
in
to
random
order.
2.
Incomplete
observ
abilit
Even
deterministic
systems
can
app
ear
sto
chastic
when
cannot
observe
all
the
ariables
that
drive
the
ehavior
of
the
system.
or
example,
in
the
Mont
Hall
problem,
game
show
con
testan
is
ask
ed
to
choose
etw
een
three
do
ors
and
wins
prize
held
ehind
the
chosen
do
or.
do
ors
lead
to
goat
while
third
leads
to
car. The
outcome
giv
en
the
contestan
t’s
choice
is
deterministic,
but
from
the
con
testan
t’s
oin
of
view,
the
outcome
is
uncertain.
3.
Incomplete
mo
deling.
When
we
use
mo
del
that
must
discard
some
of
the information
we
ha
e observ
ed, the discarded
information
results in
uncertain
in
the
mo
del’s
predictions.
or
example,
supp
ose
we
build
rob
ot
that
can
exactly
observe
the
lo
cation
of
every
ob
ject
around
it.
If
the
rob
ot
discretizes
space
when
predicting
the
future
lo
cation
of
these
ob
jects,
52
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
then
the
discretization
mak
es
the
rob
ot
immediately
ecome
uncertain
ab
out
the
precise
osition
of
ob
jects:
eac
ob
ject
could
anywhere
within
the
discrete
cell
that
it
as
observ
ed
to
ccup
In
man
cases,
it
is
more
practical
to
use
simple
but
uncertain
rule
rather
than
complex
but
certain
one,
ev
en
if
the
true
rule
is
deterministic
and
our
mo
deling
system
has
the
ﬁdelit
to
accommo
date
complex
rule.
or
example,
the
simple
rule
“Most
birds
ﬂy”
is
heap
to
develop
and
is
broadly
useful,
while
rule
of
the
form,
“Birds
ﬂy
except
for
ery
young
birds
that
hav
not
yet
learned
to
ﬂy
sick
or
injured
birds
that
hav
lost
the
abilit
to
ﬂy
ﬂightless
sp
ecies
of
birds
including
the
casso
wary
ostrich
and
kiwi.
is
exp
ensive
to
dev
elop,
maintain
and
comm
unicate
and,
after
all
this
eﬀort,
is
still
brittle
and
prone
to
failure.
While
it
should
clear
that
we
need
means
of
representing
and
reasoning
ab
out
uncertaint
it
is
not
immediately
obvious
that
probability
theory
can
provide
all
the
to
ols
we
an
for
artiﬁcial
intelligence
applications.
Probability
theory
as
originally
dev
elop
ed
to
analyze
the
frequencies
of
even
ts.
It
is
easy
to
see
ho
probabilit
theory
can
be
used
to
study
even
ts
like
drawing
certain
hand
of
cards
in
oker
game.
These
kinds
of
ev
en
ts
are
often
rep
eatable.
When
we
say
that
an
outcome
has
probability
of
ccurring,
it
means
that
if
rep
eated
the
exp
erimen
(e.g.,
drawing
hand
of
cards)
inﬁnitely
many
times,
then
prop
ortion
of
the
rep
etitions
would
result
in
that
outcome.
This
kind
of
reasoning
do
es
not
seem
immediately
applicable
to
prop
ositions
that
are
not
rep
eatable.
If
do
ctor
analyzes
patient
and
sa
ys
that
the
patien
has
40
ercent
chance
of
ha
ving
the
ﬂu,
this
means
something
ery
diﬀerent—w
cannot
make
inﬁnitely
man
replicas
of
the
patient,
nor
is
there
any
reason
to
elieve
that
diﬀerent
replicas
of
the
patient
would
presen
with
the
same
symptoms
et
hav
arying
underlying
conditions. In
the
case
of
the
doctor
diagnosing
the
patien
t,
we
use
probabilit
to
represent
degree
of
elief
with
indicating
absolute
certaint
that
the
patien
has
the
ﬂu
and
indicating
absolute
certaint
that
the
patient
do
es
not
ha
the
ﬂu.
The
former
kind
of
probability
related
directly
to
the
rates
at
which
ev
en
ts
ccur,
is
kno
wn
as
frequen
tist
probability
while
the
latter,
related
to
qualitativ
lev
els
of
certain
ty
is
kno
wn
as
Ba
esian
probabilit
If
list
several
prop
erties
that
we
exp
ect
common
sense
reasoning
ab
out
uncertain
to
hav
e, then
the
only
to
satisfy
those
prop
erties
is
to
treat
Ba
esian
probabilities
as
ehaving
exactly
the
same
as
frequentist
probabilities.
or
example,
if
we
an
to
compute
the
probability
that
play
er
will
win
oker
game
given
that
she
has
certain
set
of
cards,
use
exactly
the
same
formulas
as
when
compute
the
probability
that
patient
has
disease
given
that
she
has
certain
symptoms.
or
more
details
ab
out
wh
small
set
of
common
sense
53
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
assumptions
implies
that
the
same
axioms
must
control
oth
kinds
of
probability
see
Ramsey
1926
).
Probabilit
can
seen
as
the
extension
of
logic
to
deal
with
uncertaint
Logic
pro
vides
set
of
formal
rules
for
determining
what
prop
ositions
are
implied
to
true
or
false
giv
en
the
assumption
that
some
other
set
of
prop
ositions
is
true
or
false.
Probabilit
theory
provides
set
of
formal
rules
for
determining
the
lik
eliho
of
proposition
eing
true
given
the
likelihoo
of
other
prop
ositions.
3.2
Random
ariables
random
ariable
is
ariable
that
can
take
on
diﬀeren
alues
randomly
ypically
denote
the
random
ariable
itself
with
lo
ercase
letter
in
plain
yp
eface,
and
the
alues
it
can
tak
on
with
low
ercase
script
letters.
or
example,
and
are
oth
ossible
alues
that
the
random
ariable
can
take
on.
or
vector-v
alued
ariables,
we
would
write
the
random
ariable
as
and
one
of
its
alues
as
On
its
own,
random
ariable
is
just
description
of
the
states
that
are
ossible;
it
ust
coupled
with
probabilit
distribution
that
sp
eciﬁes
how
lik
ely
eac
of
these
states
are.
Random
ariables
may
discrete
or
contin
uous.
discrete
random
ariable
is
one
that
has
ﬁnite
or
countably
inﬁnite
num
ber
of
states.
Note
that
these
states
are
not
necessarily
the
integers;
they
can
also
just
named
states
that
are
not
considered
to
hav
any
numerical
alue.
contin
uous
random
ariable
is
asso
ciated
with
real
alue.
3.3
Probabilit
Distributions
probabilit
distribution
is
description
of
ho
likely
random
ariable
or
set
of
random
ariables
is
to
take
on
each
of
its
ossible
states.
The
wa
we
describ
probability
distributions
dep
ends
on
whether
the
ariables
are
discrete
or
con
tin
uous.
3.3.1
Discrete
ariables
and
Probabilit
Mass
unctions
probability
distribution
ov
er
discrete
ariables
may
describ
ed
using
proba-
bilit
mass
function
(PMF).
typically
denote
probability
mass
functions
with
capital
Often
we
asso
ciate
each
random
ariable
with
diﬀerent
probability
mass
function
and
the
reader
must
infer
which
PMF
to
use
based
on
the
identit
54
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
of
the
random
ariable,
rather
than
on
the
name
of
the
function;
is
usually
not
the
same
as
The
probabilit
mass
function
maps
from
state
of
random
ariable
to
the
probabilit
of
that
random
ariable
taking
on
that
state.
The
probability
that
is
denoted
as
with
probability
of
indicating
that
is
certain
and
probability
of
indicating
that
is
imp
ossible.
Sometimes
to
disam
biguate
which
PMF
to
use,
write
the
name
of
the
random
ariable
explicitly:
Sometimes
we
deﬁne
ariable
ﬁrst,
then
use
notation
to
sp
ecify
whic
distribution
it
follo
ws
later:
Probabilit
mass
functions
can
act
on
many
ariables
at
the
same
time.
Such
probability
distribution
er
man
ariables
is
known
as
join
probabilit
distribution
x,
denotes
the
probabilit
that
and
sim
ultaneously
may
also
write
x,
for
brevity
PMF
on
random
ariable
function
ust
satisfy
the
following
prop
erties:
The
domain
of
ust
the
set
of
all
ossible
states
of
x.
An
imp
ossible
even
has
probability
and
no
state
can
less
probable
than
that.
Lik
ewise,
an
even
that
is
guaranteed
to
happ
en
has
probabilit
and
no
state
can
ha
greater
hance
of
ccurring.
) = 1
refer
to
this
prop
erty
as
eing
normalized
Without
this
prop
erty
we
could
obtain
probabilities
greater
than
one
by
computing
the
probabilit
of
one
of
man
ev
en
ts
ccurring.
or
example,
consider
single
discrete
random
ariable
with
diﬀeren
states.
can
place
uniform
distribution
on
—that
is,
make
eac
of
its
states
equally
lik
ely—b
setting
its
PMF
to
) =
(3.1)
for
all
can
see
that
this
ﬁts
the
requirements
for
probability
mass
function.
The
alue
is
ositiv
ecause
is
ositive
integer.
also
see
that
) =
= 1
(3.2)
so
the
distribution
is
prop
erly
normalized.
55
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
3.3.2
Con
tin
uous
ariables
and
Probability
Densi
unctions
When
working
with
contin
uous
random
ariables,
we
describ
probabilit
distri-
butions
using
probabilit
densit
function
(PDF)
rather
than
probability
mass
function.
probability
density
function,
function
ust
satisfy
the
follo
wing
prop
erties:
The
domain
of
ust
the
set
of
all
ossible
states
of
x.
Note
that
we
do
not
require
dx
= 1
probability
density
function
do
es
not
give
the
probability
of
sp
eciﬁc
state
directly;
instead
the
probability
of
landing
inside
an
inﬁnitesimal
region
with
olume
is
given
by
can
integrate
the
density
function
to
ﬁnd
the
actual
probability
mass
of
set
of
oints.
Speciﬁcally
the
probabilit
that
lies
in
some
set
is
giv
en
by
the
in
tegral
of
ov
er
that
set. In
the
univ
ariate
example,
the
probabilit
that
lies
in
the
in
terv
al
a,
is
given
by
a,b
dx
or
an
example
of
PDF
corresp
onding
to
sp
eciﬁc
probability
density
er
contin
uous
random
ariable,
consider
uniform
distribution
on
an
in
terv
al
of
the
real
num
ers.
can
do
this
with
function
a,
where
and
are
the
endp
oin
ts
of
the
interv
al,
with
The
“;”
notation
means
“parametrized
by”;
consider
to
the
argumen
of
the
function,
while
and
are
parameters
that
deﬁne
the
function.
ensure
that
there
is
no
probability
mass
outside
the
in
terv
al,
we
sa
a,
) = 0
for
all
∈
a,
Within
a,
a,
) =
can
see
that
this
is
non-negative
everywhere.
dditionally
it
in
tegrates
to
1.
often
denote
that
follo
ws
the
uniform
distribution
on
a,
by
writing
a,
3.4
Marginal
Probabilit
Sometimes
we
kno
the
probability
distribution
ov
er
set
of
ariables
and
we
an
to
know
the
probabilit
distribution
ov
er
just
subset
of
them.
The
probabilit
distribution
ov
er
the
subset
is
known
as
the
marginal
probabilit
distribution.
or
example,
supp
ose
we
hav
discrete
random
ariables
and
and
we
know
can
ﬁnd
with
the
sum
rule
) =
x,
(3.3)
56
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
The
name
“marginal
probabilit
y”
comes
from
the
pro
cess
of
computing
marginal
probabilities
on
pap
er.
When
the
alues
of
are
written
in
grid
with
diﬀeren
alues
of
in
rows
and
diﬀeren
alues
of
in
columns,
it
is
natural
to
sum
across
row
of
the
grid,
then
write
in
the
margin
of
the
pap
er
just
to
the
righ
of
the
ro
w.
or
contin
uous
ariables,
we
need
to
use
integration
instead
of
summation:
) =
x,
dy
(3.4)
3.5
Conditional
Probabilit
In
many
cases,
we
are
in
terested
in
the
probabilit
of
some
ev
ent,
given
that
some
other
ev
en
has
happ
ened.
This
is
called
conditional
probabilit
denote
the
conditional
probability
that
giv
en
as
This
conditional
probabilit
can
computed
with
the
form
ula
) =
(3.5)
The
conditional
probabilit
is
only
deﬁned
when
cannot
compute
the
conditional
probabilit
conditioned
on
an
ev
en
that
nev
er
happ
ens.
It
is
imp
ortan
not
to
confuse
conditional
probabilit
with
computing
what
ould
happ
en
if
some
action
were
undertak
en.
The
conditional
probabilit
that
erson
is
from
Germany
given
that
they
sp
eak
German
is
quite
high,
but
if
randomly
selected
erson
is
taugh
to
speak
German,
their
country
of
origin
do
es
not
change.
Computing
the
consequences
of
an
action
is
called
making
an
in
terv
en
tion
query
In
terv
en
tion
queries
are
the
domain
of
causal
mo
deling
whic
do
not
explore
in
this
ok.
3.6
The
Chain
Rule
of
Conditional
Probabilities
An
joint
probability
distribution
ov
er
many
random
ariables
may
decomp
osed
in
to
conditional
distributions
er
only
one
ariable:
(1)
) =
(1)
)Π
=2
(1)
1)
(3.6)
This
observ
ation
is
known
as
the
hain
rule
or
pro
duct
rule
of
probability
It
follows
immediately
from
the
deﬁnition
of
conditional
probabilit
in
equation
3.5
57
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
or
example,
applying
the
deﬁnition
twice,
we
get
3.7
Indep
endence
and
Conditional
Indep
endence
wo
random
ariables
and
are
indep
enden
if
their
probability
distribution
can
expressed
as
product
of
tw
factors,
one
in
olving
only
and
one
in
olving
only
y:
x,
) =
(3.7)
wo
random
ariables
and
are
conditionally
indep
endent
giv
en
random
ariable
if
the
conditional
probability
distribution
ver
and
factorizes
in
this
for
ev
ery
alue
of
z:
x,
) =
(3.8)
can
denote independence
and
conditional independence
with
compact
notation:
means
that
and
are
indep
enden
t,
while
means
that
and
are
conditionally
indep
enden
giv
en
z.
3.8
Exp
ectation,
ariance
and
Co
ariance
The
exp
ectation
or
exp
ected
alue
of
some
function
with
resp
ect
to
probabilit
distribution
is
the
av
erage,
or
mean
alue,
that
tak
es
on
when
is
drawn
from
or
discrete
ariables
this
can
computed
with
summation:
)] =
(3.9)
while
for
con
tin
uous
ariables,
it
is
computed
with
an
integral:
)] =
dx.
(3.10)
58
CHAPTER
3.
PR
OBABILITY
AND
INF
ORMA
TION
THEOR
When
the
identit
of
the
distribution
is
clear
from
the
conte
xt,
ma
simply
write
the
name
of
the
random
ariable
that
the
exp
ectation
is
ver,
as
in
)]
If
it
is
clear
whic
random
ariable
the
exp
ectation
is
ov
er,
we
may
omit
the
subscript
entirely
as
in
)]
By
default,
we
can
assume
that
av
erages
ov
er
the
alues
of
all
the
random
ariables
inside
the
brack
ets.
Lik
ewise,
when
there
is
no
am
biguit
we
ma
omit
the
square
brac
ets.
Exp
ectations
are
linear,
for
example,
αf
)] =
)]
)]
(3.11)
when
and
are
not
dep
endent
on
The
ariance
giv
es
measure
of
ho
muc
the
alues
of
function
of
random
ariable
ary
as
we
sample
diﬀeren
alues
of
from
its
probability
distribution:
ar(
)) =
)])
(3.12)
When
the
ariance
is
low,
the
alues
of
cluster
near
their
exp
ected
alue.
The
square
ro
ot
of
the
ariance
is
known
as
the
standard
deviation
The
co
ariance
giv
es
some
sense
of
how
muc
tw
alues
are
linearly
related
to
eac
other,
as
ell
as
the
scale
of
these
ariables:
Co
v(
)) =
[(
)])
)])]
(3.13)
High
absolute
alues
of
the
co
ariance
mean
that
the
alues
hange
very
muc
and
are
oth
far
from
their
resp
ective
means
at
the
same
time.
If
the
sign
of
the
co
ariance
is
ositive,
then
oth
ariables
tend
to
take
on
relatively
high
alues
sim
ultaneously
If
the
sign
of
the
cov
ariance
is
negative,
then
one
ariable
tends
to
tak
on
relatively
high
alue
at
the
times
that
the
other
takes
on
relatively
lo
alue
and
vice
versa.
Other
measures
suc
as
correlation
normalize
the
con
tribution
of
each
ariable
in
order
to
measure
only
how
muc
the
ariables
are
related,
rather
than
also
eing
aﬀected
the
scale
of
the
separate
ariables.
The
notions
of
cov
ariance
and
dep
endence
are
related
but
distinct
concepts.
They
are
related
ecause
tw
ariables
that
are
indep
endent
hav
zero
cov
ariance,
and
ariables
that
ha
ve
nonzero
co
ariance
are
dep
endent.
Indep
endence,
ho
ev
er,
is
distinct
property
from
co
ariance.
or
ariables
to
ha
zero
co
ariance,
there
ust
no
linear
dep
endence
et
een
them.
Indep
endence
is
stronger
requirement
than
zero
cov
ariance,
ecause
indep
endence
also
excludes
nonlinear
relationships.
It
is
ossible
for
ariables
to
dep
endent
but
hav
zero
co
ariance.
or
example,
supp
ose
ﬁrst
sample
real
um
ber
from
uniform
distribution
ov
er
the
in
terv
al
1]
next
sample
random
ariable
59
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
With
probability
we
choose
the
alue
of
to
. Otherwise,
we
choose
the
alue
of
to
can
then
generate
random
ariable
assigning
sx
Clearly
and
are
not
indep
enden
t,
ecause
completely
determines
the
magnitude
of
How
ev
er,
Cov(
x,
) = 0
The
co
ariance
matrix
of
random
vector
is
an
matrix,
suc
that
Co
v(
i,j
= Co
v(
(3.14)
The
diagonal
elemen
ts
of
the
co
ariance
give
the
ariance:
Co
v(
) = V
ar(
(3.15)
3.9
Common
Probabilit
Distributions
Sev
eral
simple
probability
distributions
are
useful
in
man
contexts
in
machine
learning.
3.9.1
Bernoulli
Distribution
The
Bernoulli
distribution
is
distribution
ov
er
single
binary
random
ariable.
It
is
controlled
single
parameter
[0
1]
whic
gives
the
probability
of
the
random
ariable
eing
equal
to
1.
It
has
the
following
prop
erties:
= 1) =
(3.16)
= 0) = 1
(3.17)
) =
(1
(3.18)
] =
(3.19)
ar
) =
(1
(3.20)
3.9.2
Multinoulli
Distribution
The
ultinoulli
or
categorical
distribution
is
distribution
ov
er
single
dis-
crete
ariable
with
diﬀeren
states,
where
is
ﬁnite.
The
multinoulli
distribution
“Multinoulli”
is
term
that
was
recently
coined
by
Gustav
Lacerda
and
popularized
by
Murph
2012
).
The
multinoulli
distribution
is
sp
ecial
case
of
the
ultinomial
distribution
multinomial
distribution
is
the
distribution
ov
er
ectors
in
represen
ting
how
many
times
each
of
the
categories
is
visited
when
samples
are
drawn
from
ultinoulli
distribution.
Man
texts
use
the
term
“multinomial”
to
refer
to
multinoulli
distributions
without
clarifying
that
they
are
referring
only
to
the
= 1
case.
60
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
is
parametrized
by
ector
[0
1]
where
giv
es
the
probability
of
the
-th
state.
The
ﬁnal,
-th
state’s
probabilit
is
giv
en
Note
that
must
constrain
. Multinoulli
distributions
are
often
used
to
refer
to
distributions
ov
er
categories
of
ob
jects,
so
we
do
not
usually
assume
that
state
has
numerical
alue
1,
and
so
on.
or
this
reason,
we
do
not
usually
need
to
compute
the
exp
ectation
or
ariance
of
multinoulli-distributed
random
ariables.
The
Bernoulli
and
multinoulli
distributions
are
suﬃcient
to
describ
any
distri-
bution
ov
er
their
domain. They
are
able
to
describe
any
distribution
ov
er
their
domain
not
so
uc
because
they
are
particularly
ow
erful
but
rather
ecause
their
domain
is
simple;
they
mo
del
discrete
ariables
for
whic
it
is
feasible
to
en
umerate
all
the
states.
When
dealing
with
contin
uous
ariables,
there
are
uncountably
man
states,
so
any
distribution
describ
ed
by
small
num
er
of
parameters
must
imp
ose
strict
limits
on
the
distribution.
3.9.3
Gaussian
Distribution
The
most
commonly
used
distribution
ver
real
um
bers
is
the
normal
distribu-
tion
also
kno
wn
as
the
Gaussian
distribution
µ,
) =
exp
(3.21)
See
ﬁgure
3.1
for
plot
of
the
normal
distribution
densit
function.
The
tw
parameters
and
(0
control
the
normal
distribution.
The
parameter
giv
es
the
co
ordinate
of
the
central
eak.
This
is
also
the
mean
of
the
distribution:
] =
The
standard
deviation
of
the
distribution
is
given
by
and
the
ariance
by
When
we
ev
aluate
the
PDF,
we
need
to
square
and
in
ert
When
need
to
frequen
tly
ev
aluate
the
PDF
with
diﬀeren
parameter
alues,
more
eﬃcient
wa
of
parametrizing
the
distribution
is
to
use
parameter
(0
to
control
the
precision
or
in
erse
ariance,
of
the
distribution:
µ,
) =
exp
(3.22)
Normal
distributions
are
sensible
hoice
for
man
applications.
In
the
absence
of
prior
knowledge
ab
out
what
form
distribution
ov
er
the
real
num
bers
should
tak
e,
the
normal
distribution
is
go
default
hoice
for
ma
jor
reasons.
61
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
00
05
10
15
20
25
30
35
40
p(x)
Maxim
um
at
Inﬂection
oints
at
Figure
3.1:
The
normal
distribution.
The
normal
distribution
µ,
exhibits
classic
“b
ell
curv
e”
shap
e,
with
the
co
ordinate
of
its
central
eak
given
by
and
the
width
of
its
peak
controlled
by
In
this
example,
depict
the
standard
normal
distribution
with
= 0
and
= 1
First,
many
distributions
we
wish
to
model
are
truly
close
to
eing
normal
distributions.
The
cen
tral
limit
theorem
sho
ws
that
the
sum
of
many
indep
en-
den
random
ariables
is
approximately
normally
distributed. This
means
that
in
practice,
many
complicated
systems
can
mo
deled
successfully
as
normally
distributed
noise,
even
if
the
system
can
decomp
osed
into
parts
with
more
structured
eha
vior.
Second,
out
of
all
ossible
probabilit
distributions
with
the
same
ariance,
the
normal
distribution
enco
des
the
maximum
amoun
of
uncertaint
er
the
real
num
ers.
can
th
us
think
of
the
normal
distribution
as
being
the
one
that
inserts
the
least
amount
of
prior
knowledge
in
to
mo
del.
ully
developing
and
justifying
this
idea
requires
more
mathematical
to
ols
and
is
ostp
oned
to
section
19.4.2
The
normal
distribution
generalizes
to
in
which
case
it
is
known
as
the
ultiv
ariate
normal
distribution
It
may
parametrized
with
positive
deﬁnite
symmetric
matrix
) =
(2
det(
exp
(3.23)
62
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
The
parameter
still
gives
the
mean
of
the
distribution,
though
now
it
is
ector
alued.
The
parameter
giv
es
the
cov
ariance
matrix
of
the
distribution.
As
in
the
univ
ariate
case,
when
wish
to
ev
aluate
the
PDF
sev
eral
times
for
man
diﬀerent
alues
of
the
parameters,
the
co
ariance
is
not
computationally
eﬃcien
wa
to
parametrize
the
distribution,
since
we
need
to
inv
ert
to
ev
aluate
the
PDF.
can
instead
use
precision
matrix
) =
det(
(2
exp
(3.24)
often
ﬁx
the
cov
ariance
matrix
to
diagonal
matrix.
An
even
simpler
ersion
is
the
isotropic
Gaussian
distribution,
whose
cov
ariance
matrix
is
scalar
times
the
iden
tit
matrix.
3.9.4
Exp
onen
tial
and
Laplace
Distributions
In
the
context
of
deep
learning,
we
often
an
to
hav
probability
distribution
with
sharp
point
at
accomplish
this,
we
can
use
the
exp
onen
tial
distribution
) =
exp
λx
(3.25)
The
exp
onen
tial
distribution
uses
the
indicator
function
to
assign
probabilit
zero
to
all
negativ
alues
of
closely
related
probabilit
distribution
that
allo
ws
us
to
place
sharp
eak
of
probabilit
mass
at
an
arbitrary
oin
is
the
Laplace
distribution
Laplace(
µ,
) =
exp
(3.26)
3.9.5
The
Dirac
Distribution
and
Empirical
Distribution
In
some
cases,
we
wish
to
sp
ecify
that
all
the
mass
in
probability
distribution
clusters
around
single
point.
This
can
accomplished
deﬁning
PDF
using
the
Dirac
delta
function
) =
(3.27)
The
Dirac
delta
function
is
deﬁned
such
that
it
is
zero
alued
everywhere
except
0,
yet
integrates
to
1.
The
Dirac
delta
function
is
not
an
ordinary
function
that
asso
ciates
each
alue
with
real-v
alued
output;
instead
it
is
diﬀerent
kind
of
63
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
mathematical
ob
ject
called
generalized
function
that
is
deﬁned
in
of
its
prop
erties
when
integrated.
can
think
of
the
Dirac
delta
function
as
eing
the
limit
oint
of
series
of
functions
that
put
less
and
less
densit
on
all
oin
ts
other
than
zero.
By
deﬁning
to
shifted
by
obtain
an
inﬁnitely
narro
and
inﬁnitely
high
eak
of
probabilit
densit
where
common
use
of
the
Dirac
delta
distribution
is
as
comp
onent
of
an
empirical
distribution
) =
=1
(3.28)
whic
puts
probabilit
mass
on
eac
of
the
oin
ts
(1)
forming
giv
en
data
set
or
collection
of
samples.
The
Dirac
delta
distribution
is
only
necessary
to
deﬁne
the
empirical
distribution
ov
er
con
tin
uous
ariables.
or
discrete
ariables,
the
situation
is
simpler:
an
empirical
distribution
can
be
conceptualized
as
multinoulli
distribution,
with
probabilit
asso
ciated
with
eac
ossible
input
alue
that
is
simply
equal
to
the
empirical
frequency
of
that
alue
in
the
training
set.
can
view
the
empirical
distribution
formed
from
dataset
of
training
examples
as
sp
ecifying
the
distribution
that
we
sample
from
when
train
mo
del
on
this
dataset. Another
imp
ortant
ersp
ective
on
the
empirical
distribution
is
that
it
is
the
probability
density
that
maximizes
the
likelihoo
of
the
training
data
(see
section
5.5
).
3.9.6
Mixtures
of
Distributions
It
is
also
common
to
deﬁne
probability
distributions
by
combining
other
simpler
probabilit
distributions.
One
common w
ay
of
com
bining
distributions
is to
construct
mixture
distribution
mixture
distribution
is
made
up
of
several
comp
onen
distributions.
On
eac
trial,
the
hoice
of
whic
comp
onen
distribution
should
generate
the
sample
is
determined
by
sampling
component
identit
from
ultinoulli
distribution:
) =
(3.29)
where
is
the
multinoulli
distribution
ov
er
comp
onent
identities.
hav
already
seen
one
example
of
mixture
distribution: the
empirical
distribution
er
real-v
alued
ariables
is
mixture
distribution
with
one
Dirac
comp
onen
for
eac
training
example.
64
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
The
mixture
mo
del
is
one
simple
strategy
for
com
bining
probability
distributions
to
create
ric
her
distribution.
In
chapter
16
explore
the
art
of
building
complex
probabilit
distributions
from
simple
ones
in
more
detail.
The
mixture
model
allo
ws
us
to
brieﬂy
glimpse
concept
that
will
of
paramoun
imp
ortance
later—the
laten
ariable
laten
ariable
is
random
ariable
that
cannot
observe
directly
The
comp
onent
iden
tit
ariable
of
the
mixture
mo
del
provides
an
example.
Laten
ariables
may
related
to
through
the
joint
distribution,
in
this
case,
) =
The
distribution
er
the
latent
ariable
and
the
distribution
relating
the
latent
ariables
to
the
visible
ariables
determines
the
shap
of
the
distribution
ev
en
though
it
is
ossible
to
describ
without
reference
to
the
latent
ariable.
Latent
ariables
are
discussed
further
in
section
16.5
ery
ow
erful
and
common
yp
of
mixture
mo
del
is
the
Gaussian
mixture
mo
del
in
which
the
comp
onents
are
Gaussians.
Eac
comp
onent
has
separately
parametrized
mean
and
cov
ariance
Some
mixtures
can
hav
more
constraints.
or
example,
the
co
ariances
could
shared
across
comp
onents
via
the
constrain
As
with
single
Gaussian
distribution,
the
mixture
of
Gaussians
might
constrain
the
cov
ariance
matrix
for
each
comp
onent
to
be
diagonal
or
isotropic.
In
addition
to
the
means
and
cov
ariances,
the
parameters
of
Gaussian
mixture
sp
ecify
the
prior
probabilit
giv
en
to
each
comp
onent
The
word
“prior”
indicates
that
it
expresses
the
mo
del’s
eliefs
ab
out
efor
it
has
observed
By
comparison,
is
osterior
probability
ecause
it
is
computed
after
observ
ation
of
Gaussian
mixture
mo
del
is
univ
ersal
appro
ximator
of
densities,
in
the
sense
that
an
smo
oth
densit
can
appro
ximated
with
an
sp
eciﬁc
nonzero
amount
of
error
by
Gaussian
mixture
mo
del
with
enough
comp
onen
ts.
Figure
3.2
sho
ws
samples
from
Gaussian
mixture
mo
del.
3.10
Useful
Prop
erties
of
Common
unctions
Certain
functions
arise
often while
orking
with
probabilit
distributions,
esp
ecially
the
probabilit
distributions
used
in
deep
learning
mo
dels.
One
of
these
functions
is
the
logistic
sigmoid
) =
exp(
(3.30)
65
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Figure
3.2: Samples
from
Gaussian
mixture
mo
del. In
this
example
there
are
three
comp
onen
ts.
rom
left
to
righ
t,
the
ﬁrst
comp
onent
has
an
isotropic
cov
ariance
matrix,
meaning
it
has
the
same
amount
of
ariance
in
eac
direction.
The
second
has
diagonal
co
ariance
matrix,
meaning
it
can
control
the
ariance
separately
along
eac
axis-aligned
direction.
This
example
has
more
ariance
along
the
axis
than
along
the
axis.
The
third
comp
onent
has
full-rank
cov
ariance
matrix,
enabling
it
to
control
the
ariance
separately
along
an
arbitrary
basis
of
directions.
The
logistic
sigmoid
is
commonly
used
to
pro
duce
the
parameter
of
Bernoulli
distribution
ecause
its
range
is
(0
1)
whic
lies
within
the
alid
range
of
alues
for
the
parameter.
See
ﬁgure
3.3
for
graph
of
the
sigmoid
function.
The
sigmoid
function
saturates
when
its
argument
is
very
ositive
or
very
negativ
e,
meaning
that
the
function
ecomes
very
ﬂat
and
insensitive
to
small
changes
in
its
input.
Another
commonly
encountered
function
is
the
softplus
function
Dugas
et
al.
2001
):
= log
(1
exp(
))
(3.31)
The
softplus
function
can
useful
for
pro
ducing
the
or
parameter
of
normal
distribution
ecause
its
range
is
(0
It
also
arises
commonly
when
manipulating
expressions
in
olving
sigmoids.
The
name
of
the
softplus
function
comes
from
the
fact
that
it
is
smo
othed,
or
“softened,”
version
of
= max(0
(3.32)
See
ﬁgure
3.4
for
graph
of
the
softplus
function.
The
follo
wing
prop
erties
are
all
useful
enough
that
you
may
wish
to
memorize
66
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
10
10
Figure
3.3:
The
logistic
sigmoid
function.
10
10
10
Figure
3.4:
The
softplus
function.
67
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
them:
) =
exp(
exp(
exp(0)
(3.33)
dx
) =
)(1
))
(3.34)
) =
(3.35)
log
) =
(3.36)
dx
(3.37)
(0
1)
) = log
(3.38)
) = log
(exp(
1)
(3.39)
−∞
dy
(3.40)
) =
(3.41)
The
function
is
called
the
logit
in
statistics,
but
this
term
is
rarely
used
in
mac
hine
learning.
Equation
3.41
provides
extra
justiﬁcation
for
the
name
“softplus.”
The
softplus
function
is
intended
as
smo
othed
version
of
the
ositiv
part
function
max
The
ositive
part
function
is
the
coun
terpart
of
the
negativ
part
function
max
. T
obtain
smooth
function
that
is
analogous
to
the
negative
part,
one
can
use
Just
as
can
recov
ered
from
its
ositive
part
and
its
negative
part
via
the
iden
tit
it
is
also
ossible
to
recov
er
using
the
same
relationship
etw
een
and
as
shown
in
equation
3.41
3.11
Ba
es’
Rule
often
ﬁnd
ourselves
in
situation
where
we
know
and
need
to
know
ortunately
if
we
also
know
can
compute
the
desired
quantit
using
Ba
es’
rule
) =
(3.42)
Note
that
while
app
ears
in
the
form
ula,
it
is
usually
feasible
to
compute
) =
so
we
do
not
need
to
egin
with
knowledge
of
68
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Ba
es’
rule
is
straigh
tforw
ard
to
deriv
from the
deﬁnition
of conditional
probabilit
but
it
is
useful
to
kno
the
name
of
this
form
ula
since
many
texts
refer
to
it
name.
It
is
named
after
the
Rev
erend
Thomas
Bay
es,
who
ﬁrst
disco
ered
sp
ecial
case
of
the
form
ula.
The
general
version
presented
here
as
indep
enden
tly
discov
ered
by
Pierre-Simon
Laplace.
3.12
ec
hnical
Details
of
Con
tin
uous
ariables
prop
er
formal
understanding
of
contin
uous
random
ariables
and
probability
densit
functions
requires
dev
eloping
probabilit
theory
in
of
branc
of
mathematics
kno
wn
as
measure
theory
Measure
theory
is
eyond
the
scop
of
this
textb
ok,
but
we
can
brieﬂy
sketc
some
of
the
issues
that
measure
theory
is
emplo
ed
to
resolv
e.
In
section
3.3.2
we
sa
that
the
probability
of
contin
uous
ector-v
alued
lying
in
some
set
is
giv
en
by
the
integral
of
ov
er
the
set
Some
choices
of
set
can
produce
paradoxes.
or
example,
it
is
ossible
to
construct
sets
and
suc
that
) +
but
These
sets
are
generally
constructed
making
very
heavy
use
of
the
inﬁnite
precision
of
real
num
bers,
for
example
by
making
fractal-shap
ed
sets
or
sets
that
are
deﬁned
transforming
the
set
of
rational
num
ers.
One
of
the
key
con
tributions
of
measure
theory
is
to
pro
vide
characterization
of
the
set
of
sets
we
can
compute
the
probabilit
of
without
encountering
paradoxes.
In
this
ok,
we
integrate
only
er
sets
with
relatively
simple
descriptions,
so
this
asp
ect
of
measure
theory
nev
er
ecomes
relev
ant
concern.
or
our
purp
oses,
measure
theory
is
more
useful
for
describing
theorems
that
apply
to
most
oin
ts
in
but
do
not
apply
to
some
corner
cases.
Measure
theory
pro
vides
rigorous
wa
of
describing
that
set
of
oints
is
negligibly
small.
Such
set
is
said
to
hav
measure
zero
do
not
formally
deﬁne
this
concept
in
this
textb
ok.
or
our
purp
oses,
it
is
suﬃcient
to
understand
the
intuit
ion
that
set
of
measure
zero
ccupies
no
volume
in
the
space
we
are
measuring.
or
example,
within
line
has
measure
zero,
while
ﬁlled
polygon
has
positive
measure.
Lik
ewise,
an
individual
oint
has
measure
zero.
Any
union
of
countably
man
sets
that
each
hav
measure
zero
also
has
measure
zero
(so
the
set
of
all
the
rational
um
ers
has
measure
zero,
for
instance).
Another
useful
term
from
measure
theory
is
almost
ev
erywhere
prop
ert
that
holds
almost
everywhere
holds
throughout
all
space
except
for
on
set
of
The
Banach-T
arski
theorem
provides
fun
example
of
such
sets.
69
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
measure
zero.
Because
the
exceptions
ccup
negligible
amount
of
space,
they
can
safely
ignored
for
man
applications.
Some
imp
ortant
results
in
probability
theory
hold
for
all
discrete
alues
but
hold
“almost
everywhere”
only
for
contin
uous
alues.
Another
tec
hnical
detail
of
contin
uous
ariables
relates
to
handling
contin
uous
random
ariables
that
are
deterministic
functions
of
one
another.
Suppose
hav
random
ariables,
and
suc
that
where
is
an
inv
ertible,
con-
tin
uous,
diﬀeren
tiable
transformation.
One
migh
expect
that
) =
))
This
is
actually
not
the
case.
As
simple
example,
supp
ose
we
ha
scalar
random
ariables
and
Suppose
and
(0
1)
If
we
use
the
rule
(2
then
will
be
ev
erywhere
except
the
interv
al
[0
and
it
will
on
this
interv
al.
This
means
dy
(3.43)
whic
violates
the
deﬁnition
of
probabilit
distribution.
This
is
common
mistake.
The
problem
with
this
approach
is
that
it
fails
to
account
for
the
distortion
of
space
introduced
by
the
function
Recall
that
the
probability
of
lying
in
an
inﬁnitesimally
small
region
with
volume
is
given
Since
can
expand
or
con
tract
space,
the
inﬁnitesimal
volume
surrounding
in
space
ma
hav
diﬀeren
olume
in
space.
see
how
to
correct
the
problem,
we
return
to
the
scalar
case.
need
to
preserv
the
prop
ert
))
dy
dx
(3.44)
Solving
from
this,
obtain
) =
))
(3.45)
or
equiv
alently
) =
))
(3.46)
In
higher
dimensions,
the
deriv
ativ
generalizes
to
the
determinant
of
the
Jacobian
matrix
—the
matrix
with
i,j
Thus,
for
real-v
alued
vectors
and
) =
))
det
(3.47)
70
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
3.13
Information
Theory
Information
theory is
a branc
of
applied
mathematics
that
rev
olv
es around
quan
tifying
how
uc
information
is
presen
in
signal.
It
was
originally
inv
en
ted
to
study
sending
messages
from
discrete
alphab
ets
ov
er
noisy
channel,
suc
as
comm
unication
via
radio
transmission.
In
this
con
text,
information
theory
tells
how
to
design
optimal
co
des
and
calculate
the
exp
ected
length
of
messages
sampled
from
sp
eciﬁc
probability
distributions
using
arious
enco
ding
sc
hemes.
In
the
con
text
of
mac
hine
learning,
we
can
also
apply
information
theory
to
contin
uous
ariables
where
some
of
these
message
length
interpretations
do
not
apply
This
ﬁeld
is
fundamen
tal
to
many
areas
of
electrical
engineering
and
computer
science.
In
this
textb
ok,
we
mostly
use
few
key
ideas
from
information
theory
to
characterize
probabilit
distributions
or
to
quan
tify
similarity
etw
een
probability
distributions.
or
more
detail
on
information
theory
see
Cov
er
and
Thomas
2006
or
MacKay
2003
).
The
basic
intuition
ehind
information
theory
is
that
learning
that
an
unlikely
ev
en
has occurred
is
more informativ
e than
learning
that
lik
ely
ev
en
has
ccurred.
message
sa
ying
“the
sun
rose
this
morning”
is
so
uninformativ
as
to
be
unnecessary
to
send,
but
message
sa
ying
“there
was
solar
eclipse
this
morning”
is
very
informative.
would
lik
to
quan
tify
information
in
wa
that
formalizes
this
intuition.
Lik
ely
ev
en
ts
should
ha
lo
information
con
ten
t,
and
in
the
extreme
case,
ev
en
ts
that
are
guaranteed
to
happ
en
should
ha
no
information
conten
whatso
ev
er.
Less
lik
ely
ev
en
ts
should
ha
ve
higher
information
conten
t.
Indep
enden
ev
en
ts
should
ha
ve
additive
information.
or
example,
ﬁnding
out
that
tossed
coin
has
come
up
as
heads
twice
should
con
ey
twice
as
uc
information
as
ﬁnding
out
that
tossed
coin
has
come
up
as
heads
once.
satisfy
all
three
of
these
prop
erties,
we
deﬁne
the
self-information
of
an
ev
en
to
) =
log
(3.48)
In
this
ok,
we
alwa
ys
use
log
to
mean
the
natural
logarithm,
with
base
Our
deﬁnition
of
is
therefore
written
in
units
of
nats
One
nat
is
the
amount
of
information
gained
by
observing
an
even
of
probability
Other
texts
use
base-2
71
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
logarithms
and
units
called
bits
or
shannons
information
measured
in
bits
is
just
rescaling
of
information
measured
in
nats.
When
is
contin
uous,
we
use
the
same
deﬁnition
of
information
analogy
but
some
of
the
prop
erties
from
the
discrete
case
are
lost.
or
example,
an
ev
en
with
unit
density
still
has
zero
information,
despite
not
eing
an
even
that
is
guaran
teed
to
ccur.
Self-information
deals
only
with
single
outcome.
can
quan
tify
the
amoun
of
uncertain
in
an
entire
probabilit
distribution
using
the
Shannon
entrop
) =
)] =
[log
)]
(3.49)
also
denoted
In
other
ords,
the
Shannon
en
trop
of
distribution
is
the
exp
ected
amoun
of
information
in
an
ev
en
dra
wn
from
that
distribution.
It
gives
low
er
ound
on
the
num
ber
of
bits
(if
the
logarithm
is
base
2,
otherwise
the
units
are
diﬀerent)
needed
on
av
erage
to
enco
de
symbols
drawn
from
distribution
Distributions
that
are
nearly
deterministic
(where
the
outcome
is
nearly
certain)
ha
low
entrop
y;
distributions
that
are
closer
to
uniform
hav
high
entrop
See
ﬁgure
3.5
for
demonstration.
When
is
contin
uous,
the
Shannon
entrop
is
kno
wn
as
the
diﬀeren
tial
en
trop
If
we
hav
wo
separate
probabilit
distributions
and
ov
er
the
same
random
ariable
we
can
measure
ho
diﬀerent
these
wo
distributions
are
using
the
Kullbac
k-Leibler
(KL)
divergence
KL
) =
log
[log
log
)]
(3.50)
In
the
case
of
discrete
ariables,
it
is
the
extra
amoun
of
information
(measured
in
bits
if
we
use
the
base-
logarithm,
but
in
mac
hine
learning
we
usually
use
nats
and
the
natural
logarithm)
needed
to
send
message
containing
sym
ols
drawn
from
probability
distribution
when
we
use
co
de
that
was
designed
to
minimize
the
length
of
messages
dra
wn
from
probabilit
distribution
The
KL
div
ergence
has
many
useful
prop
erties,
most
notably
eing
non-negativ
e.
The
KL
div
ergence
is
if
and
only
if
and
are
the
same
distribution
in
the
case
of
discrete
ariables,
or
equal
“almost
everywhere”
in
the
case
of
contin
uous
ariables.
Because
the
KL
divergence
is
non-negative
and
measures
the
diﬀerence
et
een
wo
distributions,
it
is
often
conceptualized
as
measuring
some
sort
of
distance
et
een
these
distributions.
It
is
not
true
distance
measure
ecause
it
is
not
symmetric:
KL
KL
for
some
and
This
asymmetry
means
that
there
are
imp
ortan
consequences
to
the
hoice
of
whether
to
use
KL
or
KL
See
ﬁgure
3.6
for
more
detail.
72
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Shannon
entrop
in
nats
Figure
3.5: Shannon
en
tropy
of
binary
random
ariable. This
plot
shows
how
distri-
butions
that
are
closer
to
deterministic
hav
low
Shannon
en
trop
while
distributions
that
are
close
to
uniform
hav
high
Shannon
entrop
On
the
horizontal
axis,
we
plot
the
probability
of
binary
random
ariable
eing
equal
to
The
entrop
is
given
1)
log
(1
log
When
is
near
0,
the
distribution
is
nearly
deterministic,
ecause
the
random
ariable
is
nearly
alw
ys
0.
When
is
near
1,
the
distribution
is
nearly
deterministic,
because
the
random
ariable
is
nearly
alwa
ys
1.
When
= 0
the
en
trop
is
maximal,
ecause
the
distribution
is
uniform
er
the
outcomes.
quantit
that
is
closely
related
to
the
KL
divergence
is
the
cross-en
trop
) =
KL
whic
is
similar
to
the
KL
div
ergence
but
lacking
the
term
on
the
left:
) =
log
(3.51)
Minimizing
the
cross-entrop
with
resp
ect
to
is
equiv
alent
to
minimizing
the
KL
div
ergence,
ecause
do
es
not
participate
in
the
omitted
term.
When
computing
man
of
these
quan
tities,
it
is
common
to
encoun
ter
expres-
sions
of
the
form
log
By
con
en
tion,
in
the
context
of
information
theory
we
treat
these
expressions
as
lim
log
= 0
3.14
Structured
Probabilistic
Mo
dels
Mac
hine
learning
algorithms
often
inv
olv
probabilit
distributions
ov
er
very
large
num
ber
of
random
ariables.
Often,
these
probability
distributions
in
olv
direct
interactions
et
een
relatively
few
ariables.
Using
single
function
to
describ
the
entire
join
probabilit
distribution
can
ery
ineﬃcien
(b
oth
computationally
and
statistically).
73
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Probability Density
= argmin
KL
Probability Density
= argmin
KL
Figure
3.6:
The
KL
div
ergence
is
asymmetric.
Suppose
we
hav
distribution
and
wish
to
appro
ximate
it
with
another
distribution
ha
ve
the
choice
of
minimizing
either
KL
or
KL
illustrate
the
eﬀect
of
this
hoice
using
mixture
of
wo
Gaussians
for
and
single
Gaussian
for
. The
choice
of
which
direction
of
the
KL
div
ergence
to
use
is
problem
dep
enden
t.
Some
applications
require
an
approximation
that
usually
places
high
probabilit
anywhere
that
the
true
distribution
places
high
probabilit
while
other
applications
require
an
appro
ximation
that
rarely
places
high
probabilit
an
ywhere
that
the
true
distribution
places
low
probability
The
choice
of
the
direction
of
the
KL
div
ergence
reﬂects
whic
of
these
considerations
takes
priority
for
eac
application.
(L
eft)
The
eﬀect
of
minimizing
KL
In
this
case,
we
select
that
has
high
probabilit
where
has
high
probability
When
has
ultiple
mo
des,
ho
oses
to
blur
the
mo
des
together,
in
order
to
put
high
probability
mass
on
all
of
them.
(R
ight)
The
eﬀect
of
minimizing
KL
In
this
case,
select
that
has
low
probability
where
has
lo
probability
When
has
ultiple
mo
des
that
are
suﬃciently
widely
separated,
as
in
this
ﬁgure,
the
KL
divergence
is
minimized
by
ho
osing
single
mo
de,
to
av
oid
putting
probabilit
mass
in
the
lo
w-probability
areas
etw
een
mo
des
of
Here,
illustrate
the
outcome
when
is
chosen
to
emphasize
the
left
mo
de.
could
also
hav
achiev
ed
an
equal
alue
of
the
KL
divergence
by
choosing
the
right
mo
de. If
the
mo
des
are
not
separated
by
suﬃciently
strong
low-probabilit
region,
then
this
direction
of
the
KL
div
ergence
can
still
choose
to
blur
the
mo
des.
74
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Instead
of
using
single
function
to
represent
probability
distribution,
we
can
split
probability
distribution
into
many
factors
that
we
multipl
together.
or
example,
supp
ose
we
hav
three
random
ariables:
and
Supp
ose
that
inﬂuences
the
alue
of
and
inﬂuences
the
alue
of
but
that
and
are
indep
enden
giv
en
can
represent
the
probability
distribution
ov
er
all
three
ariables
as
pro
duct
of
probability
distributions
ov
er
tw
ariables:
) =
(3.52)
These
factorizations
can
greatly
reduce
the
um
ber
of
parameters
needed
to
describ
the
distribution.
Each
factor
uses
num
ber
of
parameters
that
is
exp
onen
tial
in
the
um
er
of
ariables
in
the
factor.
This
means
that
can
greatly
reduce
the
cost
of
representing
distribution
if
are
able
to
ﬁnd
factorization
in
to
distributions
er
fewer
ariables.
can
describ
these
kinds
of
factorizations
using
graphs.
Here,
we
use
the
ord
“graph”
in
the
sense
of
graph
theory:
set
of
vertices
that
may
connected
to
eac
other
with
edges. When
represen
the
factorization
of
probabilit
distribution
with a
graph, w
call
it
structured
probabilistic
model
, or
graphical
mo
del
There
are
main
kinds
of
structured
probabilistic
mo
dels:
directed
and
undirected.
Both
kinds
of
graphical
mo
dels
use
graph
in
which
eac
no
de
in
the
graph corresponds to
random
ariable, and
an
edge connecting
tw
random
ariables
means
that
the
probabilit
distribution
is
able
to
represent
direct
in
teractions
etw
een
those
tw
random
ariables.
Directed
mo
dels
use
graphs with
directed
edges, and
they represen
t fac-
torizations
into
conditional
probability
distributions,
as
in
the
example
abov
e.
Sp
eciﬁcally
directed
mo
del
contains
one
factor
for
every
random
ariable
in
the
distribution,
and
that
factor
consists
of
the
conditional
distribution
ov
er
giv
en
the
paren
ts
of
denoted
) =
))
(3.53)
See
ﬁgure
3.7
for
an
example
of
directed
graph
and
the
factorization
of
probability
distributions
it
represen
ts.
Undirected
mo
dels
use
graphs
with
undirected
edges,
and
they
represen
factorizations
into
set
of
functions;
unlike
in
the
directed
case,
these
functions
are
usually
not
probability
distributions
of
any
kind.
An
set
of
no
des
that
are
all
connected
to
each
other
in
is
called
clique.
Each
clique
in
an
undirected
mo
del
is
asso
ciated
with
factor
These
factors
are
just
functions,
not
75
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Figure
3.7:
directed
graphical
mo
del
ov
er
random
ariables
and
This
graph
corresp
onds
to
probability
distributions
that
can
factored
as
) =
(3.54)
This
graphical
mo
del
enables
us
to
quickly
see
some
prop
erties
of
the
distribution.
or
example,
and
interact
directly
but
and
interact
only
indirectly
via
c.
probabilit
distributions. The
output
of
each
factor
ust
be
non-negativ
e,
but
there
is
no
constraint
that
the
factor
must
sum
or
integrate
to
like
probability
distribution.
The
probability
of
conﬁguration
of
random
ariables
is
prop
ortional
to
the
pro
duct
of
all
these
factors—assignmen
ts
that
result
in
larger
factor
alues
are
more
lik
ely
Of
course,
there
is
no
guarantee
that
this
pro
duct
will
sum
to
1.
therefore
divide
by
normalizing
constant
deﬁned
to
the
sum
or
integral
er
all
states
of
the
pro
duct
of
the
functions,
in
order
to
obtain
normalized
probabilit
distribution:
) =
(3.55)
See
ﬁgure
3.8
for
an
example
of
an
undirected
graph
and
the
factorization
of
probabilit
distributions
it
represen
ts.
Keep in
mind that
these
graphical represen
tations of
factorizations are
language
for
describing
probability
distributions.
They
are
not
mutually
exclusiv
families
of
probabilit
distributions.
Being
directed
or
undirected
is
not
prop
erty
of
probability
distribution;
it
is
prop
erty
of
particular
description
of
probabilit
distribution,
but
any
probabilit
distribution
ma
describ
ed
in
oth
ys.
Throughout
parts
and
of
this
ok,
use
structured
probabilistic
mo
dels
76
CHAPTER
3.
PROBABILITY
AND
INF
ORMA
TION
THEOR
Figure
3.8:
An
undirected
graphical
model
ov
er
random
ariables
and
This
graph
corresponds
to
probability
distributions
that
can
factored
as
) =
(1)
(2)
(3)
(3.56)
This
graphical
mo
del
enables
us
to
quickly
see
some
prop
erties
of
the
distribution.
or
example,
and
interact
directly
but
and
interact
only
indirectly
via
c.
merely
as
language
to
describ
which
direct
probabilistic
relationships
diﬀerent
mac
hine
learning
algorithms
ho
ose
to
represent.
No
further
understanding
of
structured
probabilistic
mo
dels
is
needed
until
the
discussion
of
researc
topics,
in
part
where
explore
structured
probabilistic
mo
dels
in
uc
greater
detail.
This
hapter
has
review
ed
the
basic
concepts
of
probabilit
theory
that
are
most
relev
ant
to
deep
learning.
One
more
set
of
fundamental
mathematical
to
ols
remains:
numerical
metho
ds.
77