www.deeplearningbook.org

Chapter
Linear
Algebra
Linear
algebra
is
branc
of
mathematics
that
is
widely
used
throughout
science
and
engineering.
et
because
linear
algebra
is
form
of
con
tin
uous
rather
than
discrete
mathematics,
man
computer
scien
tists
ha
little
exp
erience
with
it.
go
understanding
of
linear
algebra
is
essen
tial
for
understanding
and
orking
with
many
machine
learning
algorithms,
esp
ecially
deep
learning
algorithms.
therefore
precede
our
in
tro
duction
to
deep
learning
with
fo
cused
presen
tation
of
the
key
linear
algebra
prerequisites.
If
ou
are
already
familiar
with
linear
algebra,
feel
free
to
skip
this
chapter.
If
ou
hav
previous
exp
erience
with
these
concepts
but
need
detailed
reference
sheet
to
review
key
formulas,
recommend
The
Matrix
Co
okb
ok
Petersen
and
edersen
2006
).
If
you
hav
had
no
exp
osure
at
all
to
linear
algebra,
this
hapter
will
teach
you
enough
to
read
this
ok,
but
we
highly
recommend
that
ou
also
consult
another
resource
fo
cused
exclusiv
ely
on
teaching
linear
algebra,
such
as
Shilo
1977
).
This
hapter
completely
omits
many
imp
ortan
linear
algebra
topics
that
are
not
essential
for
understanding
deep
learning.
2.1
Scalars,
ectors,
Matrices
and
ensors
The
study
of
linear
algebra
inv
olv
es
several
types
of
mathematical
ob
jects:
Scalars
scalar
is
just
single
num
ber,
in
contrast
to
most
of
the
other
ob
jects
studied
in
linear
algebra,
which
are
usually
arra
ys
of
ultiple
num
bers.
write
scalars
in
italics.
usually
giv
scalars
low
ercase
ariable
names.
When
we
in
tro
duce
them,
sp
ecify
what
kind
of
um
ber
they
are.
or
29
CHAPTER
2.
LINEAR
ALGEBRA
example,
we
might
say
“Let
the
slop
of
the
line,”
while
deﬁning
real-v
alued
scalar,
or
“Let
the
num
er
of
units,”
while
deﬁning
natural
num
er
scalar.
ectors
: A
vector
is
an
array
of
num
ers.
The
um
ers
are
arranged
in
order.
can
identify
eac
individual
num
er
by
its
index
in
that
ordering.
ypically
we
give
vectors
lo
ercase
names
in
old
typeface,
such
as
The
elemen
ts
of
the
vector
are
iden
tiﬁed
by
writing
its
name
in
italic
typeface,
with
subscript.
The
ﬁrst
element
of
is
the
second
elemen
is
and
so
on.
also
need
to
sa
what
kind
of
um
ers
are
stored
in
the
ector.
If
eac
element
is
in
and
the
vector
has
elemen
ts,
then
the
vector
lies
in
the
set
formed
by
taking
the
Cartesian
pro
duct
of
times,
denoted
as
When
need
to
explicitly
identify
the
elemen
ts
of
vector,
write
them
as
column
enclosed
in
square
brack
ets:
(2.1)
can
think
of
vectors
as
iden
tifying
oints
in
space,
with
eac
element
giving
the
co
ordinate
along
diﬀerent
axis.
Sometimes
we
need
to
index
set
of
elements
of
ector.
In
this
case,
we
deﬁne
set
con
taining
the
indices
and
write
the
set
as
subscript.
or
example,
to
access
and
we
deﬁne
the
set
and
write
use
the
sign
to
index
the
complement
of
set.
or
example
is
the
vector
containing
all
elements
of
except
for
and
is
the
vector
con
taining
all
elements
of
except
for
and
Matrices
matrix
is
2-D
array
of
num
ers,
so
each
elemen
is
iden
tiﬁed
indices
instead
of
just
one.
usually
giv
matrices
upp
ercase
ariable
names
with
old
typeface,
such
as
If
real-v
alued
matrix
has
height
of
and
width
of
then
we
sa
that
. W
usually
iden
tify
the
elements
of
matrix
using
its
name
in
italic
but
not
old
font,
and
the
indices
are
listed
with
separating
commas.
or
example,
is
the
upp
er
left
en
try
of
and
m,n
is
the
ottom
right
en
try
can
iden
tify
all
the
um
ers
with
vertical
co
ordinate
writing
for
the
horizon
tal
co
ordinate.
or
example,
i,
denotes
the
horizon
tal
cross
section
of
with
ertical
co
ordinate
This
is
known
as
the
-th
ro
of
Lik
ewise,
,i
is
30
CHAPTER
2.
LINEAR
ALGEBRA
the
-th
column
of
When
we
need
to
explicitly
iden
tify
the
elements
of
matrix,
we
write
them
as
an
arra
enclosed
in
square
brack
ets:
(2.2)
Sometimes
we
may
need
to
index
matrix-v
alued
expressions
that
are
not
just
single
letter.
In
this
case,
use
subscripts
after
the
expression
but
do
not
con
ert
anything
to
low
ercase.
or
example,
i,j
giv
es
element
i,
of
the
matrix
computed
by
applying
the
function
to
ensors
In
some
cases
we
will
need
an
array
with
more
than
tw
axes.
In
the
general
case,
an
array
of
num
ers
arranged
on
regular
grid
with
ariable
num
er
of
axes
is
known
as
tensor.
denote
tensor
named
“A”
with
this
typeface:
identify
the
element
of
at
co
ordinates
i,
j,
writing
i,j,k
One
imp
ortant
op
eration
on
matrices
is
the
transp
ose
The
transp
ose
of
matrix
is
the
mirror
image
of
the
matrix
across
diagonal
line,
called
the
main
diagonal
running
do
wn
and
to
the
righ
t,
starting
from
its
upp
er
left
corner.
See
ﬁgure
2.1
for
graphical
depiction
of
this
op
eration.
denote
the
transp
ose
of
matrix
as
and
it
is
deﬁned
such
that
i,j
j,i
(2.3)
ectors
can
thought
of
as
matrices
that
contain
only
one
column.
The
transp
ose
of
vector
is
therefore
matrix
with
only
one
ro
w.
Sometimes
we
Figure
2.1:
The
transp
ose
of
the
matrix
can
thought
of
as
mirror
image
across
the
main
diagonal.
31
CHAPTER
2.
LINEAR
ALGEBRA
deﬁne
vector
writing
out
its
elements
in
the
text
inline
as
row
matrix,
then
using
the
transp
ose
op
erator
to
turn
it
into
standard
column
vector,
for
example
= [
scalar
can
thought
of
as
matrix
with
only
single
en
try
rom
this,
we
can
see
that
scalar
is
its
wn
transp
ose:
can
add
matrices
to
each
other,
as
long
as
they
hav
the
same
shap
e,
just
adding
their
corresp
onding
elements:
where
i,j
i,j
i,j
can
also
add
scalar
to
matrix
or
ultiply
matrix
scalar,
just
erforming
that
op
eration
on
each
element
of
matrix:
where
i,j
i,j
In
the
con
text
of
deep
learning,
we
also
use
some
less
conv
entional
notation.
allo
the
addition
of
matrix
and
vector,
yielding
another
matrix:
where
i,j
i,j
In
other
words,
the
ector
is
added
to
each
row
of
the
matrix.
This
shorthand
eliminates
the
need
to
deﬁne
matrix
with
copied
into
eac
row
efore
doing
the
addition.
This
implicit
cop
ying
of
to
many
lo
cations
is
called
broadcasting
2.2
Multiplying
Matrices
and
ectors
One
of
the
most
imp
ortant
op
erations
inv
olving
matrices
is
multiplication
of
tw
matrices.
The
matrix
pro
duct
of
matrices
and
is
third
matrix
In
order
for
this
pro
duct
to
deﬁned,
ust
ha
the
same
num
er
of
columns
as
has
rows.
If
is
of
shap
and
is
of
shap
then
is
of
shap
can
write
the
matrix
product
just
by
placing
tw
or
more
matrices
together,
for
example,
AB
(2.4)
The
pro
duct
op
eration
is
deﬁned
by
i,j
i,k
k,j
(2.5)
Note
that
the
standard
pro
duct
of
tw
matrices
is
not
just
matrix
containing
the
pro
duct
of
the
individual
elemen
ts.
Such
an
op
eration
exists
and
is
called
the
elemen
t-wise
pro
duct
or
Hadamard
product
and
is
denoted
as
The
dot
pro
duct
et
een
tw
vectors
and
of
the
same
dimensionalit
is
the
matrix
pro
duct
can
think
of
the
matrix
pro
duct
AB
as
computing
i,j
as
the
dot
pro
duct
etw
een
row
of
and
column
of
32
CHAPTER
2.
LINEAR
ALGEBRA
Matrix
pro
duct
op
erations
hav
man
useful
prop
erties
that
make
mathematical
analysis of
matrices more con
venien
t.
or example, matrix m
ultiplication is
distributiv
e:
) =
AB
AC
(2.6)
It
is
also
asso
ciative:
) = (
AB
(2.7)
Matrix
multiplication
is
not
comm
utativ
(the
condition
AB
do
es
not
alw
ys
hold),
unlik
scalar
multiplication.
Ho
wev
er,
the
dot
pro
duct
etw
een
tw
ectors
is
commutativ
e:
(2.8)
The
transp
ose
of
matrix
pro
duct
has
simple
form:
AB
(2.9)
This
enables
us
to
demonstrate
equation
2.8
by
exploiting
the
fact
that
the
alue
of
such
pro
duct
is
scalar
and
therefore
equal
to
its
own
transp
ose:
(2.10)
Since
the
fo
cus
of
this
textb
ok
is
not
linear
algebra,
we
do
not
attempt
to
dev
elop
comprehensive
list
of
useful
prop
erties
of
the
matrix
pro
duct
here,
but
the
reader
should
aw
are
that
many
more
exist.
no
kno
enough
linear
algebra
notation
to
write
down
system
of
linear
equations:
Ax
(2.11)
where
is
known
matrix,
is
known
vector,
and
is
ector
of
unknown
ariables
we
would
lik
to
solve
for.
Each
element
of
is
one
of
these
unknown
ariables.
Each
row
of
and
eac
element
of
pro
vide
another
constrain
t.
can
rewrite
equation
2.11
as
(2.12)
(2.13)
(2.14)
m,
(2.15)
or
even
more
explicitly
as
,n
(2.16)
33
CHAPTER
2.
LINEAR
ALGEBRA
,n
(2.17)
(2.18)
m,
m,
m,n
(2.19)
Matrix-v
ector
pro
duct
notation
provides
more
compact
represen
tation
for
equations
of
this
form.
2.3
Iden
tit
and
In
erse
Matrices
Linear
algebra
oﬀers
ow
erful
to
ol
called
matrix
inv
ersion
that
enables
us
to
analytically
solve
equation
2.11
for
man
alues
of
describ
matrix
inv
ersion,
we
ﬁrst
need
to
deﬁne
the
concept
of
an
iden
tity
matrix
An
identit
matrix
is
matrix
that
do
es
not
hange
an
ector
when
we
ultiply
that
vector
that
matrix.
denote
the
iden
tit
matrix
that
preserves
-dimensional
vectors
as
ormally
and
(2.20)
The
structure
of
the
identit
matrix
is
simple:
all
the
en
tries
along
the
main
diagonal
are
1,
while
all
the
other
en
tries
are
zero.
See
ﬁgure
2.2
for
an
example.
The
matrix
inv
erse
of
is
denoted
as
and
it
is
deﬁned
as
the
matrix
suc
that
(2.21)
can
now
solve
equation
2.11
using
the
following
steps:
Ax
(2.22)
Ax
(2.23)
(2.24)
Figure
2.2:
Example
identit
matrix:
This
is
34
CHAPTER
2.
LINEAR
ALGEBRA
(2.25)
Of
course,
this
pro
cess
dep
ends
on
it
eing
ossible
to
ﬁnd
discuss
the
conditions
for
the
existence
of
in
the
following
section.
When
exists,
several
diﬀerent
algorithms
can
ﬁnd
it
in
closed
form.
In
theory
the
same
inv
erse
matrix
can
then
used
to
solv
the
equation
many
times
for
diﬀeren
alues
of
is
primarily
useful
as
theoretical
to
ol,
how
ever,
and
should
not
actually
used
in
practice
for
most
softw
are
applications.
Because
can
represented
with
only
limited
precision
on
digital
computer,
algorithms
that
make
use
of
the
alue
of
can
usually
obtain
more
accurate
estimates
of
2.4
Linear
Dep
endence
and
Span
or
to
exist,
equation
2.11
must
hav
exactly
one
solution
for
every
alue
of
It
is
also
ossible
for
the
system
of
equations
to
hav
no
solutions
or
inﬁnitely
man
solutions
for
some
alues
of
It
is
not
ossible,
how
ever,
to
hav
more
than
one
but
less
than
inﬁnitely
many
solutions
for
particular
if
oth
and
are
solutions,
then
(1
(2.26)
is
also
solution
for
any
real
analyze
how
many
solutions
the
equation
has,
think
of
the
columns
of
as
sp
ecifying
diﬀerent
directions
we
can
trav
el
in
from
the
origin
(the
oint
sp
eciﬁed
the
vector
of
all
zeros),
then
determine
ho
man
ys
there
are
of
reaching
In
this
view,
each
element
of
sp
eciﬁes
how
far
we
should
trav
el
in
each
of
these
directions,
with
sp
ecifying
how
far
to
mov
in
the
direction
of
column
Ax
,i
(2.27)
In
general,
this
kind
of
operation
is
called
linear
com
bination
ormally
linear
combination
of
some
set
of
vectors
(1)
is
given
multiplying
eac
vector
corresp
onding
scalar
co
eﬃcient
and
adding
the
results:
(2.28)
The
span
of
set
of
vectors
is
the
set
of
all
oin
ts
obtainable
by
linear
combination
of
the
original
vectors.
35
CHAPTER
2.
LINEAR
ALGEBRA
Determining
whether
Ax
has
solution
thus
amounts
to
testing
whether
is
in
the
span
of
the
columns
of
This
particular
span
is
known
as
the
column
space
or
the
range
of
In
order
for
the
system
Ax
to
ha
solution
for
all
alues
of
therefore
require
that
the
column
space
of
all
of
If
an
oin
in
is
excluded
from
the
column
space,
that
oint
is
otential
alue
of
that
has
no
solution.
The
requirement
that
the
column
space
of
all
of
implies
immediately
that
ust
hav
at
least
columns,
that
is,
Otherwise,
the
dimensionalit
of
the
column
space
would
less
than
or
example,
consider
matrix.
The
target
is
3-D,
but
is
only
2-D,
so
mo
difying
the
alue
of
at
est
enables
us
to
trace
out
2-D
plane
within
The
equation
has
solution
if
and
only
if
lies
on
that
plane.
Ha
ving
is
only
necessary
condition
for
every
oint
to
hav
solution.
It
is
not
suﬃcient
condition,
ecause
it
is
ossible
for
some
of
the
columns
to
redundant.
Consider
matrix
where
oth
of
the
columns
are
identical.
This
has
the
same
column
space
as
matrix
containing
only
one
copy
of
the
replicated
column.
In
other
words,
the
column
space
is
still
just
line
and
fails
to
encompass
all
of
even
though
there
are
tw
columns.
ormally
this
kind
of
redundancy
is
known
as
linear
dep
endence
set
of
ectors
is
linearly
indep
endent
if
no
ector
in
the
set
is
linear
com
bination
of
the
other
vectors. If
we
add
ector
to
set
that
is
linear
combination
of
the
other
vectors
in
the
set,
the
new
vector
do
es
not
add
any
oints
to
the
set’s
span.
This
means
that
for
the
column
space
of
the
matrix
to
encompass
all
of
the
matrix
ust
con
tain
at
least
one
set
of
linearly
indep
enden
columns.
This
condition
is
oth
necessary
and
suﬃcient
for
equation
2.11
to
hav
solution
for
ev
ery
alue
of
Note
that
the
requirement
is
for
set
to
hav
exactly
linearly
indep
enden
columns,
not
at
least
No
set
of
-dimensional
vectors
can
hav
more
than
utually
linearly
indep
enden
columns,
but
matrix
with
more
than
columns
may
hav
more
than
one
such
set.
or
the
matrix
to
hav
an
inv
erse,
additionally
need
to
ensure
that
equa-
tion
2.11
has
at
most
one
solution
for
each
alue
of
do
so,
need
to
mak
certain
that
the
matrix
has
at
most
columns.
Otherwise
there
is
more
than
one
of
parametrizing
each
solution.
ogether,
this
means
that
the
matrix
ust
square
that
is,
we
require
that
and
that
all
the
columns
linearly
indep
enden
t.
square
matrix
with
linearly
dep
endent
columns
is
known
as
singular
If
is
not
square
or
is
square
but
singular,
solving
the
equation
is
still
ossible,
but
we
cannot
use
the
metho
of
matrix
inv
ersion
to
ﬁnd
the
solution.
36
CHAPTER
2.
LINEAR
ALGEBRA
So
far
we
ha
discussed
matrix
inv
erses
as
eing
multiplied
on
the
left.
It
is
also
ossible
to
deﬁne
an
inv
erse
that
is
ultiplied
on
the
right:
AA
(2.29)
or
square
matrices,
the
left
inv
erse
and
righ
in
erse
are
equal.
2.5
Norms
Sometimes
need
to
measure
the
size
of
vector.
In
machine
learning,
we
usually
measure
the
size
of
vectors
using
function
called
norm
ormally
the
norm
is
given
by
||
||
(2.30)
for
Norms,
including
the
norm,
are
functions
mapping
ectors
to
non-negative
alues.
On
an
intuitiv
level,
the
norm
of
ector
measures
the
distance
from
the
origin
to
the
oint
More
rigorously
norm
is
any
function
that
satisﬁes
the
following
prop
erties:
) = 0
(the
triangle
inequalit
) =
The
norm,
with
= 2
is
kno
wn
as
the
Euclidean
norm
whic
is
simply
the
Euclidean
distance
from
the
origin
to
the
oin
iden
tiﬁed
The
norm
is
used
so
frequently
in
mac
hine
learning
that
it
is
often
denoted
simply
as
||
||
with
the
subscript
omitted. It
is
also
common
to
measure
the
size
of
ector
using
the
squared
norm,
which
can
calculated
simply
as
The
squared
norm
is
more
con
enient
to
work
with
mathematically
and
computationally
than
the
norm
itself.
or
example,
eac
deriv
ative
of
the
squared
norm
with
resp
ect
to
eac
elemen
of
dep
ends
only
on
the
corre-
sp
onding
element
of
while
all
the
deriv
atives
of
the
norm
dep
end
on
the
en
tire
ector.
In
many
contexts,
the
squared
norm
ma
undesirable
ecause
it
increases
very
slowly
near
the
origin.
In
several
machine
learning
applications,
it
is
imp
ortan
to
discriminate
et
een
elemen
ts
that
are
exactly
zero
and
elemen
ts
that
are
small
but
nonzero.
In
these
cases,
we
turn
to
function
that
gro
ws
at
the
37
CHAPTER
2.
LINEAR
ALGEBRA
same
rate
in
all
lo
cations,
but
that
retains
mathematical
simplicit
y:
the
norm.
The
norm
may
simpliﬁed
to
||
||
(2.31)
The
norm
is
commonly
used
in
machine
learning
when
the
diﬀerence
et
een
zero
and
nonzero
elements
is
very
imp
ortant.
Every
time
an
elemen
of
mo
es
from
by
the
norm
increases
by
sometimes
measure
the
size
of
the
vector
counting
its
num
er
of
nonzero
elemen
ts.
Some
authors
refer
to
this
function
as
the
norm,”
but
this
is
incorrect
terminology
. The
um
er
of
nonzero
entries
in
vector
is
not
norm,
ecause
scaling
the
vector
by
do
es
not
change
the
num
er
of
nonzero
entries.
The
norm
is
often
used
as
substitute
for
the
um
er
of
nonzero
entries.
One
other
norm
that
commonly
arises
in
mac
hine
learning
is
the
norm,
also
known
as
the
max
norm
This
norm
simpliﬁes
to
the
absolute
alue
of
the
elemen
with
the
largest
magnitude
in
the
vector,
||
||
= max
(2.32)
Sometimes
we
ma
also
wish
to
measure
the
size
of
matrix.
In
the
context
of
deep
learning,
the
most
common
to
do
this
is
with
the
otherwise
obscure
robenius
norm
||
||
i,j
i,j
(2.33)
whic
is
analogous
to
the
norm
of
vector.
The
dot
pro
duct
of
vectors
can
rewritten
in
of
norms.
Sp
eciﬁcally
||
||
||
||
cos
(2.34)
where
is
the
angle
etw
een
and
2.6
Sp
ecial
Kinds
of
Matrices
and
ectors
Some
sp
ecial
kinds
of
matrices
and
vectors
are
particularly
useful.
Diagonal
matrices
consist
mostly
of
zeros
and
hav
nonzero
entries
only
along
the
main
diagonal. F
ormally
matrix
is
diagonal
if
and
only
if
i,j
for
all
. W
hav
already
seen
one
example
of
diagonal
matrix: the
identit
38
CHAPTER
2.
LINEAR
ALGEBRA
matrix,
where
all
the
diagonal
entries
are
1.
write
diag
to
denote
square
diagonal
matrix
whose
diagonal
en
tries
are
giv
en
by
the
entries
of
the
vector
Diagonal
matrices
are
of
interest
in
part
ecause
ultiplying
diagonal
matrix
is
computationally
eﬃcient.
compute
diag
only
need
to
scale
eac
elemen
In
other
words,
diag
In
erting
square
diagonal
matrix
is
also
eﬃcient.
The
inv
erse
exists
only
if
every
diagonal
entry
is
nonzero,
and
in
that
case,
diag
diag
([1
/v
/v
In
man
cases,
may
deriv
some
general
mac
hine
learning
algorithm
in
of
arbitrary
matrices
but
obtain
less
exp
ensive
(and
less
descriptive)
algorithm
by
restricting
some
matrices
to
diagonal.
Not
all
diagonal
matrices
need
square.
It
is
ossible
to
construct
rectangular
diagonal
matrix.
Nonsquare
diagonal
matrices
do
not
ha
ve
inv
erses,
but
we
can
still
multiply
by
them
cheaply
or
nonsquare
diagonal
matrix
the
pro
duct
will
inv
olve
scaling
each
element
of
and
either
concatenating
some
zeros
to
the
result,
if
is
taller
than
it
is
wide,
or
discarding
some
of
the
last
elemen
ts
of
the
vector,
if
is
wider
than
it
is
tall.
symmetric
matrix
is
any
matrix
that
is
equal
to
its
own
transp
ose:
(2.35)
Symmetric
matrices
often
arise
when
the
entries
are
generated
some
function
of
argumen
ts
that
do
es
not
dep
end
on
the
order
of
the
arguments.
or
example,
if
is
matrix
of
distance
measuremen
ts,
with
i,j
giving
the
distance
from
oint
to
oint
then
i,j
j,i
ecause
distance
functions
are
symmetric.
unit
ector
is
vector
with
unit
norm
||
||
= 1
(2.36)
vector
and
vector
are
orthogonal
to
each
other
if
= 0
If
oth
ectors
ha
nonzero
norm,
this
means
that
they
are
at
90
degree
angle
to
each
other.
In
at
most
ectors
may
utually
orthogonal
with
nonzero
norm.
If
the
vectors
not
only
are
orthogonal
but
also
ha
ve
unit
norm,
we
call
them
orthonormal
An
orthogonal
matrix
is
square
matrix
whose
ro
ws
are
mutually
orthonor-
mal
and
whose
columns
are
mutually
orthonormal:
AA
(2.37)
This
implies
that
(2.38)
39
CHAPTER
2.
LINEAR
ALGEBRA
so
orthogonal
matrices
are
of
interest
ecause
their
in
erse
is
ery
cheap
to
compute.
careful
atten
tion
to
the
deﬁnition
of
orthogonal
matrices.
Coun
terintuitiv
ely
their
rows
are
not
merely
orthogonal
but
fully
orthonormal.
There
is
no
sp
ecial
term
for
matrix
whose
rows
or
columns
are
orthogonal
but
not
orthonormal.
2.7
Eigendecomp
osition
Man
mathematical
ob
jects
can
understo
etter
by
breaking
them
in
to
constituen
parts,
or
ﬁnding
some
prop
erties
of
them
that
are
universal,
not
caused
the
wa
we
choose
to
represen
them.
or
example,
in
tegers
can
decomposed
into
prime
factors.
The
ay
represen
the
num
er
12
will
change
dep
ending
on
whether
we
write
it
in
base
ten
or
in
binary
but
it
will
alwa
ys
true
that
12
rom
this
representation
can
conclude
useful
prop
erties,
for
example,
that
12
is
not
divisible
by
and
that
any
integer
multiple
of
12
will
divisible
by
Muc
as
can
disco
er
something
ab
out
the
true
nature
of
an
in
teger
decomp
osing
it
into
prime
factors,
we
can
also
decomp
ose
matrices
in
wa
ys
that
sho
us
information
ab
out
their
functional
prop
erties
that
is
not
obvious
from
the
represen
tation
of
the
matrix
as
an
array
of
elements.
One
of
the
most
widely
used
kinds
of
matrix
decomp
osition
is
called
eigen-
decomp
osition
in
which
decomp
ose
matrix
into
set
of
eigen
vectors
and
eigen
alues.
An
eigen
vector
of
square
matrix
is
nonzero
ector
suc
that
multi-
plication
by
alters
only
the
scale
of
Av
(2.39)
The
scalar
is
kno
wn
as
the
eigen
alue
corresp
onding
to
this
eigen
vector.
(One
can
also
ﬁnd
left
eigen
ector
suc
that
, but
are
usually
concerned
with
right
eigenv
ectors.)
If
is
an
eigenv
ector
of
then
so
is
any
rescaled
vector
for
= 0
Moreo
er,
still
has
the
same
eigen
alue.
or
this
reason,
we
usually
lo
ok
only
for
unit
eigenv
ectors.
Supp
ose
that
matrix
has
linearly
indep
endent
eigenv
ectors
(1)
with
corresp
onding
eigen
alues
. W
may
concatenate
all
the
eigen
ectors
to
form
matrix
with
one
eigen
ector
er
column:
= [
(1)
Lik
ewise,
we
can
concatenate
the
eigenv
alues
to
form
vector
= [
40
CHAPTER
2.
LINEAR
ALGEBRA
The
eigendecomp
osition
of
is
then
given
diag
(2.40)
hav
seen
that
onstructing
matrices
with
sp
eciﬁc
eigenv
alues
and
eigen-
ectors
enables
us
to
stretc
space
in
desired
directions.
et
often
an
to
decomp
ose
matrices
into
their
eigenv
alues
and
eigen
ectors.
Doing
so
can
help
us
analyze
certain
prop
erties
of
the
matrix,
muc
as
decomp
osing
an
integer
in
to
its
prime
factors
can
help
us
understand
the
eha
vior
of
that
integer.
Not
every
matrix
can
decomp
osed
into
eigenv
alues
and
eigen
ectors.
In
some
cases,
the
decomp
osition
exists
but
inv
olves
complex
rather
than
real
num
ers.
ortunately
in
this
ok,
usually
need
to
decomp
ose
only
sp
eciﬁc
class
of
󰤓
󰤓
󰤓
󰤓
󰤓
󰤓



󰤓
󰤓
󰤓
󰤓
󰤓
󰤓






Figure
2.3:
An
example
of
the
eﬀect
of
eigen
vectors
and
eigenv
alues.
Here,
we
ha
ve
matrix
with
tw
orthonormal
eigenv
ectors,
(1)
with
eigenv
alue
and
(2)
with
eigen
alue
(L
eft)
plot
the
set
of
all
unit
vectors
as
unit
circle.
(R
ight)
plot
the
set
of
all
oints
Au
By
observing
the
wa
that
distorts
the
unit
circle,
we
can
see
that
it
scales
space
in
direction
41
CHAPTER
2.
LINEAR
ALGEBRA
matrices
that
hav
simple
decomp
osition.
Sp
eciﬁcally
ev
ery
real
symmetric
matrix
can
decomp
osed
into
an
expression
using
only
real-v
alued
eigen
ectors
and
eigenv
alues:
(2.41)
where
is
an
orthogonal
matrix
comp
osed
of
eigen
ectors
of
and
is
diagonal
matrix.
The
eigenv
alue
i,i
is
asso
ciated
with
the
eigenv
ector
in
column
of
denoted
as
,i
Because
is
an
orthogonal
matrix,
we
can
think
of
as
scaling
space
by
in
direction
See
ﬁgure
2.3
for
an
example.
While
any
real
symmetric
matrix
is
guaran
teed
to
hav
an
eigendecomp
osi-
tion,
the
eigendecomp
osition
may
not
unique.
If
any
tw
or
more
eigen
ectors
share
the
same
eigenv
alue,
then
any
set
of
orthogonal
vectors
lying
in
their
span
are
also
eigenv
ectors
with
that
eigen
alue,
and
we
could
equiv
alently
choose
using
those
eigenv
ectors
instead.
By
conv
ention,
we
usually
sort
the
entries
of
in
descending
order.
Under
this
con
en
tion,
the
eigendecomp
osition
is
unique
only
if
all
the
eigenv
alues
are
unique.
The
eigendecomposition of
a matrix
tells us
many
useful facts
ab
out the
matrix.
The
matrix
is
singular
if
and
only
if
an
of
the
eigenv
alues
are
zero.
The
eigendecomp
osition
of
real
symmetric
matrix
can
also
used
to
optimize
quadratic
expressions
of
the
form
) =
Ax
sub
ject
to
||
||
= 1
Whenev
er
is
equal
to
an
eigenv
ector
of
tak
es
on
the
alue
of
the
corresp
onding
eigen
alue.
The
maxim
um
alue
of
within
the
constrain
region
is
the
maximum
eigen
alue
and
its
minimum
alue
within
the
constrain
region
is
the
minimum
eigenv
alue.
matrix
whose
eigen
alues
are
all
ositiv
is
called
ositiv
deﬁnite
matrix
whose
eigen
alues
are
all
ositive
or
zero
alued
is
called
ositiv
semideﬁ-
nite
Lik
ewise,
if
all
eigenv
alues
are
negative,
the
matrix
is
negativ
deﬁnite
and
if
all
eigenv
alues
are
negative
or
zero
alued,
it
is
negativ
semideﬁnite
ositive
semideﬁnite
matrices
are
interesting
ecause
they
guaran
tee
that
Ax
ositiv
deﬁnite
matrices
additionally
guarantee
that
Ax
= 0
2.8
Singular
alue
Decomp
osition
In
section
2.7
we
saw
ho
to
decomp
ose
matrix
into
eigen
ectors
and
eigenv
alues.
The
singular
alue
decomp
osition
(SVD)
provides
another
wa
to
factorize
matrix,
in
to
singular
ectors
and
singular
alues
. The
SVD
enables
us
to
disco
er
some
of
the
same
kind
of
information
as
the
eigendecomp
osition
reveals;
ho
ev
er,
the
SVD
is
more
generally
applicable.
Every
real
matrix
has
singular
alue
decomp
osition,
but
the
same
is
not
true
of
the
eigen
alue
decomp
osition.
42
CHAPTER
2.
LINEAR
ALGEBRA
or
example,
if
matrix
is
not
square,
the
eigendecomp
osition
is
not
deﬁned,
and
must
use
singular
alue
decomp
osition
instead.
Recall
that
the
eigendecomp
osition
inv
olves
analyzing
matrix
to
discov
er
matrix
of
eigen
ectors
and
vector
of
eigenv
alues
suc
that
we
can
rewrite
as
diag
(2.42)
The
singular
alue
decomp
osition
is
similar,
except
this
time
will
write
as
pro
duct
of
three
matrices:
(2.43)
Supp
ose
that
is
an
matrix.
Then
is
deﬁned
to
an
matrix,
to
an
matrix,
and
to
an
matrix.
Eac
of
these
matrices
is
deﬁned
to
hav
sp
ecial
structure.
The
matrices
and
are
oth
deﬁned
to
orthogonal
matrices.
The
matrix
is
deﬁned
to
diagonal
matrix.
Note
that
is
not
necessarily
square.
The
elemen
ts
along
the
diagonal
of
are
kno
wn
as
the
singular
alues
of
the
matrix
The
columns
of
are
known
as
the
left-singular
vectors
The
columns
of
are
known
as
as
the
righ
t-singular
ectors
can
actually
in
terpret
the
singular
alue
decomp
osition
of
in
of
the
eigendecomp
osition
of
functions
of
The
left-singular
vectors
of
are
the
eigen
ectors
of
AA
The
righ
t-singular
vectors
of
are
the
eigenv
ectors
of
The
nonzero
singular
alues
of
are
the
square
ro
ots
of
the
eigenv
alues
of
The
same
is
true
for
AA
erhaps
the
most
useful
feature
of
the
SVD
is
that
can
use
it
to
partially
generalize
matrix
inv
ersion
to
nonsquare
matrices,
as
we
will
see
in
the
next
section.
2.9
The
Mo
ore-P
enrose
Pseudoin
erse
Matrix
in
ersion
is
not
deﬁned
for
matrices
that
are
not
square.
Supp
ose
we
ant
to
make
left-inv
erse
of
matrix
so
that
we
can
solve
linear
equation
Ax
(2.44)
left-multiplying
each
side
to
obtain
(2.45)
43
CHAPTER
2.
LINEAR
ALGEBRA
Dep
ending
on
the
structure
of
the
problem,
it
ma
not
ossible
to
design
unique
mapping
from
to
If
is
taller
than
it
is
wide,
then
it
is
ossible
for
this
equation
to
hav
no
solution.
If
is
wider
than
it
is
tall,
then
there
could
ultiple
possible
solutions.
The
Mo
ore-P
enrose
pseudoinv
erse
enables
us
to
make
some
headwa
in
these
cases.
The
pseudoinv
erse
of
is
deﬁned
as
matrix
lim
(2.46)
Practical
algorithms
for
computing
the
pseudoinv
erse
are
based
not
on
this
deﬁni-
tion,
but
rather
on
the
formula
(2.47)
where
and
are
the
singular
alue
decomp
osition
of
and
the
pseudoinv
erse
of
diagonal
matrix
is
obtained
by
taking
the
reciprocal
of
its
nonzero
elemen
ts
then
taking
the
transp
ose
of
the
resulting
matrix.
When
has
more
columns
than
rows,
then
solving
linear
equation
using
the
pseudoin
erse
provides
one
of
the
many
ossible
solutions.
Sp
eciﬁcally
it
pro
vides
the
solution
with
minimal
Euclidean
norm
||
||
among
all
possible
solutions.
When
has
more
rows
than
columns,
it
is
ossible
for
there
to
no
solution.
In
this
case,
using
the
pseudoinv
erse
gives
us
the
for
which
Ax
is
as
close
as
ossible
to
in
of
Euclidean
norm
||
Ax
||
2.10
The
race
Op
erator
The
trace
op
erator
gives
the
sum
of
all
the
diagonal
entries
of
matrix:
r(
) =
i,i
(2.48)
The
trace
op
erator
is
useful
for
ariety
of
reasons.
Some
op
erations
that
are
diﬃcult
to
sp
ecify
without
resorting
to
summation
notation
can
sp
eciﬁed
using
matrix
pro
ducts
and
the
trace
op
erator.
or
example,
the
trace
op
erator
provides
an
alternative
wa
of
writing
the
rob
enius
norm
of
matrix:
||
||
r(
AA
(2.49)
44
CHAPTER
2.
LINEAR
ALGEBRA
riting
an
expression
in
of
the
trace
op
erator
op
ens
up
opp
ortunities
to
manipulate
the
expression
using
man
useful
iden
tities. F
or
example,
the
trace
op
erator
is
inv
ariant
to
the
transp
ose
op
erator:
r(
) = T
r(
(2.50)
The
trace
of
square
matrix
comp
osed
of
many
factors
is
also
in
ariant
to
mo
ving
the
last
factor
into
the
ﬁrst
osition,
if
the
shap
es
of
the
corresp
onding
matrices
allow
the
resulting
pro
duct
to
deﬁned:
r(
AB
) = T
r(
AB
) = T
r(
(2.51)
or
more
generally
r(
=1
) = T
r(
=1
(2.52)
This
in
ariance
to
cyclic
ermutation
holds
ev
en
if
the
resulting
pro
duct
has
diﬀeren
shap
e.
or
example,
for
and
we
hav
r(
AB
) = T
r(
(2.53)
ev
en
though
AB
and
Another
useful
fact
to
keep
in
mind
is
that
scalar
is
its
own
trace:
2.11
The
Determinan
The
determinan
of
square
matrix,
denoted
det
is
function
that
maps
matrices
to
real
scalars.
The
determinan
is
equal to
the
product of
all
the
eigen
alues
of
the
matrix.
The
absolute
alue
of
the
determinant
can
though
of
as
measure
of
ho
muc
multiplication
by
the
matrix
expands
or
contracts
space.
If
the
determinant
is
0,
then
space
is
contracted
completely
along
at
least
one
dimension,
causing
it
to
lose
all
its
olume.
If
the
determinan
is
1,
then
the
transformation
preserves
volume.
2.12
Example:
Principal
Comp
onents
Analysis
One
simple
machine
learning
algorithm,
principal
comp
onents
analysis
(PCA),
can
derived
using
only
kno
wledge
of
basic
linear
algebra.
45
CHAPTER
2.
LINEAR
ALGEBRA
Supp
ose
we
hav
collection
of
oin
ts
(1)
in
and
we
wan
to
apply
lossy
compression
to
these
oints.
Lossy
compression
means
storing
the
oin
ts
in
wa
that
requires
less
memory
but
may
lose
some
precision.
an
to
lose
as
little
precision
as
ossible.
One
wa
we
can
enco
de
these
oints
is
to
represent
low
er-dimensional
ersion
of
them.
or
each
oint
will
ﬁnd
corresp
onding
co
de
vector
If
is
smaller
than
storing
the
co
de
oin
ts
will
tak
less
memory
than
storing
the
original
data.
will
an
to
ﬁnd
some
enco
ding
function
that
pro
duces
the
co
de
for
an
input,
) =
and
deco
ding
function
that
pro
duces
the
reconstructed
input
given
its
co
de,
))
PCA
is
deﬁned
our
choice
of
the
deco
ding
function.
Sp
eciﬁcally
to
make
the
deco
der
ery
simple,
we
ho
ose
to
use
matrix
multiplication
to
map
the
co
de
back
in
to
Let
) =
Dc
where
is
the
matrix
deﬁning
the
deco
ding.
Computing
the
optimal
co
de
for
this
deco
der
could
diﬃcult
problem.
eep
the
enco
ding
problem
easy
PCA
constrains
the
columns
of
to
orthogonal
to
each
other.
(Note
that
is
still
not
technically
“an
orthogonal
matrix”
unless
.)
With
the
problem
as
describ
ed
so
far,
man
solutions
are
ossible,
ecause
we
can
increase
the
scale
of
,i
if
decrease
prop
ortionally
for
all
oin
ts.
give
the
problem
unique
solution,
constrain
all
the
columns
of
to
ha
ve
unit
norm.
In
order
to
turn
this
basic
idea
into
an
algorithm
we
can
implement,
the
ﬁrst
thing
need
to
do
is
ﬁgure
out
how
to
generate
the
optimal
co
de
oint
for
eac
input
oint
One
wa
to
do
this
is
to
minimize
the
distance
etw
een
the
input
oin
and
its
reconstruction,
can
measure
this
distance
using
norm.
In
the
principal
comp
onents
algorithm,
we
use
the
norm:
= arg
min
||
||
(2.54)
can
switc
to
the
squared
norm
instead
of
using
the
norm
itself
ecause
oth
are
minimized
by
the
same
alue
of
Both
are
minimized
by
the
same
alue
of
ecause
the
norm
is
non-negative
and
the
squaring
op
eration
is
monotonically
increasing
for
non-negative
arguments.
= arg
min
||
||
(2.55)
The
function
eing
minimized
simpliﬁes
to
))
))
(2.56)
46
CHAPTER
2.
LINEAR
ALGEBRA
(b
the
deﬁnition
of
the
norm,
equation
2.30
(2.57)
(b
the
distributive
prop
erty)
(2.58)
(b
ecause
the
scalar
is
equal
to
the
transp
ose
of
itself
).
can
now
hange
the
function
eing
minimized
again,
to
omit
the
ﬁrst
term,
since
this
term
do
es
not
dep
end
on
= arg
min
(2.59)
make
further
progress,
we
ust
substitute
in
the
deﬁnition
of
= arg
min
(2.60)
= arg
min
(2.61)
(b
the
orthogonality
and
unit
norm
constrain
ts
on
= arg
min
(2.62)
can
solve
this
optimization
problem
using
vector
calculus
(see
section
4.3
if
ou
do
not
know
how
to
do
this):
) =
(2.63)
(2.64)
(2.65)
This
mak
es
the
algorithm
eﬃcient: w
can
optimally
encode
using
just
matrix-v
ector
op
eration.
enco
de
vector,
apply
the
enco
der
function
) =
(2.66)
Using
further
matrix
ultiplication,
can
also
deﬁne
the
PCA
reconstruction
op
eration:
) =
)) =
(2.67)
47
CHAPTER
2.
LINEAR
ALGEBRA
Next,
need
to
ho
ose
the
enco
ding
matrix
do
so,
we
revisit
the
idea
of
minimizing
the
distance
etw
een
inputs
and
reconstructions.
Since
we
will
use
the
same
matrix
to
deco
de
all
the
oin
ts,
can
no
longer
consider
the
oin
ts
in
isolation.
Instead,
we
must
minimize
the
rob
enius
norm
of
the
matrix
of
errors
computed
ov
er
all
dimensions
and
all
oin
ts:
= arg
min
i,j
sub
ject
to
(2.68)
deriv
the
algorithm
for
ﬁnding
start
by
considering
the
case
where
= 1
. In
this
case,
is
just
single
vector,
Substituting
equation
2.67
into
equation
2.68
and
simplifying
into
the
problem
reduces
to
= arg
min
||
dd
||
sub
ject
to
||
||
= 1
(2.69)
The
ab
formulation
is
the
most
direct
wa
of
erforming
the
substitution
but
is
not
the
most
stylistically
pleasing
wa
to
write
the
equation.
It
places
the
scalar
alue
on
the
righ
of
the
ector
Scalar
co
eﬃcients
are
conv
entionally
written
on
the
left
of
vector
they
op
erate
on.
therefore
usually
write
such
form
ula
as
= arg
min
||
||
sub
ject
to
||
||
= 1
(2.70)
or,
exploiting
the
fact
that
scalar
is
its
wn
transp
ose,
as
= arg
min
||
dd
||
sub
ject
to
||
||
= 1
(2.71)
The
reader
should
aim
to
ecome
familiar
with
suc
cosmetic
rearrangements.
this
oint,
it
can
helpful
to
rewrite
the
problem
in
of
single
design
matrix
of
examples,
rather
than
as
sum
ov
er
separate
example
vectors.
This
will
enable
us
to
use
more
compact
notation.
Let
the
matrix
deﬁned
by
stacking
all
the
vectors
describing
the
oin
ts,
suc
that
i,
can
now
rewrite
the
problem
as
= arg
min
||
dd
||
sub
ject
to
= 1
(2.72)
Disregarding
the
constraint
for
the
moment,
we
can
simplify
the
rob
enius
norm
ortion
as
follows:
arg
min
||
dd
||
(2.73)
48
CHAPTER
2.
LINEAR
ALGEBRA
= arg
min
dd
dd
(2.74)
(b
equation
2.49
= arg
min
r(
dd
dd
dd
dd
(2.75)
= arg
min
r(
r(
dd
r(
dd
r(
dd
dd
(2.76)
= arg
min
r(
dd
r(
dd
r(
dd
dd
(2.77)
(b
ecause
not
inv
olving
do
not
aﬀect
the
arg
min
= arg
min
r(
dd
r(
dd
dd
(2.78)
(b
ecause
we
can
cycle
the
order
of
the
matrices
inside
trace,
equation
2.52
= arg
min
r(
dd
r(
dd
dd
(2.79)
(using
the
same
prop
erty
again).
this
oint,
we
rein
tro
duce
the
constrain
t:
arg
min
r(
dd
r(
dd
dd
sub
ject
to
= 1
(2.80)
= arg
min
r(
dd
r(
dd
sub
ject
to
= 1
(2.81)
(due
to
the
constraint)
= arg
min
r(
dd
sub
ject
to
= 1
(2.82)
= arg
max
r(
dd
sub
ject
to
= 1
(2.83)
= arg
max
r(
sub
ject
to
= 1
(2.84)
This
optimization
problem
may
solv
ed
using
eigendecomp
osition.
Sp
eciﬁcally
the
optimal
is
given
by
the
eigenv
ector
of
corresp
onding
to
the
largest
eigen
alue.
This
deriv
ation
is
sp
eciﬁc
to
the
case
of
and
reco
ers
only
the
ﬁrst
principal
comp
onen
t.
More
generally
when
we
wish
to
recov
er
basis
of
principal
49
CHAPTER
2.
LINEAR
ALGEBRA
comp
onen
ts,
the
matrix
is
giv
en
by
the
eigen
ectors
corresponding
to
the
largest
eigen
alues.
This
ma
shown
using
pro
of
induction.
recommend
writing
this
pro
of
as
an
exercise.
Linear
algebra
is
one
of
the
fundamental
mathematical
disciplines
necessary
to
understanding
deep
learning.
Another
key
area
of
mathematics
that
is
ubiquitous
in
machine
learning
is
probability
theory
presented
next.
50