next generation video: Introducing Daala

next generation video: Introducing Daala
next generation video: Introducing Daala
(...up to the main demo page)
aala is a new general-purpose video codec currently under
development at Xiph.Org. Our performance target is roughly
a generation beyond current 'next-generation' codecs like
VP9 and HEVC, making Daala a
next-next-generation
effort. As with other Xiph codecs, the Daala format is and
will always be royalty-free with a permissive FOSS
license.
On May 30th 2013, our in-development
Daala prototype encoded and decoded its first streams. Two
hours later, Mozilla's David "oneman" Richards streamed the
first live Daala video over the Internet. Although the
project is still pre-pre-alpha, I think it's time to
introduce Daala to the world.
Update:
Introducing Daala part 2
now posted!
Meet
Crazy Eddie
The next-generation VP9 and HEVC codecs are the latest
incremental refinements of a basic codec design that dates back 25
years to h.261. This conservative, linear development strategy
evolving a proven design has yielded reliable improvement with
relatively low risk, but the law of diminishing returns is setting
in. Recent performance increases are coming at exponentially
increasing computational cost.
Daala tries for a larger leap forward— by first leaping
sideways— to a new codec design and numerous novel coding
techniques. In addition to the technical freedom of starting
fresh, this new design consciously avoids most of the patent
thicket surrounding mainstream block-DCT-based codecs. At its
very core, for example, Daala is based on lapped transforms, not
the traditional DCT.
...so let's start there.
Strengths and Weaknesses of the DCT
Modern codecs typically encode video at bitrates that work out
to substantially less than one bit per pixel. They do this by
grouping pixels together, typically into 4x4, 8x8 or 16x16
blocks, then transforming the blocks with a
two-dimensional
Discrete Cosine Transform (DCT)
Above: Illustration of a basic video encoder; each video frame
is broken into individual 8x8 pixel blocks, then each block is
transformed by the DCT. More complex encoders still perform
these high level steps.
The DCT converts the blocks of pixels into same-sized blocks of
frequency coefficients. It also compacts most of the energy from
the original pixels into just a few coefficients, arranged such
that reduced precision in the frequency domain is far less
noticeable than reduced precision in the spatial domain. Thus,
the DCT makes it possible to encode the image in less space, as
the most important aspects of the image are represented in a
more compact form.
Above: Illustration of a single 8x8 block of grayscale pixels
transformed into an 8x8 block of frequency coefficients by the
DCT. Although the DCT outputs the same number of frequency
coefficients as input pixels, the DCT coefficients typically
represent the same information in a more compact form that can
be stored in fewer bits. In addition, purposely losing
precision in the DCT domain is much less noticable than losing
the same amount of information in the spatial domain. This is
the primary basis of operation of most modern lossy video
codecs.
An encoder then quantizes the DCT coefficients to a lower
precision, usually via a complex heuristic algorithm that's been
carefully tuned. Let's handwave that away for now, and do the
simplest possible thing: quantize all the coefficients by a fixed
quantizer. The decoder then uses an inverse 2D DCT to reconstruct
the original image block by block.
Above: Illustration of a basic decoder reconstructing the
original image from blocks of quantized DCT coefficients. Each
block of quantized coefficients is converted back into a block
of pixels by an inverse DCT. The blocks of pixels are then
reassembled into a complete image.
There's an obvious change here; we can clearly distinguish the
sharp boundary of each individiual block in the reconstructed
image. A DCT is a circular [edit: braino there, I meant
symmetric not circular] transform, and so the block edges
introduce strong discontinuities. When we invert the transform,
the spatial error introduced by quantization causes the block
edges to no longer 'line up'. Average spatial error is also
greater at the block edges, which further exacerbates the
problem.
Above: Illustration of the 'screen door' effect, in which the
reconstructed blocks of a DCT-based codec show an obvious grid
of boundaries. The effect is particularly noticeable in video,
where the grid stays fixed while objects in the video move.
Even motion-compensated codecs can display this effect when the
underlying blocks do not move perfectly in sync with objects in
the image.
For this reason, virtually all video codecs use
deblocking
filter
to smooth over the edges between blocks. This
filter may be a purely post-processing step, or it may exist
within the encoding loop in which case it's called a
loop
filter
. This filter began life decades ago as something of
an afterthought, but it's an integral part of modern codecs
accounting for a significant portion of the design complexity
and CPU cost.
A deblocking filter is at best a necessary evil. It discards
legitimate detail at block edges at the same time it smooths the
boundary discontinuities. Further, the details the deblocking
filter unintentionally discards cost bits and CPU time to
encode. Finally, good deblocking filters are complex and
heuristic, requiring detailed analysis (and in some cases,
additional signaling) to control conditional application. The
heuristics have become quite good, but they can go wrong.
A lapped transform renders this necessary evil unneccessary.
It has some other nice benefits as well.
Above: Comparison of original image to images transformed and
quantized using the DCT and Daala's lapped transform. Both
transforms used the same scaling and coarse flat quantizer to
simulate low-bitrate encoding. The summed log-energy after
quantization was the same for both transforms.
The Lapped Transform
A lapped transform is any transform where each block overlaps
its neigboring blocks. Audio has used overlapping transforms
since approximately forever; the best known example is
the
Modified
Discrete Cosine Transform (MDCT)
that powers MP3,
Vorbis, AAC, and Opus. Many other transforms also fall under
the 'lapped transform' definition (and it's important to point
out Daala does
not
use the MDCT), so let's be more
specific.
First, consider a deblocking filter that's
invertible
; a
wide variety of such filters are possible, if not widely used.
Next suppose that the inverse of the deblocking filter is
applied
before
the forward transform as a
pre-filter
, and the deblocking
filter is applied after the inverse transform as a
post-filter. If we use the DCT as our transform as above, the
pre-filter + DCT is now perfectly inverted by the inverse DCT +
post-filter.
Above: Illustration of one possible decomposition of a lapped
transform, consisting of the DCT with pre-filters
) and post-filters
-1
straddling block boundaries.
The post-filter is similar to our deblocking/smoothing filter
from earlier. The prefilter is the inverse, so it's literally
the opposite; it purposely makes the input image very blocky by
reducing the circular discontinuity at the edges. When paired
with the DCT, this means it compacts more energy into the DC
component by reducing energy leakage into higher-frequency
bins.
Above: We construct a lapped transform from the DCT and spatial
pre- and post-filtering. The post-filter has a deblocking
function, merging block contents together seamlessly. The
pre-filter does the opposite, giving the image an intentionally
blocky appearence. The blockiness is a result of reducing
spurious high frequency energy from the edge discontinuity,
flattening the apparent content of each block and increasing
average energy compaction.
The remaining goal is to find a specific
(and
-1
that, along with the DCT, forms a new lapped transform with the
visual, coding, and computational properties we desire.
Lifting Implementation
Using spatial-domain lapping and lapping transforms for images
and video is not a new idea, even if it's not mainstream. Papers on
the subject date back to Malvar et. al. in the early 1990s, so we
have a decent amount of preexisting research to draw upon.
Lifting
filters are another idea that's not new, though
they are relatively recent. Originally a product of wavelet
transform research, the generalized idea is powerful: A lifting
filter is comprised of sequence of serialized
lifting steps
also simply called
lifts
. Each step updates a single
variable
in place from an arbitrary
function
f()
that does not depend
on
. That
is:
{1 or -1}*
f(
,...,
i-1
i+1
,...,
The function
f()
is usually quite
simple, e.g. just one other value scaled. Lifting steps are often
arranged in
alternating lifts
where
value
is
updated from
value
and then
value
is
updated from
value
The literature often refers to the alternating lift steps as
'predict' (
) and 'update'
or
steps as a result of their origin in second-generation wavelet
transforms.
A lifting filter is always exactly reversible; the steps can
simply be unwound in the opposite order.
Daala uses pre/post lifting filters based on the filters
proposed by Tran, Liang, and Yu
in
Lapped
Regularity-Constraind Pre- and Post-Filtering for Block DCT
Based Systems
. The prefilter
consists of forward and inverse lifting butterflies and
transform matrices
and
. It turns out
that
and
duplicate each others' degrees
of freedom, so we set
to the
identity matrix and eliminate it from the filter design.
Above: This illustration shows the proposed Daala pre- and
post-filter structures as implemented for an 8-point lapped
transform. The
matrix adds no
additional degrees of freedom over
the
matrix, thus we leave it out.
Research literature suggests several possible models for
matrix
; the 'type-III' and 'type-IV'
models allow easy lifting implementations that suit our desired
transform structure. These simplified models allow selection and
optimization of the parameters via nonlinear optimization
techniques or even exhaustive search.
Above: Figure from
Regularity-Constrained Pre- and
Post-Filtering for Block DCT Based Systems
(Tran
et. al. 2001) illustrating two lifting-based implementation
models for matrix
The DCT itself can also be implemented via lifting, and we've
implemented one with both perfect reconstruction and orthonormal
scaling. Placed together with the pre-filter, we can implement the
entire forward lapped transform as a complete lifting
structure.
Above: Daala's complete 8x16 forward lapped transform
implemented as a monolithic lifting filter with a
type-III
matrix. Mouse over the
diagram to see the filter's decomposition
into
and the forward DCT.
As any lifting filter is inherently invertible, constructing
the inverse transform is a purely mechanical process. Lifting
also makes it trivial to implement exact reconstruction as an
integer transform. Quantization error carried through the forward
transform is exactly reversed in lock-step fashion by the inverse
transform. This can both dramatically reduce required numeric
resolution (Daala's filters currently use six-bit coefficients)
and also opens the possibility of fully lossless operation.
Coming Soon
In the next demo (update: now
ready!)
, I'll continue exploring some of the interesting new
(or at least, non-mainstream) techniques used in Daala. There's
more to cover regarding implications of the lapped transform, and
that leads nicely into frequency-domain intra prediction.
—Monty
monty@xiph.org)
June 20, 2013
Additional Resources
First and foremost:
The Daala Project Homepage
Daala development code is available from our
working repository
Dr. Timothy Terriberry (lead of the Daala Project) gave a
long and excellent talk introducing Daala and video coding in
general to Mozilla in 2012. Slides 1 through 33 cover much of
the same ground as this demo with considerably more interesting
technical asides albeit less
text:
Introduction
to Video Coding [slide deck]
Current catalog of filter coefficient search progress is
at the Xiph.Org Wiki
Join our development discussion in
#daala at irc.freenode.net
(→
web interface
H.S. Malvar:
Extended Lapped Transforms: Properties, Applications, and Fast
Algorithms.
IEEE Transactions on Acoustics, Speech, and Signal Processing,
40(11):2703-2714, Nov. 1992.
T.D. Tran:
Lapped Transform via Time-Domain Pre- and Post-Filtering
IEEE
Transactions on Signal Processing 51(6):1557-1571, Jun. 2003.
W. Dai and T.D. Tran:
Regularity-Constrained Pre- and Post-Filtering for Block
DCT-based Systems
IEEE Transactions on Signal Processing 51(10):2568-2581, Oct. 2003.
Monty's Daala documentation work is sponsored by Red Hat Emerging Technologies.
(C) Copyright 2013 Red Hat Inc. and Xiph.Org