Specification Notes

Specification Notes
What is Matroska?
Downloads
mkclean
mkvalidator
libraries
MKVToolNix
Third-party applications
Test suite
Source code repositories
FAQ
Technical / Info
Basics
Data Layout
Element Specifications
Specification Notes
Element Ordering
Chapters
Chapter Codecs
Attachments
Cues
Streaming
Codec Mapping
Subtitles
Block Additional Mappings
Tags
Audio Examples
Video Examples
Tags Precedence
Implementation Recommendations
EBML RFC
Contributors
License
Legal aspects
Logos & trademarks
Contact
Matroska versioning
Matroska is based upon the principle that a reading application does not have to support
100% of the specifications in order to be able to play the file. A Matroska file therefore
contains version indicators that tell a reading application what to expect.
It is possible and valid to have the version fields indicate that the file contains
Matroska
Elements
from a higher specification version number while signaling that a
reading application
MUST
only support a lower version number properly in order to play
it back (possibly with a reduced feature set).
The
EBML Header
of each Matroska document informs the reading application on what
version of Matroska to expect. The
Elements
within
EBML Header
with jurisdiction
over this information are
DocTypeVersion
and
DocTypeReadVersion
DocTypeVersion
MUST
be equal to or greater than the highest Matroska version number of
any
Element
present in the Matroska file. For example, a file using the
SimpleBlock Element
((#simpleblock-element))
MUST
have a
DocTypeVersion
equal to or greater than 2. A file containing
CueRelativePosition
Elements ((#cuerelativeposition-element))
MUST
have a
DocTypeVersion
equal to or greater than 4.
The
DocTypeReadVersion
MUST
contain the minimum version number that a reading application
can minimally support in order to play the file back – optionally with a reduced feature
set. For example, if a file contains only
Elements
of version 2 or lower except for
CueRelativePosition
(which is a version 4 Matroska
Element
), then
DocTypeReadVersion
SHOULD
still be set to 2 and not 4 because evaluating
CueRelativePosition
is not
necessary for standard playback – it makes seeking more precise if used.
A reading application supporting Matroska version
MUST NOT
refuse to read an
application with
DocReadTypeVersion
equal to or lower than
even if
DocTypeVersion
is greater than
A reading application
supporting at least Matroska version
reading a file whose
DocTypeReadVersion
field is equal to or lower than
MUST
skip Matroska/EBML
Elements
it encounters
but does not know about if that unknown element fits into the size constraints set
by the current
Parent Element
Stream Copy
It is sometimes necessary to create a Matroska file from another Matroska file, for example to add subtitles in a language
or to edit out a portion of the content.
Some values from the original Matroska file need to be kept the same in the destination file.
For example the SamplingFrequency of an audio track wouldn’t change between the two files.
Some other values may change between the two files, for example the TrackNumber of an audio track when another track has been added.
An Element is marked with a property:
stream copy: True
when the values of that Element need to be kept identical between the source and destination file.
If that property is not set, elements may or may not keep the same value between the source and destination.
DefaultDecodedFieldDuration
The
DefaultDecodedFieldDuration Element
can signal to the displaying application how
often fields of a video sequence will be available for displaying. It can be used for both
interlaced and progressive content. If the video sequence is signaled as interlaced,
then the period between two successive fields at the output of the decoding process
equals
DefaultDecodedFieldDuration
For video sequences signaled as progressive, it is twice the value of
DefaultDecodedFieldDuration
These values are valid at the end of the decoding process before post-processing
(such as deinterlacing or inverse telecine) is applied.
Examples:
Blu-ray movie: 1000000000ns/(48/1.001) = 20854167ns
PAL broadcast/DVD: 1000000000ns/(50/1.000) = 20000000ns
N/ATSC broadcast: 1000000000ns/(60/1.001) = 16683333ns
hard-telecined DVD: 1000000000ns/(60/1.001) = 16683333ns (60 encoded interlaced fields per second)
soft-telecined DVD: 1000000000ns/(60/1.001) = 16683333ns (48 encoded interlaced fields per second, with “repeat_first_field = 1”)
Block Structure
Bit 0 is the most significant bit.
Frames using references
SHOULD
be stored in “coding order”. That means the references first, and then
the frames referencing them. A consequence is that timestamps might not be consecutive.
But a frame with a past timestamp
MUST
reference a frame already known, otherwise it’s considered bad/void.
Block Header
Offset
Player
Description
0x00+
MUST
Track Number (Track Entry). It is coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc.) (most significant bits set to increase the range).
0x01+
MUST
Timestamp (relative to Cluster timestamp, signed int16)
Table: Block Header base parts{#blockHeaderBase}
Block Header Flags
Offset
Bit
Player
Description
0x03+
0-3
Reserved, set to 0
0x03+
Invisible, the codec
SHOULD
decode this frame but not display it
0x03+
5-6
MUST
Lacing
* 00 : no lacing
* 01 : Xiph lacing
* 11 : EBML lacing
* 10 : fixed-size lacing
0x03+
not used
Table: Block Header flags part{#blockHeaderFlags}
SimpleBlock Structure
The
SimpleBlock
is inspired by the Block structure; see (#block-structure).
The main differences are the added Keyframe flag and Discardable flag. Otherwise everything is the same.
Bit 0 is the most significant bit.
SimpleBlock Header
Offset
Player
Description
0x00+
MUST
Track Number (Track Entry). It is coded in EBML like form (1 octet if the value is < 0x80, 2 if < 0x4000, etc.) (most significant bits set to increase the range).
0x01+
MUST
Timestamp (relative to Cluster timestamp, signed int16)
Table: SimpleBlock Header base parts{#simpleblockHeaderBase}
SimpleBlock Header Flags
Offset
Bit
Player
Description
0x03+
Keyframe, set when the Block contains only keyframes
0x03+
1-3
Reserved, set to 0
0x03+
Invisible, the codec
SHOULD
decode this frame but not display it
0x03+
5-6
MUST
Lacing
* 00 : no lacing
* 01 : Xiph lacing
* 11 : EBML lacing
* 10 : fixed-size lacing
0x03+
Discardable, the frames of the Block can be discarded during playing if needed
Table: SimpleBlock Header flags part{#simpleblockHeaderFlags}
Block Lacing
Lacing is a mechanism to save space when storing data. It is typically used for small blocks
of data (referred to as frames in Matroska). It packs multiple frames into a single
Block
or
SimpleBlock
Lacing
MUST NOT
be used to store a single frame in a
Block
or
SimpleBlock
There are 3 types of lacing:
Xiph, inspired by what is found in the Ogg container [@?RFC3533]
EBML, which is the same with sizes coded differently
fixed-size, where the size is not coded
When lacing is not used, i.e. to store a single frame, the lacing bits 5 and 6 of the
Block
or
SimpleBlock
MUST
be set to zero.
For example, a user wants to store 3 frames of the same track. The first frame is 800 octets long,
the second is 500 octets long and the third is 1000 octets long. As these data are small,
they can be stored in a lace to save space.
It is possible not to use lacing at all and just store a single frame without any extra data.
When the FlagLacing – (#flaglacing-element) – is set to “0” all blocks of that track
MUST NOT
use lacing.
No lacing
When no lacing is used, the number of frames in the lace is ommitted and only one frame can be stored in the Block.
The bits 5-6 of the Block Header flags are set to
00
The Block for a 800 octets frame is as follows:
Block Octets
Value
Description
4-803
Single frame data
Table: No lacing{#blockNoLacing}
When a Block contains a single frame, it
MUST
use this No lacing mode.
Xiph lacing
The Xiph lacing uses the same coding of size as found in the Ogg container [@?RFC3533].
The bits 5-6 of the Block Header flags are set to
01
The Block data with laced frames is stored as follows:
Lacing Head on 1 Octet: Number of frames in the lace minus 1.
Lacing size of each frame except the last one.
Binary data of each frame consecutively.
The lacing size is split into 255 values, stored as unsigned octets – for example, 500 is coded 255;245 or [0xFF 0xF5].
A frame with a size multiple of 255 is coded with a 0 at the end of the size – for example, 765 is coded 255;255;255;0 or [0xFF 0xFF 0xFF 0x00].
The size of the last frame is deduced from the size remaining in the Block after the other frames.
Because large sizes result in large coding of the sizes, it is
RECOMMENDED
to use Xiph lacing only with small frames.
In our example, the 800, 500 and 1000 frames are stored with Xiph lacing in a Block as follows:
Block Octet
Value
Description
0x02
Number of frames minus 1
5-8
0xFF 0xFF 0xFF 0x23
Size of the first frame (255;255;255;35)
9-10
0xFF 0xF5
Size of the second frame (255;245)
11-810
First frame data
811-1310
Second frame data
1311-2310
Third frame data
Table: Xiph lacing example{#blockXiphLacing}
The Block is 2311 octets large and the last frame starts at 1311, so we can deduce the size of the last frame is 2311 - 1311 = 1000.
EBML lacing
The EBML lacing encodes the frame size with an EBML-like encoding [@!RFC8794].
The bits 5-6 of the Block Header flags are set to
11
The Block data with laced frames is stored as follows:
Lacing Head on 1 Octet: Number of frames in the lace minus 1.
Lacing size of each frame except the last one.
Binary data of each frame consecutively.
The first frame size is encoded as an EBML Variable-Size Integer value, also known as VINT in [@!RFC8794].
The remaining frame sizes are encoded as signed values using the difference between the frame size and the previous frame size.
These signed values are encoded as VINT, with a mapping from signed to unsigned numbers.
Decoding the unsigned number stored in the VINT to a signed number is done by subtracting 2^((7*n)-1)^-1, where
is the octet size of the VINT.
Bit Representation of signed VINT
Possible Value Range
1xxx xxxx
2^7 values from -(2^6^-1) to 2^6^
01xx xxxx xxxx xxxx
2^14 values from -(2^13^-1) to 2^13^
001x xxxx xxxx xxxx xxxx xxxx
2^21 values from -(2^20^-1) to 2^20^
0001 xxxx xxxx xxxx xxxx xxxx xxxx xxxx
2^28 values from -(2^27^-1) to 2^27^
0000 1xxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx
2^35 values from -(2^34^-1) to 2^34^
Table: EBML Lacing signed VINT bits usage{#ebmlLacingBits}
In our example, the 800, 500 and 1000 frames are stored with EBML lacing in a Block as follows:
Block Octets
Value
Description
0x02
Number of frames minus 1
5-6
0x43 0x20
Size of the first frame (800 = 0x320 + 0x4000)
7-8
0x5E 0xD3
Size of the second frame (500 - 800 = -300 = - 0x12C + 0x1FFF + 0x4000)
8-807
First frame data
808-1307
Second frame data
1308-2307
Third frame data
Table: EBML lacing example{#blockEbmlLacing}
The Block is 2308 octets large and the last frame starts at 1308, so we can deduce the size of the last frame is 2308 - 1308 = 1000.
Fixed-size lacing
The Fixed-size lacing doesn’t store the frame size, only the number of frames in the lace.
Each frame
MUST
have the same size. The frame size of each frame is deduced from the total size of the Block.
The bits 5-6 of the Block Header flags are set to
10
The Block data with laced frames is stored as follows:
Lacing Head on 1 Octet: Number of frames in the lace minus 1.
Binary data of each frame consecutively.
For example, for 3 frames of 800 octets each:
Block Octets
Value
Description
0x02
Number of frames minus 1
5-804
First frame data
805-1604
Second frame data
1605-2404
Third frame data
Table: Fixed-size lacing example{#blockFixedSizeLacing}
This gives a Block of 2405 octets. When reading the Block we find that there are 3 frames (Octet 4).
The data start at Octet 5, so the size of each frame is (2405 - 5) / 3 = 800.
Laced Frames Timestamp
A Block only contains a single timestamp value. But when lacing is used, it contains more than one frame.
Each frame originally has its own timestamp, or Presentation Timestamp (PTS). That timestamp applies to
the first frame in the lace.
In the lace, each frame after the first one has an underdetermined timestamp.
But each of these frames
MUST
be contiguous – i.e. the decoded data
MUST NOT
contain any gap
between them. If there is a gap in the stream, the frames around the gap
MUST NOT
be in the same Block.
Lacing is only useful for small contiguous data to save space. This is usually the case for audio tracks
and not the case for video – which use a lot of data – or subtitle tracks – which have long gaps.
For audio, there is usually a fixed output sampling frequency for the whole track.
So the decoder should be able to recover the timestamp of each sample, knowing each
output sample is contiguous with a fixed frequency.
For subtitles this is usually not the case so lacing
SHOULD NOT
be used.
Random Access Points
Random Access Points (RAP) are positions where the parser can seek to and start playback without decoding
of what was before. In Matroska
BlockGroups
and
SimpleBlocks
can be RAPs.
To seek to these elements it is still necessary to seek to the
Cluster
containing them,
read the Cluster Timestamp
and start playback from the
BlockGroup
or
SimpleBlock
that is a RAP.
Because a Matroska File is usually composed of multiple tracks playing at the same time
– video, audio and subtitles – to seek properly to a RAP, each selected track must be
taken in account. Usually all audio and subtitle
BlockGroup
or
SimpleBlock
are RAP.
They are independent of each other and can be played randomly.
Video tracks on the other hand often use references to previous and future frames for better
coding efficiency. Frames with such reference
MUST
either contain one or more
ReferenceBlock
Elements in their
BlockGroup
or
MUST
be marked
as non-keyframe in a
SimpleBlock
; see (#simpleblock-header-flags).
BlockGroup with a frame that references another frame, with the EBML tree shown as XML:

123456

-40

...

SimpleBlock with a frame that references another frame, with the EBML tree shown as XML:

123456

(octet 3 bit 0 not set)
...

Frames that are RAP – i.e. they don’t depend on other frames –
MUST
set the keyframe
flag if they are in a
SimpleBlock
or their parent
BlockGroup
MUST NOT
contain
ReferenceBlock
BlockGroup with a frame that references no other frame, with the EBML tree shown as XML:

123456

...

SimpleBlock with a frame that references no other frame, with the EBML tree shown as XML:

123456

(octet 3 bit 0 set)
...

There may be cases where the use of
BlockGroup
is necessary, as the frame may need a
BlockDuration
BlockAdditions
CodecState
or a
DiscardPadding
element.
For thoses cases a
SimpleBlock
MUST NOT
be used,
the reference information
SHOULD
be recovered for non-RAP frames.
SimpleBlock with a frame that references another frame, with the EBML tree shown as XML:

123456

(octet 3 bit 0 not set)
...

Same frame that references another frame put inside a BlockGroup to add
BlockDuration
, with the EBML tree shown as XML:

123456

-40

20

...

When a frame in a
BlockGroup
is not a RAP, all references
SHOULD
be listed as a
ReferenceBlock
at least some of them, even if not accurate, or one
ReferenceBlock
with the value “0” corresponding to a self or unknown reference.
The lack of
ReferenceBlock
would mean such a frame is a RAP and seeking on that
frame that actually depends on other frames
MAY
create bogus output or even crash.
Same frame that references another frame put inside a BlockGroup but the reference could not be recovered, with the EBML tree shown as XML:

123456

20

...

BlockGroup with a frame that references two other frames, with the EBML tree shown as XML:

123456

-80

40

...

Intra-only video frames, such as the ones found in AV1 or VP9, can be decoded without any other
frame, but they don’t reset the codec state. So seeking to these frames is not possible
as the next frames may need frames that are not known from this seeking point.
Such intra-only frames
MUST NOT
be considered as keyframes so the keyframe flag
MUST NOT
be set in the
SimpleBlock
or a
ReferenceBlock
MUST
be used
to signify the frame is not a RAP. The timestamp value of the
ReferenceBlock
MUST
be “0”, meaning it’s referencing itself.
Intra-only frame not an RAP, with the EBML tree shown as XML:

123456

...

Because a video
SimpleBlock
has less references information than a video
BlockGroup
it is possible to remux a video track using
BlockGroup
into a
SimpleBlock
as long as it doesn’t use any other
BlockGroup
features than
ReferenceBlock
Timestamps
Historically timestamps in Matroska were mistakenly called timecodes. The
Timestamp Element
was called Timecode, the
TimestampScale Element
was called TimecodeScale, the
TrackTimestampScale Element
was called TrackTimecodeScale and the
ReferenceTimestamp Element
was called ReferenceTimeCode.
Timestamp Ticks
All timestamp values in Matroska are expressed in multiples of a tick.
They are usually stored as integers.
There are three types of ticks possible:
Matroska Ticks
For such elements, the timestamp value is stored directly in nanoseconds.
The elements storing values in Matroska Ticks/nanoseconds are:
TrackEntry\DefaultDuration
; defined in (#defaultduration-element)
TrackEntry\DefaultDecodedFieldDuration
; defined in (#defaultdecodedfieldduration-element)
TrackEntry\SeekPreRoll
; defined in (#seekpreroll-element)
TrackEntry\CodecDelay
; defined in (#codecdelay-element)
BlockGroup\DiscardPadding
; defined in (#discardpadding-element)
ChapterAtom\ChapterTimeStart
; defined in (#chaptertimestart-element)
ChapterAtom\ChapterTimeEnd
; defined in (#chaptertimeend-element)
CuePoint\CueTime
; defined in (#cuetime-element)
CueReference\CueRefTime
; defined in (#cuetime-element)
Segment Ticks
Elements in Segment Ticks involve the use of the
TimestampScale Element
of the Segment to get the timestamp
in nanoseconds of the element, with the following formula:
timestamp in nanosecond = element value * TimestampScale
This allows storing smaller integer values in the elements.
When using the default value of
TimestampScale
of “1,000,000”, one Segment Tick represents one millisecond.
The elements storing values in Segment Ticks are:
Cluster\Timestamp
; defined in (#timestamp-element)
Info\Duration
is stored as a floating point but the same formula applies; defined in (#duration-element)
CuePoint\CueTrackPositions\CueDuration
; defined in (#cueduration-element)
Track Ticks
Elements in Track Ticks involve the use of the
TimestampScale Element
of the Segment and the
TrackTimestampScale Element
of the Track
to get the timestamp in nanoseconds of the element, with the following formula:
timestamp in nanoseconds =
element value * TrackTimestampScale * TimestampScale
This allows storing smaller integer values in the elements.
The resulting floating point values of the timestamps are still expressed in nanoseconds.
When using the default values for
TimestampScale
and
TrackTimestampScale
of “1,000,000” and of “1.0” respectively, one Track Tick represents one millisecond.
The elements storing values in Track Ticks are:
Cluster\BlockGroup\Block
and
Cluster\SimpleBlock
timestamps; detailed in (#block-timestamps)
Cluster\BlockGroup\BlockDuration
; defined in (#blockduration-element)
Cluster\BlockGroup\ReferenceBlock
; defined in (#referenceblock-element)
When the
TrackTimestampScale
is interpreted as “1.0”, Track Ticks are equivalent to Segment Ticks
and give an integer value in nanoseconds. This is the most common case as
TrackTimestampScale
is usually omitted.
A value of
TrackTimestampScale
other than “1.0”
MAY
be used
to scale the timestamps more in tune with each Track sampling frequency.
For historical reasons, a lot of Matroska readers don’t take the
TrackTimestampScale
value in account.
So using a value other than “1.0” might not work in many places.
Block Timestamps
Block Element
and
SimpleBlock Element
timestamp is the time when the decoded data of the first
frame in the Block/SimpleBlock
MUST
be presented, if the track of that Block/SimpleBlock is selected for playback.
This is also known as the Presentation Timestamp (PTS).
The
Block Element
and
SimpleBlock Element
store their timestamps as signed integers, relative
to the
Cluster\Timestamp
value of the
Cluster
they are stored in.
To get the timestamp of a
Block
or
SimpleBlock
in nanoseconds you have to use the following formula:
( Cluster\Timestamp + ( block timestamp * TrackTimestampScale ) ) *
TimestampScale
The
Block Element
and
SimpleBlock Element
store their timestamps as 16bit signed integers,
allowing a range from “-32768” to “+32767” Track Ticks.
Although these values can be negative, when added to the
Cluster\Timestamp
, the resulting frame timestamp
SHOULD NOT
be negative.
When a
CodecDelay Element
is set, its value
MUST
be substracted from each Block timestamp of that track.
To get the timestamp in nanoseconds of the first frame in a
Block
or
SimpleBlock
, the formula becomes:
( ( Cluster\Timestamp + ( block timestamp * TrackTimestampScale ) ) *
TimestampScale ) - CodecDelay
The resulting frame timestamp
SHOULD NOT
be negative.
During playback, when a frame has a negative timestamp, the content
MUST
be decoded by the decoder but not played to the user.
TimestampScale Rounding
The default Track Tick duration is one millisecond.
The
TimestampScale
is a floating value, which is usually 1.0. But when it’s not, the multiplied
Block Timestamp is a floating values in nanoseconds.
The
Matroska Reader
SHOULD
use the nearest rounding value in nanosecond to get
the proper nanosecond timestamp of a Block. This allows some clever
TimestampScale
values
to have more refined timestampt precision per frame.
Language Codes
Matroska from version 1 through 3 uses language codes that can be either the 3 letters
bibliographic ISO-639-2 form [@!ISO639-2] (like “fre” for french),
or such a language code followed by a dash and a country code for specialities in languages (like “fre-ca” for Canadian French).
The
ISO 639-2 Language Elements
are “Language Element”, “TagLanguage Element”, and “ChapLanguage Element”.
Starting in Matroska version 4, either [@!ISO639-2] or [@!BCP47]
MAY
be used,
although
BCP 47
is
RECOMMENDED
. The
BCP 47 Language Elements
are “LanguageBCP47 Element”,
“TagLanguageBCP47 Element”, and “ChapLanguageBCP47 Element”. If a
BCP 47 Language Element
and an
ISO 639-2 Language Element
are used within the same
Parent Element
, then the
ISO 639-2 Language Element
MUST
be ignored and precedence given to the
BCP 47 Language Element
Country Codes
Country codes are the [@!BCP47] two-letter region subtag, without the UK exception.
Encryption
Encryption in Matroska is designed in a very generic style to allow people to
implement whatever form of encryption is best for them. It is possible to use the
encryption framework in Matroska as a type of DRM (Digital Rights Management).
This document does not specify any kind of standard for encrypting elements.
The issue of key scheduling, authorisation, and authentication are out of scope.
External entities have used these elements in proprietary ways.
Because encryption occurs within the
Block Element
, it is possible to manipulate
encrypted streams without decrypting them. The streams could potentially be copied,
deleted, cut, appended, or any number of other possible editing techniques without
decryption. The data can be used without having to expose it or go through the decrypting process.
Encryption can also be layered within Matroska. This means that two completely different
types of encryption can be used, requiring two separate keys to be able to decrypt a stream.
Encryption information is stored in the
ContentEncodings Element
under the
ContentEncryption Element
For encryption systems sharing public/private keys, the creation of the keys and the exchange of keys
are not covered by this document. They have to be handled by the system using Matroska.
The
ContentEncodingScope Element
gives an idea of which part of the track are encrypted.
But each
ContentEncAlgo Element
and its sub elements like
AESSettingsCipherMode
really
define how the encrypted should be exactly interpreted.
The AES-CTR system, which corresponds to
ContentEncAlgo
= 5 ((#contentencalgo-element)) and
AESSettingsCipherMode
= 1 ((#aessettingsciphermode-element)),
is defined in the [@?WebM-Enc] document.
Image Presentation
Cropping
The
PixelCrop Elements
PixelCropTop
PixelCropBottom
PixelCropRight
, and
PixelCropLeft
indicate when, and by how much, encoded videos frames
SHOULD
be cropped for display.
These Elements allow edges of the frame that are not intended for display, such as the
sprockets of a full-frame film scan or the VANC area of a digitized analog videotape,
to be stored but hidden.
PixelCropTop
and
PixelCropBottom
store an integer of how many
rows of pixels
SHOULD
be cropped from the top and bottom of the image (respectively).
PixelCropLeft
and
PixelCropRight
store an integer of how many columns of pixels
SHOULD
be cropped from the left and right of the image (respectively). For example,
a pillar-boxed video that stores a 1440x1080 visual image within the center of a padded
1920x1080 encoded image
MAY
set both
PixelCropLeft
and
PixelCropRight
to “240”,
so that a
Matroska Player
SHOULD
crop off 240 columns of pixels from the left and
right of the encoded image to present the image with the pillar-boxes hidden.
Cropping has to be performed before resizing and the display dimensions given by
DisplayWidth
DisplayHeight
and
DisplayUnit
apply to the already cropped image.
Rotation
The ProjectionPoseRoll Element (see (#projectionposeroll-element)) can be used to indicate
that the image from the associated video track
SHOULD
be rotated for presentation.
For instance, the following representation of the Projection Element (#projection-element))
and the ProjectionPoseRoll Element represents a video track where the image
SHOULD
be
presented with a 90 degree counter-clockwise rotation, with the EBML tree shown as XML :

90

Figure: Rotation example.
Segment Position
The
Segment Position
of an
Element
refers to the position of the first octet of the
Element ID
of that
Element
, measured in octets, from the beginning of the
Element Data
section of the containing
Segment Element
. In other words, the
Segment Position
of an
Element
is the distance in octets from the beginning of its containing
Segment Element
minus the size of the
Element ID
and
Element Data Size
of that
Segment Element
The
Segment Position
of the first
Child Element
of the
Segment Element
is 0.
An
Element
which is not stored within a
Segment Element
, such as the
Elements
of
the
EBML Header
, do not have a
Segment Position
Segment Position Exception
Elements
that are defined to store a
Segment Position
MAY
define reserved values to
indicate a special meaning.
Example of Segment Position
This table presents an example of
Segment Position
by showing a hexadecimal representation
of a very small Matroska file with labels to show the offsets in octets. The file contains
Segment Element
with an
Element ID
of “0x18538067” and a
MuxingApp Element
with an
Element ID
of “0x4D80”.
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
0 |1A|45|DF|A3|8B|42|82|88|6D|61|74|72|6F|73|6B|61|
^ EBML Header
0 | |18|53|80|67|
^ Segment ID
20 |93|
^ Segment Data Size
20 | |15|49|A9|66|8E|4D|80|84|69|65|74|66|57|41|84|69|65|74|66|
^ Start of Segment data
20 | |4D|80|84|69|65|74|66|57|41|84|69|65|74|66|
^ MuxingApp start
In the above example, the
Element ID
of the
Segment Element
is stored at offset 16,
the
Element Data Size
of the
Segment Element
is stored at offset 20, and the
Element Data
of the
Segment Element
is stored at offset 21.
The
MuxingApp Element
is stored at offset 26. Since the
Segment Position
of
an
Element
is calculated by subtracting the position of the
Element Data
of
the containing
Segment Element
from the position of that
Element
, the
Segment Position
of
MuxingApp Element
in the above example is ‘26 - 21’ or ‘5’.
Linked Segments
Matroska provides several methods to link two or more
Segment Elements
together to create
Linked Segment
. A
Linked Segment
is a set of multiple
Segments
linked together into
a single presentation by using Hard Linking or Medium Linking.
All
Segments
within a
Linked Segment
MUST
have a
SegmentUUID
All
Segments
within a
Linked Segment
SHOULD
be stored within the same directory
or be accessible quickly based on their
SegmentUUID
in order to have seamless transition between segments.
All
Segments
within a
Linked Segment
MAY
set a
SegmentFamily
with a common value to make
it easier for a
Matroska Player
to know which
Segments
are meant to be played together.
The
SegmentFilename
PrevFilename
and
NextFilename
elements
MAY
also give hints on
the original filenames that were used when the Segment links were created, in case some
SegmentUUID
are damaged.
Hard Linking
Hard Linking, also called splitting, is the process of creating a
Linked Segment
by linking multiple
Segment Elements
using the
NextUUID
and
PrevUUID
Elements.
All
Segments
within a
Hard Linked Segment
MUST
use the same
Tracks
list and
TimestampScale
Within a
Linked Segment
, the timestamps of
Block
and
SimpleBlock
MUST
follow consecutively
the timestamps of
Block
and
SimpleBlock
from the previous
Segment
in linking order.
With Hard Linking, the chapters of any
Segment
within the
Linked Segment
MUST
only reference the current
Segment
The
NextUUID
and
PrevUUID
reference the respective
SegmentUUID
values of the next and previous
Segments
The first
Segment
of a
Linked Segment
MUST NOT
have a
PrevUUID Element
The last
Segment
of a
Linked Segment
MUST NOT
have a
NextUUID Element
For each node of the chain of
Segments
of a
Linked Segment
at least one
Segment
MUST
reference the other
Segment
of the node.
In a chain of
Segments
of a
Linked Segment
the
NextUUID
always takes precedence over the
PrevUUID
So if SegmentA has a
NextUUID
to SegmentB and SegmentB has a
PrevUUID
to SegmentC,
the link to use is
NextUUID
between SegmentA and SegmentB, SegmentC is not part of the Linked Segment.
If SegmentB has a
PrevUUID
to SegmentA but SegmentA has no
NextUUID
, then the Matroska Player
MAY
consider these two Segments linked as SegmentA followed by SegmentB.
As an example, three
Segments
can be Hard Linked as a
Linked Segment
through
cross-referencing each other with
SegmentUUID
PrevUUID
, and
NextUUID
, as in this table:
file name
SegmentUUID
PrevUUID
NextUUID
start.mkv
71000c23cd310998 53fbc94dd984a5dd
Invalid
a77b3598941cb803 eac0fcdafe44fac9
middle.mkv
a77b3598941cb803 eac0fcdafe44fac9
71000c23cd310998 53fbc94dd984a5dd
6c92285fa6d3e827 b198d120ea3ac674
end.mkv
6c92285fa6d3e827 b198d120ea3ac674
a77b3598941cb803 eac0fcdafe44fac9
Invalid
Table: Usual Hard Linking UIDs{#hardLinkingUIDs}
An other example where only the
NextUUID
Element is used:
file name
SegmentUUID
PrevUUID
NextUUID
start.mkv
71000c23cd310998 53fbc94dd984a5dd
Invalid
a77b3598941cb803 eac0fcdafe44fac9
middle.mkv
a77b3598941cb803 eac0fcdafe44fac9
n/a
6c92285fa6d3e827 b198d120ea3ac674
end.mkv
6c92285fa6d3e827 b198d120ea3ac674
n/a
Invalid
Table: Hard Linking without PrevUUID{#hardLinkingWoPrevUUID}
An example where only the
PrevUUID
Element is used:
file name
SegmentUUID
PrevUUID
NextUUID
start.mkv
71000c23cd310998 53fbc94dd984a5dd
Invalid
n/a
middle.mkv
a77b3598941cb803 eac0fcdafe44fac9
71000c23cd310998 53fbc94dd984a5dd
n/a
end.mkv
6c92285fa6d3e827 b198d120ea3ac674
a77b3598941cb803 eac0fcdafe44fac9
Invalid
Table: Hard Linking without NextUUID{#hardLinkingWoNextUUID}
In this example only the
middle.mkv
is using the
PrevUUID
and
NextUUID
Elements:
file name
SegmentUUID
PrevUUID
NextUUID
start.mkv
71000c23cd310998 53fbc94dd984a5dd
Invalid
n/a
middle.mkv
a77b3598941cb803 eac0fcdafe44fac9
71000c23cd310998 53fbc94dd984a5dd
6c92285fa6d3e827 b198d120ea3ac674
end.mkv
6c92285fa6d3e827 b198d120ea3ac674
n/a
Invalid
Table: Hard Linking with mixed UID links{#hardLinkingMixedUIDs}
Medium Linking
Medium Linking creates relationships between
Segments
using Ordered Chapters ((#editionflagordered)) and the
ChapterSegmentUUID Element
. A
Chapter Edition
with Ordered Chapters
MAY
contain
Chapter elements that reference timestamp ranges from other
Segments
. The
Segment
referenced by the Ordered Chapter via the
ChapterSegmentUUID Element
SHOULD
be played as
part of a Linked Segment.
The timestamps of Segment content referenced by Ordered Chapters
MUST
be adjusted according to the cumulative duration of the the previous Ordered Chapters.
As an example a file named
intro.mkv
could have a
SegmentUUID
of “0xb16a58609fc7e60653a60c984fc11ead”.
Another file called
program.mkv
could use a Chapter Edition that contains two Ordered Chapters.
The first chapter references the
Segment
of
intro.mkv
with the use of a
ChapterSegmentUUID
ChapterSegmentEditionUID
ChapterTimeStart
, and optionally a
ChapterTimeEnd
element.
The second chapter references content within the
Segment
of
program.mkv
. A
Matroska Player
SHOULD
recognize the
Linked Segment
created by the use of
ChapterSegmentUUID
in an enabled
Edition
and present the reference content of the two
Segments
as a single presentation.
The
ChapterSegmentUUID
represents the Segment that holds the content to play in place of the
Linked Chapter
The
ChapterSegmentUUID
MUST NOT
be the
SegmentUUID
of its own
Segment
There are 2 ways to use a chapter link:
Linked-Duration linking,
Linked-Edition linking
Linked-Duration
Matroska Player
MUST
play the content of the linked Segment
from the
ChapterTimeStart
until
ChapterTimeEnd
timestamp in place of the
Linked Chapter
ChapterTimeStart
and
ChapterTimeEnd
represent timestamps in the Linked Segment matching the value of
ChapterSegmentUUID
Their values
MUST
be in the range of the linked Segment duration.
The
ChapterTimeEnd
value
MUST
be set when using linked-duration chapter linking.
ChapterSegmentEditionUID
MUST NOT
be set.
Linked-Edition
Matroska Player
MUST
play the whole linked
Edition
of the linked Segment in place of the
Linked Chapter
ChapterSegmentEditionUID
represents a valid Edition from the Linked Segment matching the value of
ChapterSegmentUUID
When using linked-edition chapter linking.
ChapterTimeEnd
is
OPTIONAL
Track Flags
Default flag
The “default track” flag is a hint for a
Matroska Player
indicating that a given track
SHOULD
be eligible to be automatically selected as the default track for a given
language. If no tracks in a given language have the default track flag set, then all tracks
in that language are eligible for automatic selection. This can be used to indicate that
a track provides “regular service” suitable for users with default settings, as opposed to
specialized services, such as commentary, hearing-impaired captions, or descriptive audio.
The
Matroska Player
MAY
override the “default track” flag for any reason, including
user preferences to prefer tracks providing accessibility services.
Forced flag
The “forced” flag tells the
Matroska Player
that it
SHOULD
display this subtitle track,
even if user preferences usually would not call for any subtitles to be displayed alongside
the current selected audio track. This can be used to indicate that a track contains translations
of onscreen text, or of dialogue spoken in a different language than the track’s primary one.
Hearing-impaired flag
The “hearing impaired” flag tells the
Matroska Player
that it
SHOULD
prefer this track
when selecting a default track for a hearing-impaired user, and that it
MAY
prefer to select
a different track when selecting a default track for a non-hearing-impaired user.
Visual-impaired flag
The “visual impaired” flag tells the
Matroska Player
that it
SHOULD
prefer this track
when selecting a default track for a visually-impaired user, and that it
MAY
prefer to select
a different track when selecting a default track for a non-visually-impaired user.
Descriptions flag
The “descriptions” flag tells the
Matroska Player
that this track is suitable to play via
a text-to-speech system for a visually-impaired user, and that it
SHOULD NOT
automatically
select this track when selecting a default track for a non-visually-impaired user.
Original flag
The “original” flag tells the
Matroska Player
that this track is in the original language,
and that it
SHOULD
prefer it if configured to prefer original-language tracks of this
track’s type.
Commentary flag
The “commentary” flag tells the
Matroska Player
that this track contains commentary on
the content.
Track Operation
TrackOperation
allows combining multiple tracks to make a virtual one. It uses
two separate system to combine tracks. One to create a 3D “composition” (left/right/background planes)
and one to simplify join two tracks together to make a single track.
A track created with
TrackOperation
is a proper track with a UID and all its flags.
However the codec ID is meaningless because each “sub” track needs to be decoded by its
own decoder before the “operation” is applied. The
Cues Elements
corresponding to such
a virtual track
SHOULD
be the sum of the
Cues Elements
for each of the tracks it’s composed of (when the
Cues
are defined per track).
In the case of
TrackJoinBlocks
, the
Block Elements
(from
BlockGroup
and
SimpleBlock
of all the tracks
SHOULD
be used as if they were defined for this new virtual
Track
When two
Block Elements
have overlapping start or end timestamps, it’s up to the underlying
system to either drop some of these frames or render them the way they overlap.
This situation
SHOULD
be avoided when creating such tracks as you can never be sure
of the end result on different platforms.
Overlay Track
Overlay tracks
SHOULD
be rendered in the same channel as the track its linked to.
When content is found in such a track, it
SHOULD
be played on the rendering channel
instead of the original track.
Multi-planar and 3D videos
There are two different ways to compress 3D videos: have each eye track in a separate track
and have one track have both eyes combined inside (which is more efficient, compression-wise).
Matroska supports both ways.
For the single track variant, there is the
StereoMode Element
, which defines how planes are
assembled in the track (mono or left-right combined). Odd values of StereoMode means the left
plane comes first for more convenient reading. The pixel count of the track (
PixelWidth
PixelHeight
is the raw amount of pixels, for example 3840x1080 for full HD side by side, and the
DisplayWidth
DisplayHeight
in pixels is the amount of pixels for one plane (1920x1080 for that full HD stream).
Old stereo 3D were displayed using anaglyph (cyan and red colors separated).
For compatibility with such movies, there is a value of the StereoMode that corresponds to AnaGlyph.
There is also a “packed” mode (values 13 and 14) which consists of packing two frames together
in a
Block
using lacing. The first frame is the left eye and the other frame is the right eye
(or vice versa). The frames
SHOULD
be decoded in that order and are possibly dependent
on each other (P and B frames).
For separate tracks, Matroska needs to define exactly which track does what.
TrackOperation
with
TrackCombinePlanes
do that. For more details look at
(#track-operation) on how TrackOperation works.
The 3D support is still in infancy and may evolve to support more features.
The StereoMode used to be part of Matroska v2 but it didn’t meet the requirement
for multiple tracks. There was also a bug in libmatroska prior to 0.9.0 that would save/read
it as 0x53B9 instead of 0x53B8; see OldStereoMode ((#oldstereomode-element)).
Matroska Readers
may support these legacy files by checking
Matroska v2 or 0x53B9.
The older values of StereoMode were 0: mono, 1: right eye, 2: left eye, 3: both eyes, the only values that can be found in OldStereoMode.
They are not compatible with the StereoMode values found in Matroska v3 and above.
Default track selection
This section provides some example sets of Tracks and hypothetical user settings, along with
indications of which ones a similarly-configured
Matroska Player
SHOULD
automatically
select for playback by default in such a situation. A player
MAY
provide additional settings
with more detailed controls for more nuanced scenarios. These examples are provided as guidelines
to illustrate the intended usages of the various supported Track flags, and their expected behaviors.
Track names are shown in English for illustrative purposes; actual files may have titles
in the language of each track, or provide titles in multiple languages.
Audio Selection
Example track set:
No.
Type
Lang
Layout
Original
Default
Other flags
Name
Video
und
N/A
N/A
N/A
None
Audio
eng
5.1
None
Audio
eng
2.0
None
Audio
eng
2.0
Visual-impaired
Descriptive audio
Audio
esp
5.1
None
Audio
esp
2.0
Visual-impaired
Descriptive audio
Audio
eng
2.0
Commentary
Director’s Commentary
Audio
eng
2.0
None
Karaoke
Table: Audio Tracks for default selection{#audioTrackSelection}
Here we have a file with 7 audio tracks, of which 5 are in English and 2 are in Spanish.
The English tracks all have the Original flag, indicating that English is the original content language.
Generally the player will first consider the track languages: if the player has an option to prefer
original-language audio and the user has enabled it, then it should prefer one of the Original-flagged tracks.
If configured to specifically prefer audio tracks in English or Spanish, the player should select one of
the tracks in the corresponding language. The player may also wish to prefer an Original-flagged track
if no tracks matching any of the user’s explicitly-preferred languages are available.
Two of the tracks have the Visual-impaired flag. If the player has been configured to prefer such tracks,
it should select one; otherwise, it should avoid them if possible.
If selecting an English track, when other settings have left multiple possible options,
it may be useful to exclude the tracks that lack the Default flag: here, one provides descriptive service for
the visually impaired (which has its own flag and may be automatically selected by user configuration,
but is unsuitable for users with default-configured players), one is a commentary track
(which has its own flag, which the player may or may not have specialized handling for),
and the last contains karaoke versions of the music that plays during the film, which is an unusual
specialized audio service that Matroska has no built-in support for indicating, so it’s indicated
in the track name instead. By not setting the Default flag on these specialized tracks, the file’s author
hints that they should not be automatically selected by a default-configured player.
Having narrowed its choices down, our example player now may have to select between tracks 2 and 3.
The only difference between these tracks is their channel layouts: 2 is 5.1 surround, while 3 is stereo.
If the player is aware that the output device is a pair of headphones or stereo speakers, it may wish
to prefer the stereo mix automatically. On the other hand, if it knows that the device is a surround system,
it may wish to prefer the surround mix.
If the player finishes analyzing all of the available audio tracks and finds that multiple seem equally
and maximally preferable, it
SHOULD
default to the first of the group.
Subtitle selection
Example track set:
No.
Type
Lang
Original
Default
Forced
Other flags
Name
Video
und
N/A
N/A
N/A
None
Audio
fra
N/A
None
Audio
por
N/A
None
Subtitles
fra
None
Subtitles
fra
Hearing-impaired
Captions for the hearing-impaired
Subtitles
por
None
Subtitles
por
None
Signs
Subtitles
por
Hearing-impaired
SDH
Table: Subtitle Tracks for default selection{#subtitleTrackSelection}
Here we have 2 audio tracks and 5 subtitle tracks. As we can see, French is the original language.
We’ll start by discussing the case where the user prefers French (or Original-language)
audio (or has explicitly selected the French audio track), and also prefers French subtitles.
In this case, if the player isn’t configured to display captions when the audio matches their
preferred subtitle languages, the player doesn’t need to select a subtitle track at all.
If the user
has
indicated that they want captions to be displayed, the selection simply
comes down to whether Hearing-impaired subtitles are preferred.
The situation for a user who prefers Portuguese subtitles starts out somewhat analogous.
If they select the original French audio (either by explicit audio language preference,
preference for Original-language tracks, or by explicitly selecting that track), then the
selection once again comes down to the hearing-impaired preference.
However, the case where the Portuguese audio track is selected has an important catch:
a Forced track in Portuguese is present. This may contain translations of onscreen text
from the video track, or of portions of the audio that are not translated (music, for instance).
This means that even if the user’s preferences wouldn’t normally call for captions here,
the Forced track should be selected nonetheless, rather than selecting no track at all.
On the other hand, if the user’s preferences
do
call for captions, the non-Forced tracks
should be preferred, as the Forced track will not contain captioning for the dialogue.