HTML Standard
13.2
Parsing HTML documents
13.2.1
Overview of the parsing model
13.2.2
Parse errors
13.2.3
The input byte stream
13.2.3.1
Parsing with a known character encoding
13.2.3.2
Determining the character encoding
13.2.3.3
Character encodings
13.2.3.4
Changing the encoding while parsing
13.2.3.5
Preprocessing the input stream
13.2.4
Parse state
13.2.4.1
The insertion mode
13.2.4.2
The stack of open elements
13.2.4.3
The list of active formatting elements
13.2.4.4
The element pointers
13.2.4.5
Other parsing state flags
13.2.5
Tokenization
13.2.5.1
Data state
13.2.5.2
RCDATA state
13.2.5.3
RAWTEXT state
13.2.5.4
Script data state
13.2.5.5
PLAINTEXT state
13.2.5.6
Tag open state
13.2.5.7
End tag open state
13.2.5.8
Tag name state
13.2.5.9
RCDATA less-than sign state
13.2.5.10
RCDATA end tag open state
13.2.5.11
RCDATA end tag name state
13.2.5.12
RAWTEXT less-than sign state
13.2.5.13
RAWTEXT end tag open state
13.2.5.14
RAWTEXT end tag name state
13.2.5.15
Script data less-than sign state
13.2.5.16
Script data end tag open state
13.2.5.17
Script data end tag name state
13.2.5.18
Script data escape start state
13.2.5.19
Script data escape start dash state
13.2.5.20
Script data escaped state
13.2.5.21
Script data escaped dash state
13.2.5.22
Script data escaped dash dash state
13.2.5.23
Script data escaped less-than sign state
13.2.5.24
Script data escaped end tag open state
13.2.5.25
Script data escaped end tag name state
13.2.5.26
Script data double escape start state
13.2.5.27
Script data double escaped state
13.2.5.28
Script data double escaped dash state
13.2.5.29
Script data double escaped dash dash state
13.2.5.30
Script data double escaped less-than sign state
13.2.5.31
Script data double escape end state
13.2.5.32
Before attribute name state
13.2.5.33
Attribute name state
13.2.5.34
After attribute name state
13.2.5.35
Before attribute value state
13.2.5.36
Attribute value (double-quoted) state
13.2.5.37
Attribute value (single-quoted) state
13.2.5.38
Attribute value (unquoted) state
13.2.5.39
After attribute value (quoted) state
13.2.5.40
Self-closing start tag state
13.2.5.41
Bogus comment state
13.2.5.42
Markup declaration open state
13.2.5.43
Comment start state
13.2.5.44
Comment start dash state
13.2.5.45
Comment state
13.2.5.46
Comment less-than sign state
13.2.5.47
Comment less-than sign bang state
13.2.5.48
Comment less-than sign bang dash state
13.2.5.49
Comment less-than sign bang dash dash state
13.2.5.50
Comment end dash state
13.2.5.51
Comment end state
13.2.5.52
Comment end bang state
13.2.5.53
DOCTYPE state
13.2.5.54
Before DOCTYPE name state
13.2.5.55
DOCTYPE name state
13.2.5.56
After DOCTYPE name state
13.2.5.57
After DOCTYPE public keyword state
13.2.5.58
Before DOCTYPE public identifier state
13.2.5.59
DOCTYPE public identifier (double-quoted) state
13.2.5.60
DOCTYPE public identifier (single-quoted) state
13.2.5.61
After DOCTYPE public identifier state
13.2.5.62
Between DOCTYPE public and system identifiers state
13.2.5.63
After DOCTYPE system keyword state
13.2.5.64
Before DOCTYPE system identifier state
13.2.5.65
DOCTYPE system identifier (double-quoted) state
13.2.5.66
DOCTYPE system identifier (single-quoted) state
13.2.5.67
After DOCTYPE system identifier state
13.2.5.68
Bogus DOCTYPE state
13.2.5.69
CDATA section state
13.2.5.70
CDATA section bracket state
13.2.5.71
CDATA section end state
13.2.5.72
Character reference state
13.2.5.73
Named character reference state
13.2.5.74
Ambiguous ampersand state
13.2.5.75
Numeric character reference state
13.2.5.76
Hexadecimal character reference start state
13.2.5.77
Decimal character reference start state
13.2.5.78
Hexadecimal character reference state
13.2.5.79
Decimal character reference state
13.2.5.80
Numeric character reference end state
13.2.6
Tree construction
13.2.6.1
Creating and inserting nodes
13.2.6.2
Parsing elements that contain only text
13.2.6.3
Closing elements that have implied end tags
13.2.6.4
The rules for parsing tokens in HTML content
13.2.6.4.1
The "initial" insertion mode
13.2.6.4.2
The "before html" insertion mode
13.2.6.4.3
The "before head" insertion mode
13.2.6.4.4
The "in head" insertion mode
13.2.6.4.5
The "in head noscript" insertion mode
13.2.6.4.6
The "after head" insertion mode
13.2.6.4.7
The "in body" insertion mode
13.2.6.4.8
The "text" insertion mode
13.2.6.4.9
The "in table" insertion mode
13.2.6.4.10
The "in table text" insertion mode
13.2.6.4.11
The "in caption" insertion mode
13.2.6.4.12
The "in column group" insertion mode
13.2.6.4.13
The "in table body" insertion mode
13.2.6.4.14
The "in row" insertion mode
13.2.6.4.15
The "in cell" insertion mode
13.2.6.4.16
The "in template" insertion mode
13.2.6.4.17
The "after body" insertion mode
13.2.6.4.18
The "in frameset" insertion mode
13.2.6.4.19
The "after frameset" insertion mode
13.2.6.4.20
The "after after body" insertion mode
13.2.6.4.21
The "after after frameset" insertion mode
13.2.6.5
The rules for parsing tokens in foreign content
13.2.7
The end
13.2.8
Speculative HTML parsing
13.2.9
Coercing an HTML DOM into an infoset
13.2.10
An introduction to error handling and strange cases in the parser
13.2.10.1
Misnested tags:
13.2.10.2
Misnested tags:


13.2.10.3
Unexpected markup in tables
13.2.10.4
Scripts that modify the page as it is being parsed
13.2.10.5
The execution of scripts that are moving across multiple documents
13.2.10.6
Unclosed formatting elements
13.3
Serializing HTML fragments
13.4
Parsing HTML fragments
13.2
Parsing HTML documents
This section only applies to user agents, data mining tools, and conformance
checkers.
The rules for parsing XML documents into DOM trees are covered by the next
section, entitled "
The XML syntax
".
User agents must use the parsing rules described in this section to generate the DOM trees from
text/html
resources. Together, these rules define what is referred to as the
HTML parser
While the HTML syntax described in this specification bears a close resemblance to SGML and
XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used
SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for
HTML documents; the only user agents to strictly handle HTML as an SGML application have
historically been validators. The resulting confusion — with validators claiming documents
to have one representation while widely deployed web browsers interoperably implemented a
different representation — has wasted decades of productivity. This version of HTML thus
returns to a non-SGML basis.
For the purposes of conformance checkers, if a resource is determined to be in
the HTML
syntax
, then it is an
HTML document
As stated
in the terminology section
references to
element types
that do not explicitly specify a
namespace always refer to elements in the
HTML namespace
. For example, if the spec
talks about "a
element", then that is an element with the local name "
", the namespace "
", and
the interface
HTMLMenuElement
. Where possible, references to such elements are
hyperlinked to their definition.
13.2.1
Overview of the parsing model
The input to the HTML parsing process consists of a stream of
code
points
, which is passed through a
tokenization
stage followed by a
tree
construction
stage. The output is a
Document
object.
Implementations that
do not support scripting
do not
have to actually create a DOM
Document
object, but the DOM tree in such cases is
still used as the model for the rest of the specification.
In the common case, the data handled by the tokenization stage comes from the network, but
it can also come from script
running in the user
agent, e.g. using the
document.write()
API.
There is only one set of states for the tokenizer stage and the tree
construction stage, but the tree construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokenizer might be resumed, causing further tokens
to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p"
start tag token while handling the "script" end tag token:
...
script
document
write
'

'
);
script
...
To handle these cases, parsers have a
script nesting level
, which must be initially
set to zero, and a
parser pause flag
, which must be initially set to false.
13.2.2
Parse errors
This specification defines the parsing rules for HTML documents, whether they are syntactically
correct or not. Certain points in the parsing algorithm are said to be
parse errors
. The error handling for parse errors is well-defined (that's the
processing rules described throughout this specification), but user agents, while parsing an HTML
document, may
abort the parser
at the first
parse
error
that they encounter for which they do not wish to apply the rules described in this
specification.
Conformance checkers must report at least one parse error condition to the user if one or more
parse error conditions exist in the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one parse error condition if more
than one parse error condition exists in the document.
Parse errors are only errors with the
syntax
of HTML. In addition to
checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.
Some parse errors have dedicated codes outlined in the table below that should be used by
conformance checkers in reports.
Error descriptions in the table below are non-normative.
Code
Description
abrupt-closing-of-empty-comment
This error occurs if the parser encounters an empty
comment
that is abruptly closed by a U+003E (>)
code
point
(i.e.,

or

). The
parser behaves as if the comment is closed correctly.
abrupt-doctype-public-identifier
This error occurs if the parser encounters a U+003E (>)
code point
in the
DOCTYPE
public identifier (e.g.,

). In such a case, if the DOCTYPE is correctly
placed as a document preamble, the parser sets the
Document
to
quirks
mode
abrupt-doctype-system-identifier
This error occurs if the parser encounters a U+003E (>)
code point
in the
DOCTYPE
system identifier (e.g.,

). In such a case,
if the DOCTYPE is correctly placed as a document preamble, the parser sets the
Document
to
quirks mode
absence-of-digits-in-numeric-character-reference
This error occurs if the parser encounters a numeric
character reference
that doesn't contain any digits (e.g.,
&#qux;
). In this case the parser doesn't resolve the character
reference.
cdata-in-html-content
This error occurs if the parser encounters a
CDATA
section
outside of foreign content (SVG or MathML). The parser treats such CDATA
sections (including leading "
[CDATA[
" and trailing "
]]
") as comments.
character-reference-outside-unicode-range
This error occurs if the parser encounters a numeric
character reference
that references a
code point
that is greater than the valid Unicode range. The parser resolves such a character reference to
a U+FFFD REPLACEMENT CHARACTER.
control-character-in-input-stream
This error occurs if the
input stream
contains a
control
code point
that is not
ASCII
whitespace
or U+0000 NULL. Such code points are parsed as-is and usually, where parsing
rules don't apply any additional restrictions, make their way into the DOM.
control-character-reference
This error occurs if the parser encounters a numeric
character reference
that references a
control
code point
that is not
ASCII
whitespace
or is a U+000D CARRIAGE RETURN. The parser resolves such character references
as-is except C1 control references that are replaced according to the
numeric character
reference end state
duplicate-attribute
This error occurs if the parser encounters an
attribute
in a tag that already has an attribute with the
same name. The parser ignores all such duplicate occurrences of the attribute.
end-tag-with-attributes
This error occurs if the parser encounters an
end
tag
with
attributes
. Attributes in end tags are
ignored and do not make their way into the DOM.
end-tag-with-trailing-solidus
This error occurs if the parser encounters an
end
tag
that has a U+002F (/)
code point
right before the closing U+003E (>)
code point (e.g.,