HTML Standard

https://html.spec.whatwg.org/multipage/parsing.html Archived on 2026-04-24 22:25 UTC

HTML Standard
13.2
Parsing HTML documents
13.2.1
Overview of the parsing model
13.2.2
Parse errors
13.2.3
The input byte stream
13.2.3.1
Parsing with a known character encoding
13.2.3.2
Determining the character encoding
13.2.3.3
Character encodings
13.2.3.4
Changing the encoding while parsing
13.2.3.5
Preprocessing the input stream
13.2.4
Parse state
13.2.4.1
The insertion mode
13.2.4.2
The stack of open elements
13.2.4.3
The list of active formatting elements
13.2.4.4
The element pointers
13.2.4.5
Other parsing state flags
13.2.5
Tokenization
13.2.5.1
Data state
13.2.5.2
RCDATA state
13.2.5.3
RAWTEXT state
13.2.5.4
Script data state
13.2.5.5
PLAINTEXT state
13.2.5.6
Tag open state
13.2.5.7
End tag open state
13.2.5.8
Tag name state
13.2.5.9
RCDATA less-than sign state
13.2.5.10
RCDATA end tag open state
13.2.5.11
RCDATA end tag name state
13.2.5.12
RAWTEXT less-than sign state
13.2.5.13
RAWTEXT end tag open state
13.2.5.14
RAWTEXT end tag name state
13.2.5.15
Script data less-than sign state
13.2.5.16
Script data end tag open state
13.2.5.17
Script data end tag name state
13.2.5.18
Script data escape start state
13.2.5.19
Script data escape start dash state
13.2.5.20
Script data escaped state
13.2.5.21
Script data escaped dash state
13.2.5.22
Script data escaped dash dash state
13.2.5.23
Script data escaped less-than sign state
13.2.5.24
Script data escaped end tag open state
13.2.5.25
Script data escaped end tag name state
13.2.5.26
Script data double escape start state
13.2.5.27
Script data double escaped state
13.2.5.28
Script data double escaped dash state
13.2.5.29
Script data double escaped dash dash state
13.2.5.30
Script data double escaped less-than sign state
13.2.5.31
Script data double escape end state
13.2.5.32
Before attribute name state
13.2.5.33
Attribute name state
13.2.5.34
After attribute name state
13.2.5.35
Before attribute value state
13.2.5.36
Attribute value (double-quoted) state
13.2.5.37
Attribute value (single-quoted) state
13.2.5.38
Attribute value (unquoted) state
13.2.5.39
After attribute value (quoted) state
13.2.5.40
Self-closing start tag state
13.2.5.41
Bogus comment state
13.2.5.42
Markup declaration open state
13.2.5.43
Comment start state
13.2.5.44
Comment start dash state
13.2.5.45
Comment state
13.2.5.46
Comment less-than sign state
13.2.5.47
Comment less-than sign bang state
13.2.5.48
Comment less-than sign bang dash state
13.2.5.49
Comment less-than sign bang dash dash state
13.2.5.50
Comment end dash state
13.2.5.51
Comment end state
13.2.5.52
Comment end bang state
13.2.5.53
DOCTYPE state
13.2.5.54
Before DOCTYPE name state
13.2.5.55
DOCTYPE name state
13.2.5.56
After DOCTYPE name state
13.2.5.57
After DOCTYPE public keyword state
13.2.5.58
Before DOCTYPE public identifier state
13.2.5.59
DOCTYPE public identifier (double-quoted) state
13.2.5.60
DOCTYPE public identifier (single-quoted) state
13.2.5.61
After DOCTYPE public identifier state
13.2.5.62
Between DOCTYPE public and system identifiers state
13.2.5.63
After DOCTYPE system keyword state
13.2.5.64
Before DOCTYPE system identifier state
13.2.5.65
DOCTYPE system identifier (double-quoted) state
13.2.5.66
DOCTYPE system identifier (single-quoted) state
13.2.5.67
After DOCTYPE system identifier state
13.2.5.68
Bogus DOCTYPE state
13.2.5.69
CDATA section state
13.2.5.70
CDATA section bracket state
13.2.5.71
CDATA section end state
13.2.5.72
Character reference state
13.2.5.73
Named character reference state
13.2.5.74
Ambiguous ampersand state
13.2.5.75
Numeric character reference state
13.2.5.76
Hexadecimal character reference start state
13.2.5.77
Decimal character reference start state
13.2.5.78
Hexadecimal character reference state
13.2.5.79
Decimal character reference state
13.2.5.80
Numeric character reference end state
13.2.6
Tree construction
13.2.6.1
Creating and inserting nodes
13.2.6.2
Parsing elements that contain only text
13.2.6.3
Closing elements that have implied end tags
13.2.6.4
The rules for parsing tokens in HTML content
13.2.6.4.1
The "initial" insertion mode
13.2.6.4.2
The "before html" insertion mode
13.2.6.4.3
The "before head" insertion mode
13.2.6.4.4
The "in head" insertion mode
13.2.6.4.5
The "in head noscript" insertion mode
13.2.6.4.6
The "after head" insertion mode
13.2.6.4.7
The "in body" insertion mode
13.2.6.4.8
The "text" insertion mode
13.2.6.4.9
The "in table" insertion mode
13.2.6.4.10
The "in table text" insertion mode
13.2.6.4.11
The "in caption" insertion mode
13.2.6.4.12
The "in column group" insertion mode
13.2.6.4.13
The "in table body" insertion mode
13.2.6.4.14
The "in row" insertion mode
13.2.6.4.15
The "in cell" insertion mode
13.2.6.4.16
The "in template" insertion mode
13.2.6.4.17
The "after body" insertion mode
13.2.6.4.18
The "in frameset" insertion mode
13.2.6.4.19
The "after frameset" insertion mode
13.2.6.4.20
The "after after body" insertion mode
13.2.6.4.21
The "after after frameset" insertion mode
13.2.6.5
The rules for parsing tokens in foreign content
13.2.7
The end
13.2.8
Speculative HTML parsing
13.2.9
Coercing an HTML DOM into an infoset
13.2.10
An introduction to error handling and strange cases in the parser
13.2.10.1
Misnested tags:
13.2.10.2
Misnested tags:

13.2.10.3
Unexpected markup in tables
13.2.10.4
Scripts that modify the page as it is being parsed
13.2.10.5
The execution of scripts that are moving across multiple documents
13.2.10.6
Unclosed formatting elements
13.3
Serializing HTML fragments
13.4
Parsing HTML fragments
13.2
Parsing HTML documents
This section only applies to user agents, data mining tools, and conformance
checkers.
The rules for parsing XML documents into DOM trees are covered by the next
section, entitled "
The XML syntax
".
User agents must use the parsing rules described in this section to generate the DOM trees from
text/html
resources. Together, these rules define what is referred to as the
HTML parser
While the HTML syntax described in this specification bears a close resemblance to SGML and
XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used
SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for
HTML documents; the only user agents to strictly handle HTML as an SGML application have
historically been validators. The resulting confusion — with validators claiming documents
to have one representation while widely deployed web browsers interoperably implemented a
different representation — has wasted decades of productivity. This version of HTML thus
returns to a non-SGML basis.
For the purposes of conformance checkers, if a resource is determined to be in
the HTML
syntax
, then it is an
HTML document
As stated
in the terminology section
references to
element types
that do not explicitly specify a
namespace always refer to elements in the
HTML namespace
. For example, if the spec
talks about "a
element", then that is an element with the local name "
", the namespace "
", and
the interface
HTMLMenuElement
. Where possible, references to such elements are
hyperlinked to their definition.
13.2.1
Overview of the parsing model
The input to the HTML parsing process consists of a stream of
code
points
, which is passed through a
tokenization
stage followed by a
tree
construction
stage. The output is a
Document
object.
Implementations that
do not support scripting
do not
have to actually create a DOM
Document
object, but the DOM tree in such cases is
still used as the model for the rest of the specification.
In the common case, the data handled by the tokenization stage comes from the network, but
it can also come from script
running in the user
agent, e.g. using the
document.write()
API.
There is only one set of states for the tokenizer stage and the tree
construction stage, but the tree construction stage is reentrant, meaning that while the tree
construction stage is handling one token, the tokenizer might be resumed, causing further tokens
to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p"
start tag token while handling the "script" end tag token:
...
script
document
write
'

'
);
script
...
To handle these cases, parsers have a
script nesting level
, which must be initially
set to zero, and a
parser pause flag
, which must be initially set to false.
13.2.2
Parse errors
This specification defines the parsing rules for HTML documents, whether they are syntactically
correct or not. Certain points in the parsing algorithm are said to be
parse errors
. The error handling for parse errors is well-defined (that's the
processing rules described throughout this specification), but user agents, while parsing an HTML
document, may
abort the parser
at the first
parse
error
that they encounter for which they do not wish to apply the rules described in this
specification.
Conformance checkers must report at least one parse error condition to the user if one or more
parse error conditions exist in the document and must not report parse error conditions if none
exist in the document. Conformance checkers may report more than one parse error condition if more
than one parse error condition exists in the document.
Parse errors are only errors with the
syntax
of HTML. In addition to
checking for parse errors, conformance checkers will also verify that the document obeys all the
other conformance requirements described in this specification.
Some parse errors have dedicated codes outlined in the table below that should be used by
conformance checkers in reports.
Error descriptions in the table below are non-normative.
Code
Description
abrupt-closing-of-empty-comment
This error occurs if the parser encounters an empty
comment
that is abruptly closed by a U+003E (>)
code
point
(i.e.,

or

). The
parser behaves as if the comment is closed correctly.
abrupt-doctype-public-identifier
This error occurs if the parser encounters a U+003E (>)
code point
in the
DOCTYPE
public identifier (e.g.,

). In such a case, if the DOCTYPE is correctly
placed as a document preamble, the parser sets the
Document
to
quirks
mode
abrupt-doctype-system-identifier
This error occurs if the parser encounters a U+003E (>)
code point
in the
DOCTYPE
system identifier (e.g.,

). In such a case,
if the DOCTYPE is correctly placed as a document preamble, the parser sets the
Document
to
quirks mode
absence-of-digits-in-numeric-character-reference
This error occurs if the parser encounters a numeric
character reference
that doesn't contain any digits (e.g.,
&#qux;
). In this case the parser doesn't resolve the character
reference.
cdata-in-html-content
This error occurs if the parser encounters a
CDATA
section
outside of foreign content (SVG or MathML). The parser treats such CDATA
sections (including leading "
[CDATA[
" and trailing "
]]
") as comments.
character-reference-outside-unicode-range
This error occurs if the parser encounters a numeric
character reference
that references a
code point
that is greater than the valid Unicode range. The parser resolves such a character reference to
a U+FFFD REPLACEMENT CHARACTER.
control-character-in-input-stream
This error occurs if the
input stream
contains a
control
code point
that is not
ASCII
whitespace
or U+0000 NULL. Such code points are parsed as-is and usually, where parsing
rules don't apply any additional restrictions, make their way into the DOM.
control-character-reference
This error occurs if the parser encounters a numeric
character reference
that references a
control
code point
that is not
ASCII
whitespace
or is a U+000D CARRIAGE RETURN. The parser resolves such character references
as-is except C1 control references that are replaced according to the
numeric character
reference end state
duplicate-attribute
This error occurs if the parser encounters an
attribute
in a tag that already has an attribute with the
same name. The parser ignores all such duplicate occurrences of the attribute.
end-tag-with-attributes
This error occurs if the parser encounters an
end
tag
with
attributes
. Attributes in end tags are
ignored and do not make their way into the DOM.
end-tag-with-trailing-solidus
This error occurs if the parser encounters an
end
tag
that has a U+002F (/)
code point
right before the closing U+003E (>)
code point (e.g.,

). Such a tag is treated as a regular end
tag.
eof-before-tag-name
This error occurs if the parser encounters the end of the
input stream
where a tag name is expected. In this case the parser treats the beginning of a
start tag
(i.e.,
) or an
end tag
(i.e.,
) as text
content.
eof-in-cdata
This error occurs if the parser encounters the end of the
input stream
in a
CDATA section
. The parser treats such CDATA sections as if
they are closed immediately before the end of the input stream.
eof-in-comment
This error occurs if the parser encounters the end of the
input stream
in a
comment
. The parser treats such comments as if they are
closed immediately before the end of the input stream.
eof-in-doctype
This error occurs if the parser encounters the end of the input stream in a
DOCTYPE
. In such a case, if the DOCTYPE is correctly placed as a
document preamble, the parser sets the
Document
to
quirks mode
eof-in-script-html-comment-like-text
This error occurs if the parser encounters the end of the
input stream
in text
that resembles an
HTML comment
inside
script
element content (e.g.,

", or having a
element that contains a
ul
element (as the
ul
element's
start
tag
would imply the end tag for the
).
This can enable cross-site scripting attacks. An example of this would be a page that lets the
user enter some font family names that are then inserted into a CSS
style
block via
the DOM and which then uses the
innerHTML
IDL
attribute to get the HTML serialization of that
style
element: if the user enters

" as a font family name,
innerHTML
will return markup that, if parsed in a different
context, would contain a
script
node, even though no
script
node
existed in the original DOM.
For example, consider the following markup:
form
id
"outer"
><
div
>form
><
form
id
"inner"
><
input
This will be parsed into:
html
head
body
form
id
="
outer
div
form
id
="
inner
input
The
input
element will be associated with the inner
form
element.
Now, if this tree structure is serialized and reparsed, the
id="inner">
start tag will be ignored, and so the
input
element will be
associated with the outer
form
element instead.
html
><
head
>head
><
body
><
form
id
"outer"
><
div
form
id
"inner"
input
>form
>div
>form
>body
>html
html
head
body
form
id
="
outer
div
input
As another example, consider the following markup:
><
table
><
This will be parsed into:
html
head
body
table
That is, the
elements are nested, because the second
element is
foster parented
. After a serialize-reparse roundtrip, the
elements and the
table
element would all be siblings, because the
second

start tag implicitly closes the first
element.
html
><
head
>head
><
body
><
><
table
>table
>>body
>html
html
head
body
table
For historical reasons, this algorithm does not round-trip an initial U+000A (LF) character in
pre
textarea
, or
listing
elements, even though (in the
first two cases) the markup being round-tripped can be conforming. The
HTML parser
will drop such a character during parsing, but this algorithm does
not
serialize an extra
U+000A (LF) character.
For example, consider the following markup:
pre
Hello.
pre
When this document is first parsed, the
pre
element's
child text
content
starts with a single newline character. After a serialize-reparse roundtrip, the
pre
element's
child text content
is simply "
Hello.
".
Because of the special role of the
is
attribute in signaling the creation of
customized built-in elements
, in that it provides a mechanism for parsed
HTML to set the element's
is
value
, we special-case its handling during serialization. This ensures that an element's
is
value
is preserved
through serialize-parse roundtrips.
When creating a
customized built-in element
via the parser, a developer uses the
is
attribute directly; in such cases serialize-parse roundtrips
work fine.
script
window
SuperP
class
extends
HTMLParagraphElement
{};
customElements
define
"super-p"
SuperP
extends
"p"
});
script
div
id
"container"
><
is
"super-p"
Superb!
>div
script
console
log
container
innerHTML
);
//

container
innerHTML
container
innerHTML
console
log
container
innerHTML
);
//

console
assert
container
firstChild
instanceof
SuperP
);
script
But when creating a customized built-in element via its
constructor
or via
createElement()
, the
is
attribute is not added. Instead, the
is
value
(which is what the custom elements machinery uses) is set
without intermediating through an attribute.
script
container
innerHTML
""
const
document
createElement
"p"
is
"super-p"
});
container
appendChild
);
// The is attribute is not present in the DOM:
console
assert
hasAttribute
"is"
));
// But the element is still a super-p:
console
assert
instanceof
SuperP
);
script
To ensure that serialize-parse roundtrips still work, the serialization process explicitly
writes out the element's
is
value
as an
is
attribute:
script
console
log
container
innerHTML
);
//

container
innerHTML
container
innerHTML
console
log
container
innerHTML
);
//

console
assert
container
firstChild
instanceof
SuperP
);
script
Escaping a string
(for the purposes of the algorithm above)
consists of running the following steps:
Replace any occurrence of "
" character by "
&
".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by "

".
Replace any occurrences of the "
" character by
<
".
Replace any occurrences of the "
" character by
>
".
If the algorithm was invoked in the
attribute mode
, then replace any occurrences of
the "
" character by "
"
".
13.4
Parsing HTML fragments
The
HTML fragment parsing algorithm
, given an
Element
node
context
, string
input
, an
optional boolean
allowDeclarativeShadowRoots
(default false), and an optional
parser scripting mode
scriptingMode
(default
Inert
) is the following steps. They return a list of zero or
more nodes.
Parts marked
fragment case
in algorithms in the
HTML
parser
section are parts that only occur if the parser was created for the purposes of this
algorithm. The algorithms have been annotated with such markings for informational purposes only;
such markings have no normative weight. If it is possible for a condition described as a
fragment case
to occur even when the parser wasn't created for the purposes of
handling this algorithm, then that is an error in the specification.
Assert
scriptingMode
is either
Inert
or
Fragment
Let
document
be a
Document
node whose
type
is "
html
".
Let
contextDocument
be
context
's
node document
If
contextDocument
is in
quirks mode
, then set
document
's
mode
to "
quirks
".
Otherwise, if
context
's
node
document
is in
limited-quirks mode
, then set
document
's
mode
to "
limited-quirks
".
If
allowDeclarativeShadowRoots
is true, then set
document
's
allow declarative shadow roots
to
true.
Create a new
HTML parser
, and associate it with
document
If
contextDocument
's
scripting is
disabled
, then set
scriptingMode
to
Disabled
Set the parser's
scripting mode
to
scriptingMode
Set the state of the
HTML parser
's
tokenization
stage as
follows, switching on the
context
element:
title
textarea
Switch the tokenizer to the
RCDATA state
style
xmp
iframe
noembed
noframes
Switch the tokenizer to the
RAWTEXT state
script
Switch the tokenizer to the
script data state
noscript
If
scripting mode
is not
Disabled
, switch the tokenizer to the
RAWTEXT
state
. Otherwise, leave the tokenizer in the
data state
plaintext
Switch the tokenizer to the
PLAINTEXT state
Any other element
Leave the tokenizer in the
data state
For performance reasons, an implementation that does not report errors and
that uses the actual state machine described in this specification directly could use the
PLAINTEXT state instead of the RAWTEXT and script data states where those are mentioned in the
list above. Except for rules regarding parse errors, they are equivalent, since there is no
appropriate end tag token
in the fragment case, yet they involve far fewer state
transitions.
Let
root
be the result of
creating an
element
given
document
, "
html
", the
HTML
namespace
, null, null, false, and
context
's
custom element registry
Append
root
to
document
Set up the
HTML parser
's
stack of open elements
so that it
contains just the single element
root
If
context
is a
template
element, then push "
in template
" onto the
stack of template insertion modes
so that it is the new
current template
insertion mode
Create a start tag token whose name is the local name of
context
and whose attributes are the attributes of
context
Let this start tag token be the start tag token of
context
; e.g. for the purposes of determining if it is
an
HTML integration point
Reset the parser's insertion mode
appropriately
The parser will reference the
context
element as part of that algorithm.
Set the
HTML parser
's
form
element pointer
to the
nearest node to
context
that is a
form
element (going straight up the ancestor chain, and including the element
itself, if it is a
form
element), if any. (If there is no such
form
element, the
form
element pointer
keeps its initial value,
null.)
Place the
input
into the
input stream
for the
HTML
parser
just created. The encoding
confidence
is
irrelevant
Start the
HTML parser
and let it run until it has consumed all the characters
just inserted into the input stream.
Return
root
's
children
, in
tree
order

Same domain → Similar titles →