Briefly, xmlformat processes an XML document using the following steps:
Read the document into memory as a single string.
Parse the document into a list of tokens.
Convert the list of tokens into nodes in a tree structure, tagging each node according to the token type.
Discard extraneous whitespace nodes and normalize text nodes. (The meaning of "normalize" is described in Section 3.3, “ Text Handling ”.)
Process the tree to produce a single string representing the reformatted document.
Print the string.
xmlformat is not an XSLT processor. In essence, all it does is add or delete whitespace to control line breaking, indentation, and text normalization.
xmlformat uses the REX parser developed by Robert D. Cameron (see Section 7, “ References ”). REX performs a parse based on a regular expression that operates on a string representing the XML document. The parse produces a list of tokens. REX does a pure lexical scan that performs no alteration of the text except to tokenize it. In particular:
REX doesn't normalize any whitespace, including line endings. This is true for text elements, and for whitespace within tags (including between attributes and within attribute values). Any normalization or reformatting to be done is performed in later stages of xmlformat operation.
REX leaves entity references untouched. It doesn't try to resolve them. This means it doesn't complain about undefined entities, which to my mind is an advantage. (A pretty printer shouldn't have to read a DTD or a schema.)
If the XML is malformed, errors can be detected easily: REX produces error tokens that begin with "<" but do not end with ">".
xmlformat expects its input documents to be legal XML. It does not consider fixing broken documents to be its job, so if xmlformat finds error tokens in the result produced by REX, it lists them and exits.
Assuming the document contains no error tokens, xmlformat uses the token list to construct a tree structure. It categorizes each token based on its initial characters:
Initial Characters | Token Type |
---|---|
<!-- | comment |
<? | processing instruction (this includes the <?xml?> instruction) |
<!DOCTYPE | DOCTYPE declaration |
<![ | CDATA section |
</ | element closing tag |
< | element opening tag |
Anything token not beginning with one of the sequences shown in the preceding table is a text token.
The token categorization determineas the node types of nodes in the document tree. Each node has a label that identifies the node type:
Label | Node Type |
---|---|
comment | comment node |
pi | processing instruction node |
DOCTYPE | DOCTYPE declaration node |
CDATA | CDATA section node |
elt | element node |
text | text node |
If the document is not well-formed, tree construction will fail. In this case, xmlformat displays one or more error messages and exits. For example, this document is invalid:
<p>This is a <strong>malformed document.</p>
Running that document through xmlformat produces the following result:
MISMATCH open (strong), close (p); malformed document? Non-empty tag stack; malformed document? Non-empty children stack; malformed document? Cannot continue.
That is admittedly cryptic, but remember that it's not xmlformat's job to repair (or even diagnose) bad XML. If a document is not well-formed, you may find Tidy a useful tool for fixing it up.
Tokens of each type except element tokens correspond to single distinct nodes in the document. Elements are more complex. They may consist of multiple tokens, and may contain children:
An element with a combined opening/closing tag (such as
<abc/>
) consists of a single token.
An element with separate opening and closing tags (such as
<abc>...</abc>
) consists of at least the
two tags, plus any children that appear between the tags.
Element opening tag tokens include any attributes that are present, because xmlformat performs no tag reformatting. Tags are preserved intact in the output, including any whitespace between attributes or within attribute values.
In addition to the type value that labels a node as a given node type, each node has content:
For all node types except elements, the content is the text of the token from which the node was created.
For element nodes, the content is the list of child nodes that appear within the element. An empty element has an empty child node list. In addition to the content, element nodes contain other information:
The literal text of the opening and closing tags. If an element is
written in single-tag form (<abc/>
), the
closing tag is empty.
The element name that is present in the opening tag. (This is maintained separately from the opening tag so that a pattern match need not be done on the opening tag each time it's necessary to determine the element name.)
After constructing the node tree, xmlformat performs two operations on it:
The tree is "canonized" to normalize text nodes and to discard extraneous whitespace nodes. A whitespace node is a text node consisting of nothing but whitespace characters (space, tab, carriage return, linefeed (newline)). Decisions about which whitespace nodes are extraneous are based on the configuration options supplied to xmlformat.
The canonized tree is used to produce formatted output. xmlformat performs line-wrapping of element content, and adds indentation and line breaks. Decisions about how to apply these operations are based on the configuration options.
Here's an example input document, representing a single-row table:
<table> <row> <cell>1</cell><cell>2</cell> <cell>3</cell> </row></table>
After reading this in and constructing the tree, the canonized output looks like this:
<table><row><cell>1</cell><cell>2</cell><cell>3</cell></row></table>
The output after applying the default formatting options looks like this:
<table> <row> <cell>1</cell> <cell>2</cell> <cell>3</cell> </row> </table>