HTML allows to mark up (or describe) the structure of a human-readable web document or web user interface, while XML allows to mark up the structure of all kinds of documents, data files and messages, whether they are human-readable or not. HTML can be based on XML.
XML provides a syntax for expressing structured information in the form of an XML document with elements and their attributes. The specific elements and attributes used in an XML document can come from any vocabulary, such as public standards or your own user-defined XML format. XML is used for specifying
document formats, such as XHTML5, the Scalable Vector Graphics (SVG) format or the DocBook format,
data interchange file formats, such as the Mathematical Markup Language (MathML) or the Universal Business Language (UBL),
message formats, such as the web service message format SOAP
XML is based on Unicode, which is a platform-independent character set that includes
almost all characters from most of the world's script languages including Hindi, Burmese and
Gaelic. Each character is assigned a unique integer code in the range between 0 and 1,114,111.
For example, the Greek letter π has the code 960, so it can be inserted in an XML document as
π using the XML entity syntax.
Unicode includes legacy character sets like ASCII and ISO-8859-1 (Latin-1) as subsets.
The default encoding of an XML document is UTF-8, which uses only a single byte for ASCII characters, but three bytes for less common characters.
Almost all Unicode characters are legal in a well-formed XML document. Illegal characters are the control characters with code 0 through 31, except for the carriage return, line feed and tab. It is therefore dangerous to copy text from another (non-XML) text to an XML document (often, the form feed character creates a problem).
Generally, namespaces help to avoid name conflicts. They allow to reuse the same (local) name in different namespace contexts.
XML namespaces are identified with the help of a namespace URI (such as the SVG namespace URI "http://www.w3.org/2000/svg"), which is associated with a namespace prefix (such as "svg"). Such a namespace represents a collection of names, both for elements and attributes, and allows namespace-qualified names of the form prefix:name (such as "svg:circle" as a namespace-qualified name for SVG circle elements).
A default namespace is declared in the start tag of an element in the following way:
This example shows the start tag of the HTML root element, in which the XHTML namespace is declared as the default namespace.
The following example shows a namespace declaration for the SVG namespace:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> ... </head> <body> <figure> <figcaption>Figure 1: A blue circle</figcaption> <svg:svg xmlns:svg="http://www.w3.org/2000/svg"> <svg:circle cx="100" cy="100" r="50" fill="blue"/> </svg:svg> </figure> </body> </html>
XML defines two syntactic correctness criteria. An XML document must be well-formed, and if it is based on a grammar (or schema), then it must also be valid against that grammar.
An XML document is called well-formed, if it satisfies the following syntactic conditions:
There must be exactly one root element.
Each element has a start tag and an end tag; however, empty elements can be closed
<phone/> instead of
Tags don't overlap, e.g. we cannot have
Attribute names are unique within the scope of an element, e.g. the following code is not correct:
<attachment file="lecture2.html" file="lecture3.html"/>
An XML document is called valid against a particular grammar (such as a DTD or an XML Schema), if
it is well-formed,
and it respects the grammar.
The World-Wide Web Committee (W3C) has developed the following important versions of HTML:
1997: HTML 4 as an SGML-based language,
2000: XHTML 1 as an XML-based clean-up of HTML 4,
2014: (X)HTML5 in cooperation (and competition) with the WHAT working group supported by browser vendors.
HTML was originally designed as a structure description
language, and not as a presentation description language.
But HTML4 has a lot of purely presentational elements such as
font. XHTML has
been taking HTML back to its roots, dropping presentational elements and defining a simple
and clear syntax, in support of the goals of
We adopt the symbolic equation
HTML = HTML5 = XHTML5
stating that when we say "HTML" or "HTML5", we actually mean XHTML5
because we prefer the clear syntax of XML documents over the liberal and confusing HTML4-style syntax that is also allowed by HTML5.
The following simple example shows the basic code template to be used for any HTML document:
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta charset="UTF-8" /> <title>XHTML5 Template Example</title> </head> <body> <h1>XHTML5 Template Example</h1> <section><h1>First Section Title</h1> ... </section> </body> </html>
For user-interactive web applications, the web browser needs to render a user interface. The traditional metaphor for a software application's user interface is that of a form. The special elements for data input, data output and form actions are called form controls. An HTML form is a section of a document consisting of block elements that contain controls and labels on those controls.
Users complete a form by entering text into input
fields and by selecting items from choice
controls. A completed form is submitted with the help of a submit button. When a user submits a form, it is sent to a web
server either with the HTTP GET method or with the HTTP POST method. The standard encoding
for the submission is called URL-encoded. It is
represented by the Internet media type
this encoding, spaces become plus signs, and any other reserved characters become encoded as
a percent sign and hexadecimal digits, as defined in RFC 1738.
Each control has both an initial value and a current value, both of which are strings. The
initial value is specified with the control element's
value attribute, except
for the initial value of a
textarea element, which is given by its initial
contents. The control's current value is first set to the initial value. Thereafter, the
control's current value may be modified through user interaction or scripts. When a form is
submitted for processing, some controls have their name paired with their current value and
these pairs are submitted with the form.
Labels are associated with a control by including the control as a subelement of a
label element ("implicit labels"), or by giving the control an
value and referencing this id in the
for attribute of the
element ("explicit labels"). It seems that implicit labels are (in 2015) still not widely
supported by CSS libraries and assistive technologies. Therefore, explicit labels may be
preferable, despite the fact that they imply quite some overhead by requiring a
reference/identifier pair for every labeled HTML form field.
In the simple user interfaces of our "Getting Started" applications, we only need three types of form controls:
single line input fields created with an
<input name="..." /> element,
push buttons created with a
type="button">...</button> element, and
dropdown selection lists created with a
select element of the following
<select name="..."> <option value="value1"> option1 </option> <option value="value2"> option2 </option> ... </select>
An example of an HTML form with implicit labels for creating such a user interface is
<form id="Book"> <p><label>ISBN: <input name="isbn" /></label></p> <p><label>Title: <input name="title" /></label></p> <p><label>Year: <input name="year" /></label></p> <p><button type="button" id="saveButton">Save</button></p> </form>
In an HTML-form-based user interface (UI), we have a correspondence between the
different kinds of properties defined in the model classes of an app and the form controls
used for the input and output of their values. We have to distinguish between various kinds
of model class
attributes, which are typically mapped to various kinds of
input fields. This
mapping is also called data
In general, an attribute of a model class can always be represented in the UI by a plain
input control (with the default setting
type="text"), no matter
which datatype has been defined as the range of the attribute in the model class. However,
in special cases, other types of
input controls (for instance,
type="date"), or other controls, may be used. For instance, if the
attribute's range is an enumeration, a
select control or, if the number of
possible choices is small enough (say, less than 8), a radio button group can be