XML is a markup language used to store hierarchical data in text files. It is also known as semi-structured data, like JSON. XML is machine-readable, yet can also be read and produced by people.
XML is made up of elements, sometimes casually referred to as a tag soup, which can themselves contain other elements and/or text. Elements may also contain attributes.
XML is often used for data exchange between platforms, especially over the internet. It is also increasingly used for storing semi-structured data in NoSQL data stores (XML databases/document stores). Furthermore, it has the flexibility to handle document-oriented data (text with markup), which makes it very popular in the publishing industry. XML is also widely used for configuration files.
One of the main reasons why XML is so widely used is that it is standardized, with many parsers available, including open source. This makes the cost of using XML lower than the invention of one's own new syntax.
More information about XML's origin and goals can be found in the official W3C Recommendation.
There are two versions of XML, shown in the table below. The editions of each version are just revisions of the original documents and not changes of the standards.
The first version of XML is 1.0. XML 1.1 was primary changed due to the Unicode version change from 2.0 to 3.1 and specifies a set of new rules for the use and interpretation of new Unicode characters.
Element and attribute names in XML are called QNames (qualified names).
A QName is made of:
Only the namespace and the local name are relevant for comparing two QNames. The prefix is only a proxy to the namespace.
The namespace and prefix are optional, but the namespace is always present if the prefix is present (this is ensured at the syntactic level, so this cannot be done wrong).
The lexical representation of a QName is prefix:local-name
. The namespace is bound separately using the special xmlns:...
attributes (reminder: attributes beginning with xml are reserved in XML).
If the prefix is empty, no colon is used in the lexical representation of the QName, which only contains the local-name
. QNames with an empty prefix either have no namespace (if no default namespace is in scope) or are in the default namespace.
From a storage perspective, an XML document is made of entities. One of the entities is the document entity, which is the main XML document itself.
Entities can be classified like so (tentatively sorted by descending order of usage):
In many cases, an XML document consists solely of the document entity.
Characters can be escaped in XML using entity references and character references, or CDATA sections.
XML pre-defines five entities:
Named entity | Replacement text |
---|---|
amp | & |
quot | " |
apos | ' |
lt | < |
gt | > |
Consuming applications will not know whether each character has been escaped or not, and how.