XML

json+xml

{ “element-name” : [ { attribute:value,… }, content …] }

{“!–“:”comment“}
{“?PITarget“:”Processing Instruction“}
{“![CDATA[“:”Character Data“}
{“!DOCTYPE”:[{“Name”:”Name”,“ExternalID”:”ExternalID“,”intSubset”:”intSubset“}

This describes a generic approach to representing XML in JSON.
The benefit of this approach is that the document order of content is preserved,
with attributes always in the first position of each elements content array.
This is important for document models.

This approach builds upon json+metadata for elements,
whose attributes are metadata about the element & its content.

XML is a document markup language.
The document content is the data.
The markup is the metadata.

Elements and their attributes are represented thus
{ “element-name” : [ { attribute:value,… }, content …] }

An empty element with no attributes is :
{ “element-name” : [ {} ] }

An element which contains an empty text node is thus:
{ “element-name” : [ {}, “” ] }

Comments are represented thus:
{“!–“:”string content”}
(comments have no metadata)

Processing instructions are represented:

  • when the instruction is (by default) unparsed as
    {“?PITarget”:”instruction”}
  • when the instruction is parsed as attribute value pairs
    {“?PITarget”:{attribute:value,…}}

CDATA sections are represented:
{“![CDATA[“:”character data”}

Namespace in XML declarations are represented as xml attributes,
so are simply represented as attributes.

Namespaces declarations represented as processing instructions,
are represented as per processing instructions.

XML Character References (ᨫ) may be translated into JSON escaped unicode characters (\u1a2b).
XML Entity References may be parsed and replaced.

To Be Decided

  1. Should the XML prolog be copied across just as a PI is,
    or should it be replaced by a comment about the origin,
    the method and configuration used for translating to JSON ?
    eg. whether Character and Entity References have been parsed or replaced.
  2. Should items besides elements use json+metadata format for consistency
    even though they only have string values ?
  3. DOCTYPE – does anyone want further parsing ?
    eg. to specify SYSTEM or PUBLIC

Alternative Representations

The above representation was selected because it doesn’t require the use of arbitrary symbols or names to distinguish the attribute axis from the content of an element, and because it provides a representation that lists the attributes before the content, just as in the XML serialisation of the XML Information Set. Here are some of the other alternatives considered for element.

Object per element, having single property named as per element,
of an array with leading attributes object, followed by element content.

{ "element" : [ 
  { "attribute" : value , … } ,
    content ] }

Object per element, having single property named as per element,
of an object with property for attributes object,
and property for content array.

{ "element" : { 
    "@" : { "attribute" : value , … }
    "$" : [ content ] }

Object per element, having single property named as per element,
of an object with properties for attributes prefixed by at symbol,
and property for content array.

{ "element" : { 
    "@attribute" : value , … ,
    "$" : [ content ] }

Object per element, have property name as per element of content array,
and attribute property of attributes object.

{ "element" : [ content ] , 
  "@" : { "attribute" : value , … } }

Object per element, have property name as per element of content array,
properties for attributes prefixed by at symbol.

{ "element" : [ content ] , 
"@attribute" : value , … }

Alternative Array form

[ "element-name", { attribute:value,… }, content …]
[ "!--","comment" ]
[ "?PITarget","Processing Instruction"]
[ "![CDATA[","Character Data"]
[ "!DOCTYPE",{"Name":"Name","ExternalID":"ExternalID","intSubset":"intSubset"}]

A alternative approach is to represent nodes as arrays. Elements are simplified to be an array with the element name first, attributes object second, and content third onwards. Other nodes are represented as tuple arrays – key value pairs.