Object-Oriented Programming

XML

Michael L. Collard, Ph.D.

Department of Computer Science, The University of Akron

Data in an Executable Program

  • Store in data structures
  • Containers: std::vector, std::list, std::array, c-array
  • Objects: struct, class
  • Database

External Data

  • Must store persistent data externally, e.g., files
  • Must exchange data with other programs
  • Example: One program produces data, another program converts the data to a web page
  • Need to store and exchange using a data format

Text Data Formats

  • CSV
  • YAML
  • JSON
  • XML

XML

  • General-purpose interchange data format
  • Represents structured and semi-structured data
  • Very applicable to text documents
  • Wide variety of tools and approaches

XML Technologies

  • XPath - Addressing language
  • XSLT - Transformation language
  • DTD, XML Schema, Relax NG - Schema languages, specifies structure and content
  • DOM, JDOM, XOM, SAX2, LINQ - Parsers for XML, APIs to build on

Views of XML

  • “data heads”

    XML is a data format

  • “doc heads”

    XML is a document format

Simple List in XML

XML Declaration

  • First thing in an XML document
  • Optional, but always include

XML Declaration Attributes

  • version
    • Although there is an XML 1.1, most XML is 1.0. Use 1.0.
  • encoding
    • Text encoding used. UTF-8 is the default, and the most common
    • UTF-8 is an extension of ASCII
    • If you don’t know what encoding to use, use UTF-8
  • standalone
    • ignore any markup declarations in the DTD. Use “yes”.
  • Except for very special situations, these are the only values you will see

Element

  • XML structure is defined by elements:
  • tags delimit elements
  • element start tag
  • element end tag
  • empty element

Element Content

  • text - text “cin” is the text content of the element name:

  • nested elements - element name is nested inside of element expr

  • The element includes the start tag, the end tag, and all content in between

Nested Elements in XML

Attributes

  • Elements can have attributes which are name-value pairs listed in the start tag
    • name is “type”
    • value is “block”
  • Attributes must be unique to an element, i.e., multiple attributes with the same name are not allowed
  • The value of an attribute is a string and cannot have nested elements

Attributes in XML

Attributes vs. Nested Elements

  • attributes
    • Can only contain text
    • No nested elements
    • metadata
  • nested elements
    • Can contain text
    • Can have nested elements
    • data

well-formedness

  • Exactly one root element
  • Elements form a tree structure and nest properly
    • i.e., <a><b></a></b> is not well-formed
  • No missing tags
  • XML processors are required to stop processing and report an error if a document is not well-formed

Escaped Characters

unescaped escaped notes
< &lt; required
& &amp; required
> &gt; typically done, but not required
' &apos; depends on context
" &quot; depends on context

Escaped Text

Unescaped:

Escaped:

Mixed Content

  • Element has both text and nested elements
  • Additional markup for a section item
  • Good for text with included markup

Mixed Content Example

Namespaces

  • Ability to put element names into groups and specifically refer to them
  • prefix
    • e.g., s
  • URI
    • e.g., https://mlcollard.net/Student

Namespaces (more)

  • Declaration is scoped by the element it is declared on, typically the root element
  • namespace declaration e.g., the prefix s is shorthand for the URI https://mlcollard.net/Student:
  • default prefix namespace declaration e.g., the default prefix is shorthand for the URI https://mlcollard.net/Student:

Namespaces with Default Prefix

Namespaces with Prefix

How to View Namespaces

  • Required for many XML processors, but often misunderstood and mishandled
  • Prefix is shorthand for the URI in that context/scope. I.e.,
    • <{https://mlcollard.net/Student}student>
    • Note: NOT valid XML

Namespace Matching

  • Element name matching is based on namespace URI, not on prefix. The following are identical:
  • Many XML processors require a non-default prefix for certain functionality, e.g., XPath in libxml2

Multiple Namespaces

White Space

  • significant and insignificant
  • Newlines are normalized (i.e., converted) to Unix line endings even on Windows
    • CR + LF → LF
  • White space in tags is insignificant and is just a separator, i.e., the following are equivalent:
    • XML processors will often remove extra insignificant whitespace
  • White space in element content can be significant. Assume that it is.

Validating for well-formedness

  • Any XML producer must produce well-formed XML, or it cannot be processed
  • xmllint - Tool for checking if XML is well-formed
  • Ubuntu package: libxml2-utils
  • Preinstalled on macOS
  • To check if XML is well-formed: xmllint --noout data.xml

Further Validation

  • Allowed elements, text content of elements, and nested elements are specific to each XML application (i.e., each XML format)
  • Must be defined in an XML grammar
  • DTD
  • XSchema
  • RelaxNG

Special XML Contents

  • XML comment
  • CDATA section
  • XML Processing Instructions

XML Comments

  • Comment Start:
  • Comment End:

XML Terminology

  • qname short for qualified name
  • URI short for Uniform Resource Identifier. URLs are a subset of URIs more

XML Parts

  • XML Declaration: version, encoding, standalone
  • Element Start Tag: qname, prefix, localname
  • Element End Tag: qname, prefix, localname
  • Characters: content
  • Attribute: qname, prefix, localname, value
  • XML Namespace: prefix, uri
  • XML Comment: content
  • CDATA: content

Names Example

  • qname Full element name
    • E.g., src:if
  • prefix Namespace prefix
    • E.g., src
  • localname Element name inside the namespace, i.e., element name without the prefix
    • E.g., if