Object-Oriented Programming

XML

Michael L. Collard, Ph.D.

Department of Computer Science, The University of Akron

Data in an Executable Program

  • Store in memory with data structures
  • Containers: std::vector, std::list, std::array, c-array
  • Objects: struct, class
  • Database

External Data

  • Must store persistent data externally, e.g., files
  • Must exchange data with other programs
  • Example: One program produces data (a producer), another program converts the data to a web page (a consumer)
  • Need to store and exchange using a data format

Text Data Formats

  • CSV
  • JSON
  • YAML
  • XML

XML

  • General-purpose interchange data format
  • Represents structured and semi-structured data
  • Very applicable to text documents
  • Wide variety of tools and approaches

XML Technologies

  • XPath - Addressing language
  • XSLT - Transformation language
  • DTD, XML Schema, Relax NG - Schema languages, specify structure and content
  • DOM, JDOM, XOM, SAX2, LINQ - Parsers for XML, APIs to build on

Views of XML

Viewpoint Concept Characteristics
data XML is a data format No mixed content
Lots of attributes
Often a memory dump of internal objects
Often inflexible
document XML is a document format Mixed content
Favors elements over attributes
High-level view of mixed data
Very flexible to additional elements and attributes

Simple List in XML

XML Declaration

  • First thing in an XML document
  • Optional, but always include
  • version - Although there is an XML 1.1, most XML is 1.0.
  • encoding - Text encoding
  • UTF-8 is the default and the most common. UTF-8 is an extension of ASCII.
  • standalone - Ignore any markup declarations in the DTD
  • Except for exceptional situations, these are the only values you will see

Element

  • XML structure is defined by elements:
  • tags delimit elements
  • element start tag
  • element end tag
  • empty element

Element Content

  • text - text "cin" is the text content of the element name:

  • nested elements - element name is nested inside of element expr

  • The element includes the start tag, the end tag, and all content in between

Nested Elements in XML

Attributes

  • Elements can have attributes, which are name-value pairs listed in the start tag
  • E.g., name is "type"
  • E.g., value is "block"
  • Attributes must be unique to an element, i.e., multiple attributes with the same name are not allowed
  • The value of an attribute is a string and cannot have nested elements

Attributes in XML

Attributes vs. Subelements

  Example Nested Text Nested Elements Purpose
Attributes Yes No metadata
Subelements Yes Yes data

well-formed

  • Requirement of XML
  • Exactly one root element, <students>...</students>
  • Elements form a tree structure and nest properly
  • All elements are an empty element or have a start tag and an end tag

well-formed Violations

  • More than one root element
  • Missing end tags
  • Not nested properly
  • XML processors are required to stop processing and report an error if a document is not well-formed

Escaped Characters

unescaped escaped notes
< &lt; required
& &amp; required
> &gt; typically done, but not required
' &apos; depends on context
" &quot; depends on context

Escaped Text

  • Unescaped:

  • Escaped:

Mixed Content

  • Element has both text and nested elements
  • Additional markup for a section item
  • Good for text with included markup

Namespaces

  • Ability to put element names into groups and specifically refer to them
  • prefix
    • e.g., s
  • URI
    • e.g., https://mlcollard.net/Student

Namespaces (more)

  • Declaration is scoped by the element it is declared on, typically the root element
  • namespace declaration e.g., the prefix s is shorthand for the URI https://mlcollard.net/Student:
  • default prefix namespace declaration e.g., the default prefix is shorthand for the URI https://mlcollard.net/Student:

Namespaces with Default Prefix

Namespaces with Prefix

How to View Namespaces

  • Required for many XML processors, but often misunderstood and mishandled
  • Prefix is shorthand for the URI in that context/scope. I.e.,
    • <{https://mlcollard.net/Student}student>
    • Note: NOT valid XML

Namespace Matching

  • Element name matching is based on namespace URI, not on the prefix. Both of these are identical
  • Many XML processors (tools that work on XML) require a non-default prefix for certain functionality, e.g., XPath in libxml2

Multiple Namespaces

White Space

  • significant and insignificant
  • Newlines are normalized (i.e., converted) to Unix line endings even on Windows
    • CR + LF → LF
  • White space in tags is insignificant and is just a separator, i.e., the following are equivalent:
  • XML processors will often remove extra insignificant whitespace
  • White space in element content can be significant. Assume that it is.

Validating for well-formedness

  • Any XML producer must produce well-formed XML, or it cannot be processed
  • xmllint - Tool for checking if XML is well-formed
  • Ubuntu package: libxml2-utils
  • Preinstalled on macOS
  • To check if XML is well-formed: xmllint --noout data.xml

Further Validation

  • Allowed elements, text content of elements, and nested elements are specific to each XML application (i.e., each XML format)
  • Must be defined in an XML grammar
  • DTD
  • XSchema
  • RelaxNG

Special XML Contents

  • XML comment
  • CDATA section
  • XML Processing Instructions

XML Comments

  • Comment Start: <!--
  • Comment End: -->

XML Terminology

  • qName short for qualified name
  • URI short for Uniform Resource Identifier. URLs are a subset of URIs more

XML Interface Concepts

  • Start Document
  • XML Declaration: version, encoding, standalone
  • Element Start Tag: qName, prefix, localName
  • Element End Tag: qName, prefix, localName
  • Characters: characters
  • Attribute: qName, prefix, localName, value
  • XML Namespace: prefix, uri
  • XML Comment: value
  • CDATA: characters
  • Processing Instruction: target, data
  • End Document

Names Example

  • qName Full element name
  • E.g., src:if
  • prefix Namespace prefix
  • E.g., src
  • localName Element name inside the namespace, i.e., element name without the prefix
  • E.g., if