Overview
- The main points from yesterday
- Suggested solution to yesterday's exercise
- Schema languages for XML
- DTD syntax
- Linking DTDs to XML documents
- Exercise
XML—repetition
- A markup language marks up the structure of documents
- Enabling computers to process the documents
- Separating structure and content from presentation
- The XML specification:
- Rules and syntax of well-formed
XML documents
- Describes how to create a markup
language with a Document
Type Definition (DTD)
- An XML-based markup language is called an
application of XML
- XML documents are made up of elements
and attributes
- An XML processor must reject XML documents which are not well-formed
Suggested solution to yesterday's exercise
reviews.xml
Schema
- Formally describes an XML application (i.e. a markup language)
- Which elements and attributes can be used
- Constraints on the structure and content of document instances
- Some XML processors can check if a document conforms to a
schema
- An XML document that does, is called a valid
document
- A schema can be used by XML editors, "forcing" the document writer
to create a valid document
Different schema languages
- DTD (Document Type Definition)
Approved by W3C (part of the XML specification)
- XML Schema Language
Approved by W3C (a recommendation in May 2001)
- RELAX NG
- Schematron
XHTML—an XML application
- Extensible HyperText Markup Language
- A W3C Recommendation
- "HTML reformulated in XML"
- Lowercase characters in element names
(<h1>,
</h1>)
- Empty elements must be closed
(<br />)
- Attribute values must be quoted
(<table border="0">)
- Web browsers can read and display XHTML documents
- Defined as a DTD: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
(XHTML 1.0 Strict)
DTD
- A set of declarations
- Declares which elements can be used in the language
- Declares the content model of
each element
- Declares which attributes can be used for the different elements, and their legal values
- Declares entities
- The order of the declarations is of no importance
- All XML documents must be well-formed, but validity is not
mandatory
- An XML processor should not reject a well-formed
XML document, even if it's not valid
Our example document
<?xml version="1.0"
encoding="UTF-8"?>
<books>
<book number="b1">
<title>How to Build a Digital Library</title>
<authors>
<author>
<first_name>Ian H.</first_name>
<family_name>Witten</family_name>
</author>
<author>
<first_name>David</first_name>
<family_name>Bainbridge</family_name>
</author>
</authors>
<year>2003</year>
<publisher>Morgan Kaufmann Publishers</publisher>
<isbn>1-55860-790-0</isbn>
</book>
<book number="b2">
<title>Understanding Digital Libraries</title>
<edition>2</edition>
<authors>
<author>
<first_name>Michael</first_name>
<family_name>Lesk</family_name>
</author>
</authors>
<year>2005</year>
<publisher>Morgan Kaufmann Publishers</publisher>
<isbn>1-55860-924-5</isbn>
</book>
</books>
A possible DTD for this document
<!ELEMENT books (book+)>
<!ELEMENT book (title, edition?, authors, year, publisher, isbn)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT edition (#PCDATA)>
<!ELEMENT authors (author+)>
<!ELEMENT author (first_name, family_name)>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT family_name (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ATTLIST book number ID #REQUIRED>
Element declarations
- The general syntax:
<!ELEMENT element_name
content_model>
- The most common
content_models are:
- Parsed character data
- Child elements
- Mixed content
- Empty elements
Element declarations: Parsed character data
- Parsed character data can include:
- Plain characters
- General entities, like <
The XML processor will replace the entities with their
full meaning in the document tree
- Is called #PCDATA in a DTD
- Example: <!ELEMENT first_name
(#PCDATA)>
- The #PCDATA entries in
DTDs must be surrounded by parentheses
Element declarations: Child elements
- Elements which contain no character data, but only
child elements
- Example:
<!ELEMENT
author (first_name, family_name)>
- The elements in the XML document must be in exactly the order specified
in the DTD, meaning that this is not valid:
<author>
<family_name>Witten</family_name>
<first_name>Ian H.</first_name>
</author>
- The frequency of child element instances can be specified by a suffix
- ?: Zero or one
element instances are allowed
- *: Zero or more
element instances are allowed
- +: One or more
element instances are allowed
- With no suffix, exactly one element instance is allowed
- Example: <!ELEMENT
book (title, edition?, authors, year, publisher, isbn)>
Element declarations: Mixed content
- Elements with content that can contain parsed character data mixed with
child elements
- Markup example: <paragraph>But
Finnis<note>2</note> states that the
works of Rushdie are not postmodern. Marx uses the term '<term>the
deconstructive paradigm of reality</term>' to denote not, in fact,
structuralism, but
poststructuralism.</paragraph>
- The DTD element declaration:
<!ELEMENT paragraph (#PCDATA | note
| term)*>
Element declarations: Empty elements
- Empty elements have no content, but can have attributes
- Example (forced line break in XHTML): <!ELEMENT
br EMPTY>
- In markup: <p>This is
a<br/>line break.</p>
Attribute declarations
- The general syntax:
<!ATTLIST element_name
attribute_name type description
...>
- The most common attribute types:
- CDATA: Character data,
including whitespace
- NMTOKEN: name
token—Character data, excluding whitespace
- ID: Unique id numbers (primary keys)
- IDREF: Reference to
unique id numbers (foreign keys)
- Enumerations: a list of possible values
- The most common descriptions:
- #IMPLIED: The attribute
is optional
- #REQUIRED: The attribute
is required
- Literal: a default value is given as a quoted string
Attribute declarations—some examples
- ID: <!ATTLIST book number ID #REQUIRED>
- Valid markup (start tag): <book
number="b1">
- The number attribute must be unique in the document
- Not valid markup: <book>
- CDATA: <!ATTLIST homepage type CDATA #IMPLIED>
- Valid markup: <homepage type="personal">
- Also valid: <homepage>
- Enumeration: <!ATTLIST homepage
lang (no | en) "no">
- Valid markup: <homepage lang="en">
- Valid markup: <homepage>
- The XML processor will add the attribute value
no
- Not valid markup: <homepage lang="ge">
General entities
- You can define your own entities for content that is
used many times in documents
- Works like the predefined character
entities
- General syntax: <!ENTITY
entity_name replacement_text>
- Example: <!ENTITY
hioa "Oslo and Akershus University College of Applied Sciences">
- In markup: <message>Welcome
to &hioa;</message>
- The XML processor will replace every occurrence
of the entity with the replacement text
Advanced topics
- Parameter entities
- For repeated text in DTD declarations
- Example: <!ENTITY
% block "p | ul | blockquote">
- Usage: <!ELEMENT
chapter (h1, (%block;)+)>
- Result: <!ELEMENT
chapter (h1, (p | ul | blockquote)+)>
- Used a lot in the XHTML DTD!
- Modularisation
- Importing modules from external sources
- Conditional includes (INCLUDE
and IGNORE)
- XHTML has many modules, enabling inclusion of parts
of the language (like the character entities or table
definitions) in other XML applications
A generic XML document
1. <?xml version="1.0"?>
2. <!DOCTYPE book_reviews SYSTEM
"http://www.jbi.hio.no/book_reviews.dtd"
3. [
<!ENTITY part1 SYSTEM "part1.xml">
<!ENTITY ouc "Oslo University College">
]>
4. <book_reviews>
<!-- The actual document -->
</book_reviews>
- 1, 2 and 3 together is the
document prolog
- 2 and 3 together is the document type declaration
- XML declaration
- External subset
- Internal subset
- The root (or document) element
The document type declaration
Two topics from yesterday
- Namespaces
- Namespaces are used to prevent naming conflicts
when elements from two different XML languages are combined
in one document
- The XML languages are defined in a schema language,
like DTD
- The standalone attribute in the XML declaration
- Example: <?xml version="1.0"
encoding="UTF-8" standalone="yes"?>
- no: The XML
processor has to read external files to make the
XML document complete, i.e. there are general entities
and/or default values for attributes defined in
an external DTD
- yes: The XML
document does not depend on external declarations
Exercise
- Create an external DTD for the book reviews XML document
- Use XML Copy Editor
to create and edit the XML document
- Save the file in the same directory as the XML document
- Reference the DTD in the document prolog of the XML document
- Check that the XML document is valid according to the DTD
- Tip: Start with the declaration of the root element