Plan for the summer school
- Monday 22 August
- Introduction and well-formed documents
- Tuesday 23 August
- DTDs and valid documents
- Wednesday 24 August
- Making use of XML
Exercises every day.
Today's overview
- Historical background
- The XML specification
- Well-formed XML documents
- Exercise
The World Wide Web (WWW)
- Tim Berners-Lee, CERN, 1990
- Three important technical specifications
- HTML (HyperText Markup Language)
- HTTP (HyperText Transfer Protocol)
- URL (Uniform Resource Locator)
- Utilising three "old" technologies
- The Internet
Has its roots in the ARPAnet, created in 1969 and
financed by the US Department of Defence
- Hypertext
The idea: Vannevar Bush:
As We May
Think (1945)
The word: Ted Nelson, 1965
-
Markup languages
Unstructured text
The deconstructive paradigm of reality and cultural
deappropriation John Hubbard Department of English, Oxford University D. Hans Hanfkopf
Department of Literature, Miskatonic University, Arkham, Mass. Cultural deappropriation
and postdialectic desituationism The primary theme of d'Erlette's1 analysis of the
capitalist paradigm of context is the genre, and some would say the paradigm, of
preconceptual class. Therefore, Bataille uses the term 'postdialectic desituationism'
to denote not theory, as Lyotard would have it, but subtheory. The characteristic
theme of the works of Rushdie is a textual totality. But Finnis2 states that the works
of Rushdie are not postmodern. Marx uses the term 'the deconstructive paradigm of reality'
to denote not, in fact, structuralism, but poststructuralism. Therefore, if postdialectic
desituationism holds, we have to choose between cultural deappropriation and neodialectic
discourse. Sontag promotes the use of postdialectic desituationism to analyse and attack
society. In a sense, a number of narratives concerning cultural deappropriation may be
revealed. The premise of capitalist capitalism holds that narrative is created by
communication.
(Text randomly generated by the Postmodernism Generator on August 2, 2007.)
Marked-up text
<article>
<header1>The deconstructive
paradigm of reality and cultural
deappropriation</header1>
<author>John
Hubbard</author>
<institution>Department of
English, Oxford University</institution>
<author>D.
Hans Hanfkopf</author>
<institution>Department of
Literature, Miskatonic University, Arkham,
Mass.</institution>
<paragraph>Cultural
deappropriation and postdialectic desituationism The primary theme of
d'Erlette's<note>1</note> analysis of
the capitalist paradigm of context is the genre, and some would say the paradigm, of
preconceptual class. Therefore, Bataille uses the term
<term>'postdialectic desituationism'</term>to
denote not theory, as Lyotard would have it, but subtheory. The characteristic
theme of the works of Rushdie is a textual
totality.</paragraph>
<paragraph>But
Finnis<note>2</note> states that the
works of Rushdie are not postmodern. Marx uses the term <term>'the
deconstructive paradigm of reality'</term> to denote not, in fact,
structuralism, but
poststructuralism.</paragraph>
<paragraph>Therefore,
if postdialectic desituationism holds, we have to choose between cultural deappropriation and neodialectic
discourse. Sontag promotes the use of postdialectic desituationism to analyse and attack
society.</paragraph>
<paragraph>In a sense,
a number of narratives concerning cultural deappropriation may be
revealed. The premise of capitalist capitalism holds that narrative is created by
communication.</paragraph>
</article>
HTML
- A markup language supporting hypertext links being used on the web
- A fixed language with a fixed syntax
- The HTML specification is maintained by W3C (The World
Wide Web Consortium)
- The latest version is HTML 4.01
- HTML 5 is still a working draft
- Web browsers can read and display HTML documents
- An application of SGML (Standard
Generalized Markup Language)
SGML
- A set of rules for creating markup languages
- Document structure is marked up using
tags
(<example>,
</example>)
- A markup language—an application of SGML—is formally described
in a DTD
(Document Type Definition)
- Developed by Charles F. Goldfarb and colleagues at IBM in the 1970s
- Accepted as an ISO-standard for document exchange in 1986 (ISO 8879)
SGML: Advantages
- Separating structure and content from presentation
- Digital preservation
- Plain text with only character data—not proprietary binary formats
- Human readable
- Quite easy to write software applications that can read and
process the documents (in theory...)
SGML: Usage
- Technical documentation
- The graphic industry
- Too complex to be widely used
- Few SGML editors could compete with word processing
software like Microsoft Word
From SGML to XML
- The first SGML success story: HTML!
- HTML: Fixed, with limited semantics
- SGML: General, but too complex
- The solution: XML!
XML
- Extensible Markup Language
- XML 1.0 became a W3C Recommendation in 1998
- A set of rules for designing text formats to structure data
- A "lightweight" version of SGML
- The parts of SGML which are hard to implement, have
been removed
- Smaller parsers/processors that can fit into web browsers
- Easier to use for document creators
XML applications: Examples 1
- XHTML (The Extensible HyperText Markup Language)
- A reformulation of HTML in XML 1.0
- Example: View source of this document
(Illustration by Gerd Berget, 2005)
XML applications: Examples 2
- Web syndication
- RSS (Real Simple Syndication or RDF Site Summary)
(Example: BBC)
- Atom
(Example: Atomenabled.org)
- MARCXML
- The two competing document formats for office application (text documents, spreadsheets
and presentations)
- OpenDocument Format (ODF)
- Approved as an OASIS (Organization for the Advancement of Structured
Information Standards) standard in 2005 and later
by ISO/IEC
in 2006 (ISO/IEC 26300)
- Office Open XML (OOXML)
- Developed by Microsoft to suit Microsoft Office documents
- Approved as an Ecma International standard in 2006 and
later by ISO/IEC in 2008
(ISO/IEC 29500)
The World Wide Web Consortium (W3C)
- Established in October 1994 at the MIT (Massachusetts Institute of
Technology)
- Led by Tim-Berners Lee
- Mission statement1: To lead the World Wide Web to its full
potential
- Academic and commercial institutions are members
- E.g. IBM, Microsoft and Sun
W3C's goals2
Principles
- Web for everyone
- Web accessibility
- Internationalisation
- Device independence
- Web on everything
Make web access easy on many different new devices (mobile phones, PDAs,
TVs and more)
Vision
- Web for Rich Interaction
- Web of Data and Services
Creating a web of information for human and machine processing (the semantic web)
- Web of Trust
Promote web technologies that enable accountability, security, confidence and
confidentiality
The W3C Process
- Working draft
- Candidate Recommendation
- Proposed Recommendation
- W3C Recommendation
XML-related W3C Recommendations
-
XML 1.0 (Fifth Edition)
-
Namespaces in XML
- XPointer
- XLink
- DOM
- XML Schema Language
- XSL
The XML specification
- URL: http://www.w3.org/TR/REC-xml/
- Two parts
- Describes the syntax rules for well-formed XML documents
- Describes the schema language DTD (Document Type Definition) to create
XML applications (markup languages)
XML concepts
<message type="welcome">Welcome to Oslo and Akershus University
College of Applied Sciences!</message>
- This XML document consists of one element
called message
-
Start tag:
<message type="welcome">
-
End tag:
</message>
-
Content:
Welcome to Welcome to Oslo and Akershus University
College of Applied Sciences!
- In this case, it is also the document's root element
(or document element)
-
Attribute: type="welcome"
- Name: type
- Value: welcome
- The document is well-formed
Document types
- Text-centric documents (semi-structured)
- Data-centric documents (structured)
XML documents
- An XML document is a stream of characters that conforms to
the requirements for well-formed XML
- Not necessarily a file on a computer
- Instant and short-lived XML documents are used
when computer systems exchange data
XML editors
- Any plain text editor
- XML editors
Well-formed XML document: An example
<?xml version="1.0"
encoding="UTF-8"?>
<books>
<book number="b1">
<title>How to Build a Digital Library</title>
<authors>
<author>
<first_name>Ian H.</first_name>
<family_name>Witten</family_name>
</author>
<author>
<first_name>David</first_name>
<family_name>Bainbridge</family_name>
</author>
</authors>
<year>2003</year>
<publisher>Morgan Kaufmann Publishers</publisher>
<isbn>1-55860-790-0</isbn>
</book>
<book number="b2">
<title>Understanding Digital Libraries</title>
<edition>2</edition>
<authors>
<author>
<first_name>Michael</first_name>
<family_name>Lesk</family_name>
</author>
</authors>
<year>2005</year>
<publisher>Morgan Kaufmann Publishers</publisher>
<isbn>1-55860-924-5</isbn>
</book>
</books>
Document tree
- XML documents are stored as document trees in
the computer's memory
- The root node is books
- Element nodes in yellow
- Attribute nodes in light green
- Text nodes (and leaf nodes) in white
Relationships
Well-formed XML
- A well-formed XML document conforms to the syntax rules described
in the XML specification
- ... has one or more elements
- ... has exactly one root element
- The root element has no parent
- ... consists of properly nested elements
- If the start tag of one element is part of the
content of another element, then the end tag must also be
part of the content of the same element
Example 1
<message type="greeting">Happy
birthday!</message>
<message type="greeting">Hello!</message>
- This document is not well-formed because it has no root element
We can make it well-formed by adding a root element:
<messages>
<message type="greeting">Happy
birthday!</message>
<message type="greeting">Hello!</message>
</messages>
Example 2
<message type="warning"><caution>Beware
of <strong>the
dog!</caution></strong></message>
- This document is not well-formed because the elements are not properly
nested
- The start-tag of the strong element
is inside the caution element,
but the end tag is outside
We can fix this by nesting the elements correctly:
<message type="warning"><caution>Beware
of <strong>the dog!</strong></caution></message>
XML applications versus HTML
- Element and attribute names are case sensitive in XML applications
-
<message> and
<Message> are start tags of different
element types
- An XML processor must reject documents which are not well-formed
- HTML processors (i.e. web browsers) normally accept them
and try to correct faulty markup
The XML declaration
- Tells the XML processor that the document claims to be XML
- Precedes everything else in the XML document
- Example: <?xml version="1.0"
encoding="UTF-8" standalone="yes"?>
-
version: The version of
XML used in the document
-
encoding: The character encoding
used in the document. UTF-8 (8-bit
Unicode Transformation Format), which is also the
default value
-
standalone: Tells the
XML processor whether it needs to read other files to make the
XML document complete. Legal values: yes
and no
Elements
- Element names must start with a letter or an underscore (_)
- Can also contain these characters:
- Letters
- Numbers
- Dots (.)
- Hyphens (-)
- Underscores (_)
- Choose descriptive and recognisable names for elements
Attributes
- Attributes are pairs of names and values
- The same naming rules as for elements
- Values must have quotes around them, or alternatively
apostrophes
- Tip: Always use quotes, but apostrophes when the
attribute value contains quotes
-
<book title="An
'ironic' guy">
Namespaces
- To prevent naming conflicts when elements from two different
XML languages are combined in one document
- The syntax: <name:element
xmlns:name="url">
- Example: <library:author
xmlns:library="http://www.hio.no/author/">
- The URL is just a unique name—not a DTD—but
could be a description of the namespace
- Put the definition in the topmost element using the prefix
when you're using several elements from the same namespace
Comments
- XML documents can contain comments making them more readable
for humans
- Comments are written with the same syntax as in HTML:
<!-- This is a comment -->
- The comment cannot contain two succeeding hyphens
- The end of a comment cannot be preceded by a third hyphen
- Can be placed anywhere inside character data sections
Processing instructions
- Gives information to a particular processing application
- Ignored by other applications
- Starts with <?
and ends with ?>
- Can be placed anywhere inside character data sections
- An example: <?xml-stylesheet
href="books.css" type="text/css"?>
- This processing instruction tells
web browsers that the XML document should be presented
using the style sheet books.css
Whitespace
- Definition: spaces, tabs and line breaks
- Two important rules:
- Whitespace in markup is ignored
-
<author born="1925"
dead="1964">
- Whitespace in element content is preserved
-
<p> This paragraph contains two
spaces
before the first word, and two line breaks
are part of the content.</p>
Predefined character entities
- Replaces characters with special meaning in XML
-
Must be used:
-
< means <
-
& means &
-
Can be used:
-
> means >
-
" means "
-
' means '
- Unicode characters can be referenced by their numbers
Exercise
We will be working with digital book reviews in the exercises and the workshop. The first
task is to create an XML language for book reviews.
- Download at least three book reviews from different
newspapers with free online services. Some links:
- Create one well-formed XML document that stores all the
reviews
- Use XML Copy Editor to create and edit the XML document
- Identify similarities between the different reviews—"smooth out" the differences
- Create a semantic rich language
- Mark up all parts of the reviews
- Use logical markup rather than physical
- Verify that the document is well-formed