|
Abstract
XML stands for eXtensible Markup Language. XML was
designed to facilitate the sharing of data across different systems,
particularly systems connected via the Internet. XML is a "meta" language
because it is used for defining markup languages. XML is a subset of the
Standard Generalized Markup Language (SGML). A markup language defined using
SGML or XML has a specific vocabulary (labels for elements and attributes) and
a declared syntax (grammar defining the hierarchy and other features).
XML is the basic building block of Web services. All the Web services
technologies recommended by the WS-I Basic Profile 1.0 are built on XML and he
XML Schema Language.
XML Introduction
XML is a W3C recommended markup language for general purpose. XML was
originally envisioned as a language for defining new document formats for the
World Wide Web. XML was designed to facilitate the sharing of data across
different systems, particularly systems connected via the Internet. XML is a
subset of the Standard Generalized Markup Language (SGML) and thus can be
considered to be a meta-language: a language for defining markup languages.
XML and SGML
SGML is the Standard Generalized Markup Language (ISO 8879:1986), the
international standard for defining descriptions of the structure of different
types of electronic document.
SGML is very large, powerful, and complex. It has been in heavy industrial and
commercial use for nearly two decades, and there is a significant body of
expertise and software to go with it.
XML is a lightweight cut-down version of SGML which keeps enough of its
functionality to make it useful but removes all the optional features which
made SGML too complex to program for in a Web environment.
XML is a derivative of SGML, it is more restrictive than SGML:
-
XML provides a dramatic improvement in the ease of writing programs that can
parse documents written in XML-derived markup languages.
-
XML greatly simplifies the task of creating custom markup languages that are
meaningful to one's own enterprise.
-
XML-derived markup languages are slightly less expressive than SGML-based
languages.
-
XML-derived markup languages are somewhat wordier than SGML-based languages.
-
XML-derived markup languages are less forgiving of syntactical variances
than SGML-based languages.
Markup Languages
A markup language is merely a set of conventions for denoting which parts of a
document should be treated differently from other parts.
Historically, that goal has been achieved for written documents by using
different styles for different parts of the document. For example, the title
can be printed or displayed centered on a page, in bold face
type, while the body of a document can be presented or displayed as a
continuous stream of text, separated from the title by one or more blank
"lines".
This technique works well when there's a human around to look at the
document, and when that human understands that a string of text in bold
letters at the top of a page should be understood to be the document title.
A machine, of course, has no such understanding. Someone must write a program
to parse the document, and somehow identify which parts are the title and
which parts are the texts. Every new style requires a new program, or a new
fix to the old program. This scenario quickly becomes unmanageable, obviating
all but the simplest operations on documents.
A markup language embeds special "tags" in text that help programs identify
the various parts of a document.
Here, for example, is a recipe marked up with a whimsical markup language:
A recipe written in the " Whimsical Markup Language ":
title{Chocolate Cake}
ingredient{howmuch {3 tablespoons}what{Dark Chocolate}}
ingredient{howmuch {1 pound}what{Flour}}
ingredient{howmuch {2 cups}what{Milk}}
ingredient{howmuch {1 dozen}what{Eggs}}
instructions{Mix, beat, bake at 325 degrees for 40 minutes.}
Note that the marked-up document shown here is NOT in any official markup
language, such as LaTeX, rtf, or HTML. It is just an example of a document
whose component parts have been distinguished using a simple and obvious
syntax.
XML
A markup language is a mechanism to identify structures in a document. The XML
specification defines a standard way to add markup to documents. XML is a set
of abstract rules for building a markup language. XML is not a markup language
itself. XML was designed to describe data. XML tags are not predefined in XML.
You must define your own tags. XML uses a DTD (Document Type Definition) or
XML schema to describe the data. XML with a DTD or XML schema is designed to
be self-descriptive.
Structure and Entities
XML data is represented and exchanged between software applications in units
called XML Document. An XML Document is made up of declarations, elements,
attributes, text data, comments and other components. Each of these components
will be described in more details in other sections.
XML documents have both logical and physical structure. The logical structure
is simply the elements (and attributes) in the document and their order.
Logically, the document is composed of declarations, elements, comments,
character references, and processing instructions, all of which are indicated
in the document by explicit markup.
XML documents use storage units called entities to arrange physical structures
to produce a logical structure. Entities define blocks of text for reuse in
documents or in DTDs, and include data from other storage units (such as
files). Every entity is either internal or external. An internal entity is
defined in a document's prolog (along with or within the DTD), and is not
associated with any external file or data source. An external entity is also
defined in the prolog, but depends on some external file or data source. There
are other characteristics also determine an entity's type, such as parsed or
unparsed; and a general entity or a parameter entity, etc.
Each XML document has a special text entity called document entity or
root entity. All entities referred to directly or indirectly from the
root entity are regarded as parts of the physical structure of the document.
The developers of XML introduced a distinction between "well-formed" documents
(which followed the XML syntax) and "valid" documents (whose markup followed a
particular language developed from XML). The concept of a merely
"well-formed" document greatly eased the burden on the document writer, and
may be the single most important reason for XML's acceptance. A well-formed
XML document is one from which an XML Processor can successfully build a tree
structure.
|