HWG Resources FAQs XML FAQ

XML Frequently Asked Questions.

Table of Contents

  1. What is XML?
  2. What is meant by markup and markup languages?
  3. How is XML related to SGML?
  4. Who needs XML?
  5. Is XML difficult?
  6. What are the rules for writing an XML document?
  7. What is the difference between a tag and an element, and what is an empty tag?
  8. What are the rules as to how a tag must be writen?
  9. What are the rules as to how an attribute must be writen?
  10. How do you write comments in XML?
  11. What is a CDATA section in XML?
  12. What is a Processing Instruction in XML?
  13. For More Information

  1. What is XML?

    XML is short for eXtensible Markup Language, and it is really a set of rules for writing a markup language a markup language.
    Any markup document that conforms to the rules of XML is known as an 'application' of XML.
    Here is an example of an XML document.

    <greeting>Hello XML</greeting>

    XML uses angled brackets to designate opening tags and closing tags that contain content. The tags may contain attributes and their values.

    These tags are of the form:

    <[tagname] [attribute name]="[attribute value]">

    Here is an example of an opening tag with an attribute and its value:

    <greeting manner="cordial">

    Every tag must have a closing tag of the form:

    </[tagname]>

    Here is an example of the closing tag

    </greeting>

    Unlike HTML ALL attributes must be quoted with either a single or a double quote. The quotes must match.
    Also unlike HTML the tags are case sensitive, i.e. <TAG/>, <tag/>, and<Tag />are all different.

    If a tag does not have a closing tag, i.e. if it is an empty tag similar to the <IMG> tag in HTML then it must take the special form:

    <[empty tag name]/>

    Note the penultimate forward slash. Here is an example of an empty tag in XML.

    <image href="mypic.jpg"/>

    Like HTML XML can contain comments, and the syntax for comments is similar to that of HTML.

    <--This is a comment in both HTML and XML-->

    XML really describes a 'grammer' in which we can write our own Mark-up language. It is similar to SGML and indeed it is 100% compatible with SGML.

    HTML is a mark-up language written according to the rules of SGML. It is an application of SGML

    Table of Contents

  2. What is meant by markup and markup languages?

    A markup language is the set of rules, the grammar, and syntax that tells how a language which marks up documents should be "spoken". SGML is a markup language, and HTML is the vocabulary of a particular dialect of that language, albeit a very widely spoken dialect. HTML follows the rules of SGML.

    XML is also a markup language with a grammar that is based on but substantially more simple than SGML.

    Markup are the symbolic tag sets that are used to indicate that some thing needs to be done to the text. The <B></B> pair is markup in HTML. In XML and SGML it corresponds to the tags.

    Markup can take one of three forms, semantic, stylistic, or structural.

    Semantic markup gives information about the text it is marking up eg. In the element <hamlet> To be or not to be...</hamlet> the tag hamlet tells us that the words are being spoken by Hamlet. In the HTML element <CODE>For i= 0 to ubound(chapterArray)</CODE> tells us that the enclosed text is code.

    Stylistic markup tells us about the style that should be used to display a document item.

    In HTML the element

    <I>This is italic text</I>

    tells that the style of the document should change.

    Structural markup tells us some thing about the structure of a document. Again in HTML <P> The text that occurs until one comes across another similar tag is a paragraph and should be treated as such.

    The XML equivalent of this could be,

    <para>the text that occurs......</para>
    <P>Is a structural markup.</P>

    The old editor's notations of "dele" and "stet" beloved of crossword fans is structual markup.

    Table of Contents

  3. How is XML related to SGML?

    SGML

    If it wasn't for HTML hardly anyone would have heard of SGML (standardized general markup language), although it has been an international standard since 1986. It is really a document that lays down rules on how to describe a set of markup tags. HTML is its most well known product. It has been used with great success however to manipulate large bodies of documents, and relies on the fact that a document marked up according to the rules of SGML can be widely understood on a variety of platforms.

    Its great strength is that it allows the use of semantic tagging which can acuratly describe a documents content.

    Its chief draw back is its complexity which makes it difficult to use for the occasional user, and also makes it difficult to write SGML compatible software.

    XML

    XML (extensible markup language) is a recent language that is 100% compatible with SGML. It has been designed by the W3C as a version of SGML suitable for use over the Internet. It is still very much in the developmental stage, although the Specification for the language proper is quite established. There is still much work to be done on the form of linking XLL and the form of style sheet to use with it. Originally a simplified version of DSSSL called DSSSL-o or XS (extensible styling) was to be used, but both of these are horribly complicated. Currently it appears that CSS will be used for every day declarative styling and XSL will be used when more powerful document manipulation is required.

    Most people who have been exposed to this language are wildly enthusiastic about it. It has nearly all the power of SGML with none of the difficulty.

    Table of Contents

  4. Who needs XML?

    Every one who needs to send documents over the Internet containing information that needs to be manipulated in various ways. (You still make your cool display pages using HTML!!)

    XML allows us to markup a document with a set of tags of our own devising.

    Markup can be of three sorts:-

    Stylistic Markup:-

    Tells how the document is to be styled. The <I>, <B>, and <U> tags are all stylistic markup in HTML.

    Structural Markup:-

    Tells how the document is to be structured, the <H*>, <P> and the <DIV> tags are examples of structural mark up.

    Semantic Markup:-

    Tells us some thing about the content of the text. <TITLE> and <CODE> are examples of semantic markup in HTML.

    HTML has proven very adept at preparing documents for display over the web, but a document marked up in HTML tells us very little about the content of the document, and it so happens that for most documents to be useful in a business situation there is a need to know about the documents content.

    As an example if a patients medical records was marked up in HTML, and I as a doctor had wanted to find out about the patients allergies, at present I would have to down load the whole record (several K), and then do a manual search through that document.

    If however the patients records were marked up in XML and one of the tags was <allergies>, I could just send a request to the Server for that part of the document, and receive a few bytes of information instead of hundreds of Kilo- bites.

    Using the same example of patients records, what if we wanted some one to have access to some part of the records, but not others, (Would you really want every one at the Insurance office reading the notes that your Shrink may have written about you?), then you could instruct the server to withhold certain parts of the document. i.e.. in the above example anything marked up <psych.-note> or <confidential>.

    Thus the ability for individuals, groups of individuals, and institutions to write their own mark up language will expedite information transfer and provide other benefits such as confidentiality.

    More recently it has become obvious that XML can replace proprietry binary codes in Data Bases, and thus make the old dream of the true interchangebility of data acroos application and platform a reality. XMl is also being used to write many of the new language specs. It has become the De Facto language of the World Wide web consortium (the ody that 'governs' HTML).

    Table of Contents

  5. Is XML difficult?

    No! XML was designed to be easy, the official specification is a mere 40 pages (down load it from http://www.w3.org/TR/REC-xml) and is written in (almost) readable language. (They use EBNF notation to describe the keywords. Read section 6, the last section, of this document first!!)

    Any one with a basic under standing of HTML can be writing XML documents in no time at all.

    Table of Contents

  6. What are the rules for writing an XML document??

    XML documents come in two flavors, the valid document and the well formed document. Every valid document is well formed, but not every well formed document is valid.

    A Well Formed Document

    A well formed document must follow three very simple rules.

    1. It must contain at least one element.
    2. There must be a unique opening and closing tag, which contains the whole document. This forms the ROOT element
    3. All the tags must be correctly nested and must match.(Note that XML is case sensitive <tag> is not the same as <Tag>.)

    In addition all the tags and attributes must conform to the rules for writing tags, and all the values of the attributes must be quoted.

    Here are some examples of some well formed documents.

    	<greeting>Hello World!</greeting>
    

    The above example follows all the rules for a well formed document.

    	<greeting manner="cordial">Hello World!</greeting>
    

    We have given the 'greeting' element an attribute. Note how the value is quoted. Single quotes could also be used, but the quotes must match.

    <xdoc>
    	<greeting>Hello World!</greeting>
    	<greeting>Hello XML!</greeting>
    </xdoc>
    

    Note that for there to be a unique opening and closing tag we have had to add the xdoc tag.

    <xdoc>
    	<greeting>Hello World!</greeting>
    	<greeting><emphasis>Hello XML!</emphasis></greeting>
    </xdoc>
    

    Note how the 'emphasis' tag is nested (i.e. completely enclosed within) in the greeting tag.

    The following examples are NOT well formed documents. See if you can figure out why. The answers are given at the end of this question

    Bad example #1
    	<greeting>Hello World!</Greeting>
    
    Bad example #2
    	<greeting manner=cordial>Hello World!</greeting>
    
    Bad example #3
    	<greeting manner="cordial'>Hello World!</greeting>
    
    Bad example #4
    
    	<greeting>Hello World!</greeting>
    	<greeting>Hello XML!</greeting>
    
    
    Bad example #5
    <xdoc>
    	<greeting>Hello World!</greeting>
    	<emphasis><greeting>Hello XML!</emphasis></greeting>
    </xdoc>
    

    A Valid Document

    A valid document must be well formed, and it must also conform to its DTD. (Document Type Definition). This is a set of rules describing how the document must be laid out. The DTD (if present) is either written or referenced in the PROLOG of the XML document.

    Answers to Bad examples
    • 1. The tags don't match.The closing greeting begins with an upper case G. XML is case sensitive.
    • 2. The value of the attribute 'manner', cordial is not quoted.
    • 3. The value is now quoted, but the quotes don't match.
    • 4. There is no unique opening and closing tag for the document
    • 5. The elements do not nest. 'emphasis' and 'greeting' overlap.

    Table of Contents

  7. What is the difference between a tag and an element, and what is an empty tag?

    Tags and Elements.

    These two words are NOT inter-changeable. In XML a tag is what is written between angled brackets e.g. <atag>. This is an example of an opening tag. In XML all opening tags must have closing tags of the form </atag>. The way the <P> tag is used in HTML is illegal in XML. In XML an opening <P> tag requires a closing tag </P>.

    An element is an opening and a closing tag and what comes in between.

    <greeting>Hello XML!!</greeting >

    is an element.

    Empty tags must be in a special format namely <emptytag/> (note where the forward slash is), or else you are allowed to write <emptytag></emptytag>. The <IMG> tag is illegal in XML. (However try using the legal form <IMG/> on your HTML browser, if its like mine it will probably revolt!)

    Use a convention. I put HTML tags in uppercase, XML tags in lower case.( This convention is becoming quite wide spread.)

    XML is case sensitive. ie. <Atag> <atag>, and <ATAG> are three different kinds of tags.

    Table of Contents

  8. What are the rules as to how a tag must be writen?

    XML is case sensitive. ie. <Atag> <atag>, and <ATAG> are three different kinds of tags.

    A tag name must start with a letter (a-z, A-Z) or an underscore (_) and can contain letters, digits 0-9, the period (.), the underscore (_) or the hyphen (-). White space is not allowed, nor other markup.

    (Actually a tag name can also contain other unicode characters, but this is advanced stuff that will not be covered in this faq.)

    The colon (:) is reserved for experimental use, and although it is legal at present in may acquire special meaning in the future, so don't use it.(For those interested, it's main use is in namespaces, and reserved keywords.)

    No name can begin with the sequence "xml..". This sequence is reserved for use by the standardization forum.

    Your tags should have semantic meaning, otherwise why bother to use XML!!

    With these few simple rules and conventions in mind go ahead and make tags that describe your document!!

    Table of Contents

  9. What are the rules as to how an attribute must be writen?

    Attributes.

    Tags can contain attributes, the example you are probably most familiar with is the <IMG> tag eg <IMG ALT="smileyface" URL="smiley.gif" VSPACE=75>

    In XML an attribute takes the following general form

    attribute="value"

    Note there must be an equal sign and a value, and the value must be quoted, so the VSPACE attribute above would have to be VSPACE="75" to be legal in XML. Also in HTML some tags can take an attribute without a value such as <UL COMPACT>.This too would be illegal, you must give an attribute a quoted value such as <UL COMPACT="anything">, even <UL COMPACT= ""> would do.

    Attributes have to follow these rules.

    • The same rules that apply to the character types allowed in tag names (see above) apply to composing attribute and attribute value names, except they cannot contain "<" or "&".


    • As already mentioned all values must be quoted.


    • Attributes can only appear in start tags and empty element tags.


    • No attribute may appear more than once in the same start tag.


    • All attributes must be declared in the DTD if present, and their value must be of the correct type. (see below)


    • An attribute cannot contain a reference to an external entity. (See under entity later on.)

    Table of Contents

  10. How do you write comments in XML?

    XML Comments.

    XML comments are written the same way as HTML comments. i.e.

    <!--this is a comment-->.

    The XML processor is not required to pass this information on to the user agent, i.e. the piece of software that is converting the document into some thing useful, but XML also uses CDATA sections which is used to escape blocks of text containing mark up.

    Table of Contents

  11. What is a CDATA section in XML?

    CDATA is short for Character DATA. CDATA sections allow us to escape blocks of text containing mark up.

    CDATA sections take the general form:

    <![CDATA[....put text containg markup here...]]>
    

    For example, suppose I wanted to print out the following line of text, as would be quite common if I was writing a book on HTML or XML:

    "The left angled bracket '<' and the ampersand '&' must be replaced by their entities &lt; and &amp; respectively".

    If I was writing this in HTML I would have to put:

    "The left angled bracket '&lt;' 
    and the ampersand '&amp;' 
    must be replaced by their 
    entities &amp;lt; and &amp;amp; respectively".
    
    
    

    By escaping the text using CDATA, I could simply write

        <![CDATA["The left angled bracket '<' 
    	and the ampersand '&' must be replaced 
    	by their entities &lt; and &amp; respectively".]]>
    
    And the text including the markup would be displayed. Obviously the CDATA escape section could not include the sequence ']]>', but it could include any other kind of markup, which would not be interpreted by the browser.

    Table of Contents

  12. What is a Processing Instruction in XML?

    Processing instructions

    Processing instructions take the form

         <?this is a processing instruction?>
         
    

    Processing instructions cannot start with any form of the string "xml" this is reserved for the xml version declaration processing instruction that we will look at in another faq.

         <?xml version="1.0"? encoding="UTF-8"?>
    

    They can occur any where in the document and contain information that the processor must pass on to the user agent. The version declaration is an example of a processing instruction.

    Table of Contents

  13. For More Information
    • The XML Spec can be found at the W3C web site.
    • More information can be found at the authors web site.
    • Much of the material in this FAQ is quoted, with permission, from Professional Style Sheets for XML and HTML by Wrox press.

Table of Contents

Last updated by F.B. 3:09 PM 9/9/98

Table of Contents


[Valid HTML 4.0!]
This page is maintained by bckman@ix.netcom.com. Last updated on 14 September 1998.
Copyright © 1998 by the HTML Writers Guild, Inc.