[The HTML Writers Guild Logo]


The HTML Writers Guild

Project gutenberg
[Previous] [HWG Homepage] [Gutenberg Index] [Next]

Marking up documents in XHTML III

This page actully walks you through marking up "The Voyage of the Beagle" by Charles Darwin. You may want to down load the e-text, and follow along. Here are some general tips which are all really commonsense.

Marking up vbgle10.txt

Down load the document and save it. Make sure you have your text editor and XML parser handy. Here the operation will be described using EditPad and IE5. (see "Tools of the Trade")

Step 1. Save as XML

Open up the text vbgle10.txt file in your text editor and save it as vbgle10.xml.

We will want to check at regular intervals that our document is well formed. The best way to do this is to load it with an xml mime type rather than an html mime type. Checking for errors is much more thorough in an xml browser, and we don't want any errors! To check for an error of well formedness we just have to open the document in IE5, and a check is automatically carried out.

Step 2. Remove & and < characters

search for & and < characters. Replace these with their entities.&amp; and &lt;

The & and < characters have special meaning in XML, we therefore need to replace them with their entities &amp; and &lt;. The easiest way to do this is with EditPads search and replace function.

Step 3. Initial Markup I

look for concurrent new lines (use \n\n in Edit text) replace with \n</p>\n<p>\n

This will divide the document up into various sections. You may find that there are 'empty' <p></p> elements. find these with the search and replace function, and get rid of them. Of course we will be starting of with a </p>, and ending with a <p> so we need to fix this by:

move the final <p> to the begining

This is not yet a well-formed XML document, because there is no root element that encloses the whole document (<html> in XHTML), so we need to fix this now.

Step 4. Initial Markup II

Step5: Add the following to the front of the text

 <html>
  <head>
   <title>HWG Gutenberg The Voyage of the Beagle by Darwin
  </title>
  </head>
  <body>
 

and add this to the end

  </body>
 </html>

We now have a valid XHTML file! Open it in IE5 as an XML file to test for well formedness

If there are any errors fix them (there shouldn't be at this stage!). IE5 will tell you the nature of the error and the line number where the error became apparent. Remember that this may not be the line that contains the error, the error may be several lines back!

Other tools

Many proprietary HTML editors can save a text document as HTML. We could then use HTML tidy to convert the document to xml. However many of these tools have difficulty handling large text files (for example vbgle10.txt freezes Word 97), and there is often so much 'Junk' in the document that that needs to be pruned out, that it is not worth the effort. Howver by all means go ahead and experiment using your favorite HTML editor!

Step 4. Initial Markup II

Summary of steps

Step1: Open up the text vbgle10.txt file in your text editor and save 
it as vbgle10.xml

step2: search for & and < characters. Replace these with 
their entities.&amp; and &lt;

step3: look for concurrent new lines (use \n\n in Edit text) replace 
       with \n</p>\n<p>\n

step4: move the final <p> to the begining

Step5: Add the following to the front of the text
 <html>
  <head>
   <title>HWG Gutenberg The Voyage of the Beagle by Darwin
  </title>
  </head>
  <body>
  <p>
  Project Gutenberg's Etext of The Voyage of the Beagle by Darwin
  #1 in our series by Charles Darwin

and add this to the end

End of Project Gutenberg's Etext of The Voyage of the Beagle by Darwin
  </p>
  </body>
 </html>

We now have a valid XHTML file! Open it in IE5 as an XML file to 
test for well formedness

If there are any errors fix them (there shouldn't be at this stage!)

step6: Add the major division classes
  <div class="gutblurb">
  <div class="revhist">
  <div class="book">
  <div class="frontmatter">
  <div class="bookbody">
  <div class="backmatter">
  <div class="endgutblurb">


As you add each one, check them for well formedness in IE5


stepx: A quick perusal shows that each chapter contains the word 
CHAPTER *. Use the 'find' function of your text editor to isolate 
the start of each chapter and mark it up. We suggest that you first 
change the 'p' elements to something more suitable, and then you 
add the 'div class=' elements. The last one to add is the 
class="chapter". Once this is added procede to the next chapter
making sure that you add the closing tag.

Here is what the markup looks like after we have finished marking 
up the beginning of chapter 1

  <div class="bookbody">
  <div class="chapter">
  <div class="chapnumber">
  <h2>
   CHAPTER I
  </h2>
  </div><!--end of class="chapnumber"-->


  <div class="chaptitle">
  <h3>
   ST. JAGO -- CAPE DE VERD ISLANDS
  </h3>
  </div><!--end of class="chaptitle"-->
  <div class="chapsummary">
  <p>
   Porto Praya -- Ribeira Grande -- Atmospheric Dust with
   Infusoria -- Habits of a Sea-slug and Cuttle-fish -- St.
   Paul's Rocks, non-volcanic -- Singular Incrustations --
   Insects the first Colonists of Islands -- Fernando Noronha --
   Bahia -- Burnished Rocks -- Habits of a Diodon -- Pelagic
   Confervae and Infusoria -- Causes of discoloured Sea.
  </p>
  </div><!--end of class="chapsummary"-->


Check for well formedness by loading it into IE5 before proceding!
Note that we have added an additional class chapsummary. We need 
to record this.

stepx: 

We decide to add a new class 'chapsummary' to enclose the summary 
at the beginning of each chapter. We must thus include it in the 
revhist section as below.

  <div class="revhist">
  <pre>
    Initial Marker Frank Boumphrey
    email frank@hwg.org
    Date: 1/14/00
    new classes
      chapsummary: the summary of a chapter coming after the title
  </pre>
</div>

stepx: You may find it easier (I do) to use the search function to 
step through all the chapters and markup each piece individually. 
i.e. first do all the class="chapter" then all the class="chapnumber" etc.


[Previous] [HWG Homepage] [Gutenberg Index] [Next]

[Valid XHTML 1.0]
This page is maintained by frank@hwg.org. Last updated on 7 February 2000.
Copyright © 2000 by the HTML Writers Guild, Inc.