HTML for C++ Standards Documents

ISO/IEC JTC1 SC22 WG21 N3325 = 12-0015 - 2012-01-15

Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org

Introduction
    Issues
    Approaches
Good HTML
    HTML Standard Compliance
    Semantic Font Markup
    Semantic Block Tags
    Quoted Text
    Preformatted Text
    Examples
    Tables
    Deleted and Inserted Text
WG21 Front Matter
    Document Identification
    Headings and the Table of Contents
Representing C++ Standards Concepts in HTML
    Code and Code Blocks
    Grammar Terms and Rules
    Notes, etc.
    Representing Library Function Specifications
Showing Edits
    Quoting the Standard
    Inserted and Deleted Text
    Inserted and Deleted Paragraphs
Formatting the HTML Source
Literate Programming
Rendering Style
    Color and Contrast
    Tables
    Examples
    Examples
References
Scripts
    quote_code.sh
    block_code.sh
    block_extract.sh
    extract_code.sh
    contents.sh
    dynacontents.js
    outline.sh
    outline_with_names.sh
    style.hinc
    Makefile

Introduction

At the Summer 2011 meeting, there were several problems with the readability of various presentations. Readability is another side of accessibility, the ability of a wide variety of readers to read the page under a wide variety of conditions.

This paper provides guidelines and tools for producing widely accessible WG21 papers. It is based on my experience in dealing with inaccessible web pages and my experience writing accessible web pages.

These guidelines are for the production of WG21 papers. While many of the concepts and techniques carry over into other uses, they are incomplete with respect to those other uses.

Issues

Contrast was the primary problem at the summer meeting. When contrast is low, readability is poor. Further, low contrast exaggerates focus problems.

Reliance on color is a significant problem. First, close to 10% of men are color deficient, which means they cannot see colors normally. There are several kinds of deficiency, but by far the most common is an inability to distinguish red and green. Second, many browsers support a "high contrast" mode, which generally ignores page-specified colors. Third, to save costs, WG21 papers are often printed without color. The net effect is that color differences, and particularly red versus green, is not sufficient to convey information.

Reliance on font face is a significant problem. The "high contrast" mode generally ignores page-specified fonts, So font differences are also not sufficient to convey information.

Reliance on long lines is a significant problem. Low-vision readers rely on being able to increase font size to read the text. Larger fonts mean relatively shorter lines. Smaller windows also achieve the effect of forcing shorter lines. Pages need to adapt to those shorter lines.

Reliance on external tools is not presently a problem, but could become one. Browsers behave differently. They are configured in various ways. They have different sets of extensions and plug-ins. All of this variety leads to problems when straying from plain HTML.

Approaches

The primary approach to solving these problems is to rely on text to convey information, and secondarily, to enable that text to adapt to the reader's needs. One can decorate with style and color; make it easier to read with style and color; but one must make the text itself convey all needed information.

A consequence of a reliance on text is that pages should avoid technologies that displace or obscure text. Examples include embedding text in images and using Flash.

Text more effectively adapts to readers' needs when the semantic structure of the paper is separated from the presentational choices. In other words, the HTML elements should carry the paper's meaning, and separate CSS should specify presentation. Readers can alter the applied CSS, but altering the elements is much harder.

Plain HTML is the most accessible and most reliable way to convey information. So, we should encode documents with HTML elements that best represent semantics.

Reading is more comfortable when the author respects and accepts the users' choices in browser, settings, colors and sizes.

Finally, papers should avoid reliance on problematic technologies, like Javascript, Java, and video.

Good HTML

Use clean, well-structured HTML. Doing so reduces document construction and maintenance costs, as well as making documents easier to read.

Where possible, comply with all relevant standards. We cannot control where our documents go, so we should help them travel easily.

Avoid machine generation of HTML, as the results tend to work towards a particular paper-based layout rather than provide general readability. In particular, word processors, such as Microsoft Word, produce really bad HTML.

Never put style information within the body of the document. [HTMLstyleinline] Instead, uses the class attribute to give an element additional semantic information, which can then be decorated with CSS specified in the document head.

HTML Standard Compliance

Write documents with strict HTML 4.01 [HTML401] standards compilance.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>

Write documents with only the ASCII character set. It is the common base on most systems, and by design, is sufficient to represent C++ source code.

<meta http-equiv="Content-Type" content="text/html;charset=US-ASCII">

Use character entities for non-ASCII letters. [HTMLentities]

Some Character Entities
name	value	name	value	name	value
`&`	"&"	`ü`	"ü"	` `	(non-breaking space)
`<`	"<"	`Ñ`	"Ñ"	`—`	"—" (em dash)
`>`	">"	`ó`	"ó"	`–`	"–" (en dash)

Use an HTML validator. The W3C provides one at validator.w3.org.

Semantic Font Markup

Browsers may ignore font specifications, but they generally do not ignore the phrase markup elements [HTMLphrase] em, strong, dfn, code, samp, kbd, var, cite, and abbr. So, one should use one of them rather than the font style elements [HTMLfontstyle] tt, i, b, big, small, strike, s, and u. One should certainly not use the font element. [HTMLfontstyle]

Normal emphasis should use the em element, rather than the i element.

Strong emphasis should use the strong element, rather than the b element.

The definition of a term should use the dfn element, rather than the i element.

Citations should use the cite element.

Abbreviations should use the abbr element. This element requires extra work to be effective, and so may not find much application within WG21 papers.

Text that is variable, that is intended for substitution, should use var element, rather than the i element. Grammar symbols fall directly into this category.

C++ identifiers, keywords, punctuation, and the like should use one of the code element. Sample output should use the samp element. User input should use the kbd element. Such text that is variable, that is intended for substitution, should also use the var element.

Within C++ code, the characters &, <, and > must be quoted. The script quote_code.sh will convert C++ code into properly quoted HTML.

Browsers often use the same representation for more than one phrase element. Commonly, the common representations are in the following table.

Common Phrase Representations
representation	tags
normal	`abbr acronym`
italic	`cite dfn em i var`
bold	`b strong`
fixed-width	`code kbd samp tt`

Authors should excercise care to ensure that these overloaded elements are used in contexts where the intent is reasonably clear. Fortunately, some of these can be mixed, such as var with code.

Semantic Block Tags

Use elements to indicate document structure, not just for their effect on formatting.

When the semantics of a element are not fine enough, identify additional distinctions with class attributes. [HTMLclass]

Quoted Text

There are two types of quotes: block quotes and inline quotes. [HTMLquote]

Block quotes use the blockquote element and denote paragraph-level quotations. As such they always have a block-level element within them, such as an explicit p element.

Inline quotes use the q element, and generally enclose short quotations. Use inline quotes in place of quotation marks. Some browsers fail to add the quotation marks as specified, so this element may require some more time before it is reliable.

Preformatted Text

Use the pre element to enclose preformatted text. [HTMLpre] The pre element has definitional problems. In particular, the browser may or may not change to a fixed-width font, which means the author can neither avoid nor rely on a fixed-width font. Therefore, authors should always specify a fixed-width font immediately within the preformatted text and ensure that it is active throughout the block. That is,

<pre>
line of   wait for it   text
  followed by some indentation
</pre>

is not reliable. Instead, authors should specify

<pre>
<code>line of  ...  wait for it  ...  code
    some of which is indented</code>
</pre>

Furthermore, while

<pre><code>
line of  ...  wait for it  ...  code
    some of which is indented
</code></pre>

is cleanest, some browsers incorrectly [HTMLline] add an extra blank line at the beginning of the preformatted text.

In any event, preformatted text does not wrap lines, which makes them very difficult to read when the line width is greater than the window width. (This problem happens when either characters are large or windows are narrow.) Therefore, authors should strive to keep preformatted lines short.

Examples

A significant part of WG21 documents are examples. Represent examples with class=example applied to p paragraphs, pre preformatted text or div document divisions. Divisions contain any number of paragraphs.

For example, the example

int main() {
   return 0;
}

Is represented as

<pre class="example">
<code>int main() {
   return 0;
}</code>
</pre>

The script block_code.sh will convert C++ code into properly quoted, preformatted, example code block.

Tables

Tables [HTMLtable] may consist of a caption (caption), a head (thead), a body (tbody), and a foot (tfoot). The last three elements contain rows. The head and foot elements enable browsers to duplicate headings and footings when splitting a table across multiple pages.

<table>
<caption>Common Phrase Representations</caption>
<thead>
<tr><th>representation</th><th>tags</th></tr>
</thead>
<tbody>
<tr><td>normal</td><td><code>abbr acronym</code></td></tr>
<tr><td>italic</td><td><code>cite dfn em i var</code></td></tr>
<tr><td>bold</td><td><code>b strong</code></td></tr>
<tr><td>fixed-width</td><td><code>code kbd samp tt</code></td></tr>
</tbody>
</table>

Avoid wide tables; these tend to get truncated when printed. Test table width by deliberately making the browser window very narrow.

Deleted and Inserted Text

HTML provides direct representation of deleted and inserted text. [HTMLdelins] These should be used in preference to ad hoc mechanisms.

The HTML standard intended these elements for showing modifications to the document itself. However, that is rarely a problem with WG21 papers. Instead they need to show edits to the working draft, and this repurposing of the elements is reasonable.

For example, to achieve

This text was ~~deleted~~ inserted today.

use

This text was <del>deleted</del> <ins>inserted</ins> today.

and do not use

This text was <span class="del">deleted</span>
<span class="ins">inserted</span> today.

and certainly not

This text was <strike>deleted</strike>
<u>inserted</u> today.

and especially not

This text was <span style="color:red;">deleted</span>
<span style="color:green;">inserted</span> today.

When the deletion occurs before an insertion, readers can use the deletion to set the context for the insertion. So, when paired, the deletion should come before its insertion.

Unless spacing is critical to the changes, deletions and insertions should be spaced. However, in the presence of changing punctuation, non-spacing markup is preferable to excessive markup, particularly when readers may not notice it. For example,

This text <del>glues</del><ins>joins</ins> words.
This text <del>modifies</del><ins> changes</ins> spacing.
This text <del>highlights</del> <ins>clarifies</ins> changes.
Red<ins>, yellow</ins> and green are hard to distinguish.

yields

This text ~~glues~~joins words. This text ~~modifies~~ changes spacing. This text ~~highlights~~ clarifies changes. Red, yellow and green are hard to distinguish.

The del and ins elements are supposed to act as either block-level or inline-level elements, however some browsers fail to render them properly as block-level elements. Therefore, authors should use these elements as inline elements only. (This workaround is most annoying for tables and lists.)

WG21 Front Matter

The front matter for WG21 documents includes document identification and possibly a table of contents or a revision history.

Document Identification

The document identification includes a title, which is specified in the title element within the head element and in the h1 element at the top of the body element.

<title>HTML for C++ Standards Documents</title>
</head>
<body>
<h1>HTML for C++ Standards Documents</h1>

Follow the title with the document identification numbers, which is composed of "ISO/IEC JTC1 SC22 WG21", the WG21 paper number, the INCITS paper number, and the ISO date.

<p>
ISO/IEC JTC1 SC22 WG21 N3325 = 12-3325 - 2012-01-15
</p>

Follow the title with the authors. Email addresses are optional.

<address>
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
</address>

Other optional front matter includes a table of contents and a revision history.

Headings and the Table of Contents

When headings follow a simple format, they can be easily and automatically converted into a table of contents. The format consists of a single line containing a heading element and directly within that an anchor element. The anchor provides not a reference, but a name. That name must be unique within the file. Using the standard's own tagging system is often unique, but not always.

For example, the header,

30.6.6 Class template future [futures.unique_future]

is encoded on one source line as

<h3><a name="futures.unique_future">30.6.6 Class template <code>future</code> [futures.unique_future]</a></h3>

The script contents.sh generate a table of contents. The resulting file can be simply included into the HTML source.

Alternatively, one can use Javascript within the HTML itself to dynamically generate the contents. The script dynacontents.js, courtesy of Jeffrey Yasskin, does this task. It assumes that the user has previously included

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"> </script>"

It also assumes an p element, somewhere in the document, with the id "toc", which it will fill in with the table of contents.

While the table of contents serves as an outline, a more specific command-line tool that emits the outline can be helpful during development. There are two such scripts, one that emits just the headings and one that also emits the anchor names. They are available as outline.sh and outline_with_names.sh, respectively.

Representing C++ Standards Concepts in HTML

Conventions on the use of HTML in representing concepts of the C++ standard will help in cooperative editing, sharing of helpful tools, and automatic translation into the LaTeX source of the standard itself.

Code and Code Blocks

Code should use the code element.

Non-normative code within the standard should use the var phrase element. Examples include parameter names, expository member variables, standard meta-variables, and example implementations.

The standard often marks comments with an italic font. Mark these with the em element.

Emphasized code must use the strong element, because normal emphasis is usually visually indistinct from variable text.

Avoid long lines in code blocks, as they may interfere with the readablity of the document.

For example, the partial description of a standard counter might look something like the following.

template< typename T >
class counter
{
    // expository fields:
private:
    T value;

public:
    // construct and destruct:
    counter() : value( 0 ) { }
    counter(const counter& d) = default;
    ~counter() = default;

    // operations:
    void inc( T b) { value += b; }
    T get() { return value; }
}

It is encoded as follows.

<pre class="example">
<code>template&lt; typename T &gt;
class counter
{
    // <em>expository fields:</em>
<var>private:</var>
    <var>T value;</var>

public:
    // <em>construct and destruct:</em>
    counter() <var>: value( 0 ) { }</var>
    counter(const counter&amp; <var>d</var>) <strong>= default</strong>;
    ~counter() <strong>= default</strong>;

    // <em>operations:</em>
    void inc( T <var>b</var>) <var>{ value += b; }</var>
    T get() <var>{ return value; }</var>
}</code>
</pre>

Grammar Terms and Rules

The C++ grammar has the structure of a descriptive list, several terms each of which may have several definitions. We exploit that parallel structure by representing C++ grammar rules with descriptive lists.

Grammar terms are represented denoted by a var variable-phrase element. When a grammar term is defined, it is contained within by a dt descriptive-term element, and marked by dfn definition phrase element. (The colon outside the dfn element makes automatic indexing easier.) Each substitution rule is denoted by a dd descriptive-definition element. The optional marker is denoted by a sub subscript element within the var element. Literal code is denoted by the code phrase element.

The grammar

declaration-seq:: declaration declaration-seq_opt
static_assert-declaration:: static_assert ( constant-expression , string-literal ) ;

is encoded as

<dl>
<dt><dfn>declaration-seq</dfn>:</dt>
<dd><var>declaration</var>
<var>declaration-seq<sub>opt</sub></var></dd>

<dt><dfn>static_assert-declaration</dfn>:</dt>
<dd><code>static_assert (</code>
<var>constant-expression</var> <code>,</code>
<var>string-literal</var> <code>) ;</code></dd>
</dl>

Notes, etc.

Represent notes, footnotes and examples by surrounding them with markers. These markers have the form

[Footnote: Here is some non-normative text. —end footnote]

and are encoded as

[<i>Footnote:</i>
Here is some non-normative text.
&mdash;<i>end footnote</i>]

Note that here we use the font element i because it does not really fit any of the phrase markers and because it makes searching for such uses easier.

While not part of the final standard, rationale, editor's notes and notes to the editor can also be represented this way.

Comments on the paper itself, and particularly notes on work still to be done can be marked the same way, except using the b element instead of the i element. This change enables rapid searching for unfinished parts of the document.

Representing Library Function Specifications

The library has special formatting requirements for representing functions and their attributes. Each function prototype is contained within p class="function" element. The attribute paragraphs are all contained dl class="attribute" element. Each attribute is labeled with a dt element and has its body in a dd element. For example, the function definition

dynarray(size_type c);

Requires:: The constructor parameter shall be greater than zero.
Effects:: May or may not invoke the global operator new.

is represented as

<p class="function">
<code>dynarray(size_type <var>c</var>);</code>
</p>

<dl class="attribute">
<dt>Requires:</dt>
<dd><p>
The constructor parameter shall be greater than zero.
</p></dd>

<dt>Effects:</dt>
<dd><p>
May or may not invoke the global <code>operator new</code>.
</p></dd>
</dl>

Showing Edits

In the end, papers are effective only when they edit the working draft. This section explains how to do that.

Quoting the Standard

The first step in editing the standard is to quote the standard. For that we use the blockquote element with class="std". Each quoted portion of the standard must be preceeded by a paragraph indicating where in the standard it comes from. The section may be known from context, but if not, it should be stated explicitly. So, the quote appears as

Section 1.8 [intro.object] paragraph 5 says:

Unless it is a bit-field (9.6), a most derived object shall have a non-zero size and shall occupy one or more bytes of storage. Base class subobjects may have zero size. An object of trivially copyable or standard-layout type (3.9) shall occupy contiguous bytes of storage.

and is encoded as

<p>
Section 1.8 [intro.object] paragraph 5 says:
</p>

<blockquote class="std">
<p>
Unless it is a bit-field (9.6),
a most derived object shall have a non-zero size
and shall occupy one or more bytes of storage.
Base class subobjects may have zero size.
An object of trivially copyable or standard-layout type (3.9)
shall occupy contiguous bytes of storage.
</p>
</blockquote>

Inserted and Deleted Text

One can show edits to a paragraph by combining the quoting of the standard with the delete and insert markup described above. So, an edit appears as

Edit section 1.8 [intro.object] paragraph 5 as follows.

Unless it is a bit-field (9.6), a most derived object shall have a non-zero size and shall occupy one or more bytes ~~of storage~~. Base class subobjects may have zero size. An object of trivially copyable, trivially movable or standard-layout type (3.9) shall occupy contiguous bytes of storage.

and is encoded as

<p>
Edit section 1.8 [intro.object] paragraph 5 as follows.
</p>

<blockquote class="std">
<p>
Unless it is a bit-field (9.6),
a most derived object shall have a non-zero size
and shall occupy one or more bytes <del>of storage</del>.
Base class subobjects may have zero size.
An object of trivially copyable<ins>, trivially movable</ins>
or standard-layout type (3.9)
shall occupy contiguous bytes of storage.
</p>
</blockquote>

Inserted and Deleted Paragraphs

When deleting or inserting whole paragraphs or sections, the del and ins elements need not be used, but the introductory text should clearly indicate the edit. In addition, the blockquote elements use class="stddel" or class="stdins", respectively. So, full paragraph deletions and insertions appear as

Delete paragraph 12 of 2.14.5 String literals [lex.string].

Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined.

After paragraph 12 of 2.14.5 String literals [lex.string], insert a new paragraph.

All string literals are distinct; their characters never share addresses.

and are encoded as

<p>
Delete paragraph 12 of 2.14.5 String literals [lex.string].
</p>

<blockquote class="stddel">
<p>
Whether all string literals are distinct
(that is, are stored in nonoverlapping objects)
is implementation-defined.
The effect of attempting to modify a string literal is undefined.
</p>
</blockquote>

<p>
After paragraph 12 of 2.14.5 String literals [lex.string],
insert a new paragraph.
</p>

<blockquote class="stdins">
<p>
All string literals are distinct;
their characters never share addresses.
</p>
</blockquote>

Formatting the HTML Source

The format of the HTML source itself can improve its interaction with tools.

Starting each sentence on a new line improves the stability of diff, and hence of source code version control systems. The same applies to putting block-level elements on lines separate from live text.

When editing the source, separating block-level elements makes them more quickly identifiable.

More regularity in the HTML source eases tools for converting HTML source to other forms, like the LaTeX of the standard itself.

Literate Programming

The C++ standard's papers are a good application of literate programming. [LPcom] [LPwiki] Particularly when a papers includes normative declarations or sample implementations, an automatic process for extracting the code from the paper itself helps ease adoption concerns.

The essential idea is to identify code to be extracted with a distinct class, e.g. "extract", and then remove everything but that within those code elements. That process is eased considerably when all code text is on lines separate from other text. Typically, this is accomplished with HTML of the form:

<pre class="extract"><code class="extract">
echo "Hello, World!"
</code></pre>

Within the code, all HTML elements should be removed, which enables links, phrase tags, and other markup within the code. Further, the critical HTML character entities, <, >, &, and  , must be recognized and substituted.

For presentational purposes, it is also helpful to identify the pre element containing the code for extraction. Again, use the same class as above, e.g. "extract".

The script block_extract.sh will convert C++ code into properly quoted, preformatted, extraction code block.

The HTML files can contain either a single code file, as in N2648, or multiple code files, as in N2427. In the latter case, the multiple files are actually generated from a single-file shell script. The scripts in this paper follow that approach.

The script extract_code.sh will extract the code from the HTML source of this paper. Simply execute the resulting shell script to get copies of the scripts.

Rendering Style

Once the paper is well structured and independent of the presentation, we must address creating a readable presentation. We encode that presentation in a style element within the head element of the document. (Alternately, we could create a standard location for a separately read CSS file.) The proposed style element is style.hinc in the Scripts section.

Color and Contrast

Color and contrast must meet specific technical requirements. These are embodied in the Web Content Accessibility Guidelines (WCAG). [WCAG] In particular, the intensity of the foreground and background must be sufficiently different. In addition the hue of the foreground and background should be sufficiently different. Web pages exist to test colors against the various criteria. [Snook] Further, consideration must be given to red-green color deficiency.

By far, the most common use of color in WG21 is to mark inserted and deleted text. The normal convention is to use red for deleted text and green for inserted text. However, this color combination is problematic for red-green deficient readers. Instead, we use magenta in place of red. The added blue to the color makes it visually distinct from green. The other typical problem with the colors chosen is that they are too bright to provide good contrast with the typical light background. So, these colors need to be resonably dark, but still light enough to be distinct. The foreground colors #005100 green and #8B0040 red-magenta meet the criteria against a fairly broad range of light backgrounds. Unfortunately, once we specify a foreground color, we must specify a specific background color. A white background reduces printing costs.

However, these colors alone are not sufficient to identify inserted and deleted text. For that we must add text decoration. In particular, we follow existing convention and mark deleted text with a line struck through and inserted text with an underline. Now, even in the absence of color, deletions and insertions are distinct.

Earlier, we described the need to mark whole quoted paragraphs of the standard as deleted or inserted. We do this by changing the background for the paragraph. In particular, deleted quotes have a #FFEBFF light magenta background and inserted quotes have a #C8FFC8 light green background. Regular quoted paragraphs of the standard have a #F1F1F1 light grey background. Extracted code has a #F5F6A2 light yellow background. Finally, each of these backgrounds is surrounded by a thin, slightly darker, border. This border provides an attractive edge to the quote. More importantly, when browsers ignore color, as in high-contrast mode, they typically do not ignore borders. These thin subtle borders become very visible when the color is lost.

Tables

The default formatting of tables makes identifying table cells difficult. To address this problem and to be consistent with the formatting of many (but not all) of the standard's tables, we make several formatting choices.

Cell text is vertically aligned to the top, which makes identifying rows easier.
Cell text is horizontally aligned to the left, which makes identifying columns easier. (Authors may choose to use right alignment for numeric columns.)
Cells are given a little bit of extra spacing.
Use a thin borders around the table, but not individual cells or the caption.

Examples

For examples, we just indent a bit.

Examples

For extracted code, we indent a bit and use the background color for extracted code.

References

References to WG21 papers can simply use the N-number. References to WG14 papers can simply use "WG14" and the N-number. These references should link to the appropriate documents, via HTML like the following.

This paper analyses the compatibility between the draft standards,
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf">
N3035</a> and WG14
<a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">
N1425</a> with respect to alignment.

All other references should be elaborated in a references section, such as this one. The purpose of the references section is to enable following references from printed documents. NOTE: when there is a reference section, it is unclear whether references should link to the reference section entry or link directly to the referrent.

The remainder of this section lists references for this document.

[HTML401]: W3C, HTML 4.01 Specification, 24 December 1999, http://www.w3.org/TR/html401/
[HTMLclass]: W3C, HTML 4.01 Specification, 24 December 1999, 7.5.2 Element identifiers: the id and class attributes, http://www.w3.org/TR/html401/struct/global.html#h-7.5.2.
[HTMLdelins]: W3C, HTML 4.01 Specification, 24 December 1999, 9.4 Marking document changes: The INS and DEL elements, http://www.w3.org/TR/html401/struct/text.html#h-9.4.
[HTMLentity]: W3C, HTML 4.01 Specification, 24 December 1999, 24 Character entity references in HTML 4, http://www.w3.org/TR/html401/sgml/entities.html.
[HTMLphrase]: W3C, HTML 4.01 Specification, 24 December 1999, 9.2.1 Phrase elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, and ACRONYM, http://www.w3.org/TR/html401/struct/text.html#h-9.2.1
[HTMLfontstyle]: W3C, HTML 4.01 Specification, 24 December 1999, 15.2.1 Font style elements: the TT, I, B, BIG, SMALL, STRIKE, S, and U elements, http://www.w3.org/TR/html401/present/graphics.html#h-15.2.1.
[HTMLfontelem]: W3C, HTML 4.01 Specification, 24 December 1999, 15.2.2 Font modifier elements: FONT and BASEFONT, http://www.w3.org/TR/html401/present/graphics.html#h-15.2.2.
[HTMLline]: W3C, HTML 4.01 Specification, 24 December 1999, B.3.1 Line breaks, http://www.w3.org/TR/html401/appendix/notes.html#notes-line-breaks.
[HTMLpre]: W3C, HTML 4.01 Specification, 24 December 1999, 9.3.4 Preformatted text: The PRE element, http://www.w3.org/TR/html401/struct/text.html#h-9.3.4.
[HTMLquote]: W3C, HTML 4.01 Specification, 24 December 1999, 9.2.2 Quotations: The BLOCKQUOTE and Q elements, http://www.w3.org/TR/html401/struct/text.html#h-9.2.2.
[HTMLspan]: W3C, HTML 4.01 Specification, 24 December 1999, 7.5.4 Grouping elements: the DIV and SPAN elements, http://www.w3.org/TR/html401/struct/global.html#h-7.5.4.
[HTMLstyleinline]: W3C, HTML 4.01 Specification, 24 December 1999, 14.2.2 Inline style information, http://www.w3.org/TR/html401/present/styles.html#h-14.2.2.
[HTMLtable]: W3C, HTML 4.01 Specification, 24 December 1999, 11 Tables, http://www.w3.org/TR/html401/struct/tables.html.
[LPcom]: literateprogramming.com, Literate Programming, http://www.literateprogramming.com/.
[LPwiki]: Wikipedia, Literate programming, http://en.wikipedia.org/wiki/Literate_programming.
[Snook]: snook,ca, Colour Contrast Check, 6 December 2009, http://www.snook.ca/technical/colour_contrast/colour.html.
[WCAG]: W3C, Web Content Accessibility Guidelines (WCAG) 2.0, 11 December 2008, http://www.w3.org/TR/WCAG/.

Scripts

This section contains several sh/sed scripts supporting the methods in this document. The scripts are contained within a sh script that generates those files.

quote_code.sh

This script quotes code. The result may be used within paragraphs.


cat <<"EOF" >quote_code.sh


exec sed -e '
1	i<code>
	s|&|\&amp;|g
	s|<|\&lt;|g
	s|>|\&gt;|g
$	a</code>
' "$@"

EOF

block_code.sh

This script creates an example block of code.


cat <<"EOF" >block_code.sh


exec sed -e '
1	i<pre class="example">
	s|&|\&amp;|g
	s|<|\&lt;|g
	s|>|\&gt;|g
1	s|^|<code>|
$	s|$|</code>|
$	a</pre>
' "$@"

EOF

block_extract.sh

This script creates an example block of code intended for extraction.


cat <<"EOF" >block_extract.sh


exec sed -e '
1	i<pre class="extract"><code class="extract">
	s|&|\&amp;|g
	s|<|\&lt;|g
	s|>|\&gt;|g
$	a</code></pre>
' "$@"

EOF

extract_code.sh

This script extracts code from an HTML source. It can serve as the inverse function of the above, is intended to extract more generally annotated code. This script takes the class name as the first parameter.


cat <<"EOF" >extract_code.sh


class=$1
shift
exec sed -e '
1,/<code class="'$class'">/		d
/<\/code>/,/<code class="'$class'">/	d
/<\/code>/,$				d
					s|<[^<>]*>||g
					s|&lt;|<|g
					s|&gt;|>|g
					s|&nbsp;| |g
					s|&amp;|\&|g
' "$@"

EOF

contents.sh

This script creates a table of contents. The first parameter to the script is the depth of headings to include in the contents.


cat <<"EOF" >contents.sh


usage()
{
echo "usage: $0 <depth in [2-6]> [<file>...]" 1>&2
}

if test $# -lt 1
then
	usage
	exit 1
fi

case $1 in
[2-6])
	DEPTH=$1
	shift
	;;
*)
	usage
	exit 1
	;;
esac

IN1="\&nbsp;\&nbsp;\&nbsp;\&nbsp;"
IN2="${IN1}{$IN1}"
IN3="${IN2}{$IN1}"
IN4="${IN3}{$IN1}"

sed -e '
1			i<p>
$			a</p>
/<h[2-'${DEPTH}']>/	! d
			s|name="|href="#|
			s|</h[2-6]>|<br>|
			s|<h2>||
			s|<h3>|'${IN1}'|
			s|<h4>|'${IN2}'|
			s|<h5>|'${IN3}'|
			s|<h6>|'${IN4}'|
' "$@"

EOF

dynacontents.js

This script creates a table of contents dynamically within the web page. Place the script within the head of the HTML. It assumes that the user has previously included

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js" type="text/javascript"> </script>"

It also assumes an div element, somewhere in the document, with the id "toc", which it will fill in with the table of contents.

The browser must support Javascript, and the browser must be network accessibile, for the contents to appear.

Thanks to Jeffrey Yasskin for the code.


cat <<"EOF" >dynacontents.js


<script type="text/javascript">$(function() {
    var next_id = 0
    function find_id(node) {
        // Look down the first children of 'node' until we find one
        // with an id. If we don't find one, give 'node' an id and
        // return that.
        var cur = node[0];
        while (cur) {
            if (cur.id) return curid;
            if (cur.tagName == 'A' && cur.name)
                return cur.name;
            cur = cur.firstChild;
        };
        // No id.
        node.attr('id', 'gensection-' + next_id++);
        return node.attr('id');
    };

    // Put a table of contents in the #toc nav.

    // This is a list of <ol> elements, where toc[N] is the list for
    // the current sequence of <h(N+2)> tags. When a header of an
    // existing level is encountered, all higher levels are popped,
    // and an <li> is appended to the level
    var toc = [$("<ol/>")];
    $(':header').not('h1').each(function() {
        var header = $(this);
        // For each <hN> tag, add a link to the toc at the appropriate
        // level.  When toc is one element too short, start a new list
        var levels = {H2: 0, H3: 1, H4: 2, H5: 3, H6: 4};
        var level = levels[this.tagName];
        if (typeof level == 'undefined') {
            throw 'Unexpected tag: ' + this.tagName;
        }
        // Truncate to the new level.
        toc.splice(level + 1, toc.length);
        if (toc.length < level) {
            // Omit TOC entries for skipped header levels.
            return;
        }
        if (toc.length == level) {
            // Add a <ol> to the previous level's last <li> and push
            // it into the array.
            var ol = $('<ol/>')
            toc[toc.length - 1].children().last().append(ol);
            toc.push(ol);
        }
        var header_text = header.text();
        toc[toc.length - 1].append(
            $('<li/>').append($('<a href="#' + find_id(header) + '"/>')
                              .text(header_text)));
    });
    $('#toc').append(toc[0]);
})
</script>

EOF

outline.sh

This script creates an outline of the document.


cat <<"EOF" >outline.sh


SPC="[	 ]"
SPCSOPT="${SPC}*"
IN1="   "
IN2="${IN1}${IN1}"
IN3="${IN2}${IN1}"
IN4="${IN3}${IN1}"

exec sed -e "
/<h[1-6]>/	! d
		s/${SPCSOPT}<h1>//
		s/${SPCSOPT}<h2>//
		s/${SPCSOPT}<h3>/${IN1}<h3>/
		s/${SPCSOPT}<h4>/${IN2}<h4>/
		s/${SPCSOPT}<h5>/${IN3}<h5>/
		s/${SPCSOPT}<h6>/${IN4}<h6>/
		s/<[^>]*>//g
" "$@"

EOF

outline_with_names.sh

This script creates an outline of the document, including the anchor names.


cat <<"EOF" >outline_with_names.sh


SPC="[	 ]"
SPCSOPT="${SPC}*"
SPCSREQ="${SPC}${SPCSOPT}"
DQT='"'
QUOTE='\("[^"]*"\)'
IDENT='\([^	 >]*\)'
ANAME="<a${SPCSREQ}name="
ENDA="${SPCSOPT}>"
IN1="   "
IN2="${IN1}${IN1}"
IN3="${IN2}${IN1}"
IN4="${IN3}${IN1}"
LBL1='"[^"]\{0,5\}"'
LBL2='"[^"]\{6,13\}"'
LBL3='"[^"]\{14,21\}"'

exec sed -e "
/<h[1-6]>/	! d
		s|${SPCSOPT}<h1>||
		s|${SPCSOPT}<h2>||
		s|${SPCSOPT}<h3>|${IN1}<h3>|
		s|${SPCSOPT}<h4>|${IN2}<h4>|
		s|${SPCSOPT}<h5>|${IN3}<h5>|
		s|${SPCSOPT}<h6>|${IN4}<h6>|
		s|${ANAME}${QUOTE}${ENDA}|\1 |
		s|${ANAME}${IDENT}${ENDA}|"'"\1"'" |
		s|\(.*\)${QUOTE} \(.*\)$|\2	\1\3|
		s|\(${LBL1}\)|\1		|
		s|\(${LBL2}\)|\1	|
		s|<[^>]*>||g
" "$@"

EOF

style.hinc

This style element implements the style choices described above. It is intended for inclusion in WG21 papers.


cat <<"EOF" >style.hinc


<style type="text/css">

body { color: #000000; background-color: #FFFFFF; }
del { text-decoration: line-through; color: #8B0040; }
ins { text-decoration: underline; color: #005100; }

p.example { margin-left: 2em; }
pre.example { margin-left: 2em; }
div.example { margin-left: 2em; }

code.extract { background-color: #F5F6A2; }
pre.extract { margin-left: 2em; background-color: #F5F6A2;
  border: 1px solid #E1E28E; }

p.function { }
.attribute { margin-left: 2em; }
.attribute dt { float: left; font-style: italic;
  padding-right: 1ex; }
.attribute dd { margin-left: 0em; }

blockquote.std { color: #000000; background-color: #F1F1F1;
  border: 1px solid #D1D1D1;
  padding-left: 0.5em; padding-right: 0.5em; }
blockquote.stddel { text-decoration: line-through;
  color: #000000; background-color: #FFEBFF;
  border: 1px solid #ECD7EC;
  padding-left: 0.5empadding-right: 0.5em; ; }

blockquote.stdins { text-decoration: underline;
  color: #000000; background-color: #C8FFC8;
  border: 1px solid #B3EBB3; padding: 0.5em; }

table { border: 1px solid black; border-spacing: 0px;
  margin-left: auto; margin-right: auto; }
th { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }
td { text-align: left; vertical-align: top;
  padding-left: 0.8em; border: none; }

</style>

EOF

Makefile

This paper was put together with make, so as to avoid manual conversions on all the examples. It may be useful as a starting point for other papers.

The extensions used in the Makefile are as follows.

extension	file use	comment
.hsrc	source	an HTML source file with include directives
.hinc	source	an included HTML source file
.cc	source	a C/C++ source code file
.sh	source	a Bourne shell source code file
.html	product	a complete HTML file
.qinc	intermediate	a quoted version of a .hinc
.vinc	intermediate	a verbatim <pre> version of a .hinc
.cinc	intermediate	a contents file for an .hsrc file
.qcod	intermediate	a verbatim version of a source code file
.vcod	intermediate	a verbatim version of a .qcod file
.qext	intermediate	a code extraction version of a source code file
.vext	intermediate	a version of a .qext file
.xext	intermediate	the code extracted from an HTML file


cat <<"EOF" >Makefile


default : help

help :
	@echo "make help      -- do nothing but print this message"
	@echo "make variables -- do nothing but show important build variables"
	@echo "make outline   -- show the paper outline"
	@echo "make documents -- build the HTML documents"
	@echo "make codefiles -- build the code files"
	@echo "make all       -- build the documents and code files"
	@echo "make test      -- test that extracted code files match sources"
	@echo "make clean     -- remove the documents and intermediate files"

INTERMEDIATE = *.qinc *.vinc *.cinc *.qcod *.vcod *.qext *.vext *.xext *.d

CPP = cpp -MMD -MP -w -P -C -traditional-cpp
DIFF = for f in *; do echo comparing $$f; diff ../$$f $$f; done

%.html : %.hsrc
	$(CPP) -MT $@ $< $@

%.qinc : %.hinc
	sh quote_code.sh $< > $@

%.vinc : %.hinc
	sh block_code.sh $< > $@

%.cinc : %.hsrc
	sh contents.sh 6 $< > $@

%.qcod : %.cc
	sh block_code.sh $< > $@

%.vcod : %.qcod
	sh block_code.sh $< > $@

%.qext : %.sh
	sh block_extract.sh $< > $@

%.qext : %.js
	sh block_extract.sh $< > $@

%.qext : %.hinc
	sh block_extract.sh $< > $@

%.vext : %.qext
	sh block_code.sh $< > $@

%.xext : %.html
	sh extract_code.sh extract $< > $@

Makefile.qext : Makefile
	sh block_extract.sh $< > $@

SOURCES := $(shell echo *.hsrc)
PREBUILD := $(shell sh prebuild.sh $(SOURCES))
DOCUMENTS := $(SOURCES:.hsrc=.html)
CODEFILES = htmlcppstd.xext

variables :
	@echo "SOURCES  = $(SOURCES)"
	@echo "DOCUMENTS = $(DOCUMENTS)"
	@echo "PREBUILD = $(PREBUILD)"

outline :
	sh outline_with_names.sh htmlcppstd.hsrc

$(DOCUMENTS) : $(PREBUILD)

documents : $(DOCUMENTS)

codefiles : $(CODEFILES)

all : documents codefiles

testing :
	mkdir testing

test : htmlcppstd.xext testing
	cd testing ; sh ../htmlcppstd.xext ; $(DIFF)

clean :
	rm -rf testing $(DOCUMENTS) $(INTERMEDIATE)

-include *.d

EOF


cat <<"EOF" >prebuild.sh


SPC="[	 ]"
SPCSOPT="${SPC}*"
INCLDIR="^${SPCSOPT}#${SPCSOPT}include${SPCSOPT}"

exec sed -e '
/#include/	! d
/\.hinc/	d
		s/'"${INCLDIR}"'"\(.*\)".*/\1/
' "$@"

EOF

name	value	name	value	name	value
`&`	"&"	`ü`	"ü"	` `	(non-breaking space)
`<`	"<"	`Ñ`	"Ñ"	`—`	"—" (em dash)
`>`	">"	`ó`	"ó"	`–`	"–" (en dash)

name	value	name	value	name	value
`&`	"&"	`ü`	"ü"	` `	(non-breaking space)
`<`	"<"	`Ñ`	"Ñ"	`—`	"—" (em dash)
`>`	">"	`ó`	"ó"	`–`	"–" (en dash)

name	value	name	value	name	value
`&`	"&"	`ü`	"ü"	` `	(non-breaking space)
`<`	"<"	`Ñ`	"Ñ"	`—`	"—" (em dash)
`>`	">"	`ó`	"ó"	`–`	"–" (en dash)