TITLE: | Draft of 13250-6 Topic Maps -- Compact Syntax |
SOURCE: | Mr. Lars Heuer; Mr. Gabriel Hopmans; Dr. Sam Gyun Oh; Mr. Steve Pepper |
PROJECT: | WD 13250-6: Information technology - Topic Maps - Compact syntax |
PROJECT EDITOR: | Mr. Lars Heuer; Mr. Gabriel Hopmans; Dr. Sam Gyun Oh |
STATUS: | Draft for discussion |
ACTION: | For discussion at the Oslo meeting |
DATE: | 2007-02-22 |
DISTRIBUTION: | SC34 and Liaisons |
REPLY TO: |
Dr. James David Mason (ISO/IEC JTC 1/SC 34 Secretariat - Standards Council of Canada) Crane Softwrights Ltd. Box 266, Kars, ON K0A-2E0 CANADA Telephone: +1 613 489-0999 Facsimile: +1 613 489-0995 Network: jtc1sc34@scc.ca http://www.jtc1sc34.org |
1 | Scope |
2 | Normative references |
3 | Syntax description |
3.1 | About the syntax |
3.2 | Deserialization |
3.3 | Common syntactical constructs |
3.3.1 | Comments |
3.3.2 | Creating IRIs from strings |
3.3.3 | Creating IRIs from QNames |
3.3.4 | Topic References |
3.3.5 | Scope |
3.3.6 | Reifier |
3.3.7 | Type |
3.4 | Topic Map |
3.5 | Topics |
3.6 | Association |
3.7 | Identity |
3.8 | Template Invocation |
3.9 | Names |
3.10 | Variants |
3.11 | Occurrences |
3.12 | Values |
3.13 | Templates |
3.14 | Encoding Directive |
3.15 | Version Directive |
3.16 | Prefix Directive |
3.17 | Include Directive |
3.18 | Mergemap Directive |
3.19 | Atoms |
A | Syntax |
This is a working draft. Sections which contain @@@ are more in flux than the others
The following issues must be resolved
Reification of a topic map
Align datatypes of CTM with TMQL
Usage of sid / slo
Sometimes this document refers to "values", sometimes to "atoms". Make it consitent after we know which term we'll use.
Explain "parent context" or find a better term. Parent context is the is a "first class" component, like a name where subcomponents like scope and reifier etc. may appear.
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
ISO/IEC 13250-6 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information Technology, Subcommittee SC 34, Document Description and Processing Languages.
ISO/IEC 13250 consists of the following parts, under the general title Topic Maps:
CTM (Compact Topic Maps) is a text-based notation for representing topic maps. It provides a simple, lightweight notation that complements the existing XML-based interchange syntax described in [XTM] and can be used for
manually authoring topic maps;
providing human-readable examples in documents;
serving as a common syntactic basis for TMCL and TMQL.
The principal design criteria of CTM are compactness, ease of human authoring, maximum readability, and comprehensiveness rather than completeness. CTM supports all constructs of the [TMDM], except item identifiers on constructs that are not topics.
Since CTM is not designed as interchange syntax, care should be taken when using CTM as a basis for interchanging topic maps.
This part of ISO/IEC13250 should be read in conjunction with [TMDM] since the interpretation of the CTM syntax is defined through a mapping from the syntax to the data model there defined.
This part of ISO/IEC13250 defines a text-based notation for representing instances of the data model defined in [TMDM]. It also defines a mapping from this notation to the data model. The syntax is defined through an EBNF grammar.
The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.
Each of the following documents has a unique identifier that is used to cite the document in the text. The unique identifier consists of the part of the reference up to the first comma.
LH: TODO: Correct citations
Unicode, The Unicode Standard, Version 5.0.0, The Unicode Consortium, Reading, Massachusetts, USA, Addison-Wesley Developer's Press, 2007, ISBN 0-321-48091-0, http://www.unicode.org/versions/Unicode5.0.0/
XML 1.0, Extensible Markup Language (XML) 1.0, W3C, Third Edition, W3C Recommendation, 04 February 2004, http://www.w3.org/TR/REC-xml/
TMDM, ISO 13250-2 Topic Maps — Data Model, ISO, 2006, Lars Marius Garshol, Graham Moore, http://www.isotopicmaps.org/sam/sam-model/
XTM, ISO 13250-3 Topic Maps — XML Syntax, ISO, 2006, http://www.isotopicmaps.org/sam/sam-xtm/
XSDT, XML Schema Part 2: Datatypes Second Edition, W3C, W3C Recommendation, 28 October 2004, http://www.w3.org/TR/xmlschema-2/
RFC3986, RFC 3986 - Uniform Resource Identifiers (URI): Generic Syntax, T. Berners-Lee, R. Fielding, L. Masinter, 2005, http://www.ietf.org/rfc/rfc3986
RFC3987, RFC3987 - Internationalized Resource Identifiers (IRIs), M. Duerst, M. Suignard, 2005, http://www.ietf.org/rfc/rfc3987.txt
The acronym CTM is often used to refer to the syntax defined in this part of ISO/IEC13250. Its full name is Compact Topic Maps Syntax. The namespace for the CTM syntax is http://www.topicmaps.org/ctm/.
A CTM document is a text document that conforms to the CTM syntax. This clause defines the syntax of CTM documents using an EBNF grammar based on the notation described in [XML 1.0], and their semantics using prose describing the mapping from CTM documents to [TMDM]. The full EBNF can be found in Annex A.
The process of exporting a topic map from an implementation's internal representation of the data model to an instance of a Topic Maps syntax is known as serialization. The opposite process, deserialization, is the process of building an instance of an implementation's internal representation of the data model from an instance of a Topic Maps syntax.
This clause defines how instances of the CTM syntax are deserialized into instances of the data model defined in [TMDM]. Serialization is only implicitly defined, but implementations should guarantee that for any data model instance the CTM serialization produced by the implementation should when deserialized to a new data model instance produce one that has the same canonicalization as the original data model instance, according to [ISO 13250-4].
The input to the deserialization process is:
A CTM document.
An absolute IRI. This is the IRI from which the CTM document was retrieved, known as the document IRI. This IRI shall always be provided, as it is necessary in order to assign the item identifiers of the topic items created during deserialization. If the CTM document was not read from any particular IRI the application is responsible for providing an IRI considered suitable.
LH: I added the requirement that an IRI must be provided (like in XTM). Any reason to remove this requirement? What happens if CTM is used inside TMQL? .. and what happens if we do not have this requirement? Which locator is used to resolve identifiers against?
Deserialization is performed by processing each component of the document in document order. Components are defined in terms of text that matches a syntactic variable of the EBNF. For each component encountered the operations specified in the clause for the corresponding syntactic variable are performed.
Whenever a new information item is created, those of its properties which have set values are initialized to the empty set; all other properties are initialized to null.
Each CTM processor must be aware of the following prefixes
http://www.w3.org/2001/XMLSchema#
This is the namespace for the XML Schema Datatypes.
http://www.topicmaps.org/ctm/
This is the namespace for this part of ISO/IEC13250
LH: Which clause-ordering makes sense?
Comments are fragments of the character stream which are ignored by any CTM processor. Comments are allowed where whitespace characters are allowed. Comments are introduced by a hash (#) and reach until the end of the current line, or until the end of the text stream, whatever comes first.
To create an IRI from a string unescape the string by replacing %HH escape sequences with the characters they represent, and decode the resulting character sequence from UTF-8 to a sequence of abstract Unicode characters. The resulting string is turned into an absolute IRI by resolving it against the document IRI.
QNames are used to abbreviate IRIs. They are declared as follows:
[0] | qname | → | prefix : local |
[1] | prefix | → | identifier |
[2] | local | → | identifier |
A QName a causes a locator to be created. The IRI to which the prefix is bound to, and the local part are concatenated. The result of such process is always an absolute IRI.
The prefix should have been bound to an IRI as specified in 3.16; if the prefix is unbound, an error is flagged.
Topics are referenced by an item identifier, a subject identifier, or a subject locator.
[3] | topic-ref | → | identifier | subject-identifier | subject-locator |
[4] | identifier | → | [a-zA-Z][a-zA-Z0-9_-.]* @@@TODO: NCName (c.f. XML) |
[5] | subject-identifier | → | iri | qname |
[6] | subject-locator | → | = (iri | qname) |
During deserialization, one topic item is created for each topic-ref.
If the topic-ref is an identifier, a locator is created by concatenating the document IRI, a # character, and the value of the identifier. The locator is added to the [item identifiers] property of the topic.
If the topic-ref is specified by a subject identifier, a locator is created and added to the [subject identifiers] property of the topic.
If the topic-ref is specified by a subject locator, a locator is created and added (without the =) to the [subject locators] property of the topic.
If the created topic item is equal to another topic item (c.f. [TMDM] 5.3); the two topic items are merged according to the procedure given in [TMDM].
The scope construct is used to assign a scope to the statement represented by the parent context.
Each topic-ref element is processed according to the procedure described in 3.3.4. These topic items are gathered into a set that is assigned as the value of the [scope] property of the information item produced by the parent context.
The reifier construct is used to refer from the topic map construct on which it appears to the topic reifying that construct. The reference is a topic-ref as described in 3.3.4
During deserialization the topic-ref is resolved into a topic item following the procedure in 3.3.4. The topic item is set as the value of the [reifier] property of the topic map construct being processed.
The type construct is used to assign a type to the topic map construct represented by its parent context. The type is always a topic, indicated by the topic-ref.
During deserialization the topic-ref produces a topic item following the procedure in 3.3.4, which is set as the value of the [type] property of the information item produced by the parent context.
The topicmap component is the equivalent of the CTM document. It acts as a container for the topic map but has no further significance. It is declared as follows:
One CTM document produces exactly one topic map instance during its deserialization.
[11] | topic | → | topic-ref { assignment } end-of-definition |
[12] | assignment | → | template-invocation | identity | name | occurrence |
[13] | end-of-definition | → | \s+ . | ^\s*$ |
An assignment is a component that one can use to make one or more statements. Statements such as template invocations (i.e. isa), names, occurrences, or one or more assignments about identity.
http://en.wikipedia.org/wiki/John_Lennon isa http://psi.example.org/music/guitarist - "John Lennon" |
[14] | association | → | type ( roles ) [ scope ] [ reifier ] |
[15] | roles | → | role { , role } |
[16] | role | → | type : player [ reifier ] |
[17] | player | → | topic-ref |
During deserialization an association item is created for each association and added to the [associations] property of the topic map item. The [type] property of the association is set to the topic produced by the type as decribed in 3.3.7.
During deserialization an association role item is created for each type / player pair. The role item is added to the [roles] property of the association item produced by the parent context.
[18] | identity | → | subject-identifiers | subject-locators |
[19] | subject-identifiers | → | sid : (iri | qname) + |
[20] | subject-locators | → | slo : (iri | qname) + |
LH: I removed the possibilty to specify item identifiers. CTM is not a Topic Maps interchange syntax.
LH: Maybe we should remove the "sid" / "slo" thing. Why do we not specify multiple subject identifiers / locators with the same notation given in 3.3.4? Conflicts with template invocations? IMO we do not get into trouble if we define that templates must have been declared in the same stream and templates are only referenced by a local name, never by an IRI.
# Topic with three sids http://en.wikipedia.org/wiki/John_Lennon http://de.wikipedia.../John_Lennon - "John Lennon" http://psi.example.org/John # If it is possible to refer to templates by a URI we might get a problem here: http://psi../ file://.... http://.... ^^ ^^ ^^ Left topic Template Right topic "left topic" "template" and "right topic" are interpretable as subject identifiers |
[21] | template-invocation | → | template-name (argument | argument-list) |
[22] | argument | → | topic-ref | value |
[23] | argument-list | → | ( argument { , identifier : argument } ) |
If only one argument should be supplied to a template, the brackets may be omitted.
Multiple arguments require brackets
# Valid template invocations mccartney isa person mccartney isa(person) mccartney plays-for(The-Beatles, instrument: piano) mccartney has-shoesize "42"^^xs:integer |
Template invocations involve at minimum one topic and one argument which is either a topic reference or a value (3.12). If more than one argument is provided, brackets should be used and the variable assignments must be specified for all arguments but the first (which is automatically bound c.f. 3.13). The variable assignment involves the variable name (without the dollar sign) and the argument.
If the template was not previously defined, an error is flagged.
If not all variables are bound, an error is flagged.
If a variable is bound to more than one argument, an error is flagged. It is not an error, if one and the same argument is bound to more than one variable.
A template invocation causes the statements of the template to be added to the topic map where the variables in the statements are replaced by the specified arguments.
The name construct is used to add a topic name to a topic. It is declared as follows:
The colon (:) may be omitted, but for readablitity this part of ISO/IEC13250 recommends the usage of the colon for names where an explicit type is provided.
During deserialization the name construct causes a topic name item to be created, and added to the [topic names] property of the topic item created by the parent topic context.
If the type is not specified, the [type] property of the topic name item is set to the topic item whose [subject identifiers] property contains http://psi.topicmaps.org/iso13250/model/topic-name; if no such topic item exists, one is created.
paul-mccartney - "Paul McCartney" # Name with the default name type, colon omitted paul-mccartney -: "Paul McCartney" # Name with the default name type, colon provided john-lennon - "John Lennon" # Default name - fullname: "John Winston Lennon" # Name of type "fullname" - surname "Lennon" # Name of type "surname", colon omitted |
The variant construct is used to add a variant name to a topic name. It is declared as follows:
During deserialization the variant construct causes a variant item to be created and added to the [variants] property of the topic name item created by the name parent context. After the scope has been processed, the topics in the [scope] property of the topic name item created by the parent name construct are added to the [scope] property of the variant name item.
%prefix tm http://psi.topicmaps.org/iso13250/model/ # Topic with name "John Lennon" which has a variant which represents a sort name john - "John Lennon" ("lennon, john" @tm:sort) |
[26] | occurrence | → | type : value [ scope ] [ reifier ] |
During deserialization the occurrence construct causes an occurrence item to be created, and added to the [occurrences] property of the topic item created by the parent topic context.
paul-mccartney birthday: 1942-02-18 # Occurrence of type "birthday" with the value "1942-02-18" and datatype "xs:date" # Now a difficult to read (but valid) example # Topic with identifier "paul-mccartney", occurrence of type "birthday", default name in # scope "formal" and an occurrence of type "website" paul-mccartney birthday: 1942-02-18 - "Sir Paul McCartney" @formal website: http://www.paulmccartney.com/ |
The value construct represents an information resource in the form of content contained within the CTM document.
The value construct sets the [value] property of the information item created by the parent context.
The following datatypes are supported:
Templates are containers for topic and association components. The template body consists of ordinary topics and associations with the additional possibility to replace topic references (3.3.4) and values (3.12) with variables.
Template declarations are defined as follows:
[28] | template | → | ctm:tpl template-name { topic | association } end |
[29] | template-name | → | identifier |
Everywhere, where a topic-ref or an atom is allowed, a variable is allowed. A variable consists of a dollar sign $) and an identifier.
$variable-name |
The variables $_left and $_right are predefined and refer to the left topic reference and the right topic reference where the template name is used in infix notation (c.f. 3.8).
The declaration of a template does not change the topic map content until it has been used within a template invocation (3.8).
# Declaration of a template ctm:tpl born $_left birthday: $_right born-in(person : $_left, birthplace : $in) end # Invocation of the template mccartney born(1942-06-12, in: Liverpool) # The same effect would have been the following declaration: mccartney birthday: 1942-06-12 born-in(person: mccartney, birthplace: Liverpool) |
The usage of templates make the topic maps much more compact especially in maps in which one can find a lot of these kind of repetitive facts around persons like in this example.
If the template name is already used by another template declaration, an error is flagged.
It is not an error to use the template name as a topic reference. Templates should be added to a "template environment" @@@TODO: Explain it in more detail!
LH: My thoughts behind the template environment: The user still can use "isa", "iko" etc. as topic identifiers (i.e. as item identifier, type etc.) without causing conflicts with the template names. The kind of usage is determinated by the context.
The following templates are predefined:
This template creates a type-instance relationship between two topics (c.f. [TMDM] 7.2).
%prefix tm http://psi.topicmaps.org/iso13250/model/ ctm:tpl isa tm:type-instance(tm:instance: $_left, tm:type: $_right) end |
This template creates a supertype-subtype relationship between two topics (c.f. [TMDM] 7.3).
%prefix tm http://psi.topicmaps.org/iso13250/model/ ctm:tpl iko tm:supertype-subtype(tm:subtype: $_left, tm:supertype: $_right) end |
The encoding directive specifies the character encoding used by the document.
If the encoding declaration is omitted, UTF-8 encoding is assumed. The name of the encoding should be given as a string in the form recommended by [XML 1.0].
If the encoding is provided, it shall occurr in the first line of the document.
%encoding "Shift-JIS" |
The version directive states the version number of the CTM syntax, which is currently "1.0". It is declared as follows:
[31] | version | → | %version 1.0 |
The version directive tells the parser which version of the CTM syntax to use during deserialization. Currently the only legal version is 1.0, as defined by this part of ISO/IEC13250.
While the version directive may be omitted, this part of ISO/IEC13250 recommends its usage for future compatibility.
The prefix directive is used to associate an IRI with an identifier. It is declared as follows:
[32] | prefix-directive | → | %prefix identifier iri |
Typically use of a prefix is shown in the example
%prefix wiki http://www.wikipedia.org/wiki/ # That one simply use it for to use it for topics like: wiki:John_Lennon - "John Lennon" |
During deserialization the prefix component binds the identifier to the IRI.
If the identifier is already bound to an IRI, an error is flagged.
The include directive is used to include another CTM document into the CTM file. The other document is referenced by an IRI.
The referenced document should use CTM syntax, otherwise an error is flagged.
@@@TODO!
The mergemap directive is used to merge an external topic map into the topic map produced by deserializing the CTM topic map.
This directive is declared as follows:
The topic map to be merged can be in any syntax, which must be declared (using the notation component) if it is not CTM.
This part of ISO/IEC13250 defines the following identifiers for Topic Maps syntaxes:
This is the namespace for CTM.
This is the namespace for XML Topic Maps.
LH: The previous paragraph has to be changed. This part of ISO/IEC13250 does not define the identifiers!
A CTM processor should be capable to support the above mentioned syntaxes. For any other syntax a CTM processor may flag an error.
During deserialization the mergemap component causes the referenced topic map to be immediately deserialized into a data model instance. The new data model instance (B) is then merged into the current one (A) by
Adding all topic items in B's [topics] property to A's [topics] property.
Adding all association items in B's [associations] property to A's [associations] property.
Adding topics and associations to A may trigger further merges, as described in [TMDM].
Atoms are literal values, such as strings, dates or integers, and IRIs. This part of ISO/IEC13250 recognizes natively the following set of data types:
LH: natively makes no sense since users can use any datatype with the string^^datatype notation
... and i.e. qname is not a datatype
I called it Atoms since TMQL called this way, but I'd prefer values or literals. I guess, TMQL called it atoms because of atomification
[36] | atom | → |
| ||||||||||||||||||
[37] | datatype | → | ^^ (iri | qname) | ||||||||||||||||||
[38] | iri | → | @@@TODO: Construct that is RFC 3987 conform | ||||||||||||||||||
[39] | qname | → | identifier : identifier | ||||||||||||||||||
[40] | integer | → | [ sign ] [0-9]+ | ||||||||||||||||||
[41] | decimal | → |
| ||||||||||||||||||
[42] | sign | → | + | - | ||||||||||||||||||
[43] | date | → | [ - ] year - month - day [ timezone ] | ||||||||||||||||||
[44] | timezone | → |
| ||||||||||||||||||
[45] | string | → | quoted-string | triple-quoted-string | ||||||||||||||||||
[46] | quoted-string | → | " ('\"' | '\\' | [^"\])* " | ||||||||||||||||||
[47] | triple-quoted-string | → | """ .* """ |
LH: IMO we should move sign, timezone etc. into the Annex A and ignore it here.
LH: The EBNF is not valid
Any datatype can be expressed by representing the value as a string and appending the datatype qualifier (^^) and a datatype.
"42"^^xsd:integer "12-22"^^xsd:gMonthDay |
ISO/IEC 13250:2003, Topic Maps, 2003, http://www.y12.doe.gov/sgml/sc34/document/0322_files/iso13250-2nd-ed-v2.pdf
ISO 13250-4, Topic Maps — Canonicalization, http://www.isotopicmaps.org/sam/cxtm/