TITLE: | A High-level Description of a Draft Reference Model for ISO 13250 Topic Maps |
SOURCE: | Steven R. Newcomb and Michel Biezunski |
PROJECT: | Topic Map Models |
PROJECT EDITORS: | Michel Biezunski, Martin Bryan, Steven R. Newcomb |
STATUS: | Draft High-level Description of forthcoming Editor's Draft |
ACTION: | For review and comment |
DATE: | 4 April 2002 |
SUMMARY: | |
DISTRIBUTION: | SC34 and Liaisons |
REFER TO: | |
SUPERCEDES: SC34 N298 | |
REPLY TO: | Dr. James David Mason (ISO/IEC JTC1/SC34 Chairman) Y-12 National Security Complex Information Technology Services Bldg. 9113 M.S. 8208 Oak Ridge, TN 37831-8208 U.S.A. Telephone: +1 865 574-6973 Facsimile: +1 865 574-1896 E-mailk: mailto:mxm@y12.doe.gov http://www.y12.doe.gov/sgml/sc34/sc34oldhome.htm Ms. Sara Hafele, ISO/IEC JTC 1/SC 34 Secretariat American National Standards Institute 25 West 43rd Street New York, NY 10036 Tel: +1 212 642-4937 Fax: +1 212 840-2298 E-mail: shafele@ansi.org |
4 April 2002
Steven R. Newcomb and Michel Biezunski
Co-editors, ISO/IEC 13250
The purpose of ontologies (sets of knowledge-bearing assertion types), taxonomies (classes of things, ideas, etc.), and vocabularies (such as markup vocabularies) is to allow the human members of communities of interest to communicate among themselves about the things with which their communities are concerned.
However, knowledge that is represented in the terms of a specific community may have high value outside its community of origin. How can distinct bodies of knowledge, created and maintained by distinct, non-cooperating communities, be made usefully available in contexts other than their communities of origin? One answer is to merge them with other bodies of knowledge, in conformance with the Reference Model of ISO 13250 Topic Maps.
Syntax layer.The 13250 Standard provides two syntaxes for interchanging Topic Maps, one known as "HyTM" and one called "XTM". The HyTM syntax offers both flexibility and rigor by means of HyTime architectural forms. The XTM syntax was specifically designed for XML and the Web. Since there are multiple standard syntaxes, the question naturally arises: What is the nature of the information that is being interchanged by means of these syntaxes? There must be a layer of common meanings -- an ontology -- on which both of these syntaxes are based.
Standard Application layer.The ontology on which both HyTM and XTM are based is called the "Standard Application" layer. The word "Application" is used here in an extremely broad sense; except for the Reference Model, the entire ISO 13250 Topic Maps Standard will describe a single Application called the "Standard Application". The Standard Application is a set of semantics for which a data model, the "Standard Application Model [SAM]", is also being developed. The Standard Application defines virtually all of the familiar features of the Topic Maps paradigm, including topic names, topic occurrences, and scopes. Instances of both the HyTM and XTM syntaxes say the same kinds of SAM-defined things, but in different ways, as a result of the different requirements that drove the design of each syntax. Additional SAM-centric languages are planned, including Topic Maps Query Language (TMQL) and Topic Maps Constraint Language (TMCL).
Non-ISO syntaxes, including proprietary syntaxes and community-specific syntaxes, such as NewsML, can be used to interchange SAM-conforming topic maps. In order that interchange and merging can occur reliably and predictably, the interpretation of each syntax in terms of the SAM and the Reference Model will have to be rigorously specified.
Reference Model layer. The conventions on which the Standard Application and all other Applications are based is called the "Reference Model". The Reference Model provides a robust and predictable basis for collating ("aligning" or "merging") knowledge about subjects, regardless of the diversity of the ontologies that govern the understanding of such knowledge.
From the perspective of the Reference Model, the Standard Application is an extensible ontology: it provides assertion types for such familiar Topic Maps features as topic names, topic occurrences, topic types, and scopes. Although it might seem appropriate for these features to be built into the fundamental layer of the Topic Maps standard, none of these assertion types appears in the Reference Model. Instead, they are in the Standard Application, in order to allow implementations of Applications other than the Standard Application to be based on different approaches to knowledge representation and management, without incurring the cost of supporting features of the Standard Application that may be irrelevant to their contexts of use.
The draft Reference Model is designed to preserve the maximum possible flexibility for the design of Applications, while still providing a basis for predictable automated merging of diverse topic maps. Most of the familiar features of Topic Maps are found in the Standard Application, and not in the Reference Model.
In the draft Reference Model, a topic map is seen as a set of "assertions", no more and no less. Each assertion asserts the existence of a strongly-typed relationship between some specific set of subjects of conversation. Each such subject is a "role player" in the assertion; it plays a specific role in the relationship. The ontologies of Applications may include an unbounded number of kinds of assertions ("assertion types"). The roles, the role players, the assertions themselves, and the types of the assertions, are all regarded as subjects, and any of these features of an assertion can be role players in other assertions.
Every topic map is a graph, and every assertion within a topic map is a subgraph within that graph. According to the draft Reference Model, graphs of topic maps consist of "nodes" (also sometimes called "vertices" or "vertexes" or "topics") and four distinct kinds of nondirectional "arcs" or "edges" that connect the nodes to one another. The Reference Model establishes a single graphic meta-structure for all assertions.
Fundamentally, the draft Reference Model is four arc types, here shown as differently-shaded bars. The names of the arc types are concatenations of their endpoints, e.g., the CR (casting-role) arc type has the endpoints C and R. Each arc type has a specific function in the representation of assertions. There are rules regarding the combinations of arc endpoints as which a given node can serve, but any node can serve as the x end of any number of Cx arcs. Nodes are shown here as dots. Each node (or "topic") is the unique surrogate of exactly one subject. This diagram shows a 2-role assertion, but an assertion can have an unbounded number of roles.
The Reference Model imposes certain redundancy-elimination and subject-collation requirements on Application implementations, so that everything that is known about a given subject turns out to be attached to the one and only node that corresponds to that subject. Every node represents exactly one subject, and it is connected by arcs to everything that is known about it in the topic map. Every arc is a component of exactly one assertion. Every node in a topic map graph serves as some combination of zero or more of the eight end points of the four types of arcs. From this combination of endpoints, all of the assertions in which the node is involved -- as a role player and/or as an internal component of one or more assertions -- is fully and unambiguously specified.
An instance of an assertion. In this diagram, the subject of each node is shown in a balloon. (That is, except for the casting nodes, here labeled "playing of role". The subject of the casting node on the left leg of the assertion is the fact that George is the player of the MD degree holder role in this particular assertion.)
The draft Reference Model defines two "paradigmatic" assertion types:
Instances of the topic-subjectIndicator assertion type -- a 2-role assertion type -- are used to declare that subjects have subject indicators. (A subject indicator, considered as a piece of information, is itself a subject -- an "addressable subject" or "subject constituter".)
Instances of the assertionPattern-role-rolePlayerConstraints assertion type -- a 3-role assertion type -- are used to declare that assertion types (which, of course, are themselves subjects) specify the roles used in their instances. Each role is a subject, and the set of constraints on players of a role, if any, is also a subject. The draft Reference Model says nothing about the nature of role player constraints, about how such constraints are asserted, and about how instances of assertions should be validated with respect to them. All such decisions are made by designers of Applications. The draft Reference Model provides the assertionPattern-role-rolePlayerConstraints assertion type only in order to illustrate, for all Applications, a uniform and universally recognizable connection between assertion type topics, role topics, and role player constraint topics, and to illustrate how all of these subjects can fully and predictably participate in the topic maps for which they have ontological significance.
Constraints on role players are asserted by making various kinds of assertions, each of which declares some constraint or set of constraints about the topic that plays the rolePlayerConstraints role (the "role player constraints topic" or "RPC topic"). One kind of assertion that can be used to state such constraints is the topic-subjectIndicator assertion type; for example, the constraints can be stated in natural language (or in any other notation) in the RPC topic's subject indicator. This method is used in the draft Reference Model itself, in order to describe the role player constraints that are applicable to instances of the two paradigmatic assertion types. However, role player constraints can be more formally expressed -- and their expressions can be used for automatic validation processing -- by means of Application-defined assertion types.
Each instance of the assertionPattern-role-rolePlayerConstraints assertion type -- a 3-role assertion type -- declares that the assertion pattern that plays the assertionPattern role has, as one of its roles, the role that plays the role role. It can also declare the constraints on the players of that role via a topic whose subject is the constraints, and that plays the role player constraints role. The "role player constraints topic" in the diagram plays roles in one or more other assertions, whose types are not defined by the Reference Model and which are not shown in the diagram, that establish the constraint that, in instances of the assertion pattern, the only valid players of the institution role are medical schools. The subjects described in colored balloons are the subjects on which the assertionPattern-role-rolePlayerConstraints itself depends.
Predictable merging. In the Topic Maps paradigm, information items that share a particular common meaning can all be connected by reference to any of an unlimited number of "subject indicators" and/or "published subject indicators", each of which is a piece of addressable information that independently serves as a surrogate for that particular common meaning. Thus, subject indicators serve as "binding points" whose existence can allow computers to collate information about subjects, so that everything that is known about a given subject is directly available from the perspective of that subject. It doesn't matter which communities use which names for a given subject; all communities can keep using the vocabularies they already use, even though the subjects named in each vocabulary can now be automatically connected with a wealth of materials that have been indexed by other communities who may use different vocabularies.
The Reference Model enables instances of extremely diverse knowledge assets to be merged. Even though this merging is in fact driven by semantic identity, it does not require that the machine "understand" the semantics of the subjects around which it is collating information. This is what makes the knowledge alignment paradigm of Topic Maps more scalable and less resource-intensive than approaches that require machines to behave intelligently. The discipline that the Reference Model imposes -- and especially the discipline that every node has exactly one subject, and every subject has exactly one node -- facilitates knowledge aggregation among diverse topic maps. If, for example, Topic Map A and Topic Map B are merged into Topic Map C, whenever a node in Topic Map A has the same subject as a node in Topic Map B, all of the assertions made about both nodes in both topic maps are made about a single node in the resulting merged Topic Map C. If each of the original nodes had assertions connecting them both to any single subject-indicating resource (binding point), the merge can occur automatically, without human intervention.
The paradigm requires that any two topics that have the same subject must be merged into a single topic; subject indicators make such a situation easy to detect. In both Topic Map A and Topic Map B, the "George" subject has one subject indicator in common: George's birth record. Therefore, in Topic Map C, there cannot be two Georges; they are both known to be one and the same George, and everything that is asserted about George in both A and B is asserted about George in C. George's birth record is the binding point that causes the merging to occur. The subjects of the "Subject Indicator for George" nodes are addressable subjects. An addressable subject is a piece of addressable information considered as itself rather than as anything it might signify. Such subjects are direct properties (here shown as dotted lines) of their node surrogates.
Lossless merging. Conformance to the Reference Model guarantees that users of the original topic maps will find all subjects in the merged topic map exactly where they have learned to expect to find them in the original topic maps. All of the familiar relationships and subjects are still present; the merging process neither changes nor eliminates them. The only difference is that the familiar topic map has been enriched with new subjects and new relationships. While this feature of the Reference Model may seem unremarkable and obvious, it is nevertheless unusual among information interchange paradigms. One of the implications is that, when one topic map is merged with many others, anyone who knows his way around the one topic map will be able to exploit his knowledge when using the others -- even after merging has vastly increased the richness and diversity of the available material.
Self-describing semantics. After diverse topic maps have been merged in a single topic map, a subject in that map may play many roles in many assertions. The assertions may be of different types, and the assertion types may belong to different ontologies. Users of the merged topic map may or may not instantly understand the significance of all of the assertions made about the subject. However, users of conforming systems have direct and convenient access to the ontological information -- the corresponding assertion types and roles, to whatever extent they exist in the topic map -- via the AP and CR arc types.
Ontology-independent merging. The draft Reference Model, which itself makes almost no ontological assumptions, shows how the merging of topic maps occurs even when they have no ontology, taxonomy, vocabulary or syntax in common, and even when the knowledge assets are maintained under very different editorial conventions and policies. (It is, of course, also true that additional merging may be required by Applications. The SAM's name-based merging rule is one example of such ontology-dependent merging.)
Scaling capacity. The draft Reference Model is extremely simple: four arc types and two built-in assertion types. Each of these six constructs has a limited number of implications for implementations, and none of these implications prohibits the implementation of systems that distribute knowledge among peer servers that collectively and effectively behave as a single knowledge base. On the basis of the draft Reference Model, distributed systems that can do "lazy" merging -- merging that is done on an ad hoc basis in order to meet an ephemeral user need -- can be created, even if they do not all implement the same Application.
Support for Application design documentation. The draft Reference Model provides sufficient means whereby questions about the processing or semantic implications of any feature of any Application can be asked and answered exhaustively and rigorously.
The devoted, attentive, thoughtful, and hardworking support of Sam Hunting and Jan Algermissen made the development of this Draft Reference Model possible. "The solution, when found, is obvious," but until that moment, it's not obvious at all. Their pioneering proof of the workability of the ideas set forth in our earlier "PMTM4" model, and the experiences they gained in doing that work, enabled them to provide essential guidance and encouragement to the work of developing the model, and they did so very generously, over a period of many months, pointing out problems and traps, suggesting ways around them, and correcting our thinking. In addition, this paper uses ideas that they contributed about how to explain the model in simple and compelling ways.
This paper was first presented at XML Europe 2002 (http://www.xmleurope.com/), an IDEAlliance (http://www.idealliance.org/) conference held in Barcelona, Spain, in May, 2002. IDEAlliance has supported the development of the Topic Maps paradigm and standards continuously since 1993.
Graphics by Victoria T. Newcomb.