TITLE: | Response to N0497 "Analysis of RM Use Cases" |
SOURCE: | Dr. Steven R. Newcomb; Mr. Patrick Durusau |
PROJECT: | WD 13250-5: Information Technology - Document Description and Processing Languages, Topic Maps - Reference Model |
PROJECT EDITOR: | Mr. Patrick Durusau; Dr. Steven R. Newcomb |
STATUS: | Informational |
ACTION: | For review and comment |
DATE: | 2004-04-09 |
DISTRIBUTION: | SC34 and Liaisons |
REFER TO: | N0497 - 2004-04-07 - Analysis of TMRM Use Cases |
REPLY TO: |
Dr. James David Mason (ISO/IEC JTC 1/SC 34 Secretariat - Standards Council of Canada) Crane Softwrights Ltd. Box 266, Kars, ON K0A-2E0 CANADA Telephone: +1 613 489-0999 Facsimile: +1 613 489-0995 Network: jtc1sc34@scc.ca http://www.jtc1sc34.org |
0 | Introduction |
The editors of the Topic Maps Reference Model thank Steve Pepper and Lars Marius Garshol for their careful analysis of N490 ("Topic Maps -- Reference Model Use Cases") that appears in N0497 ("Analysis of RM Use Cases") (sic). Their analysis illustrates the need for the Topic Maps Reference Model more compellingly than our own efforts. The Pepper/Garshol reading of the TMRM (Topic Maps Reference Model) use cases as procedural questions, rather than as examples of situations requiring declarative disclosures, highlights the insufficiency of the TMDM (Topic Maps Data Model) and its related syntax/procedure based components in addressing the central unanswered question in Topic Maps. That central unanswered question is: How does a topic map declares its approach to the problem of reflecting the "territory" of which it is a "map"? More specifically: How does a topic map declare exactly how its topics identify their subjects?
No topic maps standard has yet been adopted that provides any generalized means for specifying arbitrary bases for the recognition of subject identity. This omission has been masked by the Topic Maps community's invention, on an ad hoc, basis, of whatever approaches were needed for subject identity recognition, and its use of such ad hoc approaches to inform the designs of any subject-oriented processes that are implemented in their software. (Such an ad hoc process can, for example, treat the contents of certain <resourceData> elements as somehow contributing to the recognition of the identity of a topic's subject, while ignoring the contents of other, equally eligible-appearing <resourceData> elements. This particular approach is one of those suggested by Pepper/Garshol; we discuss it in 1 below.) In the absence of clear guidance from the adopted standards, the ad hoc approach that the community took was appropriate, reasonable, necessary, wise, and helpful to the cause of Topic Maps.
But these historical facts do not make the omission of a standard way to disclose the basis for subject identity from the revision of ISO 13250 a virtue, nor do they make a virtue of allowing XTM, TMDM, TMCL, and TMQL to fail to provide for disclosing subject identity. It is in the best interests of the topic maps community and our users that these omissions be corrected in any future edition of 13250.
We treat only one of the Pepper/Garshol use case responses below. The responses to the other use cases missed the point of the Topic Maps Reference Model in the same way: the responses describe procedures whose undeniable effectiveness masks a lack of explicit underlying semantic declarations, a lack which leaves the ways in which the topic map reflects the mapped territory unstated and ambiguous.
1 | UC2: Specifying Properties of Topics |
In Use Case 2, the US Geological Service wishes to base subject identity on geographic coordinates, including the ability to say that locations within a range are the same subject. The question raised by this use case was how to convey that basis for subject identity, in the absence of the TMRM.
In response to that question, Pepper/Garshol suggest the following:
Latitude and longitude are most appropriately modelled as internal occurrences of a topic, thus: <topic id="tokyo"> <baseName> <baseNameString>Tokyo</baseNameString> </baseName> <occurrence> <instanceOf> <topicRef xlink:href="#latitude"/> </instanceOf> <resourceData>35 40 N</resourceData> </occurrence> <occurrence> <instanceOf> <topicRef xlink:href="#longitude"/> </instanceOf> <resourceData>139 45 E</resourceData> </occurrence> </topic> |
The above suggestion does not answer the question that was being posed by the use case. Latitude and longitude can certainly be specified as internal occurrences of a topic (we withhold comment on the consistency of such a strategy with our understanding of the semantics of occurrences), but no mere specification of latitude and longitude can, by itself, establish a doctrine for understanding the subject of any topic. As an illustration, consider a very similar topic map snippet:
<topic id="japan"> <baseName> <baseNameString>Japan</baseNameString> </baseName> <occurrence> <instanceOf> <topicRef xlink:href="#population"/> </instanceOf> <resourceData>127214499</resourceData> </occurrence> <occurrence> <instanceOf> <topicRef xlink:href="#lifeExpectancy"/> </instanceOf> <resourceData>80.93</resourceData> </occurrence> </topic> |
Both the above example and the Pepper/Garshol example have strongly-typed occurrences that are raw alphanumeric data. What determines the identity of the subject of the topic in either of these examples? While a plausible argument could be made that the information contained in the <resourceData> elements in the Pepper/Garshol example will always specify subject identity in the first example, the same argument will be much less plausible in the second example. So how is a recipient of either topic map supposed to know when the information contained in <resourceData> elements specifies subject identity, and when it doesn't?
The answer does not lie in the syntax of either example, nor is the answer found in the TMDM, nor in any querying or constraints. The answer is prior to, and must inform the design of, any process in which subject identity is important. Subject-oriented querying, merging, validation, or transformation processes must start either with undisclosed assumptions about what constitutes subject identity, or with disclosed bases for subject identity. A process that starts with unstated assumptions may give the correct result, but only as long as the unstated assumptions hold true. Moreover, no new process can be created in the absence of knowledge of the applicable bases of subject identification. To say to oneself, "I always model longitude and latitude as internal occurrences of topics that are geographic locations" is insufficient as a basis for information interchange. Neither XTM syntax nor the proposed TMDM/TMQL/TMCL provide means for disclosing the information required in order to interchange topic maps that will behave predictably when undergoing subject-oriented processing.
Above, we considered a semantic variation on the Pepper/Garshol solution. Now let's consider a syntactic variation:
<topic id="tokyo"> <baseName> <baseNameString>Tokyo</baseNameString> </baseName> <baseName> <scope> <topicRef xlink:href="#geographicPointName"/> </scope> <baseNameString>35 40 N / 139 45 E</baseNameString> </baseName> </topic> |
Both the above example and the original Pepper/Garshol example can be used to interchange information about Tokyo, and both can be understood in such a way as to make the geographic coordinates that they both specify the basis on which their common subject is identified. However, in the absence of declarations of their respective bases of subject identity, both examples are ambiguous. Neither of them can be reliably and predictably understood as representing exactly the same subject as that of the other topic, or, for that matter, of any other topic.
Finally, let's consider the information conveyed by the examples in tabular form:
element ID | baseName | latitude | longitude | population | life expectancy |
---|---|---|---|---|---|
tokyo | Tokyo | 35 40 N | 139 45 E | ||
japan | Japan | 127214499 | 80.93 |
The above table is intended to represent a relational database containing the same information as the XTM snippets already shown. We presume that few, if any, would argue that it is not useful or appropriate to be able to regard such a database as a topic map. In order to view such a database as a topic map, all that is really necessary is to explicitly disclose, somewhere, somehow, exactly how it can be seen as a topic map. The questions that such a disclosure must answer include:
What are the subjects that should be regarded as being the subjects of topics?
How should the resulting topics specify their subjects?
The TMRM provides essential guidance in how to answer the above questions, so that ways of regarding an information resource as a topic map can be disclosed and known. Rather than enumerating and exemplifying RDF, KIF, LTM, OSL, various XML vocabularies, and other information representations that may usefully be viewed as if they were topic maps, we instead invite the reader to contemplate:
how useful it will be to interchange disclosures of arbitrary ways of viewing any information resource as a topic map,
how important it is to make an international standard that provides a nomenclature, such as the TMRM's proposed nomenclature, for making such disclosures, and
whether it makes sense to allow any other part of the ISO Topic Maps standard to fail to be explained using the nomenclature provided by that same standard.
2 | Additional Remarks |
2.1 | The TMDM works, but we need to know why |
Skillful use of a tool, such as the Topic Maps Data Model, does not necessarily imply an understanding of why a tool works. It is not possible to choose an appropriate tool, or to decide how to use it (save by chance or mimicry) unless one understands why the tool works as it does. For example, a rake is a tool that does a good job of gathering leaves, but it is less ideally suited for gathering school children for a trip to the zoo -- even though a rake can be used for such a purpose, if it is handled skillfully enough. If the children's parents object to the use of a rake, however, their concerns are unlikely to be mitigated by assurances that "it works", even if they trust that the rake will be handled by a person known to be very skillful with rakes.
In general, the characteristics of a tool that make it usable in one context (the reasons why, for example, a rake is useful for gathering leaves), are not necessarily the same characteristics that make it useful in another context (such as the reasons why a rake may be useful for gathering children). Again, it is important to know why a tool may be useful for a purpose. Even if we have a procedure for using a rake to gather children for a trip to the zoo, and the procedure demonstrably works, that doesn't mean that we know very much about rakes, or about children, either. We can't say whether the same procedure could gather children using a shovel, or what advantages a shovel would offer when gathering leaves.
2.2 | Knowing the queries that work is insufficient |
If we know some effective procedures for querying a topic map, it's quite true that we may, by analyzing the query, be able to deduce some of the ways in which the topic map reflects the mapped territory. But even if we can make such deductions by analyzing procedures that are demonstrably effective, it is far preferable to know the underlying design of the topic map explicitly. It's better not only because the maintainers of the topic map can be held to their own standard, but also because such declarative explicitness leaves open the possibility of using the same topic map for new, unforeseen purposes: it provides a principled basis for the invention of new procedures. Such explicitness is valuable and necessary even in cases where the same person is both the maintainer of a topic map and its only user; it provides a principled basis on which both kinds of tasks -- both maintenance and use -- can evolve and improve, in explicit and testable harmony with each other.
2.3 | The scope of ISO/IEC 13250 |
As the TMRM demonstrates, the question of subject identity is not a simple one, nor is the question applicable only to a particular syntax, or to a particular data model representation of a syntax. Different users have different notions of subject identity; this was recognized in the original text of ISO 13250, in the definition of subjects as: "...any things whatsoever, regardless of whether they exist or have any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever." The variety of bases for subject identity is potentially boundless. We attempted to dramatize that boundlessness in our use case document.
The Topic Maps paradigm must embody a practical, general approach to the problem of making the bases for specifying and understanding the identities of subjects known. In order for the Topic Maps standard to reflect the paradigm, the standard must establish, separate and apart from any syntax or data model, a means by which users can declare the bases on which subject identity is determined in their topic map syntaxes, their data models, and their topic maps.
2.4 | Developers need deep knowledge, but users often don't |
Most users will not need to master the complexities of the assertion model that forms part of the TMRM. Consider the complexities of regular expressions as a parallel case: every designer of a language that has regular expression capabilities has to understand the complexities of regular expression theory. Every developer who writes software based on such a language has to understand the capabilities and limits of the regular expression language that it provides. However, users need not be exposed to any of these complexities; they may only need to know that they can use "Ctrl+F" to perform string searches on their documents. However, even though the user does not have to understand what happens when "Ctrl+F" is pressed, the user depends upon a stack of logical layers, at the bottom of which is an underlying paradigm or abstract model, and the top of which is the software implementation that is actually being used to serve the needs of users. Those who participate in the construction of implementations, and of the logical layers that undergird them, must understand how the lower layers inform the designs of the upper ones. (And, of course, the lower layers must exist, and they must actually inform all the layers above them.)
It is not necessary for users to understand exactly how the Topic Maps standard provides topic map designers and software developers with the ability to unambiguously understand each other's approaches to the problem of mapping subject territories. But if we want the Topic Maps standard to be adopted widely, it is essential for the standard to provide that ability. Users who are interested in protecting their investments in topic maps expect their adherence to an international standard to afford them some protection. Users understand the dangers of making investments in information whose underlying model is not completely explicit, or that is only explicit in non-standard terms.
2.5 | TMRM does not constrain XTM or TMDM/TMQL/TMCL |
The TMRM allows authors of topic maps to use occurrences, resourceData, associations and other syntax/information items to mean many different things, just as diverse users of Topic Maps are already doing. The TMRM seeks to provide the basis for developing a means to allow such authors to say (in a standard way) what their various usages of the syntactic constructs and information items mean, i.e., how to recognize the identities of the subjects of their topics. The TMRM only facilitates disclosure. It does not constrain the models and mapping approaches that can be disclosed.
The TMRM is a valuable tool for creating consensus around a deep common understanding of a syntax or data model. The work of systematically disclosing, in conformance with the TMRM, how instances of a syntax or data model are expected to be viewed as topic maps, can highlight ambiguities and misunderstandings. Such a systematic effort requires consideration of the consequences of each possible approach to the mapping problem.
All geographic maps have legends that disclose how each symbol that appears in them reveals something about something in the mapped territory. Every topic map needs a legend, too: an indication of how its symbols correspond to the subjects in the mapped subject-territory. The TMRM provides a way of disclosing the legends of topic maps, but it imposes no constraints on the symbols that the legend can define, or their definitions.
3 | Conclusion |
3.1 | TMDM and TMRM do not compete |
The notion that the TMRM "competes" with the TMDM is illusory. The TMDM and TMRM are as different as apples and deoxyribonucleic acid (DNA). The proposed TMDM is a data model based on a particular syntax. By contrast, the TMRM is a proposed nomenclature by which any syntax, data model or topic map author can disclose, in a "standard way" (SC34 is a standards body), how subject identification is done when using a particular syntax or data model, or even within a particular topic map.
3.2 | A topic map is... |
The question that led to WG3's development of the TMRM arose when 13250 incorporated the XTM syntax, in addition to the original HyTM syntax. At that moment, Topic Maps began to have multiple standard syntaxes, and it became necessary to have an answer to the question: What is it that makes instances of HyTM and XTM -- and, by extension, LTM and other notations -- "Topic Maps"? The most salient characteristic that all topic map syntaxes and models share is that they are all designed to facilitate of the "one location per subject" goal. This is not a coincidence. The impetus for the invention of the paradigm was the problem of merging diverse independently-maintained indexes, and the invention of the paradigm was simultaneous with the coining of its "Topic Maps" moniker.
3.3 | Predictability requires explicitness |
The achievement of the outward sign of being a "topic map" -- having one location per subject -- implicitly depends on the answer to a more difficult question: On what basis are subjects distinguished from, or found to be the same as, other subjects? It is true that a particular syntax or data model may gather pieces of information about each subject in a single syntactic construct or in-memory object that is, in some sense, dedicated to a single subject. The syntax may be contrived in such a way as to guide its human users to express their topic maps with a degree of predictability and consistency that, at least in some circumstances and for some purposes, is "good enough". But in the absence of explicit knowledge of the bases on which the identities of subjects will be consistently determined, there can be no certainty in the interchange and processing of topic maps, even if they all use the same interchange syntax and are processed in systems that use the same data model.
3.4 | Let's maximize the attractiveness of topic maps as investments |
The means whereby the identities of subjects are discriminated must be declarable separately from any syntax or data model. To insist that a topic map's doctrines of subject identification are inseparable from the representation of that topic map in some specific notation or data model is to attempt to create an obstacle that topic map owners who wish to exploit their assets outside the contexts of systems that have been designed around that syntax or data model will have to overcome. Such insistence does not serve the interests of topic map owners or users. If we want the Topic Maps standard to be widely adopted, it must encourage investments in topic maps. Adopters of the standard must be rewarded by enhanced exploitability of their investments in their information assets. We can reasonably expect the rate at which the Topic Maps standard is adopted to be a function of the degree to which adoption adds value to investments in information assets.
3.5 | TMRM disclosures: catalysts for information resources |
Many interchange syntaxes, data models, and database schemas have already been designed without any knowledge of Topic Maps, but, at least implicitly, with specific and consistently-applied ideas about subject identification in mind. Many of these syntaxes, models, and schemas have been adopted by significant shares of significant markets. So, the question, How can ideas about subject identification be expressed independently of any syntax or data model? is far from being merely academic; it goes to the heart of today's largest business cases for Topic Maps.
3.6 | Please let the sound and fury signify something |
The TMRM is essential to the Topic Maps standard because it provides answers to the most essential questions about the Topic Maps paradigm. If the Topic Maps standard cannot provide convincing answers to these questions, the standard cannot be truthfully defended against the charge that Topic Maps are "full of sound and fury, signifying nothing."