Unique identifiers – synonyms and homonyms

Unique identifiers, synonyms and homonyms

It is widely recognized that there is a difference between things (concepts, aspect, relations, etc.) and their various denotations, such as terms or names, abbreviations and codes. In practice, database systems and date exchange messages use different terms to denote the same things (synonyms) and the same term denotes different things (homonyms). This results in interpretation problems and difficulties for interoperability of systems and for data integration. This can be solved by using a formalization of natural languages such a Gellish, within which the ‘things’ are represented by language independent unique identifiers (UIDs). This enables that different parties use multiple different names for the same things. This article discusses the advantage of this solution over other approaches.

Names are only unique identifiers of ‘things’ within a limited context. In a wider business context for systems and databases of different parties, names are not unambiguous denotations for things. Therefore, several solutions are proposed for the creation of unique identifiers (UIDs).

Forced unique names

To address this issue, many systems constrain their terminology by not allowing for the use of homonyms and by prescribing that only particular (artificial) terms should be used to distinguish the various meanings. For example, instead of having the term ‘building’ to denote two different concepts they may prescribe the use of ‘building (activity)’ distinct from ‘building (object)’. Furthermore, often systems do not allow for synonyms, so that only one particular term may be used to denote a concept. Both solutions address the issue within a small closed community only, because in a wider context different people will make different choices for the allowed terminology, which hampers interoperability in a wider context.

Random unique numbers

A second solution is the use of generators for creating random UIDs from a very large domain, then the probability that identical numbers are generated is very small. This is usually a sufficient reason to assume that the generated UIDs are universally unique. This enables to use UIDs without a central authority to manage the uniqueness of the identifiers. This is the basis on which random generated (probably) unique identifiers are used to uniquely denote things. Examples of such systems are various versions of the Universally Unique Identifier (UUID) as standardized in ISO/IEC 11578:1996 (http://en.wikipedia.org/wiki/Universally_unique_identifier) and the Globally Unique Identifier (GUI) (http://en.wikipedia.org/wiki/Globally_unique_identifier). This solves the issue of homonyms in a general context, but it does not solve the synonyms issue, because a random number generating system allows that multiple parties create different UUIDs for the same thing. Thus there are separate statements required that specify that such UUIDs are synonymous. As synonymity is one of the issues that should be solved, UUIDs are only suitable for situations where synonymity is not required, such as for unique product coding.

Namespaces

An approach for a wider community is to specify a unique identifier that consists of a combination of a ‘namespace’ and a name, whereas a name shall always be accompanied by a namespace. A constraint for a namespace is that all names within the namespace are unique (http://en.wikipedia.org/wiki/Namespace). Only then a combination of namespace and name uniquely denotes something. For example namespace ‘architecture’ may contain the term ‘building’ and namespace ‘activity’ may also contain the term ‘building’. However, those namespaces do not prevent that different namespaces use different names for the same thing, without specifying that those denotations are synonymous. Thus for a computer it is uncertain whether ‘architecture building’ is the same thing as ‘activity building’ or not, although in this example the namespace names may suggest for humans that they are different. To address synonymity it is required that explicit relations specify that such different ‘unique identifiers’ denote the same thing.

Gellish UIDs

In the approaches with random unique numbers as well as namespaces there is not one string that uniquely represents the thing itself (the thing ‘an sich’). Gellish explicitly distinguishes between a unique representation of things and the various terms to denote those things. Each represented thing is uniquely represented in the (whole) language by its own Gellish UID (an arbitrary string that is a decimal natural number), whereas the various associated terms are only meant for human users. The terms enable users to find the intended UIDs. The fact that Gellish English is a formalized natural language implies that its terminology (dictionary) and its UIDs are managed. The ranges for UIDs include a range for unknowns (in queries), and the use of prefixes for UIDs that are allocated by other parties. This means that UIDs can be allocated to represent things and that multiple names (synonyms, including also translations, abbreviations, codes, etc.) can be specified to denote the represented things via their UIDs.
This implies that relations are defined as relations between UIDs and not as relations between the denoting terms!
This solves the synonyms issue (for computers), because all synonym terms denote the same UID. It also solved the homonyms issue, because all homonyms have different UIDs.
Note that external usage of Gellish UIDs (e.g. in RDF expressions) is possible by using ‘Gellish’ as a ‘namespace’. For example, Gellish:40018 would refer to the concept ‘building’ in the language community ‘building technology’. A direct URI reference, for example to concept 730000 and to the term ”anything” would be:
http://www.formalenglish.net/dictionary#730000 or http://www.formalenglish.net/dictionary#anything, whereas the former is unambiguous and the latter may find homonyms, if available.

The question remains how users can find the right UIDs, because they always denote things by names. This is solved by two mechanisms: 1. For each term systems can display the ‘language communities’ within which the terms uniquely denote something. Those language communities specify the ‘home base’ for the terms. This means that a term is the preferred terms to denote the things (UIDs) in its specified language community. 2. Systems can display supertypes of concepts and classifiers of individual things, because they are normally different for homonyms, whereas in Gellish, each concept (kind) is defined by a relation to its supertype concept(s) and each individual thing has one or more classification relations with classifying kinds.

The Gellish UIDs are language independent, which means that expressions (relations) are natural language independent. This enables systems to present expressions in various languages, thus providing automatic translations between languages when dictionaries in the languages are available.
In a multi-lingual environment the used language acts as a second kind of namespace for the terms, because the Gellish dictionary specifies for each term to which language it belongs.

Gellish.net

Unique identifiers – synonyms and homonyms

Unique identifiers, synonyms and homonyms

Forced unique names

Random unique numbers

Namespaces

Gellish UIDs

Leave a Reply Cancel reply