Data types not needed

Data types are not needed

I would like to defend the statement: Data types are not needed in semantic databases, provided that the meaning of each concept is unambiguously captured in its unique identifier.

In natural languages we use just strings of characters, so, why would we use data types when storing data in computers?

Certainly can infer a classification of character strings in natural languages. For example, we can distinguish as numeric strings, or numbers, decimal numbers, whole numbers, rational numbers, negative numbers, text strings, dates, etc., depending on the context in which they are used. However, in the end they all are represented as just strings of characters. There are even numeric strings for which the context in which they appear indicates that they should not be interpreted as numbers, for example when it is a country code. For example, according to ISO 3166-1 the code 840 denotes the United States of America.
Apparently the same character string can have different meanings, dependent on its context. Data types in computer systems appear to be a method to cope for such contextual information. However, if the differences in the meaning of the concepts are already captured in their UIDs, then the code 840 and the number 840 have different UIDs, so that there is no need to use data types to distinguish those concepts.

In information technology, basic data types are created to specify different (binary) encoding systems or notations. For example, integer, real and character (string) encodings. Such data types are basically syntactic concepts and not semantic concepts, because the complete meaning can be conveyed by using character strings only. In programming languages and query languages, these data types determine which operations are allowed, whereas in databases they are used e.g. to verify whether only allowed values are entered. For example, the concept ‘twelve’ and the concept ‘United States’ can be encoded in various ways in different binary encoding systems. But how does a computer know that twelve denotes the same as 12 and the same as dozen, and that 840 denotes the same as United States, but not the same as the number 840? That does not follow from the encoding or data type (the syntax), but that follows from the meaning (the semantics) of the uniquely identified concepts that are denoted by those character strings and their aliases (synonyms, abbreviations, codes, translations, etc) and different binary encodings (e.g. in ASCII or Unicode, or integer*8, etc.).

Semantic modeling is a methodology that takes some meaning or concept as its starting point and represents that concept by a unique identifier (UID). Furthermore, semantic modeling makes a strict distinction between a concept with one particular meaning and its various denotations. According to that methodology it is specified in a vocabulary of a (taxonomic) dictionary that is part of a language definition how such a concept can be denoted, thereby allowing to denote a concept in various ways, using aliases (as in natural languages), translations and possibly in various coding systems. Semantic modeling expresses information as relations between concepts and does not express it as relations between denotations or terms. In Gellish, each concept (or individual thing) is therefore represented by a language wide unique identifier (UID), whereas the various denotations (being terms and phrases, as in natural languages) remain just character strings. Only a combination of a language, a language community and a name form a unique denotation of a concept. Thus within one language community there are no homonyms, but across language communities and within a language homonyms are allowed.

In Gellish for practical reasons there are special rules for the UIDs of numbers and dates and date-time periods. The UIDs for numbers of any kind start with the prefix hash (#) and the UIDs for date-times start with the prefix dd, followed by a colon (:). Numbers are by definition quantitative subtypes of the concept ‘number’ and their UIDs specify a meaning that is expressed as a decimal whole number as mantisse that indicates the significant digits of the number, possibly followed by the character E and a positive or negative exponent.
For example, in Gellish the concept with UID #12 is a qualitative subtype (a value) of the concept ‘number’. Furthermore, that UID #12 is denoted in Gellish English by the synonym character string ‘twelve’ and by ‘12’ and by ‘dozen’, (and by ”twaalf” in Dutch), etc.. The taxonomic dictionary also specifies that a concept with UID 2700347 is classified by the concept ‘country’ (UID 700011) and is denoted as ”840” and also denoted as ”United States”, whereas ”840” has as language community context ISO 3166-1 (and not ”decimal system”). Now it should be noted that the different denotations for the same concept (such as ”12” and ”twelve” do not differ in meaning, but denote the same concept.
Date-times are individual time periods that are classified as subtypes of period in time. Their UIDs shall be conform ISO 8601 (2022) (https://en.wikipedia.org/wiki/ISO_8601) and shall be character strings that follows the Gregorian calendar data and time conventions as followes: yyyymmddhhmmss.x where x stands for a decimal value of a second, or any front end part thereof.

The above enables software to interpret synonyms and homonyms in Gellish in an unambiguous way, without the need for data types.

Data types for allowed values?

The specification of a domain of allowed values (also called a value space) is primarily a specification of the allowed concepts, independent of the encoding or notation to denote the concepts. For example, the domain of ”number of items” can be specified as being ”natural number” (UID 920245). That is a sufficient specification of what is meant. Whether such numbers are denoted as 1, 2, 3 or as one, two, three, or by a mixture of them is not relevant for the meaning and therefore it should not be relevant for a database content. The vocabulary of Gellish (in its dictionary) and possible encoding rules should specify the allowed denotations of numbers and it may specify the preferred denotations in a particular language community. Then the verification whether an entered character string is a valid ”natural number” is determined by comparing the entered character string with the vocabulary of the language. Similarly, the domain of ”country” is by definition ”country”. As countries are also represented in the language by UIDs, it is irrelevant for the meaning whether the values are denoted by numeric codes (such as ”840”) or as text strings (such as ”United States”). Whether an entered text string is a valid country is thus determined by comparison with the denotation in the vocabulary. This means that both ”840” and ”United States” as well as ”US” or ”USA” might be allowed.
Furthermore, collections of allowed values for specific purposes can be defined and it can be specified that a value should be one of the elements in such a collection.

Thus the denotation or data type is irrelevant for a database content. As the denotation of concepts is specified in a vocabulary (taxonomic dictionary) and encoding rules of the used formalized language, the verification of allowed entered data can be done on the basis of the specifications in the taxonomic dictionary.
Thus as long as the correct UIDs are used, there is no reason to specify data types.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *