Data types not needed

Data types are not needed

I would like to defend the statement: Data types are not needed in semantic databases.

In natural languages we use just strings of characters…, or are we using data types?

Certainly we can classify character strings in natural languages e.g. as numeric strings, or numbers, decimal numbers, whole numbers, rational numbers, negative numbers, text strings, dates, etc., depending on the kinds of characters that are used and whether particular conventions are obeyed. However, in the end they all are represented as just strings of characters. There are even numeric strings for which the application context indicates that they are not numbers, e.g. when they represent codes, such as country codes, which denote countries (e.g. 840 denotes the United States of America, according to ISO 3166-1). Is there a difference in meaning that is determined by ”data types”? Or is some meaning encoded by different character strings in different contexts only?

In information technology, basic data types are created to indicate different (binary) encoding systems or notations. For example, integer, real and character (string) encodings. Such data types are basically syntactic concepts and not semantic concepts, because the complete meaning can be conveyed by using character strings only. In programming languages and query languages, these data types determine which operations are allowed, whereas in databases they are used e.g. to verify whether only allowed values are entered. For example, the concept ‘twelve’ and the concept ‘United States’ can be encoded in various ways in different binary encoding systems. But how does a computer know that twelve denotes the same as 12 and the same as dozen, and that 840 denotes the same as United States, but not the same as the number 840? That does not follow from the encoding or data type (the syntax), but that follows from the meaning of the uniquely identified concepts (the semantics) that are denoted by those character strings and their aliases (synonyms, abbreviations, codes, translations, etc) and different binary encodings (e.g. in ASCII or Unicode, or integer*8, etc.).

Semantic information modeling is a methodology that takes some meaning as its starting point and represents that meaning by a unique identifier (e.g. a natural number or other character string). Therefore, semantic modeling makes a strict distinction between a concept with one particular meaning and its various allowed denotations. According to that methodology it is specified in a vocabulary in a (taxonomic) dictionary and language definition how such a meaning can be denoted, thereby allowing to denote a meaning in various ways, using aliases (as in natural languages), translations and possibly in various binary encoding systems. Semantic models are models that specify information as relations between concepts and they do not specify relations between denotations or terms. In a formalized natural language, such as Formal English, the concepts (as well as individual things) are therefore represented by language wide unique identifiers (UID’s), whereas the various denotations (being terms and phrases, as in natural languages) remain just character strings, whereas each character string uniquely denotes a concept only within its ”home base” language community context. Thus only within such a language community such a term or phrase is a unique denotation for a concept and is typically the preferred denotation. Thus within one language community there are no homonyms, but across language communities and within a language there homonyms are allowed.
For example, the Formal English taxonomic dictionary specifies that a concept with some arbitrary allocated UID 920265 is a qualitative subtype (a value) of the concept ‘natural number’ (UID 920245). Furthermore, that first UID 920265 is denoted in English by the character string ‘twelve’ and by ‘12’ and by ‘dozen’, (and by ”twaalf” in Dutch), etc.. Furthermore, it is specified that the language community ‘decimal system’ will have ‘12’ as preferred denotation. The taxonomic dictionary also specifies that a concept with UID 2700347 is classified by the concept ‘country’ (UID 700011) and is denoted as ”840” and also denoted as ”United States”, whereas ”840” has as language community context ISO 3166-1 (and not ”decimal system”). Now it should be noted that the different denotations for the same concept (such as ”12” and ”twelve” do not differ in meaning, but denote the same concept. This enables software to interpret synonyms and homonyms in the formal language in an unambiguous way, without the need for data types.

Data types for allowed values?

The specification of a domain of allowed values (also called a value space) is primarily a specification of the allowed concepts, independent of the encoding or notation to denote the concepts. For example, the domain of ”number of items” can be specified as being ”natural number” (UID 920245). That is a sufficient specification of what is meant. Whether such numbers are denoted as 1, 2, 3 or as one, two, three, or by a mixture of them is not relevant for the meaning and therefore it should not be relevant for a database content. The vocabulary of the formalized language (in its dictionary) and possible encoding rules should specify the allowed denotations of numbers and it may specify the preferred denotations in a particular language community. Then the verification whether an entered character string is a valid ”natural number” is determined by comparing the entered character string with the vocabulary of the language. Similarly, the domain of ”country” is by definition ”country”. As countries are also represented in the language by UID”s, it is irrelevant for the meaning whether the values are denoted by numeric codes (such as ”840”) or as text strings (such as ”United States”). Whether an entered text string is a valid country is thus determined by comparison with the denotation in the vocabulary. This means that both ”840” and ”United States” as well as ”US” or ”USA” might be allowed.

Thus the denotation or data type is irrelevant for a database content. As the denotation of concepts is specified in a vocabulary (taxonomic dictionary) and encoding rules of the used formalized language, the verification of allowed entered data can be done on the basis of the specifications in the taxonomic dictionary.
Thus as long as the correct UID’s are used, there is no reason to specify data types.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *