Data types not needed

Data types are not needed

I would like to defend the statement: Data types are not needed in semantic databases.

In natural languages we use just strings of characters…, or are we using data types?

Certainly we can classify character strings in natural languages e.g. as numeric strings, or numbers, decimal numbers, whole numbers, rational numbers, negative numbers, text strings, dates, etc., depending on the kinds of characters that are used and whether particular conventions are obeyed. However, in the end they all are represented as just strings of characters. There are even numeric strings for which the application context indicates that they are not numbers, e.g. when they represent codes, such as country codes, which denote countries (e.g. 840 denotes the United States of America, according to ISO 3166-1). Is there a difference in meaning that is determined by ”data types”? Or is some meaning encoded by different character strings in different contexts only?

In information technology, basic data types are created to indicate different (binary) encoding systems or notations. For example, integer, real and character (string) encodings. Such data types are basically syntactic concepts and not semantic concepts, because the complete meaning can be conveyed by using character strings only. In programming languages and query languages, these data types determine which operations are allowed, whereas in databases they are used e.g. to verify whether only allowed values are entered. For example, the concept ‘twelve’ and the concept ‘United States’ can be encoded in various ways in different binary encoding systems. But how does a computer know that twelve denotes the same as 12 and the same as dozen, and that 840 denotes the same as United States, but not the same as the number 840? That does not follow from the encoding or data type (the syntax), but that follows from the meaning of the uniquely identified concepts (the semantics) that are denoted by those character strings and their aliases (synonyms, abbreviations, codes, translations, etc) and different binary encodings (e.g. in ASCII or Unicode, or integer*8, etc.).

Semantic information modeling is a methodology that takes some meaning as its starting point and represents that meaning by a unique identifier (e.g. a natural number). Then according to the methodology it is specified in a vocabulary in a dictionary-taxonomy and language definition how such a meaning can be expressed or denoted in a formalized natural language (e.g. Formal English), thereby allowing to denote a meaning in various ways, using aliases (as in natural languages) and different binary encoding systems. Therefore, semantic modeling makes a strict distinction between a concept and its various allowed denotations. Semantic models are models that specify relations between concepts and they do not specify relations between denotations or terminology. In a formal language the concepts (as well as individual things) are therefore represented by language wide unique identifiers (UID’s), whereas the various denotations are represented by character strings (being terms and phrases, as in natural languages), each character string having its ”home base” within its own language community context. In such a language community such a term or phrase is the preferred and unique denotation for a concept. Thus there are no homonyms within one language community.
For example, the dictionary-taxonomy specifies that a concept with some arbitrary allocated UID 920265 is a qualitative subtype (a value) of natural number (UID 920245), whereas that UID 920265 is denoted in English e.g. by the character string ‘twelve’ or ‘12’ or ‘dozen’, (or ”twaalf” in Dutch), etc.. Furthermore, it is specified that the language community ‘decimal system’ will have ‘12’ as preferred denotation. The dictionary-taxonomy also specifies that a concept with UID 2700347 is classified as a country (UID 700011) and is denoted as ”840” and also denoted as ”United States”, whereas ”840” has as language community context ISO 3166-1 (and not ”decimal system”). Now it should be noted that the different denotations for the same concept (such as ”12” and ”twelve” do not differ in meaning.

Data types for allowed values?

The specification of a domain of allowed values (also called a value space) is primarily a specification of the allowed concepts, independent of the encoding or notation to denote the concepts. For example, the domain of ”number of items” can be specified as being ”natural number” (UID 920245). That is a sufficient specification of what is meant. Whether such numbers are denoted as 1, 2, 3 or as one, two, three, or by a mixture of them is not relevant for the meaning and thus not for a database content. (For a user interface it may be relevant). As the vocabulary of the formalized language (in its dictionary) specifies the allowed denotations of numbers, the verification whether am entered character string is a valid ”natural number” is determined by comparing with the vocabulary of the language. Similarly, the domain of ”country” is by definition ”country”. As countries are also represented in the language by UID”s, it is irrelevant for the meaning whether the values are denoted by numeric codes (such as ”840”) or as text strings (such as ”United States”). Whether an entered text string is a valid country is thus determined by comparison with the denotation in the vocabulary. This means that both ”840” and ”United States” as well as ”US” or ”USA” might be allowed.

Thus the denotation or data type is irrelevant for a database content, although it may be relevant for a user interface. As the denotation of concepts is specified in a vocabulary of the used formalized language (in its dictionary-taxonomy), the verification of allowed entered data can be done on the basis of the specifications in the dictionary-taxonomy.
Thus as long as the correct UID’s are used, there is no reason to specify data types.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *