Dialogues, inserts and queries

Computer interpretable dialogues among various parties require that assertions, queries, answers, denials, etc. are expressed in one common language. This article describes how that can be achieved by using Gellish.

In natural languages, queries and assertions have nearly the same structure and use the same terminology. Thus the sentences used for assertions and questions are only slightly different. The differences, such as changes in word sequences and the use of question marks, can even be eliminated completely when the ‘Speech act’ theory of John Searl is applied. Searl demonstrated that the same expression can be used for expressing assertions as well as questions or other ‘intentions‘, provided that the intentions are added as explicit indicators of those intentions. For example, a statement, a question as well as a confirmation can be expressed with different explicit intentions, followed by the same expression, as follows:

Statement: book B-1 has a price of 30 dollar
Question: book B-1 has a price of 30 dollar
Confirmation: book B-1 has a price of 30 dollar

The second expression is intended to be interpreted as a question that asks for a confirmation or a denial. Note that a confirmation is equivalent to the answer ‘yes’ or ‘indeed’. Typical query languages assume that questions ask something about unknowns. In natural languages unknowns are usually represented by terms such as ‘what’ or ‘who’. Thus a typical expression would be:

Question: what is the price of book B-1

In the Gellish formalized languages the phrase ‘is the price of’ is defined as an inverse expression of ‘has a price of’. Thus the latter expression can be converted automatically into the above word sequence as follows:

Question: book B-1 has a price of what

Queries on conventional database content are usually expressed in dedicated query languages, such as SQL and SPARQL. The ”Speech act” theory and other conventions open up the possibility to use the same formalized language for statements (in exchanges messages or data stores) as well as for insertion commands, for queries and for responses (answers), etc. As is further described below.

SQL assumptions and limited semantics

Each query that is expressed in an ordinary query language such as SQL depends on the database structure and on the terminology that is used for the content. In other words, each database requires a dedicated query dependent on the specific structure and terminology of the database. The query language itself can be adepted to any relational database structure. The query languages do not put any requirements on the tables in the database, neither on their names nor on their columns and column names, apart from the requirement that each column shall contain instances of the same kind. This freedom is required because a lack of standardization of database structures and table definitions which cause that different databases for similar things are composed of different tables, and use other numbers of columns with different names and differences terminology for their content. This is even the case when they contain the same information about the same kinds of things. Thus INSERTs as well as SELECT statements will be different for each database. This freedom also implies that these languages presuppose that authors of inserts and queries have knowledge about the internal structure (syntax) of the queried database as well as of the used terminology for table columns and table content. Therefore, SQL and other query languages themselves do not deal with any meaning (semantics) of the columns of the tables and are also independent of the terminology that is used for the content of the tables. They are generic languages that define a minimum of semantics. For example, the following simple insertion in database table called ‘Book’ is written in SQL (adapted from http://en.wikipedia.org/wiki/SQL):

INSERT INTO Book

 (title, price, type)

 VALUES

 (‘B-1’, 30) (‘B-2’, 20, paperback) ;

Apparently the author of this insert knows that there exists already a table, called ‘Book’ that has at least three columns, called title, price, and type, whereas he also knows that a title is the title of a book and a price on the same row is the net selling price (in dollar) of a single copy of a book with that title and a type denotes a subtype of book that classifies the book on that same row. He also introduces a new free term in the vocabulary of the content, unless the term ‘paperback’ is a predefined allowed value for ‘type’.

The same content could however be stored in a databases that applies different definitions. For example, a database with a table called ‘Product’, with columns that have names such as ‘name’, ‘nett price’, ‘product type’. Then the INSERT would have been different, although the semantics of the content would be the same.

Data can be retrieved from such database tables with queries that are also expressed in such query languages. For example the following select from the ‘Book’ table is also expressed in SQL:

SELECT *

FROM Book

WHERE price > 100.00 ;

This simple query apparently assumes the same knowledge from the author as is required for the insertion.

The same question should be expressed in a different way when it was a query on the ‘Product’ database. Furthermore, the expression for an insertion of a statement about information is significantly different from an expression of a query about the same information and those expressions are again different from expressions that present the results of a query.

The query language Sparql is a bit different, because it builds on a standard syntax of triples of RDF. This standardizes the structure (syntax) of the tables. However, the content, being the classes and literals and the predicates (called ‘properties’ in RDF) are still not standardized. Thus Sparql is not aware of terms such as book, price and paperback.

Equivalent inserts and selects in Gellish

This is quite different when Gellish Formalized English or another Gellish formalized language is adopted. The structure of expressions as well as the vocabulary in those formalized languages are all the same, independent of database structures or intentions of the expressions. This is achieved by the rule that all formalized expressions have the same expression components (thus they can all be stored in one standard table or one binary semantic network) and all expressions use the same (extensible) taxonomic dictionary, which includes predefined concepts, such as book, title, price and paperback, as in any ordinary dictionary. Furthermore, the expressions for insertions are similar to the expressions of queries and of information that is to be stored or exchanged.

Thus Gellish expressions of insertions and queries are only determined by the semantics and are not determined by the many possible database structures, and they are independent of the variety of terminology that is used in current practices for names of entity types and names of attribute types.

This means that information that is expressed in Gellish can be inserted in any database table that is based on the Gellish Syntax or that has import and export mapping to Gellish expressions (within access and requirements constraints). Note that Gellish expressions are not just for books only! Thus the system independent expressions don’t need to be rewritten for other databases. Table 1 presents an example of an insert command for information about the prices of two books with expressions in Gellish English that are database independent.

Intention	UID of left hand object	Name of left hand object	Name of kind of relation	UID of right hand object	Name of right hand object	UoM
command	195070	insert	the following expressions into	101	ABC
statement	102	B-1	is classified as a	490023	book
statement	102	B-1	has as aspect	103	P-1 of B-1
statement	103	P-1 of B-1	is classified as a	550742	price
statement	103	P-1 of B-1	has on scale a value equal to	920366	30	$
statement	104	B-2	is classified as a	493755	paperback
statement	104	B-2	has as aspect	105	P-1 of B-2
statement	105	P-1 of B-2	is classified as a	550742	price
statement	105	P-1 of B-2	has on scale a value equal to	920376	20	$
command	193423	terminate	the execution of	195070	insert

Table 1, Insertion of product data

Note: Table 1 only shows a subset of the Gellish Expression format. In a full Gellish Expression format each line has more UID’s and contextual facts, such as the validity period, status, originator, etc. This enables e.g. to add multiple prices in various currencies and each with its own validity time period, if the cardinality constraints allow for that.

The body of Table 1, without the first and the last line can be copied exactly into a database table, such as ABC, because the storage table has an identical table structure.

The above query in SQL is expressed in Formal English as follows:

Intention	UID of left hand object	Name of left hand object	Name of kind of relation	UID of right hand object	Name of right hand object	UoM
command	193617	select	the following expressions from	101	ABC
question	1	?Book-1	is classified as a	490023	book
question	1	?Book-1	has as aspect	2	?Price-1
question	2	?Price-1	is classified as a	550742	price
question	2	?Price-1	has on scale a value greater than	920053	100	$
command	193423	terminate	the execution of	193617	select

Table 2, Query table ABC in the form of a product model

Comparison of Table 1 with Table 2 shows the similarity of the models, which demonstrates that the expression of information and the expression of queries can be done in the same language. Thus there is no need for a dedicated query language.

Note 1: Formally the names of the unknowns are free, although it is recommended to use terms such as ‘what’ and ‘which’ and ‘who’ (possibly followed by a sequence number) or terms that start with a question mark (?), as is an SPARQL convention. The reason why the names are free is to enable searching on string commonalities (see below). This freedom is enabled by the fact that, according to the Gellish convention, all unknowns shall be represented by UIDs that are numbers in the range 1 to 99. Remember that all terms denote concepts that are represented by UIDs, although those UIDs are not shown in Table 1 and Table 2 to limit the widths of the displayed part of the tables.
Thus, in practice the expressions, the left and right hand objects and the kinds of relations all have unique identifiers (UIDs).

Note 2: A query may search for and select from more than one table at the same time (because the various tables have the same definition). Tables do not need to be JOINed, only the search results should then be presented to the user as a combined result.

The query that is expressed in Table 2 illustrates that software should take the taxonomy of concepts into account. For example, the dictionary-taxonomy specifies that the concept paperback is a subtype of book. If the software processes that information correctly, then a query on book will also find the paperbacks. This hierarchy enables to simply modify the query to search e.g. on paperbacks only or on any other subtype.

In SQL and asterisk (*) can be used to specify that ‘all’ attributes from a table should be reported. This assumes that the authors knows which attributes are in the table, but when there is information about the books in other tables the query becomes more complicated. In the Gellish approach the kinds of relations that are queried can be specified more precisely. For example, the query in Table 2 uses the kind of relation <has as aspect> and thus it only asks for aspects, whereas on the following line it is specified that only aspects are required for which holds that the aspect <is classified as a> price. The query can easily be extended with additional requests for other information, such as

question	What-1	is located in	Some location-1
question	Some location11	is classified as a	building

Or with the very generic question:

question	What-1	is related to	A-1
question	A-1	is classified as a	anything

This latest question asks for everything that is known about the books.

Further data manipulation constructs for operations on the search results, including operations on the contextual facts, are outside the scope of this article.

Search string commonalities

One of the reasons to leave the names of the variables free is to enable to specify character strings that only partially match with names that are searched for. The name of an unknown can be specified as P-1, whereas the intention is to search for things that have a name that starts with P-1, so that e.g. P-101 and P-1201, etc. are included in the search result. To enable specifying to what extent a search string shall match with target strings, there are two additional components available in a Gellish Expression Format, being a left hand and a right hand string commonality. For example, in case of the above search on P-1 it will be ‘case insensitive front end identical’.

Further details about these string commonalities are described in the book ‘Semantic Modeling in Formal English’.

SPARQL and RDF

SPARQL is a query language that is especially made for querying databases that are formatted conform RDF, also called ‘triple stores’. As shown above, the semantics of questions and other expressions require more than just triples, such as units of measure and contextual facts. That is the reason why many implementation specify extensions of RDF to represent collections of triples, which are called ‘named graphs’, as is also applied in ISO 15926-11.

Extended RDF implementations of Gellish Formalized English can use SPARQL directly. However, RDF itself defined a syntax and a minimum of semantics (it only defined a few concepts), just as SQL. This enables that in RDF expressions any kind of relation (‘predicate’ in RDF) and any left hand and right hand term (‘subject’ and ‘object’ in RDF) can be used. Thus everybody can use his or her own ‘namespace’ and own ontology. This powerful flexibility at the same time reveals the weakness towards interoperability, because of the lack of standardization of the language in which the database, message and query contents can or shall be expressed.

The expression of inserts and queries can be made database system independent only when an extended RDF is combined with a semantically rich language, such as Gellish. Such a combination provides a language that includes semantics as well as syntax (format).

Another question is whether the SPARQL syntax is to be preferred above the tabular Gellish Expression format syntax as is used in Table 1 and Table 2. The commonalities and differences between these two formats can be illustrated on the SPARQL example query for a ‘foaf’ (friend of a friend) database http://en.wikipedia.org/wiki/SPARQL:

PREFIX foaf: <http://xmlns.com/foaf/spec/>

SELECT ?name ?email

WHERE {

  ?person a foaf:Person.

  ?person foaf:name ?name.

  ?person foaf:mbox ?email.

The above example shows that SPARQL also presupposes knowledge about the particular structure of the queries database. Although the structure of RDF expressions is database (data model) independent, this example demonstrates that this query is dependent on the structure of the foaf database and relies on the understanding of the content of the foaf ontology (http://xmlns.com/foaf/spec/), which includes a database structure (table definitions) with definitions of ‘classes’ (entity types) that have pre-defined ‘properties’ (attribute types). For example the class foaf:Person is not the same as the generic concept ‘person’, because the foaf ontology defines a foaf:Person as a person that has a number of predefined ‘properties’ (attributes) with specific names. Thus a foaf:Person is defined as a particular collection of ‘properties’. For example the foaf ontology pre-defines that a foaf:Person can have or has a surname, as well as e.g. publications and a currentProject, and inherits an mbox. Apparently the foaf ontology defines a very specific ‘language’ that cannot be merged with other ontologies/languages and thus the query will only work on a foaf database and shall be rewritten for any other database.

This demonstrates why the neutral form of Gellish expressions in Table 1 and 2 has advantages.

Gellish.net