Text Query
Query Languages
Formal languages are required to define, select and transform text units.
The relational database model is the great example. Its huge success shows how effective a sound theory can be. Think about how much work relational database management systems do every day all over the world. Think about how straightforward it is to query a large complex amount of data seeking for a particular information, how little work the programmer needs to do and how much of it is done by the system itself. There is a language available that has a tremendous expressive power, not because it is rich and complicated, but because it is simple and general and relies on a well-thought-out model that catches the deep structure of data in general.
We should follow the example set by the relational model and not be content with finding out a general text structure but providing it also with query languages. For this will be the key for its practical success.
How could text query languages look like?
Text is essentially hierarchical, therefore it suggests itself a multiple level notation each selecting particular nodes. Examples of such notations are CSS style sheets, that select HTML tags to apply style elements to, and XPATH, to select XML node lists.
While text selectors are handy and one can build compact simple expressions with them, they are not scalable and expressions get rapidly unclear as they grow. The full solution is a word-based query language allowing to build sentences and group them into parameterized functions and, in general, with all the means of fully developed higher-order programming languages. Examples of such languages are FLOWR for XML and SQL for the relational model, but both of them lack higher-order constructs and are much weaker than full programming languages.
A Text Selector Notation
Let us introduce a selector notation to query text.
The basic grammar of the selectors results from the text structure: selectors must choose text units having a particular name, role, or type. We can write simply a unit name with the usual prefix sign to indicate whether it is a unit name, a role name or a type name, and separate levels by a period.
To select a unit with a particular name, we write =name. If we select :type we get all instances of the given type or a descendant type of it. A tag ~role selects all units that happen to play the given role. The role is the default selector, so that role is the same as ~role.
Let us consider the example used above describing family Jones:
= Jones ~ family {
= Ann ~ parent : woman
= John ~ parent : man
= Lena ~ child : woman
}
The selector parent returns Ann and John, :woman outputs Ann and Lena, and with :person we get Ann, John and Lena.
You can also restrict the results upon unstructured contents. If you have units such as:
~birth-date 6/24/1954
then you can select them with:
person . "6/24/1954" birth-date
This would return a unit playing the role birth-date and containing the given date string. You would usually rather want to get the person whose birth-date matches. For this purpose one defines the output level with the prefix #. If you select:
#person . "6/24/1954" birth-date
you get the unit =Ann.
To restrict the selection to more than one level, one writes more than one clause separated by a period. For example, to get all children of family Jones one selects:
=Jones . child
To get all first-level descendant units one uses a ? for the level.
=Jones . ?
This matches all units under the unit named ”Jones“. Using ?? instead matches the descendant units recursively, not restricting them to one level. If you want to get descendants up to the forth level, you select ??4.
If a selector clause is prefixed with a number in parentheses, the results are limited to this count.
=Jones . (2):person
This outputs the second person from family Jones. The following retrieves the second and third ones:
=Jones . (2-3):person
If you do not specify a sort order, the results get the same order as the underlying text structure: first a unit and then its children ordered as they were fed.
With a prefix modifier - one gets the order reversed: first the last child until the first child, then the parent unit. If you have this text:
~books
~book Literary Machines
~book Augmenting Human Intellect
~book Software Pioneers
then with the selector -book you get:
~book Software Pioneers
~book Augmenting Human Intellect
~book Literary Machines
That refers to the order of the units, not to their contents. You can sort according to the binary contents of the returned units with < (ascending) and > (descending). For the text above with <book you get:
~book Augmenting Human Intellect
~book Literary Machines
~book Software Pioneers
The prefix modifier > is a shortcut for the modifiers -<.
This text selector notation can be extended in many ways in order to determine what results are to be retrieved.

