Computing Pages

by Francesc Hervada-Sala


Text Query

Query Languages

Formal languages are required to define, select and transform text units.

The relational database model is the great example. Its huge success shows how effective a sound theory can be. Think about how much work relational database management systems do every day all over the world. Think about how straightforward it is to query a large complex amount of data seeking for a particular information, how little work the programmer needs to do and how much of it is done by the system itself. There is a language available that has a tremendous expressive power, not because it is rich and complicated, but because it is simple and general and relies on a well-thought-out model that catches the deep structure of data in general.

We should follow the example set by the relational model and not be content with finding out a general text structure but providing it also with query languages. For this will be the key for its practical success.

How could text query languages look like?

Text is essentially hierarchical, therefore it suggests itself a multiple level notation each selecting particular nodes. Examples of such notations are CSS style sheets, that select HTML tags to apply style elements to, and XPATH, to select XML node lists.

While text selectors are handy and one can build compact simple expressions with them, they are not scalable and expressions get rapidly unclear as they grow. The full solution is a word-based query language allowing to build sentences and group them into parameterized functions and, in general, with all the means of fully developed higher-order programming languages. Examples of such languages are FLOWR for XML and SQL for the relational model, but both of them lack higher-order constructs and are much weaker than full programming languages.

A Text Selector Notation

Let us introduce a selector notation to query text.

The basic grammar of the selectors results from the text structure: selectors must choose text units having a particular name, role, or type. We can write simply a unit name with the usual prefix sign to indicate whether it is a unit name, a role name or a type name, and separate levels by a period.

To select a unit with a particular name, we write =name. If we select :type we get all instances of the given type or a descendant type of it. A tag ~role selects all units that happen to play the given role. The role is the default selector, so that role is the same as ~role.

Let us consider the example used above describing family Jones:

= Jones ~ family { 
	= Ann ~ parent : woman 
	= John ~ parent : man 
	= Lena ~ child : woman 
}

The selector parent returns Ann and John, :woman outputs Ann and Lena, and with :person we get Ann, John and Lena.

You can also restrict the results upon unstructured contents. If you have units such as:

~birth-date 6/24/1954

then you can select them with:

person . "6/24/1954" birth-date

This would return a unit playing the role birth-date and containing the given date string. You would usually rather want to get the person whose birth-date matches. For this purpose one defines the output level with the prefix #. If you select:

#person . "6/24/1954" birth-date

you get the unit =Ann.

To restrict the selection to more than one level, one writes more than one clause separated by a period. For example, to get all children of family Jones one selects:

=Jones . child

To get all first-level descendant units one uses a ? for the level.

=Jones . ?

This matches all units under the unit named ”Jones“. Using ?? instead matches the descendant units recursively, not restricting them to one level. If you want to get descendants up to the forth level, you select ??4.

If a selector clause is prefixed with a number in parentheses, the results are limited to this count.

=Jones . (2):person

This outputs the second person from family Jones. The following retrieves the second and third ones:

=Jones . (2-3):person

If you do not specify a sort order, the results get the same order as the underlying text structure: first a unit and then its children ordered as they were fed.

With a prefix modifier - one gets the order reversed: first the last child until the first child, then the parent unit. If you have this text:

~books
~book Literary Machines
~book Augmenting Human Intellect
~book Software Pioneers

then with the selector -book you get:

~book Software Pioneers
~book Augmenting Human Intellect
~book Literary Machines

That refers to the order of the units, not to their contents. You can sort according to the binary contents of the returned units with < (ascending) and > (descending). For the text above with <book you get:

~book Augmenting Human Intellect
~book Literary Machines
~book Software Pioneers

The prefix modifier > is a shortcut for the modifiers -<.

This text selector notation can be extended in many ways in order to determine what results are to be retrieved.

Print Contact

Text Query

Query Languages

A Text Selector Notation

Text-Oriented Software (Book)

Text-Oriented Software

Copyright

Preface

Text

Text Structure

Comparing Text to Other Structures

Text Query

Languages

Text-Orientation

Imagine

Text-Oriented IDE

Text-Oriented Programming Languages

Files and Text

Programs and Text

Text-Oriented Compiling

Case Studies

Sample: Program Parameters

Unix: A Text-Aware Environment

Universaltext Interpreter

Background

What is Text?

What is Text-Orientation?

Just Once: A Programming Ideal

Why is Computing Important?