Computing Pages

by Francesc Hervada-Sala


Universaltext Interpreter

The Universaltext Interpreter is an experimental software written in Perl by me that implements the text structure presented in this book. It is a tool to produce written output (e.g. a prose book to be published, a dictionary in both electronic and paper format, a website) based upon source information in plain text files or in word processor documents. With a Perl script you can navigate and query the source text structure and generate some documents that extract all or some part of the contents.

This program implements a parser for the Universaltext Language (UTL) and supports alternate custom parsers. It implements text selectors for querying the text structure. It supports binding output processors at run time for the sake of generating different file formats out of the same source.

Universaltext Language

UTL is an implementation of the plain text language that is used throughout this book. It has a line-oriented format, each line defining one text unit and curly brackets enclosing text levels. Example:

^species {
    ^common-name :string
    ^scientific-name :string
}
^mammal : species

This defines a species as having a common and a scientific name and mammals being species.

The interpreter ensures the logical integrity of the entered text. Each symbol being mentioned must be previously defined, each role must be one of the type's children and have a coherent type. If one tries to define a text that violates the integrity rules, the parsing aborts with an error message.

UTL is designed to be embedded in prose and be non obtrusive. When entering a string you usually do not need to put it in quotation marks, the parser considers the first word without prefix (~, :, etc.) up to the end of the line to be binary data.

The parser knows about the text structure and infers which parent a child definition applies to without needing explicit curly brackets. For example the following is valid UTL (indentation is syntactically irrelevant):

^elephant : mammal
~elephant 
  ~common-name savanna elephant
  ~scientific-name Loxodonta africana africana
^penguin : species
~penguin 
  ~common-name Little Blue Penguin
  ~scientific-name Eudyptula minor

UTL considers the first child to be the default role, which is automatically assumed if a line consists just of binary data. For example, if you define prose as consisting of paragraphs and headings this way:

^ Prose {
	^p : string
	^h : string
}

A paragraph is assumed if the role is not explicit. You can enter prose with these lines:

~Prose
~h First Chapter
This is the first paragraph...
This is the second paragraph...

This way a prose document is very readable, having just occasional embedded tags that define its structure.

Alternate Parsers

UTL is a general-purpose text language appropriate for prose, that is sparse structured character strings. It becomes rather verbose for expressing dense structures such as a table. In order to embed in UTL portions of text in another syntax one can use alternate parsers. For example instead of defining a table this way:

~table 
~line
	~column 1.1
	~column 1.2
~line
	~column 2.1
	~column 2.2

you could enter this:

~table [
1.1 1.2
2.1 2.2
]

The square brackets instead of curly brackets instruct the parser to call the alternate parser defined for the type table and pass to it the following two lines. To set up a parser for a type one declares the type this way:

^table {	
	~parser main::parseTable
	^line {
		^column : string
	}
}

The Perl function parseTable from name space main is called whenever a table has to be parsed. This function gets as parameter an object representing the interpreter that can be used to feed text and exposes a method readline that returns the next line to be parsed.

It is quite straightforward to define alternate parsers for dense structures that one uses throughout the source files and it makes the source documents more readable and maintainable.

Feeding Text

One can feed text either through UTL text files that are read and parsed by the interpreter or programmatically. With a Perl script one can generate units, for example:

$ut = new UText;
$ut->def({role=>"elephant"});
$ut->enter();
$ut->def({
	role=>"common-name", 
	bin=>"savanna elephant"
});
$ut->leave();

One can also feed UTL expressions to be parsed:

$ut->read(<<"END");
~elephant 
    ~common-name savanna elephant
END

It is very easy to feed UTL files that were written with a word processor instead of plain text files. A Perl script just needs to extract their contents and perform a read on them.

One can also write a Perl script that reads a regular word processor file without UTL tags and generates a text structure based upon the document's structure. For example, the headings are translated to ~h units, the paragraphs to ~p, etc. With some amount of programming one can feed any structured documents (including spreadsheets, outlines, etc.) into the interpreter.

Text Selectors

Once the text has been fed, it can be exploited. The interpreter implements selectors as described in the chapter ”Text Query“.

The function getVar returns the binary contents of the first element returned by the given selector. To output the title of the current chapter, one writes:

print $ut -> getVar("title");

That outputs for example ”First chapter“. To get the second paragraph, one puts:

print $ut -> getVar("(2)p");

The function foreach iterates over all items returned by a selector. For example, to get all known species that are elephants:

$ut -> foreach("elephant", \&doSomething);

This calls the function doSomething once for each species that is an elephant.

In UTL one can embed a tag [v <selector>] that gets expanded by the interpreter with getVar. For example, if you have this text:

^ book {
	^ title : string
	^ chapter {
		^ title : string
		^ p : string
	}
}
~book 
~title Teach It Yourself
~chapter Preface
~p Reading "[v book.title]" will improve your skills.

A call to getVar for the first paragraph will return:

Reading "Teach It Yourself" will improve your skills.

Note that the selector is book.title, if it were just title it would return ”Preface“.

A loop over a selector can be included with a tag [foreach]. For example:

[foreach/ elephant]
[v common-name] (sc. [v scientific-name]) 
[/foreach]

This will get expanded say as:

savanna elephant (sc. Loxodonta africana africana)

Asian elephant (sc. Elephas maximus)

The selectors can be simply a role name but they can be complex query expressions, too. You can get for example the elephant list sorted by common name:

[foreach/ #elephant.<common-name]

And you can get all species described by Linnaeus with:

[foreach/ #species."Linnaeus" described-by]

provided you have a definition such as:

^species {
	...
	^described-by :string
}

Output Processors

It is possible to define custom tags and provide Perl code for expanding them. For example, one can define tags [ul] and [li] to declare item lists. The source text could look like that:

[ul/]
[foreach/ elephant]
	[li/][v common-name][/li]
[/foreach]
[/ul]

If you want a Perl script to generate a web page, you can bind both tags ul and li to a function that returns a HTML tag this way:

$ut->set_out_binding("HTML","ul",\&tag);
$ut->set_out_binding("HTML","li",\&tag);
sub tag
{
my ($self,$all,$op,$mod,$param,$str) = @_;
return "<$op>$str</$op>";
}

This tells the interpreter to call the function tag for rendering each found ul and li tags. (Actually a script does not need to provide these particular bindings since they are already set by default by the HTML module of the interpreter.)

The elephant list will then expand as:

<ul>
	<li>savanna elephant</li>
	<li>Asian elephant</li>
</ul>

Your Perl script can not only generate a web site but also a source LaTeX file for a printed book from the same source file. For this you would change the function bindings to something like this:

$ut->set_out_binding("tex","ul",\&ul);
$ut->set_out_binding("tex","li",\&li);
sub ul
{
my ($self,$all,$op,$mod,$param,$str) = @_;
return <<"END";
	\\begin{itemize} 
	$str  
	\\end{itemize} 
END
}
sub li
{
my ($self,$all,$op,$mod,$param,$str) = @_;
return "\\item $str ";
}

The elephant list will then expand as:

\begin{itemize}
	\item savanna elephant
	\item Asian elephant
\end{itemize}

The strings with embedded tags [...] are parsed by the interpreter, that calls each time the bound function. The bound function gets as parameter the interpreter object, the tag name, the tag parameters and the contents:

[tag-name/ parameters]
content
[/tag-name]

The bound function must return the expanded text. The bound function must expand explicitly the parameters and the contents, if they could contain tags to be expanded. That makes it possible to define tags such as [foreach], whose contents should never be expanded before calling the bound processor. If they were, this code:

[foreach/ chapter][v title][/foreach]

would not output the chapter titles but repeat the title of the book so many times as there are chapters.

To expand the parameters and the contents, the bound function calls the method out from the interpreter object:

$param = $self -> out($param); 
$str = $self -> out($str);

Sample: Website generation

The combination of text selectors, predefined tags and custom tags makes it pretty easy to generate output documents. Let us see how one can generate for example a web site.

One can define a structure like this:

^ webpage {
	^ title : string
	^ content {
		^ p : string
		^ h1 : string
	}

One can populate the above structure with handmade sources:

~webpage =index {
~title Some Web Site
~content
~h1 [v title]
Welcome to [v title]!
}
~webpage =contact {
~title Contact
~content
~h1 [v title]
You can contact me at jane@example.com
}

But one can also populate the webpage units through the interpreter querying some other texts, for example you collect information about elephant species in a text structure as mentioned above and a Perl script generates UTL creating for each species a single web page describing it and a table of contents listing them all.

To generate the actual HTML files one can run a script that loops over the web pages and generates a file for each of them. The script could look like this:

$ut->foreach("webpage", <<"END");
[save/ [u].html]
<html>
<head>
	<title>[v title]</title>
</head>
<body>
[foreach/ content.?]
[if/]
    :h1     <h1>[v]</h1>
    :p      <p>[v]</p>
[/if]
[/foreach]
</body>
</html>
[/save]
END

The tag [u] expands as the current unit's name, [save] saves its contents as a file with the given name, [if] is a conditional operator that expands differently according to the current unit's type. The above generates two files:

index.html
contact.html

These are the contents of the index file:

<html>
<head>
    <title>Some Web Site</title>
</head>
<body>
<h1>Some Web Site</h1>
<p>Welcome to Some Web Site!</p>
</body>
</html>

Internals

Let us see now how the universaltext interpreter manages the text structure. The interpreter loads at boot time the following two text units:

^unit {
	^unit
	^binary
}

The symbol unit is the root text unit that is its own parent, type and role. All other text units in the system are descendants of it. This unit has the base functionality that applies to every text unit, for example parser support.

The unit named binary is a child from unit that holds text units that contain data to be stored and retrieved in binary form. Binary data is unstructured data, you cannot query and navigate it, you can just store it and retrieve it as a whole.

The implementation of binary units is hard-coded and cannot be overriden. If you need some unit to have binary data, you must define it as a having the type binary or, more commonly, as having the type of a descendant unit of binary. The interpreter defines additionally some common binary units such as string and cardinal that are available to be instantiated.

If you need a unit to contain strings, you define for example this:

^ title : string

Then you can enter binary data when defining a title:

~title Chapter One: Introduction

One can obviously build new types on the top of that. If one defines a subtype of a binary type, it becomes automatically a binary type, too.

^ file {
    ^ line : string
    ^ comment : line
    ^ code : line
}
=to-do ~file {
    ~comment That must be done:
    ~code int v(); //yields the current value
}

Here both comment and code are lines, thus strings, therefore they accept binary data.

The text structure is implemented as an array of scalars:

UNITS[ID] = (RID, TID, PID)

Each unit has an internal numerical identifier (a counter), for each unit one stores the identifiers of its role, its type and its parent. At boot time the units unit and binary are loaded with identifiers respectively 0 and 1.

UNITS[0] = (0, 0, 0)
UNITS[1] = (1, 1, 0)

Next Steps

The experimentation with the text structure has shown its usefulness for practical purposes, it is very expressive and flexible and one can make major semantic changes afterwards without having to manually update many places. The main limitation of the universaltext interpreter arises from the implementation of embedded tags. Now tags are parsed by the interpreter at run time and processed by special-purpose Perl code. They should be parsed by the interpreter at feed time instead and produce regular text units that can be queried and output by regular means. That would make the interpreter much more text-oriented and thus simpler and more powerful.

The next project can be a text-engine that not only parses and transforms text as the interpreter, but it can store it persistently. This way one could maintain all documents and media files in a text repository. A client would check in and check out files, although these files would not be stored as binaries, but as parsed text. What you check in is not necessarily the same that you check out. For example one can check in the website definition above and from the server check out the HTML pages, the text-engine generates them on request according to the current contents. If you are writing a book, you can check out a single chapter or the whole book as a word processor document, update it and check it in. Or get it as a LaTeX source file or as PDF. And you can also check out the book's outline as a spreadsheet document and update it. The text transformations are stored in the text repository and invoked automatically when required. The text engine knows about the dependencies between text units and updates them whenever necessary.

Basing on the text-engine one can develop a text workbench. A text workbench is essentially a text editor that knows about the text structure. Instead of checking out a document as word processor or spread sheet document and use the correspondent application to edit it, at the text workbench you can query the text repository and view the results as prose, list or tree and change the view at will. If you edit for example a prose writing that has ~h headings and ~p paragraphs, you can see the prefixed signs before the units, or you can hide them and just use another font size for headings, or both, and you can switch between presentation modes interactively. You can define your own views for particular unit types or single units, this begins customizing appearance and ends with programmed add-ins that provide specialized rendering and editing capabilities, for example a programmer's plug-in that supports development.

Print Contact

Universaltext Interpreter

Universaltext Language

Alternate Parsers

Feeding Text

Text Selectors

Output Processors

Sample: Website generation

Internals

Next Steps

Text-Oriented Software (Book)

Text-Oriented Software

Copyright

Preface

Text

Text Structure

Comparing Text to Other Structures

Text Query

Languages

Text-Orientation

Imagine

Text-Oriented IDE

Text-Oriented Programming Languages

Files and Text

Programs and Text

Text-Oriented Compiling

Case Studies

Sample: Program Parameters

Unix: A Text-Aware Environment

Universaltext Interpreter

Background

What is Text?

What is Text-Orientation?

Just Once: A Programming Ideal

Why is Computing Important?