UTL Syntax
Universal Text Language Syntax
The Universal Text Language (UTL) is a computer language for expressing Universal Text.
The following different syntaxes are defined:
- Simple UTL Syntax
- Prose UTL Syntax
- Symbolic UTL Syntax
- Script UTL Syntax
The UTL Raw format is separately specified because it is a reduced language that does not ground on the Universal Text layer but on the basic text layer.
Last update on Sun Jun 5, 2011.
UTL Files
UTL files are a group of lines that are parsed together and share context. A file corresponds to a conventional operating system file, but it is not restricted to it. UTL is by encoded in UTF8.
A UTL file is identified by its first line:
[UTL/<version-major>.<version-minor> <format>
UTL must be in capital letters. The version numbers must be in decimal base. No whitespace before the major version number, white space between the minor version number and the format, leading and trailing whitespace for the whole line is optional.
For example:
[UTL/2.0 Prose
The version numbers correspond to the version of the text engine, they are not specific to the UTL format.
The format name is case-sensitive and refers to a text unit defined in the text repository with type UTL-Syntax.
Configuration parameters can be set for the format. Which parameters are accepted is defined by each format as its children with type Format Parameter. The parameters can be passed at the first UTL file line after the format name , for example:
[UTL/2.0 Prose Separator="," Encoding="UTF8"
or changed inside the file for the next lines with:
[%format Separator="," Encoding="UTF8"]
The last line of a UTL file must consist of just a closing square bracket (optionally with leading or trailing white space). Example of a UTL file:
[UTL/2.0 Symbol
[...]
]
Files can be embedded in other files:
[UTL/2.0 Prose
[...]
[UTL/2.0 Symbol
[...]
]
[...]
]
Each new file opens a new instance of the parser and the current unit is reset.
The parser closes the current instance when it comes to a standalone ]. It returns to the previous parser instance, if the file was embedded in another file, or ignores all the following lines, when closing the outmost file, until possibly a new file begins. Lines before a beginning of a file are also ignored. Example:
these lines ignored
[UTL/2.0 Prose
[... this is parsed ...]
]
these lines also ignored
by the parser
[UTL/2.0 Prose
[... this is parsed ...]
]
these lines ignored, too
Special Characters
UTL files can contain any UTF8 characters. The only ones with a special meaning are three: [, ] and \.
The square brackets are used to instruct the UTL parser what syntax a segment has. It is used for marking begin and end of files and for marking begin and end of string segments that must be parsed with an alternate parser.
The back slash is used as an escape character. The parser interprets then the next character as a literal one without any syntactic function. For example an escaped square bracket does not mark the limit of a file:
[UTL/2.0 Prose
[... lines to be parsed ...]
\]
[... lines to be parsed ...]
]
The Prose UTL parser sees this:
[... lines to be parsed ...]
]
[... lines to be parsed ...]
One can also escape a white space to get the parser treat many words as a single token:
[my\ new\ parser some content]
Escaping this the parser my new parser is called to parse ”some content“, without escape one would call the parser my to parse ”new parser some content“.
The character ”\“ is only interpreted as an escape character if it changes the syntax, otherwise it is ignored.
The syntax [\ ... content ...] makes the content syntactically a single token, irrespective of its characters.
~p ""
this is some paragraph
[\/]
[UTL/2.0 Prose
this is not another file, but the same paragraph
]
[/\]
this is still the same paragraph
""
Raw UTL Format
This is the specification of Raw UTL, a simple file format for expressing the contents of a text repository.
Raw UTL Files
A Raw UTL file is defined as a general UTL file. Example:
[UTL/2.0 Raw
[...]
]
Raw UTL is coded in UTF8. Empty lines are ignored.
Header Line
The first non-empty line of a UTL file informs about the units contained in the UTL file:
<unit count> <check sum>
Whitespace before and after both numbers is optional, between is mandatory. Both the unit count and the check sum are given in hexadecimal base, indifferent in upper or lower case.
The check sum is the hash value for all the unit parent, role and type Ids excluding binary data. How the hash is computed depends on the UTL version. The check sum is optional, one can left it blank putting a - instead.
Lines
The following lines have this format:
=<unit Id> ^<parent Id> ~<role Id> :<type Id> [<binary data>]
[<binary data>]
Trailing and leading whitespace is irrelevant, whitespace between Ids and between an Id and binary data is mandatory.
The Ids are local to the repository instance. They are expressed in hexadecimal base, indifferent in upper or lower case. Its order is irrelevant, but all four must be given, each of them once.
The binary data is expressed in this format:
<length> <check sum> <data>
The length and the check sum are expressed in hexadecimal base, indifferent in upper or lower case. The length is the byte count of the data.
The check sum is the hash of the data, how it is computed depends on the UTL version. The check sum is optional, one can left it blank putting a - instead.
The data is given in base 64. The length and the check sum must appear in that order at the unit definition line. The data can span over more than one line, whitespace and new line separators are ignored. Example:
=A0 ^7B :65 ~8C 0F 2A
EAKD0494
479845==
If a binary unit does not have binary data, length (0) and check sum (or -) must be given, if its type is not a binary unit itself, otherwise it is optional.
For non-binary units length, check sum and binary data are not allowed.
Sample Raw UTL File
This is a file generated by UText/2.0:
Universal Text Engine/2.0 (devel) - Repository Dump
Host name: pc64
Repository: /home/francesc/devel/UText/v2/trunk/test-repo
Date: 28/Feb/2010:09:35:48 +0100
Unit count: 11
[UTL/2.0 Raw
b 5b
= 0 ^ 0 ~ 0 : 0
= 1 ^ 0 ~ 1 : 1
= 2 ^ 0 ~ 2 : 1
= 3 ^ 0 ~ 3 : 1
= 4 ^ 0 ~ 4 : 4
= 5 ^ 4 ~ 5 : 1
= 6 ^ 4 ~ 6 : 1
= 7 ^ 0 ~ 4 : 4
= 8 ^ 7 ~ 5 : 3
= 9 ^ 7 ~ 5 : 2
= a ^ 7 ~ 6 : 3
]
End Of Transmission.
UTL Syntax
Configurations
The syntax of UTL is not bound to particular signs, the parser can be freely configured instead. These are configurable:
- The prefixes for unit, role and type, which default to
=,~and:. - The prefix for binary data, by default not set.
- The default word type, to be infered if a word without prefix is found, by default ”binary data“.
- The brackets for opening text level, parser and comment blocks, defaulting to
{},[]and{-- --}. - The brackets for tags embedded in ustrings, defaulting to
[]. - The mark for the end of an instruction, defaulting to a line break.
- The word delimiters depending on word type, defaulting to an empty string, ' and " for binary data and defaulting to whitespace for other word types.
A UTL expression can change each of these settings at will:
- for a particular unit and its descendants, or
- for a particular type and its instances, or
- for some lines or line blocks, or
- for the rest of the file.
A UTL expression can also define configuration sets consisting of one or more of the settings, these sets can then be applied to UTL expressions.
There are two configuration sets defined by default by the text engine, for example for prose writings and for formal language.
Simple UTL Syntax
Simple UTL is a regular format that is easy and cheap to parse intended vor interprocess comunication.
Each line defines exact one unit with this form:
~role :type =name data
The order of the three prefixed words is free, but all of them must be present. If the role, type or unit name contain a space, it must be escaped. The data is encoded by default in UTF8. This can be changed by the format parameter, for example:
[UTL/2.0 Simple Encoding="Base64"
Lines that follow one another are children of the same parent. To enter a new level, a line containing just { is mandatory. To leave a level, a line containing just } is mandatory.
Prose UTL Syntax
Prose UTL is intended to be used in prose writings in order to confer them some structure, the main focus being on readability of prose paragraphs and titles whose flow should be not obstructed.
A line consists of zero or more prefixed words and a content string and ends with a newline character. The prefixes are: ~ (role), : (type), = (name), % transformation, ? (multiline). A line consisting of just a content string defines an unnamed unit with default role and type (by child and binary role inference).
A word is a delimited word or an implicit word. An implicit word consists of non-whitespace non-delimiter characters. A delimited word begins with a delimiter character and ends with a corresponding delimiter character. Word delimiters are: ” and ‘.
Examples:
~t This is the Title of the Book
~t :chapter This is the First Chapter
~"book part" Part I
This is a paragraph
The content string can be delimited with word delimited pairs or, it if is not, it begins and ends with the first and last non-whitespace character.
If the prefix ? is present, the content is a multiline string. The word prefixed with ? is the multiline end mark.
The multiline string begins at the next line and ends up before the end mark, which is placed on a line by its own.
~input ? ""
first input line of two
second input line of two
""
~input ? END
first input line of three
""
third input line of three
END
A block is a delimited block or an implicit block. An implicit block is defined by parent unit inference. A delimited block is enclosed between { and }. A unit block defines the children of a unit.
~chapter First Chapter
first paragraph
second paragraph
~chapter Second Chapter
{
first paragraph
second paragraph
}
The prefix % calls a transformation identified by its name. More than one transformation can be piped, provided their unit types are compatible, being executed from right to left and the output of one being put as input of the next one.
The transformation can be a content parser. If no parser is given, the implicit parser according to the unit's type is assumed.
~chapter ="Appendix A" %index ? ""
include index names
include index topics
""
~chapter ="Appendix B" %index include index cross-references
With the prefix % one can introduce a renderer, too.
~workname %getworkname author "Ted Nelson" . work "Literary Machines"
The renderer "getworkname" expects a parameter of type "selector", thus the content string is casted into a selector by the selector's implicit parser and passed to the transformation, that returns a string containing a formatted work's name.
The prefix % can invoke a transformation that generates UTL.
%template standard
The resulting unit describes zero or more children consisting of role, type and unit name and optionally binary content, that are appended as children of the current unit.
(If %template is a renderer, that is the return unit has a binary type, then only one unit is appended with this content.)
Symbolic UTL
Symbolic UTL is intended to specify formal expressions, emphasys is placed on the logical structure. It is a general-purpose symbolic notation.
It is inspired by my paper ”Symbolic Language“ (in German), that describes a general purpose symbolic language.
Example:
my-web { index "This is the index"; ~h1 "First things first";
~p "As you all know..." }
Script UTL
Script UTL is a lightweight word-based language intended to be used interactively at the UTL shell and in small scripts such as configuration files or batch processing commands. It has a regular syntax <keyword> [<object>], keywords can be optional, then the arguments (objects) are recognized by position.
Sample interactive UTL script session. Questions prefixed with the prompt ">", system answers without prompt:
> verbose output
> load ~francesc/geneaweb.utl
file /home/francesc/geneaweb.utl read, 340 lines in 3 UTL blocks found
> ls :website
2 websites
family.geneaweb John 23/05/2009
work.animal-zoo Jane 12/01/2007
> export geneaweb
generating website geneaweb, 120 pages in 4 directories
uploading website geneaweb to ftp://ftp.example.com/~smith/web/genea
> export geneaweb
files are up to date
> output [foreach/ :website.webpage]"[v title]"[/foreach]
"Genealogy Web of Family Smith"
"Animal Zoo"
> foreach :website.webpage output value title
Genealogy Web of Family Smith
Animal Zoo
Sample UTL script configuration file for a web server:
hosts u-tx.net www.philosophisches-lesen.de francesc.hervada.cat
logfile /var/logs/{host}
root /cp/ for u-tx.net
alias www.u-tx.net for u-tx.net
alias philosophisches-lesen.de for www.philosophisches-lesen.de
default host u-tx.net
for francesc.hervada.cat francesc.hervada.org francesc.hervada.net
redirect /wordpress.html to http://wordpress.org
for philosophisches-lesen.de
gone werke.html

