Fundamentals of Universal Text
Abstract: There are two data entities: text unit, binary data. The first one contains structured, the second one unstructured data. There are three kinds of operations: transformation, rendering, parsing. They are programmatic actions respectively from text to text, from text to binary, and from binary to text.
Last update on Sun Jun 5, 2011.
A single text unit is a 4-way relationship between text units:
(unit, parent, role, type)
The above defines the unit
unit as having a particular
parent, a particular
role and a particular
type, being all of them known units (here or elsewhere defined).
- The parent-child-relationship defines a hierarchy between units.
- The type-of-relationship defines the properties and behavior of the unit. Each instance inherits all characteristics from its type (and the type, itself an instance, too, inherits them from its type, and so forth).
- The role-relationship is a correspondence between a child unit and a child of the parent's type.
Text Integrity Rules
There are the following text integrity rules. These constraints apply to all text units:
- All units participating as
typemust participate at some other tuple (possibly itself) as
- Each unit participates at exact one tuple as
roleunit must be a child of the parent's
- The type of each unit must be the type of its role or a subtype of it.
Binary data consists of a sequence of bytes.
A transformation is an operation that applies to a single text unit and has a single text unit as result. A transformation is deterministic if and only if it always produces the same output text unit when applied to the same input text unit.
Parsing is an operation on binary data that produces text units and their binary data. The input data is called representation (coming from a human user or digital devices), the output is called content (intended for structured or unstructured storage in the text system).
Rendering is an operation on a text unit and its binary data that produces binary data. This is the reverse operation of parsing. The input data is the content and the output a representation of it, suitable for the human user, a computer system or, more in general, a digital device.