Kingdom Project
Overview
The goal of the project is to create a compelling software environment for the development and execution of applications that are adaptable to diverse delivery contexts, using formal methods to ensure reliability.
Put simply, we want to create software that doesn't suck.
Licensing
Code will be released under the GNU General Public License, version 3.
Other licenses may be available from YesLogic on request.
Focus Areas
- Regular Expressions
- Irregular Expressions
- Grammars & Parsing
- JavaScript
- Binary Data Formats
- Unicode Encoding
Regular Expressions
- Goals
-
Certified compilation of regular expressions to automata using Antimirov's partial derivatives construction.
- Motivation
-
Regular expressions are a formalism ideally suited to specification and implementation with formal methods. They are essential for text processing and form the basis of most markup schema languages. Antimirov's construction offers a new technique for compiling regular expressions to automata that has not been widely implemented, but promises improved efficiency over the traditional approach.
- References
-
Partial Derivatives of Regular Expressions and Finite Automata Constructions, V. Antimirov (CiteSeer)
Irregular Expressions
- Goals
-
Formal semantics and implementation for regular expressions with backreferences and other extensions.
- Motivation
-
Popular implementations of regular expressions typically add features such as backreferences, that are not found in the regular languages and are impossible to implement efficiently in the general case. Nevertheless, these features must be supported, and ideally would be implemented in such a way that they do not impose any performance penalty for expressions in which they are not used.
Grammars & Parsing
- Goals
-
Factoring left-recursive grammars while preserving equivalence and implementation of top-down parsers.
- Motivation
-
Many grammars can be parsed by top-down parsers, which are simple to implement and verify. However, left-recursive grammars must be factored first to avoid infinite recursion in the parser. The grammar given for JavaScript in the ECMA standard is a good example of a left-recursive grammar that is amenable to top-down parsing if factored.
JavaScript
- Goals
-
Formal specification of JavaScript semantics and implementation of a certified interpreter / JIT compiler.
- Motivation
-
JavaScript is the programming language of the modern web, powering interactive websites, online applications, games, and increasingly even the browsers themselves. Standards compliance, reliability, and security are crucial requirements for any JavaScript implementation. A certified JavaScript implementation conforming to a formal specification of the language semantics would be a strong foundation for the next generation of web user-agents.
- References
-
ECMA-262 — 3rd edition (HTML)
ECMAScript Edition 3 Errata (HTML)
EcmaScript 3.1 Working Draft, ES4 Wiki (HTML)
Binary Data Formats
- Goals
-
Design and implementation of a language for describing binary data formats.
- Motivation
-
Applications often need to process binary data formats such as OpenType fonts, PNG and JPEG images, multimedia container formats and ZIP archives. Typically these formats are supported by C libraries which offer no guarantees of correctness or reliability, and over the years there have been numerous security exploits targeting these libraries.
A data description language for describing binary data formats would offer a robust solution to these problems. The language would require dependent types, as array lengths and integer widths can depend upon earlier values in the binary data. From the format description it would be possible to programmatically derive functions for safe access to the data in a binary file, eliminating the need for tedious and error-prone manual coding.
- References
-
The Next 700 Data Description Languages, K. Fisher, Y. Mandelbaum, and D. Walker (PDF)
Unicode Encoding
- Goals
-
Formal specification of UTF-8 encoding and implementation of encoder/decoder.
- Motivation
-
UTF-8 decoding is simple, but surprisingly easy to get wrong, which has been the cause of several security exploits. An elegant formalisation of UTF-8 that could be used to verify implementations would be very useful. An extra complication is that implementations may be optimised to decode multiple characters simultaneously using SIMD instructions.
Future Ideas
Knuth / Plass paragraph breaking algorithm, a landmark in document layout. Perhaps this is a specific instance of general techniques dynamic programming and could be formalised as such?
Hexadecimal integer literals in Coq. Don't laugh, this is a serious issue!
Specialisation or partial evaluation allows greater use of abstraction in the code design without compromising runtime performance. For example, specifying multiple simple transformations that are combined into a more complex single-pass transformation. A common use case for this is text processing, which usually consists of a decoding step that transforms a byte sequence to a Unicode code point sequence, then a subsequent parsing step applied to the code point sequence. Expressing these as separate transformations improves clarity, but combining them can improve performance by removing the need to explicitly create the intermediate sequences. See Neil Mitchell's Supero optimisations for Haskell.
HTML parsing. This is a big one. How to create a formal spec when there is barely even an informal spec?
Are attribute grammars an appropriate formalisation for the "cascade" of formatting properties required by CSS and XSL-FO?
OpenType layout features combine glyph substitutions and context-sensitive rewrite rules in a way that can be quite inefficient if implemented naively. Is it worth compiling these into an automata of some kind?
Rasterisation has a strong geometrical basis, and should be amenable to formal treatment. For compositional imaging, the Haskell Pan library and Core Image MacOS X API offer promising directions.
IO with left-fold enumerators?