Copyright © 2020 The New York Times
ArchieML is a text format optimized for human writability. It was designed for users unfamiliar with existing serialization formats, and defines a forgiving, incremental parser that does not throw syntax errors.
This document is draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organization.
This is the first candidate recommendation of the Archie Markup Language specification. It is a pre-release draft of the spec. As such, it can implemented in parsers, however this candidate should not be considered the definition of ArchieML 1.0.
The public is invited to provide feedback on this draft, either through the Github Issues page for archieml.org, or via email. The comment period on candidate recommendations for the ArchieML 1.0 spec will remain open for at least four weeks following the most recent candidate recommendation.
At that time, this candidate will either be revised or promoted to a more final state, eventually reaching recommendation status for the 1.0 release. That recommendation will then be considered the definition of ArchieML 1.0. Major changes to the spec at that point will not require implementation to be 1.0-compliant.
Until the release of version 1.0, parsers should note the version of the candidate spec they are following.
:skip
blocks, when placed inside a multi-line value, should break up
the value, causing only the first line of the value to be stored.Parsers should implement a system by which whole lines are interpreted either as a command line, or a plain text line to be read into a buffer.
"Command" lines are defined as those that begin with any of the following general patterns, with
token
defined as above:
token:
:token
{token}
[token]
{}
[]
*
Any non-newline whitespace may appear at either end of a token or any of the punctuation used above without
affecing the parsing. Special lines should be recognized based on how they start; all text after the patterns
mentioned above are valid and should not affect the line's status. For example, each of the following is a valid
:token
line:
:end
:end this line
:endthisline
When a command line is found, logic should be executed that immediately affects the output. For example, if a key/value pair is encountered, the value specified on that line should immediately be added to the output. A document may end at any time, and the output object should always be in a state ready to be returned by the parser.
Based on context, some lines that fit one of the above patterns may not be defined as special; for example, bullet point lines ("* anything") are only special when the parser is inside an array. In these cases (defined below), the line should be interpreted as a plain text line.
All lines that are not commands are interpreted as plain text. When they are encountered, they should not immediately affect the output in any way. Instead, every plain text line should be incrementally added to a buffer that is used during multi-line value parsing, and emptied whenever a special line is encountered.
ArchieML does not throw syntax errors. This is to make parsing of documents as simple as possible, but also to avoid requiring writers be directly aware of the parsing mechanism. Documents are parsed line by line, and any line that does not conform to one of the special line syntaxes above is interpreted as plain text. Syntax errors that cause a command line to be treated as plain text thus do not "break" the parsing of a document, but instead generally omit that line from the output.
Many of the command lines are set up to "reset" to document to some state. This generally means that any mistakes that occur in a document remain local, and do not harm the rest of the document.
If a key is encountered twice, the latter occurrence should take precedence. Arrays that are defined on top of
an existing value or array should also be replaced. Thus, no validation should be made as to whether a key is
already taken. In the following example, the output
should contain second value
as the
value for key
.
key: first value
key: second value
A partial exception to this is in defining object blocks: if an object already
exists as the definition key, the value should not be deleted. In the following example, both first
and second
should remain defined.
{scope}
first: 1
{}
{scope}
second: 2
{}
When redefining a value, it's possible to change the data structure of the output. For example, a key can change
from holding a string to being an object namespace with keys underneath it. In these cases too, the latter
definition should take precedence, and parsers must override the existing data structure of the output as
necessary to accomodate the new data type (such as replacing a string value with an object). Here, the output
should define key
as {"subkey": "subvalue"}
key: value
{key}
subkey: subvalue
{}
Parsers should accept several options for modifying how documents are processed.
Keys, both at the top level of the output, and within nested objects, may contain only alphanumeric characters, hyphens and underscores, in any sequence. If other charactes are included, then the line should not be interpreted as a key/value pair, and instead treated as plain text.
Keys are defined wholly or in part within key/value lines, as well as lines defining {objects}
and
[arrays]
. The same rules apply in all cases, and should follow the definition of a token above.
Any key may also include any number of periods (.
) that function as dot-notation to define nested
values. At the time this key or object is defined, the output should conform to the data structure required by
that key. Any conflict resolution that needs to take place (such as replacing previous string values with objects)
should occur immediately.
The following lines are all valid. Successive lines in each group will end up overwriting the previous value with an object, in order to hold the new key.
key: value
key.key: value
key.key.key: value
[array]
[array.array]
[array.array.array]
{scope}
{scope.scope}
{scope.scope.scope}
Values are the characters that are stored within the output's data structure. Special lines beginning with
token:
or *
define values. Any characters that follow the special beginning of the line
should be immediately stored as a value.
Values are always stored in the output as strings. Leading and trailing whitespace should be stripped away. A value ends when the parser encounters a newline.
All plain-text lines should be stored into a buffer that is emptied whenever any command line is encountered.
This is because values that span multiple lines are defined by including an "ending anchor tag",
:end
, to the line following the value. Thus the parser can not know ahead of time whether lines 2+
of a value need to be included, until the value has been defined in full.
key: This value should immediately be saved.
This will be read into a buffer.
So will this.
:end Now the buffer should be emptied into "key", adding on two additional lines to the value.
When a command line beginning with :end
is encountered, the buffer should be emptied. If the
previous command that was parsed was either a token:
or a *
line (each of which
signals the start of a value), then the contents of the buffer should be appended to that value, which should
already contain the first line of the value.
Because the first line of a value is always inserted into the output immediately (before the rest of the value is parsed), and surrounding whitespace around the initial values is discarded, care should be taken to preserve newlines between the first and second line of a value. A simple way to accomplish this is to send any trailing newline whitespace from the first line of a value into the buffer, so that it is included when you append the buffer to the value's first line.
As characters are added to the buffer, all whitespace should be preserved, including newlines. The end effect should be that leading whitespace on the first line, and trailing whitespace on the last line is discarded, and nothing else.
If you wish to include a line within a multi-line value that would normally be interpreted as a command line, you
can escape the line. This is accomplished by prepending the line with a backslash
(\
). Its presence should prevent the line from being treated as a command, and instead treated as
plain text.
To accomodate this, leading backslashes must be removed when parsing multi-line values. All lines in a multi-line value (but not the first line, since no escaping of characters is necessary there) should be post-processed to remove these leading backslashes.
ArchieML purposefully does not treat escapes as applying to individual characters, but instead to entire lines. Backslashes are treated as escape characters only when they appear at the beginning of a line. Backslashes that appear inside a line must be preserved in the output. In short, leading backslashes should be removed only when the backslash is the first non-whitespace character of a line within a multi-line value (but not the value's first line).
Here, the value of key
should be value\n:end
(backslash removed):
key: value
\:end
:end
Parsers should always remove leading backslashes in these cases, whether or not the line would have been treated as a command without it. As with other commands, whitespace surrounding the leading backslash should not impact whether the line is escaped, and that whitespace should be preserved in the resulting value.
Leading backslashes may, in turn, be escaped with an additional leading backslash, preserving the ability to actually begin a line with one in the output.
Object blocks are a shorthand way to avoid repeating a namespace for multiple keys. You can specify the start of
an object block with a line that begins with {token}
.
All key-value pairs that are defined within this block should be added to the output within the namespace defined
by the object line. In other words, a key key
within the object block {object}
should be
parsed the same way as a key defined by object.key
.
These are equivalent:
{object}
key: value
{}
object.key: value
The namespace defined by an object block should persist until either A) a new object block or array is defined, or
B) an empty object or array key ({}
or []
) is encountered. This allows for both explicit
"closing" of a block, and implicit closing when a new object or array begins.
Keys within blocks should be treated exactly as they would if they had been defined naturally through dot notation. Because of this, objects can be "reopened" if an object block is defined more than once (the second definition of a block does not replace the original definition, as in arrays). This is to say, avoiding any conflicts encountered from the conflict resolution section above, values should always be merged into the output without deleting values unless necessary.
As soon as an object is initially opened, an empty object at that key should be added as its value.
Nested object block notation removes the redundancies of dot notation.
Object blocks with names prepended with a period should behave exactly as other object blocks, except their declaration should not close opened objects. Instead, they should nest inside of opened object blocks. The parser should not place any restrictions on nesting depth.
These are equivalent:
{scope}
key: value
{.scope}
key: value
{.scope}
key: value
{scope}
key: value
{scope.scope}
key: value
{scope.scope.scope}
key: value
Just like top-level object blocks, {}
should end the object and return the parser to the scope held
before encountering the nested object. Declaration of any top level object {} [] [+]
should close
all nested objects and reset the parser back to the top-level scope.
{scope}
{.scope}
key: value
{}
key: value
{.scope}
key: value
{newScope}
key: value
This should produce the following structure:
{"scope":{"scope":{"key":"value"},"key":"value"},"newScope":{"key":"value"}}
.
Arrays of objects can be defined to create multiple instances of an object. They begin by declaring an array line, similar to an object block above; however, arrays use square brackets instead of curly brackets.
[array]
item: 1
item: 2
item: 3
[]
When an array definition is encountered, the parser should immediately create an empty array at the given key.
When an array begins, the parser should take note of the first key that is defined. This is the element delimiter key. Every time that key is encountered again, the parser should create a new element inside the array, and that key's value, and subsequent values (until a later occurrence of the delimiter key), should be added to that new object.
Keys within the array block should be assigned to an object within that array. All rules defined above relating to keys still apply, but within the scope of a particular object inside the array. Keys containing dot-notation are relative to individual items within the array.
[array]
scope.one: 1
scope.two: 1
scope.one: 1
scope.two: 1
[]
This example would result in two items, each containing a top-level scope
key. This is distinct
from the following array, where the resulting items would contain only the top-level keys one
and
two
:
[array.scope]
one: 1
two: 1
one: 1
two: 1
[]
Arrays can also be created that contain simple strings instead of objects. They begin the same way as object
arrays do, and the same rules apply around naming a string array. String arrays contain values defined by lines
beginning with *
. All characters following an *
, minus surrounding whitespace, should
be stored as a new value inside the array.
[array]
* item 1
* item 2
* item 3
[]
When an array is opened, the parser should take note of which command line is defined first: either a key/value
line, or an asterisk (*
) line. If the former is defined first, then the array is defined as an
object array. If a *
line appears first, it is defined as a string array.
Object arrays should treat token:
lines as commands, and *
lines as plain-text. String
arrays should treat token:
lines as plain-text and *
lines as commands.
Multi-line values are allowed within both object and string arrays:
[object]
key: value...
...with multiple lines.
:end
[object]
* value...
...with multiple lines
:end
Array elements can also contain nested-arrays (i.e., sub-arrays). To open a nested array, add a .
to the front of the array's key: [.subarray]
. This signfies that subarray
should be an
array called subarray
within the current array element, instead of closing the current array and
opening a new one.
[array]
[.subarray]
key: value
This should produce the following structure: {"array": [{"subarray": [{"key": value"}]}]}
.
Nested array keys can be mixed freely with normal key/value pairs. Note that subarrays must be closed with empty
brackets ([]
) in order to "return" to the parent element.
[array]
key: Depth 1
[.subarray]
subkey: Depth 2
[] This returns us to the first element in "array."
anotherkey: Depth 1
This should produce a top-level array
with three keys: key
, subarray
and
anotherkey
.
The nested array's key functions the same as regular keys for determining an array's element delimitor. If a
nested array subarray
is the first key within a parent array, then subarray
will
become the array's delimitor.
[array]
[.subarray] First element
key: value
[]
[.subarray] Second element
key: value
[]
Here, array
will contain two elements, each with its own subarray
.
Nested arrays can be either Object or String arrays, using the same rules as top-level arrays for determining the type. Only object arrays can contain other arrays, but string arrays can exist at any depth level.
Where Object arrays build up data structures with multiple key/value pairs with no inherent ordering of keys, Freeform arrays preserve the order of every line. Every line within a Freeform array results in an individual object in the array. Sequential lines result in as many elements in the array; the values are not folded into the same object.
Each array element is an object with two attributes: type
and value
. What is normally
treated as the key
on each line becomes the type
.
This type of array is useful when you want to tie ArchieML input more directly to the presentation order of content. The order of content can then be directly preserved between the text input and the presentation on a page.
[+array]
title: Middlemarch
author: George Eliot
[]
{array: [
{type: "title", value: "Middlemarch"},
{type: "author", value: "George Eliot"}
]}
In this example, moving title
below author
changes the output, which is not the case
in standard Object arrays.
Lines of text which do not contain a key are preserved in Freeform arrays, and not ignored as they are in Object
arrays. These lines are given a presumed type
of text. In this sense, Freeform
arrays are collections of lines, each of which may or may not be tagged
with a type. Lines without
a type are given one by default.
[+array]
h1: This is a header
h2: This is a sub-header
This is normal text.
[]
{array: [
{type: "h1", value: "This is a header"},
{type: "h2", value: "This is a sub-header"},
{type: "text", value: "This is normal text"}
]}
To open a Freeform array, prepend the array's key with a +
character. This serves as an array
modifier similar to the leading .
used in nested arrays. That array should then be treated as a
freeform array.
Freeform arrays can also be nested under other arrays by using both modifiers. The order of the modifiers must not matter.
[array]
[.+nestedFreeform]
Nested Freeform content.
[]
[]
NOTE: Inline comments have been deprecated. This section will be removed in the 1.0 recommendation. If currently implemented as an option, it should default to off.
Inline comments in ArchieML are modeled after a common syntax in copy editing where editor's notes are placed within square brackets. In this tradition, all text within a matching set of square brackets on a single line, including the brackets, should be ignored by parsers.
In the event that square brackets are desired in the final value, a double set of brackets should be used. Parsers should replace sets of double brackets with single sets of brackets in the output. Care should also be taken not to remove text inside single brackets that is surrounded by an additional set of brackets.
key: value [inline comment] more value.
key: value [[this will appear in single brackets]] more value.
As with other punctuation, surrounding non-newline whitespace should not affect parsing.
To avoid making assumptions about the end use of the output, whitespace on either side of inline comments is preserved, which may result in extra whitespace after comments have been removed.
Command keys are defined as any command line matching the :token
pattern. Specifically, parsers
should only treat this pattern of line as a command if the token begins with any of the following sequences:
:skip
:endskip
:end
:ignore
Any text after the token should be ignored, and should not affect whether the line is interpreted as a command
key. This includes cases where the token and extra characters are not separated by whitespace, such as
:ignoreeverything
This is so that the intended effect of the command is not lost due to simple syntax
errors.
Due to this flexibility, care should be taken so that :endskip
lines are not interpreted as
:end
lines.
Any unescaped command key should reset the buffer.
The specifics of :end
command keys are described above in "Multi-line
values."
It should be noted that :end
lines which do not occur after a key/value line or array element line
(*
) should have no effect on the output.
When a line beginning with :skip
is encountered, the parser should begin to ignore all lines of
text. Plain text lines should not be added to the buffer, and all command lines should be ignored. This is with
the exception of two command keys which serve to end the skip block: :endskip
(which closes the
block and signals to resume normal parsing) or :ignore
(which stops parsing altogether).
This allows for creating blocks of text where even lines that fit the formal of command lines are ignored and do not affect parsing.
:endskip
should resume normal parsing. No special actions need be taken upon resuming parsing, and
the buffer should be empty at this point.
As soon as a line beginning with :ignore
is encountered, parsing should stop immediately, and the
output should be returned. This is a safety mechanism to allow for a safe comment / scratchpad area that has no
chance of ending up in, or affecting, the output.