Archie Markup Language (ArchieML) 1.0

Unofficial Draft

Editor:
(The New York Times)
Authors:

Abstract

ArchieML is a text format optimized for human writability. It was designed for users unfamiliar with existing serialization formats, and defines a forgiving, incremental parser that does not throw syntax errors.

Status of This Document

This document is draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organization.

This is the first candidate recommendation of the Archie Markup Language specification. It is a pre-release draft of the spec. As such, it can implemented in parsers, however this candidate should not be considered the definition of ArchieML 1.0.

The public is invited to provide feedback on this draft, either through the Github Issues page for archieml.org, or via email. The comment period on candidate recommendations for the ArchieML 1.0 spec will remain open for at least four weeks following the most recent candidate recommendation.

At that time, this candidate will either be revised or promoted to a more final state, eventually reaching recommendation status for the 1.0 release. That recommendation will then be considered the definition of ArchieML 1.0. Major changes to the spec at that point will not require implementation to be 1.0-compliant.

Until the release of version 1.0, parsers should note the version of the candidate spec they are following.

Changelog

15 October 2015

9 May 2015

23 April 2015

  • Clarification to note :skip blocks, when placed inside a multi-line value, should break up the value, causing only the first line of the value to be stored.
  • Change to how arrays that are defined multiple times are handled. Instead of "reopening" the original array, secondary definitions should replace previous definitions, making duplicate-key behavior of arrays match that of key/value pairs.

Goals

Definitions

1. Parsing overview

1.1 Special tokens

Parsers should implement a system by which whole lines are interpreted either as a command line, or a plain text line to be read into a buffer.

"Command" lines are defined as those that begin with any of the following general patterns, with token defined as above:

Any non-newline whitespace may appear at either end of a token or any of the punctuation used above without affecing the parsing. Special lines should be recognized based on how they start; all text after the patterns mentioned above are valid and should not affect the line's status. For example, each of the following is a valid :token line:


          :end
          :end this line
          :endthisline
        

When a command line is found, logic should be executed that immediately affects the output. For example, if a key/value pair is encountered, the value specified on that line should immediately be added to the output. A document may end at any time, and the output object should always be in a state ready to be returned by the parser.

Based on context, some lines that fit one of the above patterns may not be defined as special; for example, bullet point lines ("* anything") are only special when the parser is inside an array. In these cases (defined below), the line should be interpreted as a plain text line.

All lines that are not commands are interpreted as plain text. When they are encountered, they should not immediately affect the output in any way. Instead, every plain text line should be incrementally added to a buffer that is used during multi-line value parsing, and emptied whenever a special line is encountered.

1.2 Conflict resolution

ArchieML does not throw syntax errors. This is to make parsing of documents as simple as possible, but also to avoid requiring writers be directly aware of the parsing mechanism. Documents are parsed line by line, and any line that does not conform to one of the special line syntaxes above is interpreted as plain text. Syntax errors that cause a command line to be treated as plain text thus do not "break" the parsing of a document, but instead generally omit that line from the output.

Many of the command lines are set up to "reset" to document to some state. This generally means that any mistakes that occur in a document remain local, and do not harm the rest of the document.

If a key is encountered twice, the latter occurrence should take precedence. Arrays that are defined on top of an existing value or array should also be replaced. Thus, no validation should be made as to whether a key is already taken. In the following example, the output should contain second value as the value for key.


          key: first value
          key: second value
        

A partial exception to this is in defining object blocks: if an object already exists as the definition key, the value should not be deleted. In the following example, both first and second should remain defined.


          {scope}
          first: 1
          {}

          {scope}
          second: 2
          {}
        

When redefining a value, it's possible to change the data structure of the output. For example, a key can change from holding a string to being an object namespace with keys underneath it. In these cases too, the latter definition should take precedence, and parsers must override the existing data structure of the output as necessary to accomodate the new data type (such as replacing a string value with an object). Here, the output should define key as {"subkey": "subvalue"}


          key: value

          {key}
          subkey: subvalue
          {}
        

1.3 Parser options

Parsers should accept several options for modifying how documents are processed.

2. Keys

Keys, both at the top level of the output, and within nested objects, may contain only alphanumeric characters, hyphens and underscores, in any sequence. If other charactes are included, then the line should not be interpreted as a key/value pair, and instead treated as plain text.

Keys are defined wholly or in part within key/value lines, as well as lines defining {objects} and [arrays]. The same rules apply in all cases, and should follow the definition of a token above.

Any key may also include any number of periods (.) that function as dot-notation to define nested values. At the time this key or object is defined, the output should conform to the data structure required by that key. Any conflict resolution that needs to take place (such as replacing previous string values with objects) should occur immediately.

The following lines are all valid. Successive lines in each group will end up overwriting the previous value with an object, in order to hold the new key.


        key: value
        key.key: value
        key.key.key: value

        [array]
        [array.array]
        [array.array.array]

        {scope}
        {scope.scope}
        {scope.scope.scope}
      

3. Values

Values are the characters that are stored within the output's data structure. Special lines beginning with token: or * define values. Any characters that follow the special beginning of the line should be immediately stored as a value.

Values are always stored in the output as strings. Leading and trailing whitespace should be stripped away. A value ends when the parser encounters a newline.

3.1 Multi-line values

All plain-text lines should be stored into a buffer that is emptied whenever any command line is encountered. This is because values that span multiple lines are defined by including an "ending anchor tag", :end, to the line following the value. Thus the parser can not know ahead of time whether lines 2+ of a value need to be included, until the value has been defined in full.


          key: This value should immediately be saved.
          This will be read into a buffer.
          So will this.
          :end Now the buffer should be emptied into "key", adding on two additional lines to the value.
        

When a command line beginning with :end is encountered, the buffer should be emptied. If the previous command that was parsed was either a token: or a * line (each of which signals the start of a value), then the contents of the buffer should be appended to that value, which should already contain the first line of the value.

Because the first line of a value is always inserted into the output immediately (before the rest of the value is parsed), and surrounding whitespace around the initial values is discarded, care should be taken to preserve newlines between the first and second line of a value. A simple way to accomplish this is to send any trailing newline whitespace from the first line of a value into the buffer, so that it is included when you append the buffer to the value's first line.

As characters are added to the buffer, all whitespace should be preserved, including newlines. The end effect should be that leading whitespace on the first line, and trailing whitespace on the last line is discarded, and nothing else.

4. Escaping

If you wish to include a line within a multi-line value that would normally be interpreted as a command line, you can escape the line. This is accomplished by prepending the line with a backslash (\). Its presence should prevent the line from being treated as a command, and instead treated as plain text.

To accomodate this, leading backslashes must be removed when parsing multi-line values. All lines in a multi-line value (but not the first line, since no escaping of characters is necessary there) should be post-processed to remove these leading backslashes.

ArchieML purposefully does not treat escapes as applying to individual characters, but instead to entire lines. Backslashes are treated as escape characters only when they appear at the beginning of a line. Backslashes that appear inside a line must be preserved in the output. In short, leading backslashes should be removed only when the backslash is the first non-whitespace character of a line within a multi-line value (but not the value's first line).

Here, the value of key should be value\n:end (backslash removed):


        key: value
        \:end
        :end
      

Parsers should always remove leading backslashes in these cases, whether or not the line would have been treated as a command without it. As with other commands, whitespace surrounding the leading backslash should not impact whether the line is escaped, and that whitespace should be preserved in the resulting value.

Leading backslashes may, in turn, be escaped with an additional leading backslash, preserving the ability to actually begin a line with one in the output.

5. Object blocks

Object blocks are a shorthand way to avoid repeating a namespace for multiple keys. You can specify the start of an object block with a line that begins with {token}.

All key-value pairs that are defined within this block should be added to the output within the namespace defined by the object line. In other words, a key key within the object block {object} should be parsed the same way as a key defined by object.key.


        These are equivalent:

        {object}
        key: value
        {}

        object.key: value
      

The namespace defined by an object block should persist until either A) a new object block or array is defined, or B) an empty object or array key ({} or []) is encountered. This allows for both explicit "closing" of a block, and implicit closing when a new object or array begins.

Keys within blocks should be treated exactly as they would if they had been defined naturally through dot notation. Because of this, objects can be "reopened" if an object block is defined more than once (the second definition of a block does not replace the original definition, as in arrays). This is to say, avoiding any conflicts encountered from the conflict resolution section above, values should always be merged into the output without deleting values unless necessary.

As soon as an object is initially opened, an empty object at that key should be added as its value.

5.1 Nested object blocks

Nested object block notation removes the redundancies of dot notation.

Object blocks with names prepended with a period should behave exactly as other object blocks, except their declaration should not close opened objects. Instead, they should nest inside of opened object blocks. The parser should not place any restrictions on nesting depth.


          These are equivalent:
  
          {scope}
          key: value
          {.scope}
          key: value
          {.scope}
          key: value
  
          {scope}
          key: value
          {scope.scope}
          key: value
          {scope.scope.scope}
          key: value
        

Just like top-level object blocks, {} should end the object and return the parser to the scope held before encountering the nested object. Declaration of any top level object {} [] [+] should close all nested objects and reset the parser back to the top-level scope.


          {scope}
          {.scope}
          key: value
          {}
          key: value
          {.scope}
          key: value
  
          {newScope}
          key: value
        

This should produce the following structure: {"scope":{"scope":{"key":"value"},"key":"value"},"newScope":{"key":"value"}}.

6. Arrays

6.1 Object arrays

Arrays of objects can be defined to create multiple instances of an object. They begin by declaring an array line, similar to an object block above; however, arrays use square brackets instead of curly brackets.


          [array]
          item: 1
          item: 2
          item: 3
          []
        

When an array definition is encountered, the parser should immediately create an empty array at the given key.

When an array begins, the parser should take note of the first key that is defined. This is the element delimiter key. Every time that key is encountered again, the parser should create a new element inside the array, and that key's value, and subsequent values (until a later occurrence of the delimiter key), should be added to that new object.

Keys within the array block should be assigned to an object within that array. All rules defined above relating to keys still apply, but within the scope of a particular object inside the array. Keys containing dot-notation are relative to individual items within the array.


          [array]
          scope.one: 1
          scope.two: 1

          scope.one: 1
          scope.two: 1
          []
        

This example would result in two items, each containing a top-level scope key. This is distinct from the following array, where the resulting items would contain only the top-level keys one and two:


          [array.scope]
          one: 1
          two: 1

          one: 1
          two: 1
          []
        

6.2 String arrays

Arrays can also be created that contain simple strings instead of objects. They begin the same way as object arrays do, and the same rules apply around naming a string array. String arrays contain values defined by lines beginning with *. All characters following an *, minus surrounding whitespace, should be stored as a new value inside the array.


          [array]
          * item 1
          * item 2
          * item 3
          []
        

When an array is opened, the parser should take note of which command line is defined first: either a key/value line, or an asterisk (*) line. If the former is defined first, then the array is defined as an object array. If a * line appears first, it is defined as a string array.

Object arrays should treat token: lines as commands, and * lines as plain-text. String arrays should treat token: lines as plain-text and * lines as commands.

Multi-line values are allowed within both object and string arrays:


        [object]
        key: value...
        ...with multiple lines.
        :end

        [object]
        * value...
        ...with multiple lines
        :end
        

6.3 Nested arrays

Array elements can also contain nested-arrays (i.e., sub-arrays). To open a nested array, add a . to the front of the array's key: [.subarray]. This signfies that subarray should be an array called subarray within the current array element, instead of closing the current array and opening a new one.


        [array]
        [.subarray]
        key: value
        

This should produce the following structure: {"array": [{"subarray": [{"key": value"}]}]}.

Nested array keys can be mixed freely with normal key/value pairs. Note that subarrays must be closed with empty brackets ([]) in order to "return" to the parent element.


        [array]
        key: Depth 1
        [.subarray]
        subkey: Depth 2
        [] This returns us to the first element in "array."
        anotherkey: Depth 1
        

This should produce a top-level array with three keys: key, subarray and anotherkey.

The nested array's key functions the same as regular keys for determining an array's element delimitor. If a nested array subarray is the first key within a parent array, then subarray will become the array's delimitor.


        [array]
        [.subarray] First element
        key: value
        []
        [.subarray] Second element
        key: value
        []
        

Here, array will contain two elements, each with its own subarray.

Nested arrays can be either Object or String arrays, using the same rules as top-level arrays for determining the type. Only object arrays can contain other arrays, but string arrays can exist at any depth level.

6.4 Freeform arrays

Where Object arrays build up data structures with multiple key/value pairs with no inherent ordering of keys, Freeform arrays preserve the order of every line. Every line within a Freeform array results in an individual object in the array. Sequential lines result in as many elements in the array; the values are not folded into the same object.

Each array element is an object with two attributes: type and value. What is normally treated as the key on each line becomes the type.

This type of array is useful when you want to tie ArchieML input more directly to the presentation order of content. The order of content can then be directly preserved between the text input and the presentation on a page.


          [+array]
          title: Middlemarch
          author: George Eliot
          []
        

          {array: [
            {type: "title", value: "Middlemarch"},
            {type: "author", value: "George Eliot"}
          ]}
        

In this example, moving title below author changes the output, which is not the case in standard Object arrays.

Lines of text which do not contain a key are preserved in Freeform arrays, and not ignored as they are in Object arrays. These lines are given a presumed type of text. In this sense, Freeform arrays are collections of lines, each of which may or may not be tagged with a type. Lines without a type are given one by default.


          [+array]
          h1: This is a header
          h2: This is a sub-header
          This is normal text.
          []
        

          {array: [
            {type: "h1", value: "This is a header"},
            {type: "h2", value: "This is a sub-header"},
            {type: "text", value: "This is normal text"}
          ]}
        

To open a Freeform array, prepend the array's key with a + character. This serves as an array modifier similar to the leading . used in nested arrays. That array should then be treated as a freeform array.

Freeform arrays can also be nested under other arrays by using both modifiers. The order of the modifiers must not matter.


          [array]
            [.+nestedFreeform]
              Nested Freeform content.
            []
          []
        

7. Inline comments Deprecated

NOTE: Inline comments have been deprecated. This section will be removed in the 1.0 recommendation. If currently implemented as an option, it should default to off.

Inline comments in ArchieML are modeled after a common syntax in copy editing where editor's notes are placed within square brackets. In this tradition, all text within a matching set of square brackets on a single line, including the brackets, should be ignored by parsers.

In the event that square brackets are desired in the final value, a double set of brackets should be used. Parsers should replace sets of double brackets with single sets of brackets in the output. Care should also be taken not to remove text inside single brackets that is surrounded by an additional set of brackets.


        key: value [inline comment] more value.

        key: value [[this will appear in single brackets]] more value.
      

As with other punctuation, surrounding non-newline whitespace should not affect parsing.

To avoid making assumptions about the end use of the output, whitespace on either side of inline comments is preserved, which may result in extra whitespace after comments have been removed.

8. Command keys

Command keys are defined as any command line matching the :token pattern. Specifically, parsers should only treat this pattern of line as a command if the token begins with any of the following sequences:

Any text after the token should be ignored, and should not affect whether the line is interpreted as a command key. This includes cases where the token and extra characters are not separated by whitespace, such as :ignoreeverything This is so that the intended effect of the command is not lost due to simple syntax errors.

Due to this flexibility, care should be taken so that :endskip lines are not interpreted as :end lines.

Any unescaped command key should reset the buffer.

8.1 End command

The specifics of :end command keys are described above in "Multi-line values."

It should be noted that :end lines which do not occur after a key/value line or array element line (*) should have no effect on the output.

8.2 Skip blocks

When a line beginning with :skip is encountered, the parser should begin to ignore all lines of text. Plain text lines should not be added to the buffer, and all command lines should be ignored. This is with the exception of two command keys which serve to end the skip block: :endskip (which closes the block and signals to resume normal parsing) or :ignore (which stops parsing altogether).

This allows for creating blocks of text where even lines that fit the formal of command lines are ignored and do not affect parsing.

:endskip should resume normal parsing. No special actions need be taken upon resuming parsing, and the buffer should be empty at this point.

8.3 Ignore

As soon as a line beginning with :ignore is encountered, parsing should stop immediately, and the output should be returned. This is a safety mechanism to allow for a safe comment / scratchpad area that has no chance of ending up in, or affecting, the output.