Copyright © 2015 The New York Times
ArchieML is a text format optimized for human writability. It was designed for users unfamiliar with existing serialization formats, and defines a forgiving, incremental parser that does not throw syntax errors.
This is the first candidate recommendation of the Archie Markup Language specification. It is a pre-release draft of the spec. As such, it can implemented in parsers, however this candidate should not be considered the definition of ArchieML 1.0.
The public is invited to provide feedback on this draft, either through the Github Issues page for archieml.org, or via email. The comment period on candidate recommendations for the ArchieML 1.0 spec will remain open for at least four weeks following the most recent candidate recommendation.
At that time, this candidate will either be revised or promoted to a more final state, eventually reaching recommendation status for the 1.0 release. That recommendation will then be considered the definition of ArchieML 1.0. Major changes to the spec at that point will not require implementation to be 1.0-compliant.
Until the release of version 1.0, parsers should note the version of the candidate spec they are following.
:skip blocks, when placed inside a multi-line value, should break up the value, causing only the first line of the value to be stored.Parsers should implement a system by which whole lines are interpreted either as a command line, or a plain text line to be read into a buffer.
          "Command" lines are defined as those that begin with any of the following general patterns, with token defined as above:
        
token::token{token}[token]{}[]*
          Any non-newline whitespace may appear at either end of a token or any of the punctuation used above without affecing the parsing. Special lines should be recognized based on how they start; all text after the patterns mentioned above are valid and should not affect the line's status. For example, each of the following is a valid :token line:
        
          :end
          :end this line
          :endthisline
        When a command line is found, logic should be executed that immediately affects the output. For example, if a key/value pair is encountered, the value specified on that line should immediately be added to the output. A document may end at any time, and the output object should always be in a state ready to be returned by the parser.
Based on context, some lines that fit one of the above patterns may not be defined as special; for example, bullet point lines ("* anything") are only special when the parser is inside an array. In these cases (defined below), the line should be interpreted as a plain text line.
All lines that are not commands are interpreted as plain text. When they are encountered, they should not immediately affect the output in any way. Instead, every plain text line should be incrementally added to a buffer that is used during multi-line value parsing, and emptied whenever a special line is encountered.
ArchieML does not throw syntax errors. This is to make parsing of documents as simple as possible, but also to avoid requiring writers be directly aware of the parsing mechanism. Documents are parsed line by line, and any line that does not conform to one of the special line syntaxes above is interpreted as plain text. Syntax errors that cause a command line to be treated as plain text thus do not "break" the parsing of a document, but instead generally omit that line from the output.
Many of the command lines are set up to "reset" to document to some state. This generally means that any mistakes that occur in a document remain local, and do not harm the rest of the document.
        If a key is encountered twice, the latter occurrence should take precedence. Arrays that are defined on top of an existing value or array should also be replaced. Thus, no validation should be made as to whether a key is already taken. In the following example, the output should contain second value as the value for key.
        
          key: first value
          key: second value
        
          A partial exception to this is in defining object blocks: if an object already exists as the definition key, the value should not be deleted. In the following example, both first and second should remain defined.
        
          {scope}
          first: 1
          {}
          {scope}
          second: 2
          {}
        
          When redefining a value, it's possible to change the data structure of the output. For example, a key can change from holding a string to being an object namespace with keys underneath it. In these cases too, the latter definition should take precedence, and parsers must override the existing data structure of the output as necessary to accomodate the new data type (such as replacing a string value with an object). Here, the output should define key as {"subkey": "subvalue"}
        
          key: value
          {key}
          subkey: subvalue
          {}
        Keys, both at the top level of the output, and within nested objects, may contain only alphanumeric characters, hyphens and underscores, in any sequence. If other charactes are included, then the line should not be interpreted as a key/value pair, and instead treated as plain text.
        Keys are defined wholly or in part within key/value lines, as well as lines defining {objects} and [arrays]. The same rules apply in all cases, and should follow the definition of a token above.
      
        Any key may also include any number of periods (.) that function as dot-notation to define nested values. At the time this key or object is defined, the output should conform to the data structure required by that key. Any conflict resolution that needs to take place (such as replacing previous string values with objects) should occur immediately.
      
The following lines are all valid. Successive lines in each group will end up overwriting the previous value with an object, in order to hold the new key.
        key: value
        key.key: value
        key.key.key: value
        [array]
        [array.array]
        [array.array.array]
        {scope}
        {scope.scope}
        {scope.scope.scope}
      
        Values are the characters that are stored within the output's data structure. Special lines beginning with token: or * define values. Any characters that follow the special beginning of the line should be immediately stored as a value.
      
Values are always stored in the output as strings. Leading and trailing whitespace should be stripped away. A value ends when the parser encounters a newline.
          All plain-text lines should be stored into a buffer that is emptied whenever any command line is encountered. This is because values that span multiple lines are defined by including an "ending anchor tag", :end, to the line following the value. Thus the parser can not know ahead of time whether lines 2+ of a value need to be included, until the value has been defined in full.
        
          key: This value should immediately be saved.
          This will be read into a buffer.
          So will this.
          :end Now the buffer should be emptied into "key", adding on two additional lines to the value.
        
          When a command line beginning with :end is encountered, the buffer should be emptied. If the previous command that was parsed was either a token: or a * line (each of which signals the start of a value), then the contents of the buffer should be appended to that value, which should already contain the first line of the value.
        
Because the first line of a value is always inserted into the output immediately (before the rest of the value is parsed), and surrounding whitespace around the initial values is discarded, care should be taken to preserve newlines between the first and second line of a value. A simple way to accomplish this is to send any trailing newline whitespace from the first line of a value into the buffer, so that it is included when you append the buffer to the value's first line.
As characters are added to the buffer, all whitespace should be preserved, including newlines. The end effect should be that leading whitespace on the first line, and trailing whitespace on the last line is discarded, and nothing else.
        If you wish to include a line within a multi-line value that would normally be interpreted as a command line, you can escape the line. This is accomplished by prepending the line with a backslash (\). Its presence should prevent the line from being treated as a command, and instead treated as plain text.
      
To accomodate this, leading backslashes must be removed when parsing multi-line values. All lines in a multi-line value (but not the first line, since no escaping of characters is necessary there) should be post-processed to remove these leading backslashes.
        ArchieML purposefully does not treat escapes as applying to individual characters, but instead to entire lines. Backslashes are treated as escape characters only when they appear at the beginning of a line. Backslashes that appear inside of a line must be preserved in the output. To avoid as much processing as possible, leading backslashes should be removed only when the backslash is the first character of a line (but not a value's first line), and when the second character is any of the following: {, [, *, : or \.
      
        The value of key should be value\n:end (backslash removed):
      
        key: value
        \:end
        :end
      Parsers should always remove leading backslashes in these cases, whether or not the line would have been treated as a command without it. Leading backslashes may, in turn, be escaped with an additional leading backslash, preserving the ability to actually begin a line with one in the output.
        Object blocks are a shorthand way to avoid repeating a namespace for multiple keys. You can specify the start of an object block with a line that begins with {token}.
      
        All key-value pairs that are defined within this block should be added to the output within the namespace defined by the object line. In other words, a key key within the object block {object} should be parsed the same way as a key defined by object.key.
      
        These are equivalent:
        {object}
        key: value
        {}
        object.key: value
      
        The namespace defined by an object block should persist until either A) a new object block or array is defined, or B) an empty object or array key ({} or []) is encountered. This allows for both explicit "closing" of a block, and implicit closing when a new object or array begins.
      
Keys within blocks should be treated exactly as they would if they had been defined naturally through dot notation. Becase of this, objects can be "reopened" if an object block is defined more than once (the second definition of a block does not replace the original defintion, as in arrays). This is to say, avoiding any conflicts encountered from the conflict resoution section above, values should always be merged into the output without deleting values unless necessary.
As soon as an object is initially opened, an empty object at that key should be added as its value.
Arrays of objects can be defined to create multiple instances of an object. They begin by declaring an array line, similar to an object block above; however, arrays use square brackets instead of curly brackets.
          [array]
          item: 1
          item: 2
          item: 3
          []
        When an array definition is encountered, the parser should immediately create an empty array at the given key.
When an array begins, the parser should take note of the first key that is defined. This is the element delimiter key. Every time that key is encountered again, the parser should create a new element inside the array, and that key's value, and subsequent values (until a later occurrence of the delimiter key), should be added to that new object.
Keys within the array block should be assigned to an object within that array. All rules defined above relating to keys still apply, but within the scope of a particular object inside the array. Keys containing dot-notation are relative to individual items within the array.
          [array]
          scope.one: 1
          scope.two: 1
          scope.one: 1
          scope.two: 1
          []
        
          This example would result in two items, each containing a top-level scope key. This is distinct from the following array, where the resulting items would contain only the top-level keys one and two:
        
          [array.scope]
          one: 1
          two: 1
          one: 1
          two: 1
          []
        
          Arrays can also be created that contain simple strings instead of objects. They begin the same way as object arrays do, and the same rules apply around naming a string array. String arrays contain values defined by lines beginning with *. All characters following an *, minus surrounding whitespace, should be stored as a new value inside the array.
        
          [array]
          item: 1
          item: 2
          item: 3
          []
        
          When an array is opened, the parser should take note of which command line is defined first: either a key/value line, or an asterisk (*) line. If the former is defined first, then the array is defined as an object array. If a * line appears first, it is defined as a string array.
        
          Object arrays should treat token: lines as commands, and * lines as plain-text. String arrays should treat token: lines as plain-text and * lines as commands.
        
Multi-line values are allowed within both object and string arrays:
        [object]
        key: value...
        ...with multiple lines.
        :end
        [object]
        * value...
        ...with multiple lines
        :end
        Inline comments in ArchieML are modeled after a common syntax in copy editing where editor's notes are placed within square brackets. In this tradition, all text within a matching set of square brackets on a single line, including the brackets, should be ignored by parsers.
In the event that square brackets are desired in the final value, a double set of brackets should be used. Parsers should replace sets of double brackets with single sets of brackets in the output. Care should also be taken not to remove text inside single brackets that is surrounded by an additional set of brackets.
        key: value [inline comment] more value.
        key: value [[this will appear in single brackets]] more value.
      As with other punctuation, surrounding non-newline whitespace should not affect parsing.
To avoid making assumptions about the end use of the output, whitespace on either side of inline comments is preserved, which may result in extra whitespace after comments have been removed.
        Command keys are defined as any command line matching the :token pattern. Specifically, parsers should only treat this pattern of line as a command if the token begins with any of the following sequences:
      
:skip:endskip:end:ignore
        Any text after the token should be ignored, and should not affect whether the line is interpreted as a command key. This includes cases where the token and extra characters are not separated by whitespace, such as :ignoreeverything This is so that the intended effect of the command is not lost due to simple syntax errors.
      
        Due to this flexibility, care should be taken so that :endskip lines are not interpreted as :end lines.
      
Any unescaped command key should reset the buffer.
          The specifics of :end command keys are described above in "Multi-line values."
        
          It should be noted that :end lines which do not occur after a key/value line or array element line (*) should have no effect on the output.
          When a line beginning with :skip is encountered, the parser should begin to ignore all lines of text. Plain text lines should not be added to the buffer, and all command lines should be ignored. This is with the exception of two command keys which serve to end the skip block: :endskip (which closes the block and signals to resume normal parsing) or :ignore (which stops parsing altogether).
        
This allows for creating blocks of text where even lines that fit the formal of command lines are ignored and do not affect parsing.
          :endskip should resume normal parsing. No special actions need be taken upon resuming parsing, and the buffer should be empty at this point.
        
          As soon as a line beginning with :ignore is encountered, parsing should stop immediately, and the output should be returned. This is a safety mechanism to allow for a safe comment / scratchpad area that has no chance of ending up in, or affecting, the output.