Lazyjson

Given my nature, I like to continuously improve my skills. And a great way to do that is to learn how existing things are created. That's why one day I decided, I wanted to understand parsers. It always fascinated me, how the meaning of something so complicated like a (programming) language could be parsed, and transformed into something different. For example a compiler, or an interpreter. I didn't want to create a complete programming language. That would have taken way too much time to get anything useful out of it. So I decided to try and create a JSON parser. I re-wrote the parser multiple times, even in several languages. First, I used TypeScript, simply because that's what I knew best. After writing a first, rudimentary implementation, I refactored the code a couple of times, until I decided, I wanted to try out Go. So I started writing it in Go, and I recently learned about WebAssembly, so I tried compiling it to that. After a couple of issues, I left the project on the side for some time. But then I learned about another language that could be compiled into WebAssembly. Additionally, I keep hearing good things about this language, supposedly, It's even going to be inside the Linux kernel. So I chose to re-write again. This time in rust.

If you don't care about the details, just head over the demo! Otherwise, read on.

gear iconCheckout the live demo of Lazyjson

Also checkout the code on GitHub.

GitHub logoCheckout Lazyjson on GitHub

The parser consists of two main components, the tokenizer, and the tree builder.

Tokenizer

The tokenizer is responsible for figuring out what each part of the text is. Just simple checks, for example: "Is this a number", "is this a comma", "is this whitespace", and so on.

This step is called tokenization or often lexical analysis.

I implemented each "consumer" in its own, separate function. A consumer in my context just takes the input, and checks if it can identify the contents, for example, here is my operator consumer:

pub fn operator_consumer(inp: &mut CharQueue) -> Result<Option<Token>, TokenizationErr> {
    let c = inp.peek().ok_or(TokenizationErr::new_out_of_bounds())?;

    let tok = match c {
        ':' => Token::new_json_assignment_op(inp.idx()),
        '=' => Token::new_equal_assignment_op(inp.idx()),
        _ => return Ok(None),
    };

    inp.advance_by(1);

    Ok(Some(tok))
}

As you can see, well at least if you know rust a little bit, this function can return either an error or an optional token.

Then, in the main tokenization function, all of the consumers are combined. It loops over every one of them. If a function returns a token, we can move on to the next piece of text. If it returns None, the text was not identified as consumable by this consumer, and we can check it with the next one.

pub fn tokenize(inp: &str, config: &Config) -> Result<Vec<Token>, TokenizationErr> {
    if inp.is_empty() {
        return Err(TokenizationErr::new_no_inp());
    }

    let consumers: &[&Consumer] = &[
        &line_comment_consumer,
        &whitespace_consumer,
        &delimiter_consumer,
        &keyword_literal_consumer,
        &number_literal_consumer,
        &operator_consumer,
        &separator_consumer,
        &string_literal_consumer,
    ];

    let mut queue = CharQueue::new(inp);
    let mut toks = Vec::new();

    'o: while queue.has_remaining() {
        for consumer in consumers {
            let tok = consumer(&mut queue)?;

            if let Some(tok) = tok {
                // Omit unnecessary whitespace tokens
                if tok.typ == TokenType::WhitespaceLiteral {
                    continue 'o;
                }

                // Line comments are currently not supported by the treebuilder.
                // So if they are allowed, we omitted them, and otherwise throw an
                // error.
                if tok.typ == TokenType::LineComment {
                    if config.allow_line_comments {
                        continue 'o;
                    }

                    return Err(TokenizationErr::new_line_comments_not_allowed(
                        tok.from, tok.to,
                    ));
                }

                toks.push(tok);
                continue 'o;
            }
        }

        panic!("{:?} was not consumed", queue.peek());
    }

    Ok(toks)
}

Tree builder

After obtaining a list of tokens, the tree builder can now check for valid patterns, for example: [, false, ], and build a tree of nodes out of it. The mentioned example would result in the following node structure: ArrayNode -> entries -> BoolNode -> false.

The tree builder works fundamentally in the same way as the tokenizer. It combines a set of consumers, and checks if they could consume a given token composition.

pub fn number_consumer(
    inp: &mut Queue<Token>,
    _: &Rc<VarDict>,
    _: &Config,
) -> Result<Option<Node>, TreebuilderErr> {
    if t.typ != TokenType::NumberLiteral {
        return Ok(None);
    }

    let i = inp.idx();
    let t = inp.next().unwrap();

    Ok(Some(NumberNode::new(i, t.val.clone()).into()))
}
/// Consumes all possible forms of "value constellations". For example simple
/// numbers (`1`), or arrays (`[1, 2]`), and so on. This consumer combines other
/// "sub-consumers" to achieve this behavior.
pub fn value_consumer(
    toks: &mut Peekable<TokenIndices>,
    var_dict: &Rc<VarDict>,
    config: &Config,
) -> Result<Option<Node>, TreebuilderErr> {
    let consumers: &[&Consumer] = &[
        &array_consumer,
        &keyword_consumer,
        &variable_usage_consumer,
        &number_consumer,
        &object_consumer,
        &string_consumer,
    ];

    for consumer in consumers {
        let res = consumer(toks, var_dict, config)?;

        if res.is_some() {
            return Ok(res);
        }
    }

    Ok(None)
}

So this is the basic functionality of a JSON parser complete. But I wanted to add some of my own features. Mostly "fixing" things I always found to be annoying about the JSON format.

Trailing commas

The JSON format does not permit having trailing commas, this can be most annoying when moving entries around. I've also seen that another argument for trailing comma that I've read, is that to add an entry, you would need the add the entry itself, and a comma on the previous line. In source control, this will show up as a two-line change, which it isn't.

So I went and added a config, as I wanted to be able to turn this feature on and off. And started checking for trailing commas:

consume_val_sep(inp)?;

// Check if the next token is an object close, if yes, we have a trailing
// separator.
if consume_obj_cls(inp, opn_i)? {
    if !config.allow_trailing_commas {
        return Err(TreebuilderErr::new_trailing_sep(inp.idx() - 2));
    }

    return Ok(Some(ObjectNode::new(opn_i, inp.idx(), entries).into()));
}

I also added a custom error message for it, so when the option is disabled (no trailing commas allowed), the following will appear:

expected the next value or close (trailing separator not allowed), line: 1, char: 14

{"foo": "bar",}
             ^

Speaking of error messages, I spend quite a lot of time making them as useful as possible.

Error messages

The parser was designed to tell the user what went wrong. For me, this was important, as I often found the error messages of the JavaScript JSON parser quite useless.

Let's look at a few error messages

Missing , inside an array

expected a `,` but received a `KeywordLiteral`, line: 1, char: 8

[false true]
       ^^^^

Missing quotes around an object key

(to be fair, this one isn't all that obvious, but still, it marks what is wrong)

expected a `StringLiteral` but received a `KeywordLiteral`, line: 1, char: 2

{key: "val"}
 ^^^

Forgot to close the object

object was not terminated, line: 1, char: 1

{"foo": "bar"
^

Next up: line comments.

Line comments

Provided the correct flag is set to true, the parser supports line comments. Well, the tokenizer just ignores them. If the flag is not set, the output will be:

line comments not allowed

Emitting

Given that I have a complete tree of nodes, I implemented the opposite of parsing, emitting! I've implemented this somewhat limited. The emitter is not configurable at all, but that wasn't its purpose anyway. I mainly implemented it, so that one can see some sort of output, instead of just "parsed successfully". Also, the next feature would be hard to demonstrate without this.

So let's look at the maybe biggest feature I added to JSON.

Variables

Yep, I added variables. Nothing really to say about this, except that they can be defined inside container nodes (arrays, and objects), and the scope of them is bound to the node it is defined in. Let's jump into some examples:

This is a valid variable declaration:

{let foo = 10}

And would simply output:

{}

Actually using the variable:

{
    let foo = "bar",
    "foobar": foo
}

The output:

{
	"foobar": "bar"
}

Nested variables are supported as well:

{
    let port = 3000,
    let apiArgs = ["run", port],
    let webArgs = ["bind", port],
    "services": {
        "api": apiArgs,
        "web": webArgs
    }
}
{
	"services": {
		"api": ["run", 3000],
		"web": ["bind", 3000]
	}
}

Conclusion

This is, like all the others, a learning project, and not intended to be actually used. There are some bugs, some errors, and definitely some improvements that could be made.

If you made it this far, definitely check out the demo!

gear iconCheckout the live demo of Lazyjson GitHub logoCheckout Lazyjson on GitHub