How do formatters work?

Em dashes: 10

What is a formatter?

If you’ve programmed for a bit, you’ve probably run into a formatter. Maybe you’ve used prettier, black, rustfmt, or gofmt. Maybe it’s just built into your editor

No matter what tools you use, they all serve the same purpose—styling code consistently. This reduces merge conflicts and avoids fights your coworker Gerald, who wants to indent with 17 spaces for some reason. Here’s how a formatter actually works under the hood!

Gerald wrote valid JSON for me to review

{ "rust"   :
"is a must"  
     ,"key"
:
null}

Uhhhh please no. I can’t deal with this today

A formatter will right these wrongs by applying a set of rules to consistently style the output

{ 
  "rust": "is a must",
  "key": null
}

How do formatters know what’s important?

Both JSON blobs are functionally equivalent—how did the formatter know what can and can’t be cut?

A character’s surroundings define its meaning

Most formatters parse the string into a sequence of meaningful chunks of text called tokens

No matter how the JSON is styled, the tokens will be the same! Note there is no token for spaces or newlines, just functionality

{ 
  "rust": "is a must",
  "key": null
}
[OpenCurlyBrace, String("rust"), Colon, String("is a must"), Comma, String("key"), Colon, Null, ClosedCurlyBrace]

Why not just loop over the string?

Long story short, modularity. It’s hard to manage token validity and output at the same time

It’s certainly possible to loop directly over the JSON, tracking what the current character represents, but this gets complex fast

Strings alone have a laundry list of rules to abide by. Don’t worry about understanding all of them—just think about tracking string validations while handling other concerns like formatting

JSON strings

  1. start with a quotation mark, have 0 to many characters, then end with a quotation mark
  2. can contain escaped quotes (\") and other characters requiring escapes, like newlines (\n)
  3. can contain Unicode escapes, which look like uXXXX where X is a hex digit
  4. cannot have control characters like newlines inside of them
  5. must contain valid utf8

Now imagine handling all these rules while making sure the tokens appear in the right order

Abstract Syntax Trees (ASTs)

Tokens tell us what each chunks of text mean, but they do not encode relationships between them. It’s useful to know how deeply nested a value is when formatting

Abstract syntax trees describe the structure without worrying about syntax. Key value relationships and nesting are represented in the tree instead of lower level syntactical tokens like Colon or Comma

[OpenCurlyBrace, String("rust"), Colon, String("is a must"), Comma, String("key"), Colon, Null, ClosedCurlyBrace]
Object([
  (String("rust"), String("is a must")),
  (String("key"), Null)
])

Sometimes tokens are enough

ASTs are very convenient, but they are not free to build! My formatter JJPWRGEM supports uglification, also known as minification. Uglification trades readability for compact storage. If your main consumer is a computer, storing extra spaces for human readability is wasteful

{"rust":"is a must","key":null}

We don’t need to know about nesting or any structural relationships to properly format—tokens can be emitted immediately as they are parsed. As of the time of writing, skipping the AST in this path decreases formatting time by about 44%1 for JJPWRGEM!

However, skipping the AST isn’t the fastest if the AST is already cached—building the tree is the expensive part. The same benchmark takes 12ms when parsing from the source string to tokens, but only 0.77ms to serialize from a cached AST2.

The best part? The same function handles both cases. JJPWRGEM has a visitor trait abstracting tree traversal into token events. An OpenCurlyBrace token will emit an on_object_open event. An AST object emits the appropriate lower level events for open curly braces, keys, values, and so on. The visitor decides what to do with each event, and in this case it immediately serializes the output!

pub trait Visitor<'a> {
    fn on_object_open(&mut self); // implementation emits '{'
    fn on_object_key(&mut self, key: &'a str); // implementation emits "\"{key}\""
    fn on_object_key_val_delim(&mut self);
    fn on_object_close(&mut self);
    fn on_null(&mut self);
    fn on_string(&mut self, value: &'a str);
    fn on_item_delim(&mut self);
    // omitted for brevity
}

Whether you’re formatting tokens or an AST, the formatter sees the same events

But wait—there’s more! The visitor enables so much more than just formatting. Since parsing validates the JSON, my validation visitor is just a noop. Other extensions are just as clean—the visitor completely defines the output. I could map old keys to new ones, calculate key frequency, or validate against a JSON schema—all without new traversal logic


A formatter is all about finding what’s meaningful and normalizing what’s not. JJPWRGEM uses both tokens for parsing speed and ASTs for structure—resulting in huge flexibility and performance gains

Now you have all the tools you need to fight back against Gerald and their newfangled Fibonacci indentation

Please consider subscribing to my RSS feed, or I will personally ask Gerald to reformat your codebase by hand

Gerald giving a very menacing hand rub as they scheme

Footnotes

  1. canada.json is a GeoJSON file with arrays of floats. Building an AST and serializing takes 21.28ms+0.77ms=22.05ms21.28\text{ms} + 0.77\text{ms} = 22.05\text{ms} and serializing tokens directly takes 12.38ms12.38\text{ms}. That is 12.38/22.050.56112.38 / 22.05 \approx 0.561, meaning the token path is about 44%44\% faster. See full benchmarks here

  2. canada.json is a GeoJSON file with arrays of floats. Formatting from a cached AST takes 0.77ms, compared to 12ms when parsing from string to tokens first. See full benchmarks here