Killing a `Cow` made my JSON formatter 42% faster

Em dashes: 12

A formatter styles code consistently, usually improving human readability. If you are unfamiliar with formatters, tokenizers, or abstract syntax trees, see my article on how formatters work

The JavaScript ecosystem has a gap—most formatters are either featureful or fast. Prettier is the standard in the JavaScript ecosystem, but it’s not the fastest. Oxfmt has made major strides—supporting parity with Prettier’s JavaScript formatting with much improved performance. One report found a migration to Oxfmt formatted their repo 6.5x faster—from 13.9 seconds to 2.1

However these gains do not apply to JSON. Oxfmt is 10–20% slower than Prettier for JSON. Oxfmt has a native JavaScript formatter, but it reasonably delegates to Prettier for file types without native implementations1. Oxfmt plans for a native JSON implementation this year!

Prettier and Oxfmt both take more than a second to format a 2MB JSON file, while my formatter, JJPWRGEM takes around 30 milliseconds

Dense floats benchmark - Execution time

Parsing and stringifying a 2.2MB JSON file with lots of lightly nested arrays

See the full benchmarks for more details

See tabular benchmarks
CommandMedian Time
jsonxf14.0ms
jjp31.0ms
bun70.0ms
node89.0ms
jq112.0ms
dprint665.0ms
prettier1149.0ms
oxfmt1218.0ms

To be fair, JJPWRGEM doesn’t have Prettier feature parity—but what if it did? It’s a great base, and I am confident there are performance gains to be found while implementing features

I started with a well scoped and impactful change—normalizing numbers like Prettier—and I ended up speeding up my formatter by 42% in the process. If you’re interested in how that was calculated, jump to the final benchmarks

Number normalization

Let’s start with 3 normalizations related to scientific notation for simplicity. They all build nicely on each other

  1. Lowercases exponent symbol (from 1E5 to 1e5)
  2. Strips + (from 1e+5 to 1e5)
  3. Removes 0 exponents (from 1e0 to 1)

Baseline

At the moment, JJPWRGEM leaves numbers untouched and represents them as a single Cow<str>. I’ll explain more about Cow as relevant

I am benchmarking against several files with varying data shapes. Here are a few samples

dense floats
{ "type": "FeatureCollection",
  "features": [{
    "type": "Feature",
    "properties": { "name": "Canada" },
    "geometry": {"type":"Polygon","coordinates":[
      [[-65.613616999999977,43.420273000000009],
       [-65.619720000000029,43.418052999999986],
       [-65.625,43.421379000000059], ...]
    ]}
  }]
}
many objects and integers
{
	"events": {
		"138586341": {
			"description": null,
			"id": 138586341,
			"name": "30th Anniversary Tour",
			"subTopicIds": [337184269, 337184283],
			"topicIds": [324846099, 107888604]
		}
	},
	"areaNames": {
		"205705993": "Arrière-scène central",
		"205705994": "1er balcon central"
	}
}
strings and deep nesting
{
  "statuses": [{
    "metadata": { "result_type": "recent", "iso_language_code": "ja" },
    "created_at": "Sun Aug 31 00:29:15 +0000 2014",
    "id": 505874924095815700,
    "text": "@aym0566x \n\n名前:前田あゆみ\n第...",
    ...
  }]
}

Deser refers to deserializing into an abstract syntax tree. The performance is represented by throughput to avoid size based discrepancies. A 1MB file will be formatter faster than a 3MB file, but MB/s is comparable for both!

benchmarkbaseline (MB/s)
deser/dense floats115.6
deser/many objects and integers385.5
deser/strings and deep nesting257.8

Iteration 0: Build a new normalized string

The simplest approach is to build new strings for improperly formatted numbers

Unfortunately, each number needs traversed twice, once to find the range and again to validate it. Even if there are no numbers in our input, it requires slow heap allocations

We need a more complex approach that takes references to chunks of the input string

Iteration 1: 2 Cows

Only 2 parts of each number matter. The mantissa to left of the e and the exponent to the right. You may recognize this as scientific notation—1e5 represents 1e51e5

Say we have 1E+5, we can store 1 and 5 and reconstruct 1e5. If we have an exponent, the e or E is lowercased. Any +s can be skipped since 1e5 and 1e+5 are equivalent

pub enum Token<'a> {
    Number {
        mantissa: Cow<'a, str>,
        exponent: Cow<'a, str>
    },
}

This approach works, but lowers performance across the board. Cow<str> comes with overhead raw str avoids, negatively affecting datasets without many numbers

benchmarkbaseline (MB/s)this (MB/s)delta
deser/dense floats115.6101.1-12.5%
deser/many objects and integers385.5289.3-25.0%
deser/strings and deep nesting257.8214.8-16.7%

What is a Cow anyways?

Cow stands for Clone On Write. It references data until mutation is needed, then duplicates it. This pays off when most values are unchanged—only data needing modification is cloned

A to_lowercase function is a great use case—if the input is already lowercase, you can return a reference to the same string! In Rust terms, the data is “borrowed”

fn to_lowercase(string: &str) -> Cow<'_, str> {
    if string.is_lowercase() {
        // reference input string
        Cow::Borrowed(string)
    } else {
        // make a new string
        Cow::Owned(string.to_lowercase())
    }
}

Diagram showing a call to to_lowercase with an already-lowercase &str returns a Cow::Borrowed pointing to the same memory, avoiding allocation

Now imagine an uppercase input. A new string is created to avoid changing the referenced string. The Cow “owns” the new data and can update the allocated capacity without reallocating, but it must clean up that data when it goes out of scope

Diagram showing a call to to_lowercase with an uppercase &str returns a Cow::Owned pointing to a new allocation

Since a Cow is potentially responsible to clean up its data, it pays a small performance penalty when it goes out of scope. This ultimately renders 20% of the program’s instructions useless! Cow’s flexibility was never used

Even worse, each Token could be a Number, so cleanup checks run on tokens like OpenCurlyBrace that carry no data

Let’s get rid of the Cows!

Iteration 2: use &str instead of Cow<str>

pub enum Token<'a> {
    Number {
        mantissa: &'a str,
        exponent: &'a str
    },
}

The compiler gets to remove all those cleanup instructions and branches, which is a huge win for every datasets

However, the “dense floats” dataset is below baseline since it has no exponents—paying the cost to check for exponents that never materialize. Let’s try splitting the tokens in 2 to avoid paying that cost

benchmarkbaseline (MB/s)this (MB/s)delta
deser/dense floats115.6111.0-3.9%
deser/many objects and integers385.6442.9+14.9%
deser/strings and deep nesting258.7341.4+32.0%

Iteration 3: Separate Mantissa and Exponent variants

Instead of 1 variant holding the full number, we’ll split it in 2 so the CPU can read 2 tokens at once, saving an instruction

pub enum Token<'a> {
    Mantissa(&'a str),
    Exponent(&'a str),
}

CPUs read memory in chunks—like reading a word at a time instead of a full sentence

A focused CPU thinking and reading one JSON Token at a time

A CPU fetches 64 bytes of RAM at a time into its cache for processing, meaning 2 &strs or Cow<str>s per token limits reads to one Token at a time

A sad CPU saying “I can only read one token at a time this is so sad. please make token 32 bytes or lower so I can read 2 at once”

Variants holding 1 &str or Cow<str> are both 24 bytes, so the 2 fit in 1 read

A happy CPU saying “I can read two tokens at a time since token is 24 bytes thank you so much, enjoy higher throughput”

“Dense floats” deserialization is now better than baseline, but “many objects and integers” has regressed from the previous iteration due to higher number parsing overhead

benchmarkprev (MB/s)this (MB/s)delta
deser/dense floats111.0122.6+10.5%
deser/many objects and integers442.9401.7-9.3%
deser/strings and deep nesting341.4356.8+4.5%

Iteration 4: Split the state machine

Parsing both the mantissa and exponent has ballooned the number state machine to 88 bytes. Just like with larger tokens, The CPU needs two reads per state change

A state machine is like a sequential checklist. To find where a number starts and ends, we can follow this (abbreviated) checklist

  1. start with a digit or minus sign
  2. then another digit or a . for a float
  3. finish

At first, it made sense to have a full number state machine, but now we care about 2 separate chunks of that number

Not only will splitting allow the CPU to fetch the full state in one read, the compiler better optimize with less possibilities to reason about

enum MantissaState { ... }
enum ExponentState  {
    MinusOrPlusOrDigit,
    AfterSign,
    Digits,
    Zero,
    End
}

Now the “many objects and integers” dataset is even faster than baseline!

Unfortunately, “dense floats” has regressed to slightly below baseline, but that’s an acceptable tradeoff since typical JSON has more mixed data

benchmarkprev (MB/s)this (MB/s)delta
deser/dense floats122.6111.3-9.2%
deser/many objects and integers401.7456.9+13.7%
deser/strings and deep nesting356.8385.1+7.9%

Final Benchmarks

Switching from Cow to str and splitting Number token and state machines both help to improve performance

The headline of “42%” improvement refers to the strings and deep nesting dataset deserialization + prettification. The slower stage will pull down the faster one, so we add their times together and covert back to MB/s

throughputcombined=11throughputdeser+1throughputprettify\text{throughput}_{\text{combined}} = \frac{1}{\frac{1}{\text{throughput}_{\text{deser}}} + \frac{1}{\text{throughput}_{\text{prettify}}}}

For the baseline case

11257.8+12091.1229.5 MB/s\frac{1}{\frac{1}{257.8} + \frac{1}{2091.1}} \approx 229.5\ \text{MB/s}

For the new implementation

11385.1+12105.0325.5 MB/s\frac{1}{\frac{1}{385.1} + \frac{1}{2105.0}} \approx 325.5\ \text{MB/s}

Resulting in a gain of around 42%

325.5229.5229.50.41841.8%.\frac{325.5 - 229.5}{229.5} \approx 0.418 \approx 41.8\%.

Deserialization into AST

Benchmarks converting a string into an AST

“dense floats” pays for the extra unused token per number, but “many objects and integers” and “strings and deep nesting” have massive improvements due to converting Cow to str

baseline (MB/s)new (MB/s)delta
dense floats115.6111.3-3.7%
many objects and integers385.5456.9+18.5%
strings and deep nesting257.8385.1+49.4%

Prettify from AST

Benchmarks serializing AST into a human readable string

“dense floats” suffered the most for a branch on each Number to check if an exponent is available to print. Other datasets are flat

baseline (MB/s)new (MB/s)delta
dense floats1424.71332.0-6.5%
many objects and integers1853.21845.3-0.4%
strings and deep nesting2091.12105.0+0.7%

Uglify from Token Stream

Converts a string into tokens and serializes into a compact string

Big improvements across the board. Avoids building a syntax tree, so serialization avoids branching with formatting like with AST to string paths

baseline (MB/s)new (MB/s)delta
dense floats198.9251.2+26.3%
many objects and integers449.8630.4+40.2%
strings and deep nesting269.9418.2+54.9%

The benchmarks prove formatters can be fast and featureful—I did more work and improved performance in most categories. I’ll see y’all again when I implement the next Prettier feature!

Please consider subscribing to my RSS feed or your programs will give your CPU depression

A capacitor giving a CPU therapy. They say “So I hear you’ve been feeling cache miss-erable recently…”

The CPU is laying down on a coach and the capacitor is wearing glasses and holding a clipboard

Footnotes

  1. I love everything Oxfmt has done so far, so this is not meant to put down Oxfmt. It’s the only Prettier compatible tooling I’ve found that doesn’t crash when formatting larger JSON files. I’ve seen some amazing architectural decisions they’ve made. Open source is a lot of work