Ruff v0.4.0: a hand-written recursive descent parser for Python

Ruff v0.4.0 is available now! Install it from PyPI, or your package manager of choice:

pip install --upgrade ruff

Ruff is an extremely fast Python linter and formatter, written in Rust. Ruff can be used to replace Black, Flake8 (plus dozens of plugins), isort, pydocstyle, pyupgrade, and more, all while executing tens or hundreds of times faster than any individual tool.

This release marks an important milestone in Ruff’s development as we switch from a generated to a hand-written recursive descent parser.

Ruff's new parser is >2x faster, which translates to a 20-40% speedup for all linting and formatting invocations.

Repository	Linter (v0.3)	Linter (v0.4)	Formatter (v0.3)	Formatter (v0.4)
`home-assistant/core`	449.9	364.1	381.9	307.8
`pytorch/pytorch`	328.7	251.8	351.1	274.9
`python/cpython`	134.6	94.4	180.2	138.3
`huggingface/transformers`	198.5	143.6	239.0	184.1

Time in milliseconds to lint and format popular repositories. Lower is better.

A hand-written parser also opens the door to future optimizations and improvements, especially in error recovery.

Read on for discussion of the major changes, or take a look at the changelog.

A hand-written parser #

Parsers form the foundational layer of any static analysis tool, transforming raw source code into Abstract Syntax Trees (ASTs), which serve as the basis for analysis.

Ruff v0.4.0 introduces a hand-written recursive descent parser, replacing the existing generated parser.

The difference between the two lies in how they are implemented:

A generated parser is created using a tool called a parser generator (in our case, LALRPOP). Typically, a parser generator requires that the grammar is defined in a Domain Specific Language (DSL), which is then converted into executable code by the generator. In our case, rules were defined in a .lalrpop file, which LALRPOP converted into Rust code.
On the other hand, a hand-written parser involves encoding the parsing rules directly in Rust code, using functions to define the parsing logic for individual nodes.

On initial release, Ruff used the Python parser from the RustPython project. As Ruff evolved, we learned that a Python interpreter and linter have different needs, and the ideal AST for those two use cases can look pretty different. Ultimately, we pulled the parser into Ruff and maintained it separately as we evolved our AST structure.

Enter Victor Hugo Gomes, a contributor to the Ruff project who opened a pull request to introduce a hand-written recursive descent parser. It was an ambitious proposal, but one that made a lot of sense for Ruff's future, given that...

We were already maintaining the parser separately from RustPython.
We had a clear understanding of the AST structure we needed.
Victor had already demonstrated that the hand-written parser would significantly outperform the generated parser.
The generated parser, ironically, had become harder to maintain. Parser generators come with limitations around the grammars that they can support, and we'd already found ourselves fighting LALRPOP to support the latest Python syntax.
A hand-written parser would give us more control over error handling and recovery, which is especially important for building editor-friendly tools, which need to be resilient to syntax errors.

From there, we worked closely with Victor to integrate the parser into Ruff and add support for the latest Python syntax. Once the parser was fully compliant with Ruff's own test suite, we spent a few months testing and validating its accuracy and reliability across millions of lines of real-world Python code and fuzzer-generated inputs.

Advantages #

In line with our initial motivations, the introduction of a recursive descent parser brings several benefits over the generated parser.

Control and Flexibility #

A hand-written parser has complete control over the parsing process which allows for greater flexibility in handling edge cases. For example, parenthesized with items in Python introduce a syntactic ambiguity regarding which node the opening parenthesis "belongs" to:

# Parenthesis belongs to the `with` item
with (item): ...
#    ^^^^^^ with item

# Parenthesis belongs to the context expression which is part of the `with` item
with (item) as var: ...
#    ^^^^^^        context expression
#    ^^^^^^^^^^^^^ with item

Encoding this ambiguity in a generated parser can be challenging, while a hand-written parser gives you the flexibility you need to handle such cases.

Performance #

The hand-written parser is significantly faster. Optimizing the parser generator was difficult, since we had minimal control over the generated code and few opportunities to take advantage of domain-specific knowledge around hot paths and cold paths, and other properties of the data. While we could optimize our hand-written lexer, the parser remained a black box.

	Benchmark	LALRPOP parser	Hand-written parser	Change
⚡	`parser[large/dataset.py]`	63.6 ms	26.6 ms	×2.4
⚡	`parser[numpy/ctypeslib.py]`	10.8 ms	5 ms	×2.2
⚡	`parser[numpy/globals.py]`	964.6 µs	424.9 µs	×2.3
⚡	`parser[pydantic/types.py]`	24.4 ms	10.9 ms	×2.2
⚡	`parser[unicode/pypinyin.py]`	3.8 ms	1.7 ms	×2.2

Micro-benchmark comparison between the two parsers.

Ruff's hand-written parser is >2x faster than the generated parser, which translates to a 20-40% speedup for all linting and formatting invocations.

Error handling #

With a hand-written parser, we can now provide better error messages on encountering a syntax error, as seen in the following examples:

--- a/ruff/parser/old_error_messages
+++ b/ruff/parser/new_error_messages
   |
 1 | from x import
-  |              ^ SyntaxError: Unexpected token Newline
+  |              ^ SyntaxError: Expected one or more symbol names after import

   |
 1 | async while test: ...
-  |       ^ SyntaxError: Unexpected token 'while'
+  |       ^ SyntaxError: Expected 'def', 'with' or 'for' to follow 'async', found 'while'
   |

   |
 1 | a; if b: pass; b
-  |    ^ SyntaxError: Unexpected token 'if'
+  |    ^ SyntaxError: Compound statements are not allowed on the same line as simple statements
   |

   |
 1 | with (item1, item2), item3,: ...
-  |                            ^ SyntaxError: Unexpected token ':'
+  |                           ^ SyntaxError: Trailing comma not allowed
   |

   |
 1 | x = *a and b
-  |        ^ SyntaxError: Unexpected token 'and'
+  |      ^ SyntaxError: Boolean expression cannot be used here
   |

Error resilience #

For many of our users, Ruff is a tool that lives in the editor, and in the editor, it's common to have syntax errors, even temporarily. Imagine, for example, that you're in the midst of defining a new function. You've typed out the def func (x):, but haven't filled in the function body. While your code is not syntactically valid, you'd still like to see linting and formatting diagnostics for the rest of the file.

A hand-written parser enables us to support error recovery, and thereby build error resilience into Ruff. That is, Ruff's parser can recover from syntax errors in the source code and continue parsing despite the interruption.

What does this look like in an editor? Imagine you've made multiple syntax errors in your code, as shown below:

import os  # unused-import (F401)


def fibonacci(n):
    """Compute the nth number in the Fibonacci sequence."""
    x = 1  # unused-variable (F841)
    if n in (0, 1)
        #         ^ SyntaxError: Expected ':', found newline
        return n
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)


if __name__ == "__main__":
    import sys

    1 = int(sys.argv[1])
#   ^ SyntaxError: Invalid assignment target
    print(fibonacci(n))  # undefined-name (F821)

With an error-resilient parser, Ruff can continue to analyze the code even after encountering the above syntax errors, which then allows the linter to provide and even fix diagnostics in a single run.

While Ruff does not yet exhibit this error-resilient behavior, the hand-written parser lays the foundation for it, which we plan to implement in future releases.

What's next #

Looking ahead, we aim to further improve Ruff's parsing capabilities with the following objectives:

Complete error recovery: Ensure that the parser recovers from all syntax errors, providing developers with an uninterrupted experience in the editor.
Reporting all syntax errors: Display all syntax errors encountered during parsing, providing developers with a complete overview of the issues in their code.
Continuous analysis: Allow the linter to proceed with analysis even in the presence of syntax errors.

Ultimately, our goal is to enable a first-class editor experience for Ruff by making it both faster and, critically, resilient to syntax errors.

Thank you! #

Finally, we'd like to acknowledge the RustPython project for enabling us to leverage their Python parser. The RustPython parser was a significant enabler to Ruff's early development, and we're grateful for the opportunity to collaborate and build on their work.

We'd also like to thank Victor Hugo Gomes, for initiating the transition to a hand-written parser and all the work that went into making it Ruff-compliant; and Addison Crump, for contributing the fuzzer that we leveraged to validate the new parser.

View the full changelog on GitHub.

Read more about Astral — the company behind Ruff.

Dhruv Manilawala