Compiletime lexer generation in rust
Find a file
2025-09-17 07:30:44 +02:00
lexer-gen-codegen Fix regex parsing of alternations 2025-09-17 07:17:28 +02:00
lexer-gen-derive Initial commit; version 0.1.0 2025-02-21 22:51:25 +01:00
src Add derive feature 2025-08-21 08:30:43 +02:00
.gitignore Initial commit; version 0.1.0 2025-02-21 22:51:25 +01:00
Cargo.lock Bump version to 0.1.10 2025-09-17 07:30:44 +02:00
Cargo.toml Bump version to 0.1.10 2025-09-17 07:30:44 +02:00
LICENSE Initial commit; version 0.1.0 2025-02-21 22:51:25 +01:00
readme.md Add note about order issue of alternations 2025-09-17 07:14:21 +02:00

lexer-gen

Compiletime lexer generation

License

This Software is licensed under the AGPLv3. For more details please see the LICENSE file.

Example

use lexer_gen::TokenImpl;

#[derive(Debug, PartialEq, Eq, TokenImpl)]

// Sets the type of the Lexer's `state` field.
#[lexer(state MyState)]

// Specifies a closure that retrieves a `char` must return a `bool`.
// If `true` is returned, the given character is skipped and not used for lexing.
// This also accepts an path like `path::to::my_skip_fn`.
#[lexer(skip |ch| matches!(ch, ' '|'\n'))]

// Enables tracing of the lexing process, by inserting `println!` statements
// into the generated code.
#[lexer(trace)]

// Enables debugging by printing the internal graph for matching a token.
#[lexer(debug)]
enum Token {

    // Matches an the string as-is
    #[token("return")]
    Return,

    // Simple regexes are supported.
    #[token(regex = r"\w+")]
    Name,

    // The second argument is an optional callback.
    // Supported returntypes: `bool`, `T`, `Option<T>`
    //   and `Option<Result<T, Your-Token-Type::Error>>`.
    #[token("true", |_| true)]
    #[token("false", |_| false)]
    Boolean,

    // You can even store data inside a token!
    // The parameter given to the callback is of type `&mut Lexer<Your-Token-Type>`!
    // Callbacks ofc can also be a path like `path::to::my_convert_fn`.
    #[token(regex = r"\d+", |lex| lex.slice().parse::<u64>().unwrap())]
    Number(u64),

    // Marks this token as to being emitted whenever the lexer hits the end
    // of the input. If not configured, a `None` is returned instead.
    #[end]
    Eof,
}

fn main() {
    // This calls TokenImpl::lexer and constructs a new Lexer<Token>
    // with the given input. This however only works if either no state
    // is set, or the type configured implements `Default`.
    // 
    // To configure the initial state, use Token::lexer_with_state instead.
    let mut lexer = Token::lexer("return returnz true false 12");

    // Lexer<Token> implements Iterator<Item = Result<Token, Token::Error>>, so one can easily
    // iterate over all tokens. This is also memory efficent, as the lexer
    // only ever lexes one token with each call to `.next()`.
    while let Some(tok) = ts.next() {
        let tok = tok.expected("Token");
        println!(
            " - {:?} {:?} {:?}",
            tok,

            // `.slice()` returns a slice into the source for the last
            // matched token.
            ts.slice(),

            // Like `.slice()` but returns a `Range<usize>` instead for
            // the last matched token.
            ts.span()
        );

        if tok == Token::Eof {
            break
        }
    }
}

Allowed regex

Allowed:

  • Sets [a-z_] including inversing of a set.
  • Basic escape characters \n\r\t\v\f
  • Some basic short-hands \d\w\s (and also their inverse \D\W\S)
  • Grouping ()
  • Repetition ?*+

Disallowed:

  • Any pattern that allows "nothing", i.e.: x?, x*.
  • Mixing sets of different inversing state, i.e.: [ab\D], which would expand to [ab[^0-9]]. [ab\d] ([ab[0-9]]) however is fine.

Issues:

  • Alternations may panic with an error like Cannot insert range in a entry of read-size 2 or similar. This is due to the order the alternation is written in: The algorithm expects you to put the parts from lowest to highest readsize; i.e. a regex like ab has a readsize of 2 while [^a] has 1.

Error handling

You can configure the Error-type used by the lexer easily with an additional attribute on your token type.

Note: the error type is afterwards available via <Token as lexer_gen::TokenImpl>::Error.

To configure the error you can use:

  • #[lexer(error = MyError)], uses MyError and expects it to implement lexer_gen::LexerError.
  • #[lexer(error = MyError, default)], uses MyError and expects it to implement Default.
  • #[lexer(error = MyError, ::path::to::my_error_constructor)], uses MyError and uses the free-standing function ::path::to::my_error_constructor to construct the error. It needs to have a signature of fn (&Lexer<'_, Token>) -> Token::Error.
  • when no error-type is configured, the crate defaults to ().

Custom lexing

Note: this is a somewhat unstable API and should used with care.

Note: custom lexing currently provides no error handling capability.

The generator allows to insert some custom lexing code between the skipping code created by #[lexer(skip ...)] and the actual lexing code of your tokens:

  • #[lexer(custom)] enables lexing via the lexer_gen::CustomLexer trait that needs to be implemented on your token trait.
  • #[lexer(custom ::path::to::my_lexing)] uses the specified path as a function to do the lexing. Allowed returntypes are: T, Option<T> and Option<Result<T, T::Error>>.

Example:

fn my_custom_lex(lex: &mut Lexer<'_, Token>) -> Option<Token> {
    use lexer_gen::internal::LexerInternal;
    if let Some(ch) = lex.peek_char() {
        if ch == '"' {
            // Parse a string...
            lex.bump(1);
            while let Some(ch) = lex.peek_char() {
                lex.bump(1);
                if ch == '"' {
                    return Some(Token::String);
                } else if ch == '\\' {
                    lex.bump(1);
                }
            }
            panic!("END OF INPUT!");
        }
    }
    None
}

Attributions & Thanks

  • Thanks to the compiler-tools crate for inspiring me to create an own rust project for generating lexers.
  • Thanks to the logos crate for the idea to organize an tree before attempting code generation as well as being a generally source of inspiration.