Compiletime lexer generation in rust

lexer lexical parsing rust text-processing tokenizer

Find a file

Mai-Lapyst de000a17b3 Bump version to 0.1.10		2025-09-17 07:30:44 +02:00
lexer-gen-codegen	Fix regex parsing of alternations	2025-09-17 07:17:28 +02:00
lexer-gen-derive	Initial commit; version 0.1.0	2025-02-21 22:51:25 +01:00
src	Add derive feature	2025-08-21 08:30:43 +02:00
.gitignore	Initial commit; version 0.1.0	2025-02-21 22:51:25 +01:00
Cargo.lock	Bump version to 0.1.10	2025-09-17 07:30:44 +02:00
Cargo.toml	Bump version to 0.1.10	2025-09-17 07:30:44 +02:00
LICENSE	Initial commit; version 0.1.0	2025-02-21 22:51:25 +01:00
readme.md	Add note about order issue of alternations	2025-09-17 07:14:21 +02:00

readme.md

lexer-gen

Compiletime lexer generation

License

This Software is licensed under the AGPLv3. For more details please see the LICENSE file.

Example

use lexer_gen::TokenImpl;

#[derive(Debug, PartialEq, Eq, TokenImpl)]

// Sets the type of the Lexer's `state` field.
#[lexer(state MyState)]

// Specifies a closure that retrieves a `char` must return a `bool`.
// If `true` is returned, the given character is skipped and not used for lexing.
// This also accepts an path like `path::to::my_skip_fn`.
#[lexer(skip |ch| matches!(ch, ' '|'\n'))]

// Enables tracing of the lexing process, by inserting `println!` statements
// into the generated code.
#[lexer(trace)]

// Enables debugging by printing the internal graph for matching a token.
#[lexer(debug)]
enum Token {

    // Matches an the string as-is
    #[token("return")]
    Return,

    // Simple regexes are supported.
    #[token(regex = r"\w+")]
    Name,

    // The second argument is an optional callback.
    // Supported returntypes: `bool`, `T`, `Option<T>`
    //   and `Option<Result<T, Your-Token-Type::Error>>`.
    #[token("true", |_| true)]
    #[token("false", |_| false)]
    Boolean,

    // You can even store data inside a token!
    // The parameter given to the callback is of type `&mut Lexer<Your-Token-Type>`!
    // Callbacks ofc can also be a path like `path::to::my_convert_fn`.
    #[token(regex = r"\d+", |lex| lex.slice().parse::<u64>().unwrap())]
    Number(u64),

    // Marks this token as to being emitted whenever the lexer hits the end
    // of the input. If not configured, a `None` is returned instead.
    #[end]
    Eof,
}

fn main() {
    // This calls TokenImpl::lexer and constructs a new Lexer<Token>
    // with the given input. This however only works if either no state
    // is set, or the type configured implements `Default`.
    // 
    // To configure the initial state, use Token::lexer_with_state instead.
    let mut lexer = Token::lexer("return returnz true false 12");

    // Lexer<Token> implements Iterator<Item = Result<Token, Token::Error>>, so one can easily
    // iterate over all tokens. This is also memory efficent, as the lexer
    // only ever lexes one token with each call to `.next()`.
    while let Some(tok) = ts.next() {
        let tok = tok.expected("Token");
        println!(
            " - {:?} {:?} {:?}",
            tok,

            // `.slice()` returns a slice into the source for the last
            // matched token.
            ts.slice(),

            // Like `.slice()` but returns a `Range<usize>` instead for
            // the last matched token.
            ts.span()
        );

        if tok == Token::Eof {
            break
        }
    }
}

Allowed regex

Allowed:

Sets [a-z_] including inversing of a set.
Basic escape characters \n\r\t\v\f
Some basic short-hands \d\w\s (and also their inverse \D\W\S)
Grouping ()
Repetition ?*+

Disallowed:

Any pattern that allows "nothing", i.e.: x?, x*.
Mixing sets of different inversing state, i.e.: [ab\D], which would expand to [ab[^0-9]]. [ab\d] ([ab[0-9]]) however is fine.

Issues:

Alternations may panic with an error like Cannot insert range in a entry of read-size 2 or similar. This is due to the order the alternation is written in: The algorithm expects you to put the parts from lowest to highest readsize; i.e. a regex like ab has a readsize of 2 while [^a] has 1.

Error handling

You can configure the Error-type used by the lexer easily with an additional attribute on your token type.

Note: the error type is afterwards available via <Token as lexer_gen::TokenImpl>::Error.

To configure the error you can use:

#[lexer(error = MyError)], uses MyError and expects it to implement lexer_gen::LexerError.
#[lexer(error = MyError, default)], uses MyError and expects it to implement Default.
#[lexer(error = MyError, ::path::to::my_error_constructor)], uses MyError and uses the free-standing function ::path::to::my_error_constructor to construct the error. It needs to have a signature of fn (&Lexer<'_, Token>) -> Token::Error.
when no error-type is configured, the crate defaults to ().

Custom lexing

Note: this is a somewhat unstable API and should used with care.

Note: custom lexing currently provides no error handling capability.

The generator allows to insert some custom lexing code between the skipping code created by #[lexer(skip ...)] and the actual lexing code of your tokens:

#[lexer(custom)] enables lexing via the lexer_gen::CustomLexer trait that needs to be implemented on your token trait.
#[lexer(custom ::path::to::my_lexing)] uses the specified path as a function to do the lexing. Allowed returntypes are: T, Option<T> and Option<Result<T, T::Error>>.

Example:

fn my_custom_lex(lex: &mut Lexer<'_, Token>) -> Option<Token> {
    use lexer_gen::internal::LexerInternal;
    if let Some(ch) = lex.peek_char() {
        if ch == '"' {
            // Parse a string...
            lex.bump(1);
            while let Some(ch) = lex.peek_char() {
                lex.bump(1);
                if ch == '"' {
                    return Some(Token::String);
                } else if ch == '\\' {
                    lex.bump(1);
                }
            }
            panic!("END OF INPUT!");
        }
    }
    None
}

Attributions & Thanks

Thanks to the compiler-tools crate for inspiring me to create an own rust project for generating lexers.
Thanks to the logos crate for the idea to organize an tree before attempting code generation as well as being a generally source of inspiration.