Compiletime lexer generation in rust
| lexer-gen-codegen | ||
| lexer-gen-derive | ||
| src | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| LICENSE | ||
| readme.md | ||
lexer-gen
Compiletime lexer generation
License
This Software is licensed under the AGPLv3. For more details please see the LICENSE file.
Example
use lexer_gen::TokenImpl;
#[derive(Debug, PartialEq, Eq, TokenImpl)]
// Sets the type of the Lexer's `state` field.
#[lexer(state MyState)]
// Specifies a closure that retrieves a `char` must return a `bool`.
// If `true` is returned, the given character is skipped and not used for lexing.
// This also accepts an path like `path::to::my_skip_fn`.
#[lexer(skip |ch| matches!(ch, ' '|'\n'))]
// Enables tracing of the lexing process, by inserting `println!` statements
// into the generated code.
#[lexer(trace)]
// Enables debugging by printing the internal graph for matching a token.
#[lexer(debug)]
enum Token {
// Matches an the string as-is
#[token("return")]
Return,
// Simple regexes are supported.
#[token(regex = r"\w+")]
Name,
// The second argument is an optional callback.
// Supported returntypes: `bool`, `T`, `Option<T>`
// and `Option<Result<T, Your-Token-Type::Error>>`.
#[token("true", |_| true)]
#[token("false", |_| false)]
Boolean,
// You can even store data inside a token!
// The parameter given to the callback is of type `&mut Lexer<Your-Token-Type>`!
// Callbacks ofc can also be a path like `path::to::my_convert_fn`.
#[token(regex = r"\d+", |lex| lex.slice().parse::<u64>().unwrap())]
Number(u64),
// Marks this token as to being emitted whenever the lexer hits the end
// of the input. If not configured, a `None` is returned instead.
#[end]
Eof,
}
fn main() {
// This calls TokenImpl::lexer and constructs a new Lexer<Token>
// with the given input. This however only works if either no state
// is set, or the type configured implements `Default`.
//
// To configure the initial state, use Token::lexer_with_state instead.
let mut lexer = Token::lexer("return returnz true false 12");
// Lexer<Token> implements Iterator<Item = Result<Token, Token::Error>>, so one can easily
// iterate over all tokens. This is also memory efficent, as the lexer
// only ever lexes one token with each call to `.next()`.
while let Some(tok) = ts.next() {
let tok = tok.expected("Token");
println!(
" - {:?} {:?} {:?}",
tok,
// `.slice()` returns a slice into the source for the last
// matched token.
ts.slice(),
// Like `.slice()` but returns a `Range<usize>` instead for
// the last matched token.
ts.span()
);
if tok == Token::Eof {
break
}
}
}
Allowed regex
Allowed:
- Sets
[a-z_]including inversing of a set. - Basic escape characters
\n\r\t\v\f - Some basic short-hands
\d\w\s(and also their inverse\D\W\S) - Grouping
() - Repetition
?*+
Disallowed:
- Any pattern that allows "nothing", i.e.:
x?,x*. - Mixing sets of different inversing state, i.e.:
[ab\D], which would expand to[ab[^0-9]].[ab\d]([ab[0-9]]) however is fine.
Issues:
- Alternations may panic with an error like
Cannot insert range in a entry of read-size 2or similar. This is due to the order the alternation is written in: The algorithm expects you to put the parts from lowest to highest readsize; i.e. a regex likeabhas a readsize of 2 while[^a]has 1.
Error handling
You can configure the Error-type used by the lexer easily with an additional attribute on your token type.
Note: the error type is afterwards available via
<Token as lexer_gen::TokenImpl>::Error.
To configure the error you can use:
#[lexer(error = MyError)], usesMyErrorand expects it to implementlexer_gen::LexerError.#[lexer(error = MyError, default)], usesMyErrorand expects it to implementDefault.#[lexer(error = MyError, ::path::to::my_error_constructor)], usesMyErrorand uses the free-standing function::path::to::my_error_constructorto construct the error. It needs to have a signature offn (&Lexer<'_, Token>) -> Token::Error.- when no error-type is configured, the crate defaults to
().
Custom lexing
Note: this is a somewhat unstable API and should used with care.
Note: custom lexing currently provides no error handling capability.
The generator allows to insert some custom lexing code between the skipping code created by #[lexer(skip ...)] and the actual lexing code of your tokens:
#[lexer(custom)]enables lexing via thelexer_gen::CustomLexertrait that needs to be implemented on your token trait.#[lexer(custom ::path::to::my_lexing)]uses the specified path as a function to do the lexing. Allowed returntypes are:T,Option<T>andOption<Result<T, T::Error>>.
Example:
fn my_custom_lex(lex: &mut Lexer<'_, Token>) -> Option<Token> {
use lexer_gen::internal::LexerInternal;
if let Some(ch) = lex.peek_char() {
if ch == '"' {
// Parse a string...
lex.bump(1);
while let Some(ch) = lex.peek_char() {
lex.bump(1);
if ch == '"' {
return Some(Token::String);
} else if ch == '\\' {
lex.bump(1);
}
}
panic!("END OF INPUT!");
}
}
None
}
Attributions & Thanks
- Thanks to the compiler-tools crate for inspiring me to create an own rust project for generating lexers.
- Thanks to the logos crate for the idea to organize an tree before attempting code generation as well as being a generally source of inspiration.