Compiletime lexer generation in rust
lexer-gen-codegen | ||
lexer-gen-derive | ||
src | ||
.gitignore | ||
Cargo.lock | ||
Cargo.toml | ||
LICENSE | ||
readme.md |
lexer-gen
Compiletime lexer generation
License
This Software is licensed under the AGPLv3. For more details please see the LICENSE file.
Example
use lexer_gen::TokenImpl;
#[derive(Debug, PartialEq, Eq, TokenImpl)]
// Sets the type of the Lexer's `state` field.
#[lexer(state MyState)]
// Specifies a closure that retrieves a `char` must return a `bool`.
// If `true` is returned, the given character is skipped and not used for lexing.
// This also accepts an path like `path::to::my_skip_fn`.
#[lexer(skip |ch| matches!(ch, ' '|'\n'))]
// Enables tracing of the lexing process, by inserting `println!` statements
// into the generated code.
#[lexer(trace)]
// Enables debugging by printing the internal graph for matching a token.
#[lexer(debug)]
enum Token {
// Matches an the string as-is
#[token("return")]
Return,
// Simple regexes are supported.
#[token(regex = r"\w+")]
Name,
// The second argument is an optional callback.
// Supported returntypes: `bool`, `T`, `Option<T>`
// and `Option<Result<T, Your-Token-Type::Error>>`.
#[token("true", |_| true)]
#[token("false", |_| false)]
Boolean,
// You can even store data inside a token!
// The parameter given to the callback is of type `&mut Lexer<Your-Token-Type>`!
// Callbacks ofc can also be a path like `path::to::my_convert_fn`.
#[token(regex = r"\d+", |lex| lex.slice().parse::<u64>().unwrap())]
Number(u64),
// Marks this token as to being emitted whenever the lexer hits the end
// of the input. If not configured, a `None` is returned instead.
#[end]
Eof,
}
fn main() {
// This calls TokenImpl::lexer and constructs a new Lexer<Token>
// with the given input. This however only works if either no state
// is set, or the type configured implements `Default`.
//
// To configure the initial state, use Token::lexer_with_state instead.
let mut lexer = Token::lexer("return returnz true false 12");
// Lexer<Token> implements Iterator<Item = Result<Token, Token::Error>>, so one can easily
// iterate over all tokens. This is also memory efficent, as the lexer
// only ever lexes one token with each call to `.next()`.
while let Some(tok) = ts.next() {
let tok = tok.expected("Token");
println!(
" - {:?} {:?} {:?}",
tok,
// `.slice()` returns a slice into the source for the last
// matched token.
ts.slice(),
// Like `.slice()` but returns a `Range<usize>` instead for
// the last matched token.
ts.span()
);
if tok == Token::Eof {
break
}
}
}
Allowed regex
Allowed:
- Sets
[a-z_]
including inversing of a set. - Basic escape characters
\n\r\t\v\f
- Some basic short-hands
\d\w\s
(and also their inverse\D\W\S
) - Grouping
()
- Repetition
?*+
Disallowed:
- Any pattern that allows "nothing", i.e.:
x?
,x*
. - Mixing sets of different inversing state, i.e.:
[ab\D]
, which would expand to[ab[^0-9]]
.[ab\d]
([ab[0-9]]
) however is fine.
Issues:
- Alternations may panic with an error like
Cannot insert range in a entry of read-size 2
or similar. This is due to the order the alternation is written in: The algorithm expects you to put the parts from lowest to highest readsize; i.e. a regex likeab
has a readsize of 2 while[^a]
has 1.
Error handling
You can configure the Error-type used by the lexer easily with an additional attribute on your token type.
Note: the error type is afterwards available via
<Token as lexer_gen::TokenImpl>::Error
.
To configure the error you can use:
#[lexer(error = MyError)]
, usesMyError
and expects it to implementlexer_gen::LexerError
.#[lexer(error = MyError, default)]
, usesMyError
and expects it to implementDefault
.#[lexer(error = MyError, ::path::to::my_error_constructor)]
, usesMyError
and uses the free-standing function::path::to::my_error_constructor
to construct the error. It needs to have a signature offn (&Lexer<'_, Token>) -> Token::Error
.- when no error-type is configured, the crate defaults to
()
.
Custom lexing
Note: this is a somewhat unstable API and should used with care.
Note: custom lexing currently provides no error handling capability.
The generator allows to insert some custom lexing code between the skipping code created by #[lexer(skip ...)]
and the actual lexing code of your tokens:
#[lexer(custom)]
enables lexing via thelexer_gen::CustomLexer
trait that needs to be implemented on your token trait.#[lexer(custom ::path::to::my_lexing)]
uses the specified path as a function to do the lexing. Allowed returntypes are:T
,Option<T>
andOption<Result<T, T::Error>>
.
Example:
fn my_custom_lex(lex: &mut Lexer<'_, Token>) -> Option<Token> {
use lexer_gen::internal::LexerInternal;
if let Some(ch) = lex.peek_char() {
if ch == '"' {
// Parse a string...
lex.bump(1);
while let Some(ch) = lex.peek_char() {
lex.bump(1);
if ch == '"' {
return Some(Token::String);
} else if ch == '\\' {
lex.bump(1);
}
}
panic!("END OF INPUT!");
}
}
None
}
Attributions & Thanks
- Thanks to the compiler-tools crate for inspiring me to create an own rust project for generating lexers.
- Thanks to the logos crate for the idea to organize an tree before attempting code generation as well as being a generally source of inspiration.