The main design goal of flex
is that it generate high
performance scanners. It has been optimized for dealing well with large
sets of rules. Aside from the effects of table compression on scanner
speed outlined above, there are a number of options/actions which
degrade performance. These are, from most expensive to least:
REJECT
pattern sets that require backtracking arbitrary trailing context `^' beginning-of-line operatoryymore
with the first three all being quite expensive and the last two being quite cheap.
REJECT
should be avoided at all costs when performance is
important. It is a particularly expensive option.
Getting rid of backtracking is messy and often may be an enormous amount of work for a complicated scanner. In principle, one begins by using the `-b' flag to generate a `lex.backtrack' file. For example, on the input
%% foo return TOK_KEYWORD; foobar return TOK_KEYWORD;
the file looks like:
State #6 is non-accepting - associated rule line numbers: 2 3 out-transitions: [ o ] jam-transitions: EOF [ \001-n p-\177 ] State #8 is non-accepting - associated rule line numbers: 3 out-transitions: [ a ] jam-transitions: EOF [ \001-` b-\177 ] State #9 is non-accepting - associated rule line numbers: 3 out-transitions: [ r ] jam-transitions: EOF [ \001-q s-\177 ] Compressed tables always backtrack.
The first few lines tell us that there's a scanner state in which it can make a transition on an `o' but not on any other character, and that in that state the currently scanned text does not match any rule. The state occurs when trying to match the rules found at lines 2 and 3 in the input file. If the scanner is in that state and then reads something other than an `o', it will have to backtrack to find a rule which is matched. With a bit of headscratching one can see that this must be the state it's in when it has seen `fo'. When this has happened, if anything other than another `o' is seen, the scanner will have to back up to simply match the `f' (by the default rule).
The comment regarding State #8 indicates there's a problem when `foob' has been scanned. Indeed, on any character other than a `b', the scanner will have to back up to accept `foo'. Similarly, the comment for State #9 concerns when `fooba' has been scanned.
The final comment reminds us that there's no point going to all the trouble of removing backtracking from the rules unless we're using `-f' or `-F', since there's no performance gain doing so with compressed scanners.
The way to remove the backtracking is to add "error" rules:
%% foo return TOK_KEYWORD; foobar return TOK_KEYWORD; fooba | foob | fo { /* false alarm, not really a keyword */ return TOK_ID; }
Eliminating backtracking among a list of keywords can also be done using a "catch-all" rule:
%% foo return TOK_KEYWORD; foobar return TOK_KEYWORD; [a-z]+ return TOK_ID;
This is usually the best solution when appropriate.
Backtracking messages tend to cascade. With a complicated set of rules
it's not uncommon to get hundreds of messages. If one can decipher
them, though, it often only takes a dozen or so rules to eliminate the
backtracking (though it's easy to make a mistake and have an error rule
accidentally match a valid token. A possible future flex
feature
will be to automatically add rules to eliminate backtracking).
Variable trailing context (where both the leading and trailing parts do
not have a fixed length) entails almost the same performance loss as
REJECT
(i.e., substantial). So when possible a rule like:
%% mouse|rat/(cat|dog) run();
is better written:
%% mouse/cat|dog run(); rat/cat|dog run();
or as
%% mouse|rat/cat run(); mouse|rat/dog run();
Note that here the special `|' action does not provide any savings, and can even make things worse (see section Deficiencies and Bugs).
Another area where the user can increase a scanner's performance (and
one that's easier to implement) arises from the fact that the longer the
tokens matched, the faster the scanner will run. This is because with
long tokens the processing of most input characters takes place in the
(short) inner scanning loop, and does not often have to go through the
additional work of setting up the scanning environment (e.g.,
yytext
) for the action. Recall the scanner for C comments:
%x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\n]* <comment>"*"+[^*/\n]* <comment>\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL);
This could be sped up by writing it as:
%x comment %% int line_num = 1; "/*" BEGIN(comment); <comment>[^*\n]* <comment>[^*\n]*\n ++line_num; <comment>"*"+[^*/\n]* <comment>"*"+[^*/\n]*\n ++line_num; <comment>"*"+"/" BEGIN(INITIAL);
Now instead of each newline requiring the processing of another action, recognizing the newlines is "distributed" over the other rules to keep the matched text as long as possible. Note that adding rules does not slow down the scanner! The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as `*' and `|'.
A final example in speeding up a scanner: suppose you want to scan through a file containing identifiers and keywords, one per line and with no other extraneous characters, and recognize all the keywords. A natural first approach is:
%% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ .|\n /* it's not a keyword */
To eliminate the back-tracking, introduce a catch-all rule:
%% asm | auto | break | ... etc ... volatile | while /* it's a keyword */ [a-z]+ | .|\n /* it's not a keyword */
Now, if it's guaranteed that there's exactly one word per line, then we can reduce the total number of matches by a half by merging in the recognition of newlines with that of the other tokens:
%% asm\n | auto\n | break\n | ... etc ... volatile\n | while\n /* it's a keyword */ [a-z]+\n | .|\n /* it's not a keyword */
One has to be careful here, as we have now reintroduced backtracking
into the scanner. In particular, while we know that there will never be
any characters in the input stream other than letters or newlines,
flex
can't figure this out, and it will plan for possibly needing
backtracking when it has scanned a token like `auto' and then the
next character is something other than a newline or a letter.
Previously it would then just match the `auto' rule and be done,
but now it has no `auto' rule, only a `auto\n' rule. To
eliminate the possibility of backtracking, we could either duplicate all
rules but without final newlines, or, since we never expect to encounter
such an input and therefore don't how it's classified, we can introduce
one more catch-all rule, this one which doesn't include a newline:
%% asm\n | auto\n | break\n | ... etc ... volatile\n | while\n /* it's a keyword */ [a-z]+\n | [a-z]+ | .|\n /* it's not a keyword */
Compiled with `-Cf', this is about as fast as one can get a
flex
scanner to go for this particular problem.
A final note: flex
is slow when matching NUL
's, particularly
when a token contains multiple NUL
's. It's best to write
rules which match short amounts of text if it's anticipated
that the text will often include NUL
's.