[Shootout-list] Directions of various benchmarks

José Bollo jose.bollo@tele2.fr
Wed, 18 May 2005 16:56:12 +0200


Le mercredi 18 Mai 2005 16:16, skaller a écrit :
> On Wed, 2005-05-18 at 23:24, José Bollo wrote:
> > So the RegExp bench has to stay.
>
> I would say that we need several graded tests of
> string processing capabilities -- one test isn't enough.
>
> The particular tests we have include the phone number finder,
> but also the word count tests need to lex text to find and
> count words.
>
> > It is good for a language to offer it.
> > But only a few subset of regexp is used in the bench.
> > I like the use of (?<...), (?!...), (?:...), \2, ...
>
> I do not even know what these are. I have only needed to use
> 'regexps' half a dozen times in my whole life: excluding
> extremely heavy use of lexer generators. Most of the things
> you're probably talking about are NOT part of any sane
> concept of a regular expression: they're hacks and kludges
> added on by people without any proper mathematics and have
> prevented the adoption of better solutions.

maybe. are you sure?

> Generally the correct techniques to handle these things is to
> use a lexer to get a token stream, then parse the token stream
> and emit the result -- it is fully understood mathematically,
> but in most languages lexing and parsing requires laborious
> interfacing with tools like lex and yacc.

maybe. are you sure?

> I would argue they still represent the proper solution if
> only they were properly integrated .. and in Felix they are
> integrated, although there are still rough edges.
>
> There is, however, plenty of argument for various additional
> tests such as high speed searching of HUGE documents for
> keywords, as well as complicated transductions of strings
> using weird stuff like the above, and probably going all
> the way up to building a parser.
>
> Just so you can see, Felix has expression:
>
> regexp letter = ['A'-'Z'];
> regexp digit = ['0'-'9'];
> regexp alnum = letter | digit;
>
> regmatch s with
>
> | digit+ => "number"
> | letter alnum * => "identifier"
>
> endmatch
>
> which is used for matching, and a variant of this which is used to
> match the head of a string, and is used for lexing.
>
> This is the end .. no captures, no back references,
> no substitution, nothing else. What is here is the
> definitive total of regular expressions -- if you
> go any further you're out of the domain of regular
> languages into the domain of context free languages
> and then to use a general tool you have to have
> a GLR or Earley parser engine (Felix has GLR),
> an RTN, or some other kind of context free language
> parsing technque.
(snip)

maybe a distinction has to be made between 'rational expressions' that you are 
talking about and 'regular expressions' that are not really rational but 
extremely usefull.

regular expressions with extractions and/or substitutions are in the main 
stream imho.

josé