Pattern matching resources
GavinWraith (26) 1563 posts |
Lots of text manipulation programs naturally fall into three phases:
Phase 3 is, to a certain extent, catered for by SWI OS_SubstituteArgs, but there appear to be no native RISC OS resources for phase 1. The PCRE (Perl Compatible Regular Expression) library is rather large and not all that efficient compared to the leaner and more modern PEGs (Parser Expression Grammars), which have only become more popular since 2004. If one were to have a relocatable module providing a PEG virtual machine what would its interface look like? Perhaps an SWI with entry:
and exit:
The PEG pattern syntax would include symbols indicating that the current position in the string is to be captured. Would such a resource be of use to anybody? Is this a sensible project? |
Timo Hartong (2813) 204 posts |
More software for RISCOS is always good. With the speed of ARM processors it could be done very quickly…I must admit I don’t have an immidiate use for it. |
Steve Drain (222) 1620 posts |
There is the RegEx module which is available from Stefan Bellon. I have a StrongHelp manual, and a manual for Regex itself, on my site . I use the module with Basalt to provide keywords such as SEARCH, MATCH and REPLACE Martin Avison has a application for testing regex patterns. |
Steve Drain (222) 1620 posts |
I have just realised that I have an updated manual that I have not uploaded. This is not significantly different, but adds details about Stefan and Martin. |
GavinWraith (26) 1563 posts |
I wrote an article in Drag’nDrop Vol 2 No 2 about PEGs.
So PEGs, though around since 1970, did not gain much traction till 2004, long after the early days of RISC OS. I noticed recently that they seem to be becoming more popular, and as they look easier to code than RegExes, maybe this would be an interesting project for an ARM assembler enthusiast? There are one or two papers detailing the instructions for a simple virtual machine for parsing, with state given by an index into the text to be parsed, an index into the VM instructions (program counter), a stack for saving indices when backtracking is needed, and a boolean for success/failure. |
nemo (145) 2552 posts |
An observation. What is the best programming language to use for a project? Usually the answer is “whichever one you’re least likely to screw up in”. regexps are notoriously opaque, and we already have a number of different ones (in effect) under RISC OS – GSTrans, ReadArgs, Zap, StrongEd (I presume) – and they’re all different. If one were going to abstract that (and it’s a good idea) I would like to see it completely abstracted, so the API allows choice of syntax, rather than imposing a new one. If I have to write a regexp, I’d like to write one in the syntax I’m most familiar with. Probably. |
GavinWraith (26) 1563 posts |
I am writing up some notes on pattern matching which I will put on my website soon. The idea at the back of my head is a library of BASIC assembler routines. The patterns are Parser Expression Grammars – slightly different from regexps because alternatives are prioritized. Thus . I am trying to be as neutral as possible about concrete syntax. The only way to do this sort of thing is to have a simple virtual machine, and to optimize the code at the VM instruction level, before translating to ARM assembler. Only very few ARM instructions would figure, so I guess any old language could be used, for the compiling. However when it comes to capturing information from a pattern match, there a lot depends on the ambient language, as different languages support different datatypes (even if they share the same name :). For a really minimal type of capture I am going for a single pattern which always succeeds and as side-effect pushes the current text-pointer onto a stack. On exit from the match the stack would be exported; basically one captures where things match, rather than what , which one can do afterwards anyway by extracting a substring.Actually to implement this project requires making more decisions; notably the choice of VM. |
nemo (145) 2552 posts |
Which is great, but for efficiency one would want a Create/Use/Discard API, in addition to a DoItAllInOneGo convenience call. My suggestion was to not limit the API to one particular syntax:
|
GavinWraith (26) 1563 posts |
I think I see what you are getting at. But a pattern is like a program source; it must be compiled or interpreted to act on a piece of text. I am not too bothered about the concrete syntax – that is window dressing. But the abstract syntax is another matter. That has to be fixed before one sets out. Simpler to test out a pattern matcher as a standalone application first. Too complicated to jump in with a relocatable module until an application is sorted. |
Steve Drain (222) 1620 posts |
Which is how the RegEx module presents results, except the pointers are in group registers, not stacked. RegEx is an implementation of the Posix RegexLib. |
Steve Drain (222) 1620 posts |
Basalt using RegEx has, among other ways: pattern%=COMPILE(pattern$) SEARCH target$,pattern% COMPILE CLEAR pattern%
And Basalt has: SEARCH target$,pattern$
Of course, but my point is that as far as regexp goes this already exists, and can be used with a BASIC veneer. If you do not want Basalt you can use a library. It is another question as to whether the more general cases, including PEG, are important. |
GavinWraith (26) 1563 posts |
Probably not. But I thought the exercise of implementing a PEG VM would appeal to ARM assembler enthusiasts. |
Steve Drain (222) 1620 posts |
I read a bit about PEG and it seemed to me to be a case of swings and roundabouts in comparison to regular expressions. Nevertheless, an implementation of PEG is going to be interesting. The problem of the various regexp systems already in RO programs, and their incompatible syntax, is real for the user. Because they are integral, I am not sure that there is scope to unify them in the way proposed. It is a case of: “I would not start from here”. |
nemo (145) 2552 posts |
“If I were going there I wouldn’t start here” Well, RO has been in need of a motto… Steve, where’s your website these days? kappa.me.uk seems neglected. I wish this woeful forum had sigs. |
GavinWraith (26) 1563 posts |
Beginnings of notes here. |
nemo (145) 2552 posts |
Gavin wrote, elsewhere
Pretty sure that’s not what you meant. |