MAXMATCH Parser Combinator, With/Without Rollback!

hostilefork · August 11, 2021, 9:55pm

All right... hold onto your little blue coffee cups... because I've got something rather awesome working!!!

TL;DR - Show Me the Code

Here is MAXMATCH-D (default "rollback") and MAXMATCH-C (custom "rollback"), implemented for your parsing pleasure!

(Note that this also shows how you can define a COMBINATOR in your own files and call it directly. This lets it step in to fill the feature hole that PARSING-AT had to patch around...

...Also see the nice demonstrations of how BLOCK! grouping can be reorganized freely without bugs or worries. You can use them or don't, and the engine sorts it out!)

How It Works

Quick Refresher: The role of the UPARSE engine is to wire up the parser combinators (you might think of them as the "keywords") that can take parameters of some number of rules. They are used to create parsers which just process the input and return a synthesized result and a remainder.

(For instance: BETWEEN is a combinator which takes in the input and two parsers. But once an instance of BETWEEN is "combinated" it becomes specialized with the right parser instances and becomes a parser itself. When called from the outside it appears as a parser with only the input parameter...as its combinator parameters have been fixed as the appropriate parser functions.)

I spoke about adding a third return result representing whatever "pending material" a combinator has accrued. For the moment let's just say this is a BLOCK! of values of various types. Combinators like COLLECT or GATHER will filter through these blocks and pull out the parts they think are relevant to them. But most combinators just want to aggregate results together and pass them through.

I've called this third result "pending". And what I have done is I notice when you write a COMBINATOR spec block if you explicitly mention the pending: return result or not. If you do not mention it, it is assumed you want the "default aggregating behavior".

So what is the default aggregating behavior? Well, when the UPARSE engine fills in the parser parameters to your combinator, they all start out with their own pending return result. But what the common combinator prelude can do is specialize out that parameter and wire it up as a sequential aggregator.

Hence if you don't indicate you have a pending: return result from your combinator, the parsers your combinator is specialized with will appear not have pending return results either! They'll only return the 2 classical results: the synthesized value and the remainder of input to process. All the wiring happens behind the scenes.

But if you do say you have a pending: result, the parsers your combinator receives will all demand to return that third pending parameter. And you need to make the decisions of what to do with each pending result to produce your combinator's own result.

Everybody Got That?

Executive summary that I gave @BlackATTR:

By default all parser invocations you call that succeed add to the result, in the order they are called. Failed parsers don't add to the result. This is good enough for a lot of things, like SOME for instance.
- If SOME calls its one parser it takes as a parameter 5 times and it succeeds, and then one time and it fails, it will succeed (since it matched at least one time). But it's happy enough just giving back the aggregate of those 5 successful calls in order. So it does not mention pending: in its spec, and just gets the automatic behavior.
Such a default is not good enough for the BLOCK! combinator. If it calls uparse "ab" [keep "a" keep "a" | keep "a" keep "b"] then it doesn't want ["a" "a" "b"]. Mere success of the parsers it calls is not enough, it has a higher-level idea of "alternates" and a whole alternate group must succeed to keep its result.
- So it mentions pending: in its spec, which means the parsers it gets don't have their pending results specialized away. It builds the appropriate result.

Now To Build More Features On It!

The thing is that this block of "pending" results is kind of a monolith, it accrues everything which could be a mixed bag of stuff.

What I was thinking is that any QUOTED! items are assumed to be for COLLECT. We can just say COLLECT is probably one of the more common things. Then, maybe any GROUP!s are deferred code. (e.g. code that will only run if the entire parse succeeds...or some marked phase, maybe)

Fanciful example, where the GROUP! combinator will delay any code when prefaced with the <delay> tag. Let's assume if the group was just (<delay>) it's an error to guide you to say ('<delay>) if you actually want to evaluate to the TAG! delay...

>> uparse "aaabbb" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
hey B!
delay A!
delay B!

>> uparse "aaaccc" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
; null

I mused that a phase would just be a bracket that says "you don't have to wait to the absolute end, go ahead and run the delays you've accumulated now":

>> uparse "aaabbbccc" [
    phase [
        some "a" (print "hey A!") (<delay> print "delay A!")
        some "b" (print "hey B!") (<delay> print "delay B!")
    ]
    some "c" (print "hey C!")
]
hey A!
hey B!
delay A!
delay B!
hey C!

Anyway so this mishmash of a PENDING block could contain mixes. If it sees [(print "delay A") '(print "delay A")] it knows that the unquoted thing is something the delay mechanism pays attention to, while the QUOTED! thing is something for COLLECT.

Then maybe BLOCK! is used for emit, e.g. the WORD! and value for the object. This could be stuck in the stream with everything else, like emit x: integer! => [(print "delay A") '(print "delay A") [x: 1020]] ... basically this big glommed together bunch of results that are being collected filtered and discarded.

The arrays that are produced give up their ownership, so when your combinator gets a pending array back you can party on it all you want. So COLLECT might look at a pending array and realize it's all QUOTED! items, and just go through and unquote them and return that as the collected array--without needing another allocation.

(I think these will be common cases. Anyway the GLOM mechanic is what I introduced to try and make it cheap so that you can also just return BLANK! So you're not paying for parses that don't use any of this to make empty arrays at every level of every combinator... otherwise a rule like [100 "a"] would be generating 100 empty arrays for no reason.)

I Think UPARSE Is On Track to be the !

The amazing thing is just how well slinging usermode code of frames and functions and specializations around is working. "It's programming Jim...but not as we know it..." -- it's the language malleability that Rebol has tried to promise but was just not feasible in usermode until Ren-C.

But of course, there's a lot of work to do here. It's slow and the errors are bad. But this experimental implementation is passing the barrage of tests--to which I've added everything from the rebol-issues database, and closed many issues and wishes that are all now solved and answered!

Brett · August 12, 2021, 2:38am

Lol, took me a while to get the reference.

I've got something rather awesome working!!!

That does appear to be rather awesome! I like the "do more work only if required" approach with the pending results.

The thing is that this block of "pending" results is kind of a monolith, it accrues everything which could be a mixed bag of stuff.

So I guess this is the part where an application will get specific about how the parse will be interpreted like how might the same underlying parse rules be mixed with different application needs to produce different result structures. I'm imagining a user sharing a set of parse rules with others who then achieve different resulting structures or calculations for their own needs. Is that a layer of rules-with-emits that call into rules-to-parse-the-syntax or something different?

I realise case studies would be of value. C-lexicals and the build process might be one. Json type response data might be another. Will keep it in mind.

hostilefork · August 12, 2021, 6:04am

So I've gone ahead with my proposed implementation (the version where GROUP! can be prefixed with (<delay> ...)) and implemented the PHASE construct.

You can see how simple the GROUP! combinator is. It looks for the <delay> tag, and if it finds it, it uses NEXT to skip past the tag, and adds the remaining group data to the pending list. PHASE just filters anything that's a group out of the pending list and runs it. There's a PHASE automatically added to the top-level of every UPARSE operation.

Because this is the understood protocol of meaning of GROUP!s in the pending list, any combinator can stick deferred code into that list--just by adding groups to it.

So...with that tool in hand, I went back to tackle the problem that we saw above with the hooked-word-combinator demo (I'm now calling it "TRACKPARSE" as opposed to parsetree...will save that for something more closely giving your results). As a refresher, the problem was:

>> trackparse "fffyyy" [foo-rule some "x" | foo-rule some "y"]
foo-rule [
] => "fff"
foo-rule [
] => "fff"

We didn't want the FOO-RULE to be contributing to the stack log when it was a member of the alternate that failed.

The way I approach it now is to push groups of deferred code for appending strings to the stack. It actually sticks one string append before all the pendings that are returned by the WORD!'s processed rule, and then one string after all the pendings.

(By no means am I suggesting this is an ideal way to do this, but it is just a corrected version of the off-the-cuff code from before!)

It achieves the desired result!

It's probably easiest to just look at the implementation to see what's happening in the COLLECT, GATHER, and PHASE cases.

COLLECT is siphoning out the QUOTED! values from the pending material. It unquotes them.
GATHER gets the BLOCK!s, all of which are assumed to have the form [var: value]
I've already shown that PHASE (which also brackets the entirety of a UPARSE) gets the GROUP!s

(Efficiency sidenote: I might should have used something like @[...] blocks for GATHER, and use BLOCK!s for KEEP to complement the QUOTED!s. This would save splicing until the very end COLLECT when you have a better idea of how big the total series will be.)

In any case, this strategy will obviously run out of datatypes at some point. So once the common "lightweight" values are spoken for, an ecology based around something like an OBJECT! which uses a key like combinator: as a tag to know whether to pay attention to something is probably the safest bet. Perhaps something lighter weight like EVENT! which could put a label in the cell spot where a block index would usually be would be of use here.

What I'm trying to do here--though--is to make this intrinsically hackable. If you want QUOTED! to mean something else in the pending list, it's not like you can't rewrite COLLECT and KEEP. Everything is supposed to be modular and comprehensible.

I realise case studies would be of value. C-lexicals and the build process might be one. Json type response data might be another. Will keep it in mind.

I definitely want to get scenarios worked through before going down the rabbit hole of optimizing all this with native code. The %examples/ directory in the parse tests can hopefully be home to some good challenges of the model. Throw hardballs at it!

And if there's any chance you can just take a little time to skim through the parse tests and see if everything "jibes" that would be great:

https://github.com/metaeducation/ren-c/tree/master/tests/parse

Anything you want to test should work (very slowly) in the web REPL. I'm trying to be good about taking basically every experiment I type down and making sure it gets incarnated as a test instead of just tried once and forgotten about...