Parsing Giant Streams Without Consuming Tons of Memory: How?

hostilefork · September 2, 2021, 5:17am

I've mentioned that before I go through optimizing UPARSE I wanted to make it good at one thing that's been a bit of a pet wish of mine...

...and that's to be able to PARSE a giant file or network stream without needing to read it all into memory at once.

There are two levels of goal here:

Big Crazy Goal: To have something like a long network stream of data (that you can't seek backwards in) and be able to have records in it parsed discretely one at a time. Even if the stream is 100GB in size, you'd only use a fixed amount of memory to do your processing.
More Modest Goal: To let the PARSE be woven in with the read process, so it can start processing and reacting without waiting for all the data to be read...even if it ultimately isn't able to avoid reading the whole stream of data into a buffer.

Getting 2 to work is the easier of these. -But- let me be clear that given the lack of existence of "streams" in historical Rebol, it by no means easy!

1 is the more tricky and interesting one to my tastes, so I'll start by talking about that.

If Combinators Inform Us, Then (1) Seems Tractable

Let's say we're trying to parse lines with just the letters A and B, and count them as we go:

p: open %giant-file.txt

count: 0

parse p [
   some [some ["A" | "B"] newline (count: count + 1)]
]
then [
    print ["Your giant file had" count "lines of ABs"]
else [
    print ["Giant file wasn't just lines of ABs"]
]

Our intuition tells us that we can can do this one line at a time, throwing it out as you go. But how might PARSE know that?

It builds a SOME combinator and can see that's the last thing in the rule block. Assuming the user doesn't capture any positions with <here> or do any SEEK, there is no way that the SOME will ever backtrack from the point where it started. Each iteration can throw out the ability to backtrack.

But right now SOME is a black box; doing whatever it wants until it finally returns a result. From PARSE's perspective there's nothing from the outside that differentiates it from something called SOME-HALF that will repeat a rule some number of times, and then jump back in time to the halfway point of the match:

>> parse "abababab" [some-half "ab", return <here>]
== "abab"

>> parse "abababababab" [some-half "ab", return <here>]
== "ababab"

Without some additional information, the system doesn't know that SOME won't make a decision like SOME-HALF would. It has to let it run until it is finished.

How Can Combinators Tell PARSE About Backtrack Needs?

One way of looking at this is that the combinator itself becomes responsible for storing any memory that it requires for backtracking.

That is to say that it pulls information out of the stream...and if it wants to backtrack it pushes it back in.

>> parse "aaab" [some ["a"] "b"]

SOME combinator grabs an "a" from stream, matches the "a"
SOME combinator grabs an "a" from stream, matches the "a"
SOME combinator grabs an "a" from stream, matches the "a"
SOME combinator grabs a "b" from stream, doesn't like it, pushes it back and ends
TEXT! combinator grabs a "b" from the stream, matches the "b"

If the SOME becomes responsible for pushing back any input it doesn't like, then the stream can just discard everything as it goes (in cases where it doesn't see any potential for some rule down the line to request backtrack). This means offering some kind of "push back into stream" operator that combinators can use if they need to back out.

This concept of putting back the character is actually how many things like this work.

In C++ iostreams there is istream::putback() and istream::unget()
In Haskell's Data.Stream, there is unRead

In C++, unget() requires you give what you read in. By doing so, then if the data is no longer in a buffer and you're reading from a file...it doesn't need to do anything but push its file offset backwards. Haskell's unRead and C++ putback() let you push back something different than what you read...and considers that a feature (we'll assume it does a similar optimization to unget() if you were reading from a file and it noticed what you pushed back was the same as the data in the buffer?)

"Going Unit-By-Unit Sounds Laborious, and Slow...?"

It may seem laborious on the surface, but as far as I can tell this is the way streams work.

I was just working on an implementation of READ-LINE for standard input. And all the prescribed methods of reading one line at a time from a file in C would go one character at a time. That sounds like a lot of I/O requests, but the thing is that basically all I/O systems have buffering in them...if you ask to read a character from a file, it isn't going to your hard drive or making a network request for that one character. It buffers it--and if you try to buffer it yourself you're likely just going to be adding complexity/code/memory and making things worse.

Unfortunately libuv() streams don't have any putback() or ungetc() ability. There's no going back in time with them. :-/

And as it turns out Boost.ASIO doesn't have it either. (Which surprises me.)

This means if we were building combinators on top of an ungetc()-type logic...and want to go back in time to read a file and not have it fully in memory...we'd have to be using the raw file API if we wanted to keep sync'd to the data that's already on disk and be able to use it instead of keeping the full buffer contents.

That's a bit depressing. But if there's any good news, it's that Rebol datatypes are optimized specifically for "unget". If the buffers are BLOCK!s or BINARY!s or TEXT!s then when you "NEXT" them the data is still there, and you just BACK it to do an ungetc.

Plus, we'd have to have our own layer for managing this if we were going to seek back in time using HTTP Range Requests on networks.

I guess I'll just experiment and see what I can work out. :-/

hostilefork · September 2, 2021, 5:37am

An Additional Note On the Practicality Of Discarding Input

I painted a picture above of how SOME could become more participatory in helping PARSE avoid being responsible for keeping around all the past input in a parse stream for all time.

The rule I was examining looked like this:

parse p [
    some [some ["A" | "B"] newline (count: count + 1)]
]

What if your rule was instead:

parse p [
    some [
        some ["A" | "B"] newline
        (count: count + 1)
    ]
    |
    some-other-rule
]

The SOME may know it's never going back after a match it makes, for each line. And it may effectively communicate that through its input consumption (with the promise it will be responsible for giving back any input it decides it does not like).

But since there's an alternate rule coming up, the PARSE can't use that knowledge to let the information go that the SOME claims to be done with (or responsible for giving back). Because the "some other rule" could run if the SOME ultimately doesn't match...requiring backtrack all the way to the beginning state before the SOME.

This points out that | in a BLOCK! rule is basically a SEEK operation. It's a "seek the position before the last alternate grouping" rule. If the block rule is at a position and wants to know if it can throw out data it's streamed before the current point, it only needs to look forward and see if there are any | in the block ahead of it, and if so... it can't throw out the data.

But coming to the rescue is ||...which I have called the inline sequencing operator. If the BLOCK! combinator looks ahead and sees one of those, then it knows any | after it aren't relevant to whether it can throw out the past. The current rule will have to complete on the input it is given or things will fail.

All this may sound too tricky to be worth it, but this is a topic that interests me...and I have to stay interested, right?

hostilefork · June 13, 2025, 10:07am

So I guess I'm ready to accept that the "minimum viable product" of UPARSE doesn't have this capability, and that it's something that other parse implementations can tackle in the future.

I've managed to find enough other things to be interesting, and I think UPARSE has enough of a "wow" factor as it is to inspire someone (or some AI) to take a crack at a streaming version.

It's also likely that the current PARSE3 implementation will stick around and evolve into a less-featured-but-fast subset of the default combinators in UPARSE.

Having different parsing implementations for different needs isn't a bad thing. And I think it's time to get to work on nativizing UPARSE so it's not unusably slow. So it's time to more-or-less freeze the spec.

bradrn · June 14, 2025, 5:05am

If you want to see a practical implementation of streaming parser combinators, I suggest having a look at attoparsec.

hostilefork · June 15, 2025, 1:48am

Thanks! I've seen it, there's some complexity:

Parsing Huge Simulated Streams In Attoparsec | AE1020: Lazy Notebook

To get some forward motion and be able to start working on performance, I think I've made peace with the idea that it's better to build UPARSE coherently on top of the basic natives like FIND etc. and vet them on plain old series... making sure those primitives are rock solid.

As I've previously mentioned:

Calling Combinators (Decoders?) as Normal Functions

I will point out that UPARSE builds on basics in ways that e.g. Red PARSE does not. Consider:
red>> find "abcd" ""
== none

red>> parse "abcd" ["" "abcd"]
== true
Is there a "" at the beginning of "abcd" or is there not? They have different answers because they don't run through a common code path.

UPARSE actually builds on FIND, and for this reason has a consistent answer:
>> find "abcd" ""
; first in pack of length 2
== "abcd"

>> parse "abcd" ["" "abcd"]
== "abcd"