Incomplete TRANSCODEs: Actually an Optimization Problem

hostilefork · August 22, 2022, 3:09pm

Ren-C has a multi-return interface for TRANSCODE. Without /NEXT, you get the whole thing:

>> transcode "abc def"
== [abc def]

With the /NEXT refinement, it will go one item at a time. But the return convention is that you receive back a the remainder as the primary return result, and the transcoded value is the second:

>> [pos value]: transcode/next "abc def"
== " def"

>> pos
== " def"

>> value
== abc

Of course, with multi-return you can ask for the overall return result to be the synthesized value:

>> [pos @value]: transcode/next "abc def"
== abc

You don't even have to name things if you don't want them!

>> [_ @]: transcode/next "abc def"
== abc

And you can just use a regular SET-WORD! to get just the primary result.

>> pos: trancode/next "abc def"
== " def"

You know that you're at the end of the input when it returns NULL. This means there was no value synthesized, and you're done.

>> [pos /value]: transcode/next ""
== ~null~  ; anti

>> pos
== ~null~  ; anti

>> value
== ~null~  ; anti

Writing foolproof loops to process items are a breeze:

while [[utf8 /item]: transcode utf8]
    print mold item
]

The leading slash on /item is necessary when you want to accommodate the case where transcode didn't produce any item. Because then it doesn't return a 2-parameter pack, it just returns a pure null. This is required for clean interoperability with THEN and ELSE...because nulls in packs are considered to be "something" vs. "nothing". Multi-return unpacking requires you to demonstrate consciousness when you are trying to unpack more items than you're getting, hence the slash is needed when trying to unpack a singular null into two slots.

On the plus side, if you are expecting that there must be a transcoded item, then you get a free check by eliminating the slash...it will then cause an error if the item isn't produced!

This Runs Circles Around Red and R3-Alpha

For starters: neither support strings as input--because the scanner is built for reading UTF-8 files...and both R3-Alpha and Red unpack strings into fixed-width encodings. So if you have string input, you have to pay for a copy encoded as UTF-8 via TO BINARY!. (Ren-C's UTF-8 Everywhere wins again!)

R3-Alpha unconditionally returns a block with the last element as a remainder, whether you ask for one item via /NEXT or not:

r3-alpha>> transcode to binary! "abc def"
== [abc def #{}]

r3-alpha>> transcode/next to binary! "abc def"
== [abc #{20646566}]

r3-alpha>> transcode/next to binary! ""
== [#{}]

So if you were transcoding an entire input, you have to TAKE/LAST an always-empty binary off of the result.

But you are using /NEXT you have to PICK out the element from the start of the array and the remainder from the end. But you need to notice the exception of no-value-produced where the block is length 1 instead of 2.

That's awkward, but as usual... Red somehow manages to make an incompatible interface that is as much worse as it is better:

The better part is that if you don't ask for /NEXT you just get the block back, like in Ren-C:

red>> transcode to binary! "abc def"
== [abc def]

But the /NEXT interface is outright broken:

red>> transcode/next to binary! "abc def"
== [abc #{20646566}]

red>> transcode/next to binary! ""
== [[] #{}]

It might look better because you don't have to guess about which position to find the remainder in--it's always in the second slot. But it has a fatal flaw: you can't distinguish the result state of scanning "[]" and any string with nothing but comments and whitespace.

Consider this very basic loop to scan one item at a time and print it:

red>> utf8: to binary! "abc def"

red>> while [not tail? utf8] [
     set [item utf8] transcode/next utf8
     print mold item
]
abc
def

You get two items. But what if you had something that was--say--a comment:

red>> utf8: to binary! "; I'm just a comment"

red>> while [not tail? utf8] [
     set [item utf8] transcode/next utf8
     print ["Item is:" mold item]
]
Item is: []

You get one spurious item. (They chose BLOCK! for the item, but it wouldn't matter what it was--a NONE! would be just as bad, you're just losing the distinction between empty strings and "#[none]" then.)

If I were prescribing a solution for Red I'd suggest approximating Ren-C's solution as closely as possible.

When /NEXT is used have it take a variable to write the transcoded value into. Then return the position. If the scan turns out to have no product, return NONE. For consistency with Ren-C you might set the transcoded value to NONE as well (it doesn't matter, because the return of none signals whatever it is isn't meaningful...so use UNSET! if you want.)

while [utf8: transcode/next utf8 'item] [
    print mold item
]
assert [none? item]  ; or unset, or whatever

Not as nice as the multi-returns, and you can't duck out of passing the variable if you aren't interested. But...

Ren-C Also Thrashes R3-Alpha and Red In Error Handling

Ren-C TRANSCODE has these potential behaviors:

RETURN a BLOCK! (if plain TRANSCODE)
RETURN a PACK of the ~[remainder value]~ if TRANSCODE/NEXT) -or- RETURN NULL if no value was transcoded from the input (empty string, comments, just spaces, etc.)
- Having remainder as the primary return means you can check the default result in a loop for truthiness and loop easily using WHILE or whatever.
- Returning pure NULL when no value is transcoded means you can react to there being nothing to transcode with THEN and ELSE, etc.
It can do a "hard FAIL"
- This would happen if you asked something fundamentally incoherent...like asking to TRANSCODE an input that was non-UTF-8...like a GOB!, or something like that
- Such errors are only interceptible by a special SYS.UTIL.RESCUE method--they are not supposed to be easy to gloss over and unlikely to have meaningful mitigation. So only special sandboxing situations (like writing consoles that print out the error) are supposed to trap them.
It can RETURN an antiform ERROR! ("raised error") if something went wrong in the transcoding process itself
- This would be something like a syntax error, like if you asked transcode "a bc 1&x def"
- These will be promoted to a hard FAIL if the immediate caller doesn't do something to specially process them.
- You can casually ignore or intercept these, because you can be confident that it was a formal return result of the thing you just called--not some deeper problem like a random typo or other issue.

I won't rehash the entire "why definitional errors are foundational" post, but TRANSCODE was one of the first functions that had to be retrofitted to use them.

>> transcode "a bc 1&x def" except e -> [print ["Error:" e.id]]
Error: scan-invalid

The definitionality is extremely important! I spent a long time today because in the bootstrap shim I had a variation of transcode...parallel to this in R3-Alpha:

r3-alpha>> transcode: func [input] [
               prnit "My Transcode Wrapper"  ; oops, typo
               return transcode input
           ]

r3-alpha>> if not attempt [transcode to binary! "abc def"] [print "Bad input"]
Bad input

But the input isn't bad!!! This leads to a nightmare of trying to figure out what was going wrong. Today's particular nightmare was when tinkering with the shim implementation of TRANSCODE. A bug in the shim was leading to silently skipping work that should have been done, because the caller wanted to be tolerant of bad transcode input.

There's simply no practical way of working on code of any complexity without something like definitional failures, and experience has proven this day after day.

Getting Incomplete Results Via R3-Alpha's /ERROR

R3-Alpha offered this feature:

/error -- Do not cause errors - return error object as value in place

The intended use is that you might want the partial input of what had been successfully scanned so far. If the code went and raised an error, you could trap that error. But you wouldn't have any of the scanned items.

It would put it any ERROR! as the next-to-last item in the block, with the remainder after that:

>> transcode/error to binary! "a bc 1&x def"
== [abc make error! [
    code: 200
    type: 'Syntax
    id: 'invalid
    arg1: "pair"
    arg2: "1&x"
    arg3: none
    near: "(line 1) a bc 1&x def"
    where: [transcode]
] #{20646566}]

>> to string! #{20646566}
== " def"  ; wait...why isn't 1&x part of the "remainder"

It's clumsy to write the calling code (or to read it...testing to see if the next-to-last-item is an ERROR! and reacting to that.

(Also: What if there was some way to represent ERROR! values literally in source? This would conflate with such a block that was valid...but just incidentally had an ERROR! and then a BINARY! in the last positions.)

But the thing that had me most confused about it was the remainder. Notice above you don't get 1&x as the start of the stuff it couldn't understand.

Was it trying to implement some kind of recoverable scan? What would that even mean?

Ultimately I think this was just a leaking of an implementation detail as opposed to any reasonable attempt at recoverable scanner. It only didn't tell you where the exact tail of the successfully scanned material was because it did not know.

The scanning position is based on token consumptions, and so if you started something like a block scan and it saw a [ then it forgets where it was before that. Then if something inside the block goes bad, it will just give you a remainder position somewhere inside that--completely forgetting about how many nesting levels it was in.

So what you were getting was a crappier implementation of scanning one by one, and remembering where you were before the last bad scan:

pos: input
error: null
block: collect [
   while [true] [
       keep [pos @]: transcode pos else [
           break
       ] except e -> [
           error: e
           break
       ]
   ]
]

That gives you a proper version, setting error if something happened and giving you the block intact.

So Finally... We See It's An Optimization Problem

Question is if there's some way of folding this into TRANSCODE, so it's doing the looping and collecting efficiently for you. What would the interface be like that gave you back the error, and how would you know to remember to check it?

The problem is that when you return a raised definitional error from a function, that's the only thing you return. How would you return partial results (and maybe a resumption position) as well?

A /TRAP refinement could cause another variation in how the return results are given:

>> [error block]: transcode/trap "a bc"
; null

>> block
== [a bc]

>> [error block]: transcode/trap "a bc 1&x def"
== make error! [...]

>> block
== [a bc]

Having the error be first seems good, lining up with TRAP. Then the block as the second result.

That's not bad, but it would require some implementation reworking that I don't have time for. Problem is that how the scanner is written now it clears all the stack out when an error gets raised, and there'd have to be some flag to tell it to persist the data stack accruals despite unwinding the level stack. It's not rocket science it's just not important right now.

Answer For Now: Kill Off /ERROR

The answer /ERROR has been giving back in error cases for the remainder is sketchy and conflates potential literal error scanning with a scanning error.
You can get the behavior 100% reliably just by intercepting errors going one transcode item at a time.
- Bear in mind that one-at-a-time is only going one top-level item at a time. If you scan a block with 1000 items in it, that's one transcode step. So we're not really talking about that many steps most of the time with regards to the scale of a file.
This is a good opportunity to write tests of item-by-item scanning with error handling
Red added a bunch of refinements on transcode [/next /one /prescan /scan /part /into /trace], and they didn't pick up /error themselves

Speaking of adding lots of refinements: I also want to get away in general from investments in weird C scanner code and hooks (especially if it's just an optimization).

What we should be investing in is more fluid mixture of PARSE of strings/binary with the scanner. e.g. we should have ways of knowing what line number you're at during the parse for any combinator, and just generally pushing on that. Adding TRANSCODE parameters up the wazoo isn't a winning strategy.