BINARY! Dialected Encoding/Decoding instead of TO conversions?

hostilefork · March 28, 2020, 9:45am

Rebol2 had an asymmetrical sense of conversion for BINARY! and INTEGER!:

rebol2>> to binary! 32
== #{3332}  ; #{33} is ASCII "3", #{32} is ASCII "2"

rebol2>> to integer! #{3332}
== 13106  ; 0x3332 is the internal big endian form of 13106

R3-Alpha "corrected" this unusual behavior by changing TO BINARY!:

r3-alpha>> to binary! 32
== #{0000000000000020}  ; 0x20 is 32 in hexadecimal

r3-alpha>> to integer! #{0000000000000020}
== 32

r3-alpha>> to binary! to string! 32
== #{3332}  ; ...if you really wanted that

The conventional wisdom seemed to be that TO BINARY! TO STRING! was not a common desire, but easy enough to express if you wanted it. The harder conversion for users (which was "easy" for Rebol to do) involved these internal byte representations of native C integers.

While this might seem more sensible on the surface, it was awkward for users to get the actual INTEGER! <> BINARY! conversions they wanted...where details of signedness, byte-size, and byte ordering vary considerably. 99% of the time users didn't want 8 bytes...but fewer. Taking the example of just two bytes, they might want #{FF00} to mean:

#{FF00} <=> big-endian unsigned 65280
#{FF00} <=> big-endian signed -256
#{FF00} <=> little-endian 255 (signed/unsigned)

Getting these reversible transformations from a fixed 8-byte signed big-endian conversion was hard and error prone--try it if you like!! (then as a bonus try writing code that works in Rebol2, Red, or R3-Alpha). The confusing acrobatics are inefficient, and a pain for people working at the byte-level. These are operations I saw invented over and over see e.g. rebzip.r, in non-generic and inefficient ways.

Are Dialects the Answer for the Parameterized "TO BINARY!"?

This seems to call for some kind of "BINIFY" or "BINARIZE" function that is dialected or parameterized. Or maybe this is really something ENCODE and DECODE could do with a nice shorthand way of passing in parameterized codecs?

Here's a very draft proposal for how such a thing might work with ENCODE and DECODE with a BLOCK! dialect for BE (BigEndian) and LE (LittleEndian):

>> encode [BE + 4] 32
== #{00000020}  ; big-endian, 4 byte, unsigned

>> encode [LE + 2] 32
== #{2000}  ; little-endian, 2 byte, unsigned

>> encode [LE +/- 3] -2
== #{FEFFFF}  ; little-endian, 3 byte, signed

>> encode [LE + 3] 16777214
== #{FEFFFF}  ; little endian 3 byte, this time unsigned

>> encode [LE +/- 3] 16777214
** Error: 16777214 aliases a negative value with signed
     encoding of only 3 bytes

Decoding would use the same dialect but goes the other way:

>> decode [LE + 3] #{FEFFFF}
== 16777214  ; reverse of the corresponding above example

The dialect choice was made with the endianness first, with the idea that this is how all ENCODE/DECODE of blocks would pick their decoder.

Then the sign is in the middle because unlike with encoding, decoding can guess the size accurately from the length of the input...and you can thus omit the third parameter if you wish.

What about TO BINARY! of INTEGER! (...was Rebol2 right??)

Rebol2's "weird" decision to encode the ASCII bytes of the Base-10 representation of the integer has two interesting properties:

the number of bytes scale with the size of the number input - This is important since Ren-C's goal is to finesse the nature of immutable and mutable INTEGER! to be "BigNum-capable"...yet still exploit integers that fit directly in cells where possible.
not all bytes are meaningful in the representation - When you're using only bytes for the digits 0 to 9 (and a possible negative symbol), that means you can delimit the arbitrary-sized number with other bytes. Maybe those are spaces, commas, #{00} bytes, or other delimiters.

Seen in this light...maybe the only "weird" thing about it was that it wasn't reversible, and TO INTEGER! of BINARY! was using a fixed-size interpretation of the bytes. R3-Alpha "corrected" it, but maybe in light of what I'm saying here, that correction went in the wrong direction.

Unfortunately, as future-proofing goes...this canonizes Base-10...which feels a bit "arbitrary". And considering the availability of more compact representations for arbitrary-precision numbers, it might seem to "waste space". But we're dealing with human parameters in other ways, like case-insensitivity and other things that a ten-fingered species takes for granted. As standards for mathematics go, this is one that will probably have a longer life for a ten-fingered species than 8-byte big-endian values.

The other "futureproof" option would have to be some kind of length-prefixed wire format, like the "Common Binary Object Format (CBOR)" BigNum Encoding. That feels less in line with Rebol principles.

So strangely I'm feeling like to binary! some-integer is really an equivalent to as binary! to text! some-integer (as Rebol2 did) and that to integer! some-binary is the same as to integer! as text! some-binary (which Rebol2 did not).

hostilefork · March 29, 2020, 12:35pm

So I wrote a prototype implementation of ENCODE/DECODE [BE/LE]. I wanted to get the mechanical functionality straightened out and tested. They are now being used in place of the impromptu conversions in codebases like rebzip, and the improvements are clear!!!

As usual, Ren-C proved malleable enough to implement and debug the proposal nearly as fast as it could be articulated. It was done the same day I came up with it.

I haven't changed anything about TO BINARY! or TO INTEGER! in Ren-C from how it has been running for a while, yet. They are just implemented in terms of the encoding operations, so the guts are right.

The Backstory on Ren-C's Attempted "Twist"

Ren-C hadn't yet changed R3-Alpha's 8-byte TO BINARY! of an INTEGER! behavior. But it did change TO INTEGER! to interpret the sign from the high bit of the first byte:

ren-c>> to integer! #{7F}
== 127

ren-c>> to integer! #{80}
== -128

One way around this if you wanted to force an unsigned interpretation would be to pad with a leading #{00} byte:

ren-c>> to integer! #{0080}
== 128

Another would be to use TO-INTEGER...which wasn't a simple synonym for TO INTEGER!, but added an /UNSIGNED refinement:

ren-c>> to-integer/unsigned #{80}
== 128

This strange behavior was part of a somewhat unusual proposal I made in 2013 for future-proofing. I wanted the property of having the number of bytes for TO BINARY! of INTEGER! scale to the size of the input (to generalize to BigNums), and then you would PAD or trim the result however you liked. To deal with the issue of values having a sign, I assumed any positive integer would get a binary conversion that ensured the first byte didn't have the high-bit set (so it would add a leading #{00} if it had to).

It's kind of a cool concept. If all you have is a TO BINARY! with no parameters, this allows you to effectively add those "parameters" by means of post-processing operations. You could force an unsigned interpretation of any byte sequence with to integer! join #{00} binary. But it's more a funny thought experiment that leads to inefficient solutions "just because you could". Realizing it was silly I'd already made TO-INTEGER a native with an added refinement /UNSIGNED to avoid such contortions for the sake of so-called "parameterless conversions" of signed integers of arbitrary size. :-/

So in practice, it sucks. In my defense... the concept was inspired by a common response to the question of "why doesn't Rebol have a substring() operation". That encouraged breaking operations into parts like COPY and AT which could be composed with very little syntax in more interesting ways. I was going with that spirit...but it's not very useful here. Dialecting is the more fundamental principle of Rebol to be exploiting, I believe.

P.S... Red (as usual) Thumbs Its Nose At Invariants

Red adds its own unique "character" to the mix. You get a 4-byte binary if the number fits in 32-bits, or an 8-byte binary if it doesn't:

red>> to binary! 1
== #{00000001}

red>> to binary! 100000
== #{000186A0}

red>> to binary! 10000000000000
== #{42A2309CE5400000}

I would guess the motivation is compatibility over the range of values that Rebol2 would have accepted. But it winds up being one of those wacky compromises that just leads to surprises and satisfying no one. When are you working with 64-bit encodings that never cross to the 32-bit range?

As it happens..my first exposure to Rebol's woes in binary conversion came while trying to convert the Red compiler from Rebol2 to R3-Alpha. A lot needing changing, and it was a major pain--since Red was doing a lot of byte-level generation for executable and machine code formats (that were picky). I'm old enough to have lived through all of the 8-bit => 16-bit => 32-bit => 64-bit transitions, and this reminded me we should have some way of not going through so much pain the next time.

But this is the kind of thinking that just makes more pain.

hostilefork · December 6, 2024, 4:52pm

I got to wondering "why isn't there a numeric encoding for just negative numbers"?

Then I thought "uh, duh...no such representation is needed, as you just would negate the all-positive representation".

>> negate decode [BE +] #{FF}
== -255

But with these encoders/decoders... why not build it in as an option?

>> decode [BE -] #{FF}
== -255

And on encoding, you can ensure the input value is negative:

>> encode [BE -] -255
== #{FF}

>> decode [BE -] 255
** Error: DECODE [BE -] requires input value to be negative

Not sure how often it would be used... I just like the completeness of it.

There's another axis of potential feature here, which is offset encoding. So if you're trying to encode a range of numbers between 128 and 383, you can do that in a byte... just consider it to be offset by 128. I can see it being kind of convenient if that offset were part of the dialected block, and could be reused on the encode and decode side.

Far from being any kind of priority, but, have to stay entertained somehow.