Rebol2 had an asymmetrical sense of conversion for BINARY! and INTEGER!:
rebol2>> to binary! 32
== #{3332} ; #{33} is ASCII "3", #{32} is ASCII "2"
rebol2>> to integer! #{3332}
== 13106 ; 0x3332 is the internal big endian form of 13106
R3-Alpha "corrected" this unusual behavior by changing TO BINARY!:
r3-alpha>> to binary! 32
== #{0000000000000020} ; 0x20 is 32 in hexadecimal
r3-alpha>> to integer! #{0000000000000020}
== 32
r3-alpha>> to binary! to string! 32
== #{3332} ; ...if you really wanted that
The conventional wisdom seemed to be that TO BINARY! TO STRING!
was not a common desire, but easy enough to express if you wanted it. The harder conversion for users (which was "easy" for Rebol to do) involved these internal byte representations of native C integers.
While this might seem more sensible on the surface, it was awkward for users to get the actual INTEGER! <> BINARY! conversions they wanted...where details of signedness, byte-size, and byte ordering vary considerably. 99% of the time users didn't want 8 bytes...but fewer. Taking the example of just two bytes, they might want #{FF00} to mean:
- #{FF00} <=> big-endian unsigned 65280
- #{FF00} <=> big-endian signed -256
- #{FF00} <=> little-endian 255 (signed/unsigned)
Getting these reversible transformations from a fixed 8-byte signed big-endian conversion was hard and error prone--try it if you like!! (then as a bonus try writing code that works in Rebol2, Red, or R3-Alpha). The confusing acrobatics are inefficient, and a pain for people working at the byte-level. These are operations I saw invented over and over see e.g. rebzip.r, in non-generic and inefficient ways.
Are Dialects the Answer for the Parameterized "TO BINARY!"?
This seems to call for some kind of "BINIFY" or "BINARIZE" function that is dialected or parameterized. Or maybe this is really something ENCODE and DECODE could do with a nice shorthand way of passing in parameterized codecs?
Here's a very draft proposal for how such a thing might work with ENCODE and DECODE with a BLOCK! dialect for BE (BigEndian) and LE (LittleEndian):
>> encode [BE + 4] 32
== #{00000020} ; big-endian, 4 byte, unsigned
>> encode [LE + 2] 32
== #{2000} ; little-endian, 2 byte, unsigned
>> encode [LE +/- 3] -2
== #{FEFFFF} ; little-endian, 3 byte, signed
>> encode [LE + 3] 16777214
== #{FEFFFF} ; little endian 3 byte, this time unsigned
>> encode [LE +/- 3] 16777214
** Error: 16777214 aliases a negative value with signed
encoding of only 3 bytes
Decoding would use the same dialect but goes the other way:
>> decode [LE + 3] #{FEFFFF}
== 16777214 ; reverse of the corresponding above example
The dialect choice was made with the endianness first, with the idea that this is how all ENCODE/DECODE of blocks would pick their decoder.
Then the sign is in the middle because unlike with encoding, decoding can guess the size accurately from the length of the input...and you can thus omit the third parameter if you wish.
What about TO BINARY! of INTEGER! (...was Rebol2 right??)
Rebol2's "weird" decision to encode the ASCII bytes of the Base-10 representation of the integer has two interesting properties:
-
the number of bytes scale with the size of the number input - This is important since Ren-C's goal is to finesse the nature of immutable and mutable INTEGER! to be "BigNum-capable"...yet still exploit integers that fit directly in cells where possible.
-
not all bytes are meaningful in the representation - When you're using only bytes for the digits 0 to 9 (and a possible negative symbol), that means you can delimit the arbitrary-sized number with other bytes. Maybe those are spaces, commas, #{00} bytes, or other delimiters.
Seen in this light...maybe the only "weird" thing about it was that it wasn't reversible, and TO INTEGER! of BINARY! was using a fixed-size interpretation of the bytes. R3-Alpha "corrected" it, but maybe in light of what I'm saying here, that correction went in the wrong direction.
Unfortunately, as future-proofing goes...this canonizes Base-10...which feels a bit "arbitrary". And considering the availability of more compact representations for arbitrary-precision numbers, it might seem to "waste space". But we're dealing with human parameters in other ways, like case-insensitivity and other things that a ten-fingered species takes for granted. As standards for mathematics go, this is one that will probably have a longer life for a ten-fingered species than 8-byte big-endian values.
The other "futureproof" option would have to be some kind of length-prefixed wire format, like the "Common Binary Object Format (CBOR)" BigNum Encoding. That feels less in line with Rebol principles.
So strangely I'm feeling like to binary! some-integer is really an equivalent to as binary! to text! some-integer (as Rebol2 did) and that to integer! some-binary is the same as to integer! as text! some-binary (which Rebol2 did not).