What To Do About Horrible, Grievous, Unicode

hostilefork · March 18, 2025, 11:27pm

At least on the encoding side of UTF-8, I have an answer for stopping the propagation of weird bytes if you want, that's now pretty palatable...

I was fretting over how TO of BINARY! doing UTF-8 encoding wasn't really fitting nicely with my idea of it, and so I preferred to go ahead with a more explicit parameterization to ENCODE:

>> encode 'utf8 -{C😺T}-
== #{43F09F98BA54}

It's not that much longer than to binary!, just two characters:

to binary!
encode 'utf8

But you get a lot more information. However, if you have these personal settings, it gets a little bit messier, let's say like:

>> encode [utf8 emoji: no] -{C😺T}-
** Error: Emoji disabled in UTF-8 Codec configuration

That's a lot to type so you're liable to need to invent a name for it, which puts some burden on you. But utf8-of isn't a terrible name for you to call your specialization:

/utf8-of: specialize encode?/ [codec: [utf8 emoji: no]]

This would override the default configuration, we'll assume UTF8-OF is in LIB is defined as some sane default.

It's still a little bit ugly due to the hyphen, BUT, pursuant to the newfound binding-based configurability of infix OF, you can call the relevant local definition for your environment with no space! And still not lose your variable names!

>> utf8: utf8 of -{C😺T}-
== #{43F09F98BA54}

>> utf8
== #{43F09F98BA54}

And maybe the default UTF8-OF offers some refinements to swap up the settings, that OF can migrate onto the call:

>> utf8: utf8:basic of -{C😺T}-
** Error: Codepoints of ENCODE UTF-8 with Basic setting Must Be < 65535

So things like UTF8:PRINTABLE could be options of the stock UTF8-OF, and you could just say in your file:

/utf8-of: utf8-of:printable/

Then all your utf8 of would be protected, in the scope of that file (or if you imported your standard configuration library, in all the files that use your personal configuration... I expect everyone to have such files... the goal here is to bend the language to you, not to bend yourself to the language...)

This strategy, of having strong routines like ENCODE or CHECKSUM, which can be parameterized and specialized, ultimately being able to "hook" things like OF, is looking like it's going to be able to hit the target of the kind of flow we are seeking.

So if you want the exact bytes, you can always use AS BINARY!, which just aliases the memory as a binary blob directly. Anything else is making a copy, and while memcpy() would be fastest a little bit of filtering isn't going to be a problem.

What About the Decoding Side?

That... I dunno yet.

So there's decode 'utf8 With the corresponding options like decode [utf8 emoji: no] or whatever. More thinking is needed on how READ and LOAD and other such operations get configured.

But at least I think the encoding side looks good.