O noes, Unicode Normalization

hostilefork · June 13, 2024, 9:10pm

I hope you agree that allowing append string #{CC} would be bad, and that today's behavior is good:

>> string: "e"
== "e"

>> append string #{CC}
** Script Error: String aliased as BINARY! can't become invalid UTF-8

If you want arbitrary bytes, you have to keep everything as BINARY!:

>> bytes: to binary! "e"
== #{65}

>> append bytes #{CC}
== #{65CC}

But TEXT! must be valid UTF-8 on every operation:

>> to text! bytes
** Script Error: invalid UTF-8 byte sequence found during decoding

I'm pleased with all of that.

Per my writing criticizing the robustness principle, the system would be mandating that input already be normalized and keeping things normalized at all times. This gives a saner foundation to the process.

I did suggest that perhaps there be a middle tier... where TEXT! enforces NFC as an additional constraint on top of UTF-8!, and UTF-8! does its enforcement on top of BINARY!.

Because forcing you to use BINARY! for all non-NFC would lose the advantages of the already-existing codepoint coherence. Seems like a waste.

But most of the system would use TEXT! as currency in canon form.

Because I see this as a direct analogy to the constraint of maintaining valid UTF-8, I don't see this as being oppressive. It keeps you sane.

The current state of things is what I find oppressive:

>> single: to text! #{C3A9}
== "é"

>> double: to text! #{65CC81}
== "é"

>> length of single
== 1

>> length of double
== 2

So paralleling the "UTF-8 Everywhere Manifesto", I'd say the "NFC Everywhere Manifesto" has the potential to make people's lives better and not worse.

Giving an oft-better answer for LENGTH OF makes yet another argument for why to pick NFC.

If you were assured that all the TEXT! in the system was in NFC, you would not have any troubles when you went searching for substrings, because the substrings would also be NFC.