Realistically Migrating Rebol to "UTF8 Everywhere"

hostilefork · March 21, 2019, 5:40pm

Coming back to it to try and get this out the door, I think "half done" was about right. It took a week or two more work...and it's going to take a bit more work before it's "done"...it's a bit on the slow side right at the moment.

Nevertheless, I've gone ahead and committed it....UTF-8 Everywhere Lives!

There's a whole new set of interesting angles to how BINARY! and TEXT! can intermix in PARSE:

github.com/metaeducation/ren-c

tests/parse-tests.r

64c12765c


      
          ; Multi-byte characters and strings present a lot of challenges.  There should
          ; be many more tests and philosophies written up of what the semantics are,
          ; especially when it comes to BINARY! and ANY-STRING! mixtures.  These tests
          ; are better than nothing...
          (
              catchar: #"🐱"
              did parse #{F09F90B1} [catchar end]
          )(
              cattext: "🐱"
              did parse #{F09F90B1} [cattext end]
          )(
              catbin: #{F09F90B1}
              did parse "🐱" [catbin end]
          )(
              catchar: #"🐱"
              did parse "🐱" [catchar end]
          )
          
          [
              (

This file has been truncated. show original

I also imported a file from the W3C to the tests, and got things started on how more purposeful tests might be written:

github.com/metaeducation/ren-c

tests/string/utf8.test.reb

master


      
          ;     import codecs
          ;     with codecs.open('utf8-plain-text.txt', encoding='utf-8') as myfile:
          ;         data = myfile.read()
          ;         print(len(data))
          (
              t: to text! read %../fixtures/utf8-plain-text.txt
              tlen: length of t
              assert [tlen = 7086]

This approach isn't impossible...but it hinges on having a value cell in your hand at the moment of doing the lookup. A lot of places have series nodes that aren't paired with any value, so there'd be no caching.

For the moment, the main caching is just done on the series itself. Small series don't bother with a cache, larger ones could have several. It leads to orders of magnitude in speedup, and a collection of large parses (like source analysis) is reduced to the scale of "minutes" instead of the scale of "a day".

This is where the big speedup is going to come from, but I figured it would be better to phase it. Not only will the extra processing in parse help give this a test for a while, but also people can adapt to the first set of necessary changes before being hit with needing to rewrite any parse rules that expected to modify the iterated series without using parse keywords.