Flexible Symbol Reordering for Optimization

hostilefork · March 18, 2025, 12:58pm

In the original formulation of building the symbol table, it kept a running count of the current symbol ID, and incremented it as it went.

The order wasn't always random. For instance, one of the optimizations based on ordering was that the symbols for R3-Alpha's PARSE keywords were in a particular order, which was specified in %words.r

; Parse: - These words must not reserved above!!
parse
|	 ; must be first
; prep words:
set  
copy
some
any
opt
not
and
then
remove
insert
change
if
fail
reject
while
return
limit
??
accept
break
; match words:
skip
to
thru
quote
do
into
only
end  ; must be last

By putting these in such an order, when PARSE3 was processing a rule it could do a quick check of whether or not something was a keyword:

#define GET_CMD(n) (((n) >= SYM_OR_BAR && (n) <= SYM_END) ? (n) : 0)

And then further it divided the range based on which phase the rule was in... either "pre-match" or "match":

if (cmd = VAL_CMD(item)) {
    if (!IS_WORD(item))
        Trap1(RE_PARSE_COMMAND, item); // SET or GET not allowed

        if (cmd <= SYM_BREAK) { // optimization
            switch (cmd) {...}
        ...

Though this is a bit brittle, and hard to be sure you're adjusting all the dependencies if you change it (more on that later...)

Wanting To Find LENGTH-OF Quickly From LENGTH

I had a stroke of inspiration when I realized that modern binding made it possible that the OF function could be declared as infix with only a left hand side, and then it could effectively retrigger length of to act the same as length-of, and do that for any xxx of to run a function it dynamically looks up as xxx-of:

Squaring the circle of LENGTH? and LENGTH-OF - #8 by hostilefork

This is a breakthrough powered by modern binding. However, one of the things it needs to do is to somehow navigate from the Symbol pointer for xxx to xxx-of... and that's not exactly free. You have to build a UTF-8 buffer which has the "-of" variant, and hash it to look it up in the symbol table.

But I had an idea

Hence I reworked the code that builds the symbol table. Instead of pinning down the symbol numbers permanently as it goes, it waits to assign the numbers until all the symbols have been added. This permits reorganization.

So if it sees the symbol for LENGTH-OF, it goes through and looks to see if a symbol for LENGTH was already added. If it was, it goes back and inserts LENGTH-OF right after LENGTH. If it hasn't been added yet, then it goes ahead and adds both LENGTH and LENGTH-OF.

This way you get sequential symbol IDs, and it's trivial and fast to get to LENGTH-OF if you have LENGTH in your hand. This works for any built-in symbol, so we're getting good performance for super common operations like type of

But It Needs Exceptions...

There are some places you can't disrupt the ordering, for instance the datatype symbols are expected to be in a certain fixed order. So the reordering logic checks to make sure you're not disrupting those cases. So far there are only two exceptions that need to be made: sigil and file, which have operations sigil of and file of

>> sigil of first [$hello world]
== $

>> sigil of second [$hello world]
== ~null~  ; anti

>> data: load %something.r
== [a [b c] d]

>> second data
== [b c]

>> file of second data
== %something.r

The design point of keeping type symbols in order is too important, so SYM_SIGIL can't be followed by SYM_SIGIL_OF and SYM_FILE can't be followed by SYM_FILE_OF. But these were the only two exceptions, and handling them is trivial!

Option(SymId) id = Symbol_Id(sym);
if (id) {  // built-in symbols w/-OF variations, optimized positions
    if (id == SYM_SIGIL)
        sym_of = CANON(SIGIL_OF);  // needs exception
    else if (id == SYM_FILE)
        sym_of = CANON(FILE_OF);  // needs exception
    else
        sym_of = Canon_Symbol(cast(SymId, (unwrap id) + 1));
}
else {
    ... // slower method, produce UTF-8 for "-of" variant, hash it
}

The build process detects any cases where a disruption would be a problem and warns you so you know that you have to add an exception. For instance: let's say you want to keep the PARSE3 keyword sequential trick intact. Since there's a word there for LIMIT, it would be a problem if we added a LIMIT-OF native, and an exception would have to be added here.

It's not a huge issue and I don't expect a lot of exceptions needing to be made here. In any case, what makes this so fast is that turning a SYM_XXX into a Symbol* is trivially fast, because there's just a global array of Symbol (not even an array of Symbol*, it's the Stub structure for the symbol itself) indexed by the SYM constant.

Eliminating The Hardcoding of Symbol IDs for Ranges

This was a rework that makes it easier for other ideas down the line of how to creatively use the symbol IDs. So I made it easier to mark positions in the IDs in a way that's less likely to break.

I use TAG! to put in the markers. For instance, the PARSE3 keywords are now done as:

<MIN_SYM_PARSE3>  ; no </> means next symbol (SYM_SOME is MIN_SYM_PARSE3)
    some
    opt
    optional
    repeat
    ; ... etc ... 
<MIN_SYM_PARSE3_MATCH>
    skip
    one
    to
    thru
    ; ... etc ...
    end 
</MAX_SYM_PARSE3>  ;  </> means prior symbol (SYM_END is MAX_SYM_PARSE3)

Once all the symbols have been worked out, the tags are removed and automatically turned into #defines by %make-boot.r:

/*
 * These definitions are derived from markers added during the symbol
 * table creation process via ADD-SYM:PLACEHOLDER (and are much better
 * than hardcoding symbol IDs in source the way R3-Alpha did it!)
 */
#define MIN_SYM_TYPESETS  253  /* blank? */
#define MAX_SYM_TYPESETS  407  /* any-element? */
#define MIN_SYM_PARSE3  427  /* some */
#define MIN_SYM_PARSE3_MATCH  454  /* skip */
#define MAX_SYM_PARSE3  461  /* end */
#define MIN_SYM_NATIVE  480  /* native */
#define MAX_SYM_LIB_PREMADE  876  /* inflate */
#define MIN_SYM_ERRORS  988  /* Internal */
#define MAX_SYM_BUILTIN  1207  /* export* */

It's much less brittle, communicates what's going on, and lets you search the code for where the given trick is being used.

So from now on, coming up with similar tricks will be much easier and robust!

Using More Rebol To Make Rebol Better

As Ren-C's power grows, it makes more and more sense to implement more of the project using Rebol technique.

I'm really pleased by things like the TAG! trick for marking ranges (as I did in %types.r), and just how expressive you can get with things.

Although the bootstrap code is a total mess, it's amazing to throw Rebol concepts at simplifying it...and as string interpolation and other magic becomes commonplace, it's really shaping up.