Agreed Upon Symbol Numbers for Extensions

A concept in the R3-Alpha codebase is that there are a certain number of built-in words...which come from a file called %words.r

https://github.com/rebol/rebol/blob/master/src/boot/words.r

This is done so you can switch on a numeric code for these words, and not bother with needing to do a string comparison in C. Some words (like PARSE keywords) are strategically chosen to be in a sequential range, to make testing for them faster.

If you write an extension in C that operates at the internal level API and want the performance of a native, you might want to talk about a word that's not in that list. You can get a bit close to the performance for a single test by caching a pointer to the canonized version of that word, and comparing to that canon pointer. But it won't be quite as fast, and since that won't be a constant...C can't use it in switch statements.

To be more concrete, imagine you have some words not in %words.r like OVERLOAD, MULTIPLE, INHERITANCE. You couldn't write:

 switch (VAL_WORD_SYM(some_word)) {  ; small 16-bit # can be cached in word
     case SYM_OVERLOAD: ...  ; ...but these weren't in %words.r!
     case SYM_MULTIPLE: ...
     case SYM_INHERITANCE: ...
     default: ...
}

Can't do that for those new terms. You'd have to do case-insensitive string comparisons, or something like this pseudocode:

 const Symbol* canon_overload;
 const Symbol* canon_multiple;
 const Symbol* canon_inheritance;

 void On_Module_Load() {
     symbol_overload = Register_Symbol("overload");
     symbol_multiple = Register_Symbol("multiple");
     symbol_inheritance = Register_Symbol("inheritance");
 }

 void On_Module_Shutdown() {
     Unregister_Symbol(symbol_overload);
     Unregister_Symbol(symbol_multiple);
     Unregister_Symbol(symbol_inheritance);
 }

So imagine this gives you word series pointers that are guarded from GC for as long as your module is loaded. Then you could say:

 const String* symbol = Cell_Word_Symbol(some_word);
 if (symbol == symbol_overload) { ... }
 else if (symbol == symbol_multiple) { ... }
 else if (symbol == symbol_inheritance) { ... }
 else { ... }

It's less elegant than the switch(), and since the numbers are runtime pointers and not fixed at compile-time, there's no way to optimize as in a switch() by repeatedly bisecting the range of values...if you have N words, you will do N comparisons.

Weird idea: Agree on a list of words and numbers, commit on Internet

It would be pretty heinous to make a much bigger %words.r and ship it in every executable...inflating the size of Rebol to include a dictionary.

But there's a possibility that doesn't go that far yet still gets the benefit. Make the word list and commit it somewhere on the internet that developers can look. Give every common word a number. Then, the extension ships with just the spellings and numbers it needs. All extensions agree to use the same numbers:

 #define SYM_OVERLOAD 15092
 #define SYM_MULTIPLE 32091
 #define SYM_INHERITANCE 63029

 void On_Module_Load() {
     Register_Symbol("overload", SYM_OVERLOAD);
     Register_Symbol("multiple", SYM_MULTIPLE);
     Register_Symbol("inheritance", SYM_INHERITANCE);
 }

 void On_Module_Shutdown() {
     Unregister_Symbol(SYM_OVERLOAD);
     Unregister_Symbol(SYM_MULTIPLE);
     Unregister_Symbol(SYM_INHERITANCE);
 }

Your switch() statements can work just fine, and you're only out of luck if you use a sequence of characters that wasn't committed to in the database. But the database can grow, so long as it grows centrally and not inconsistently. (In fact, it's probably better to do it that way, where extension authors ask for the words they want and get them approved before shipping the extension.)

The worst that can happen is you load two extensions that disagree, and it refuses to load them. It could print out the disagreeing numbers and you could consult the internet to decide who was the culprit using the wrong number.

It's a weird idea but kind of interesting--not in particular because of the performance aspect, but because of enabling the C switch()es. Since there's only 16 bits of space in the word available for the symbol trick, it's an exhaustible resource. But maybe still worth doing. This really isn't difficult, outside of the administrative headache of deciding the policy on giving out #s

Five years later...

Since I'm reviving Extension Types, it brought me face to face with the issue of these old extensions introducing clutter, in terms of built-in symbols that you pay for whether you use them or not.

You can see how the original %words.r grew between R3-Alpha and R3-Atronix/Saphirion:

R3-Alpha: https://github.com/rebol/rebol/blob/master/src/boot/words.r

R3-Atronix: https://github.com/zsx/r3/blob/atronix/src/boot/words.r

It's messy, and building in the words grows the lookup hash table... costs you the string storage... and costs you the memory Stub corresponding to the string (though small strings fit in the Stub).

So I decided it was time to implement the idea...

A Little Different...

This is what I originally suggested:

 #define SYM_OVERLOAD 15092
 #define SYM_MULTIPLE 32091
 #define SYM_INHERITANCE 63029

 void On_Module_Load() {
     Register_Symbol("overload", SYM_OVERLOAD);
     Register_Symbol("multiple", SYM_MULTIPLE);
     Register_Symbol("inheritance", SYM_INHERITANCE);
 }

 void On_Module_Shutdown() {
     Unregister_Symbol(SYM_OVERLOAD);
     Unregister_Symbol(SYM_MULTIPLE);
     Unregister_Symbol(SYM_INHERITANCE);
 }

I've made it a bit easier by having an #include file with the SYM_XXX defined that you can use in your extension. But I call them EXT_SYM_XXX instead.

Also, what you get back is actually not a Symbol*, but instead a Value*.

"Why a Value?".. you ask. Well, because it's not useless to have the value around, and because for now an API handle is the easiest way to prevent the GC from collecting the symbol if it's not used. You don't want that to happen between the Register/Unregister calls, because if it disappeared and came back it wouldn't get the symbol ID.

If there were space inside the symbol to put a reference could we could put it there. But things are hyper optimized, and there's not space for that--for such a fringe feature. Even if the space was available, I'd use it for something else.

So it looks more like this:

Value* g_word_overload = nullptr;
Value* g_word_multiple = nullptr;
Value* g_word_intheritance = nullptr;

void On_Module_Load() {
    g_word_overload = Register_Symbol("overload", EXT_SYM_OVERLOAD);
    g_word_multiple = Register_Symbol("multiple", EXT_SYM_MULTIPLE);
    g_word_inheritance = Register_Symbol("inheritance", EXT_SYM_INHERITANCE);
}

void On_Module_Shutdown() {
    Unregister_Symbol(g_word_overload, EXT_SYM_OVERLOAD);
    Unregister_Symbol(g_word_multiple, EXT_SYM_MULTIPLE);
    Unregister_Symbol(g_word_inheritance, EXT_SYM_INHERITANCE);
 }

(You don't technically have to pass the symbol in on unregistering, but it's a sanity check.)

And that's all it takes to be able to use the EXT_SYM_XXX in switch() statements in your extension... without bloating the core with your weird words!