Further Optimizations Of Breaking the 64-Type Barrier

hostilefork · March 16, 2025, 8:28pm

Historical Rebol/Red are limited to 64 datatypes. It's not because there isn't space in the cells to store more (e.g. 256 in a header byte) -- but because TYPESET! wanted to fit into a Cell's payload, as a 64-bit integer to sparsely convey an arbitrary collection of bits in a typeset.

Ren-C made the bold move away from typesets into type predicates... arbitrary functions that can check more than just the type. So you can take an argument and demand it be EVEN?, for instance...

I've discussed how this made it critical to be able to call functions like ANY-SERIES? very quickly. A big part of that is Intrinsics: Functions Without Frames.

What's really neat about the ability to call intrinsics super fast is that it doesn't only apply in type checking. Ordinary evaluation gets accelerated as well, e.g. you can write if any-series? value [...] and it uses the same magic that turbocharges typechecks.

(Note: All intrinsics are stylized in such a way that they can be called non-intrinsically if need be. In fact, the debug build sporadically calls intrinsics non-intrinsically to make sure they still work. This would be needed if you wanted a stepwise debugger, so that it didn't seem to eerily skip functions every time an intrinsic dodged making a frame.)

A Key Optimization: 4 or 8 PARAMETER! Bytes

The PARAMETER! spec's Array stub has 4 spare bytes on 32-bit platforms, and 8 spare bytes on 64-bit platforms. So this gave me an idea to have a numbered list of up to 255 built-in intrinsics. There'd be a one-time analysis of the argument spec to see if you used any of those 255 common type checks, and if you did then it would encode that in the bytes so you wouldn't have to actually call the function--you could just run the test the function would do.

So let's say you said: [integer! any-list?]. It would recognize these as being in the built in list (INTEGER! equivalent to the INTEGER? constraint). And you would get an optimized byte sequence like {1, 77, 0}. (0 is reserved for indicating a premature end, but if you're on a 32 bit platform and it sees 4 values with no 0 it knows they all apply... and if you're on a 64-bit platform and it sees 8 values with no 0 it knows they all apply. Only a single bit is needed to mark "incomplete optimizations" where it has to run through the typeset array.)

There's a special bit set on the cells in the array for the typeset to say that they were accounted for by the optimization. This means if you can't fully optimize the array (either because you have too many types to check for the 4 or 8 bytes, or because you used constraints that aren't in the built in list) then it still knows it can skip past the optimized constraint when it's walking the array to test the outliers.

Here is a list of the optimized bytes so far (not close to 255 yet!)

(There's a separate byte for all antiform categories instead of all being TYPE_ANTIFORM, which helps in internal code as well as accelerators here, so that chews up slots as type byte multiplied by 2. I'm sure there could be clever uses for the gaps, if you can truly guarantee a type will never be isotopic...)

integer 1
decimal 2
percent 3
pair 4
time 5
date 6
bitset 7
map 8
handle 9
opaque 10
blob 11
text 12
file 13
tag 14
url 15
email 16
rune 17
varargs 18
parameter 19
object 20
module 21
warning 22
port 23
frame 24
let 25
comma 26
tuple 27
chain 28
path 29
block 30
fence 31
group 32
word 33
metaform 34
tied 35
pinned 36
quasiform 37
quoted 38
~ 39
~ 40
~ 41
~ 42
~ 43
~ 44
~ 45
~ 46
~ 47
~ 48
~ 49
~ 50
~ 51
~ 52
~ 53
~ 54
trash 55
~ 56
~ 57
~ 58
~ 59
error 60
~ 61
action 62
~ 63
ghost 64
~ 65
~ 66
~ 67
pack 68
datatype 69
splice 70
keyword 71
any-antiform 72
any-float 73
any-string 74
any-context 75
any-sequence 76
any-list 77
any-number 78
any-scalar 79
any-inert 80
any-sequencable 81
any-series 82
any-utf8 83
any-isotopic 84
any-branch 85
any-unit 86
any-plain 87
any-fundamental 88
any-element 89

Removing Generality To Boost Performance

My first cut at implementing this tried to be very general. I had a table of up to 255 C function pointers that took a Cell as an argument, and returned a boolean. This meant basically any constraint could be optimized... like EVEN?, because it had not only the type (e.g. INTEGER!) but it had the full cell it could pick apart:

bool Optimized_Even_Checker(Cell* cell) {
    if (Cell_Type(cell) != REB_INTEGER)
        return false;
   return VAL_INT64(cell) % 2 == 0;
}

But as you see from the table above, I'd only gotten around to automating the construction of the table based on information in %types.r. So all the functions did was test the cell's type. There were three kinds of checkers:

bool Optimized_Integer_Checker(Cell* cell) {  // single datatype
    return Cell_Type(cell) == REB_INTEGER;
}

bool Optimized_Any_List_Checker(Cell* cell) {  // datatype range
    Type type = Cell_Type(cell);
    return type >= REB_BLOCK and type <= REB_VAR_GROUP;
}

bool Optimized_Any_Scalar_Checker(Cell* cell) {  // sparse typeset
    Type type = Cell_Type(cell);
    return g_sparse_memberships[type] & TYPESET_FLAG_SCALAR;
}

So what you see is that the way that the type bytes are numbered, sometimes you can check for typeset membership by just seeing if it's in a certain range. Careful arrangement of %types.r means that's possible for a lot of checks.

But if something doesn't fit that model, there's another table of sparse typeset flags put together. I've limited the number of sparse typesets to 31--and I'll explain why--but there's only 13 of them so far.

Let's Do Some Slight Transformations

Let's say that we decide that anything that can't be checked by datatype alone we're willing to call as an (often intrinsic) function. So we're willing to pass the type in and not have each function recalculate it.

ALSO: Let's say that we want to collapse datatype checking to just be a trivial case of range checking, where the start and end of the range are the same.

bool Optimized_Integer_Checker(Type type) {  // trivial range
    return type >= REB_INTEGER and type <= REB_INTEGER;
}

bool Optimized_Any_List_Checker(Type type) {  // non-trivial range
    return type >= REB_BLOCK and type <= REB_VAR_GROUP;
}

bool Optimized_Any_Scalar_Checker(Type type) {  // sparse
    return g_sparse_memberships[type] & TYPESET_FLAG_SCALAR;
}

That's interesting... but why are we calling a function at all? What if we drop the function pointers and just make the table be the information?

...BUT... what if there's a bit reserved, let's say TYPESET_FLAG_0_RANGE, which we use to indicate the table entry has two bytes of range information, start and finish. and if that bit is not set, then the table entry has a single TYPESET_FLAG_XXX for the flag you need to test in the sparse table!

INLINE bool Builtin_Typeset_Check(TypesetByte typeset_byte, Type type) {
    TypesetFlags typeset = g_typesets[typeset_byte];

    if (typeset & TYPESET_FLAG_0_RANGE) {  // trivial ranges ok (one datatype)
        Byte start = THIRD_BYTE(&typeset);
        Byte end = FOURTH_BYTE(&typeset);
        return start <= type and type <= end;  // note: type converts to Byte
    }

    return did (g_sparse_memberships[type] & typeset);  // just a typeset flag
}

You don't even have to shift a datatype to do the check. Where historical Rebol required a 64-bit shift, the flag you need to test against is just waiting for you in the array (and sparse memberships only needs a 32-bit integer).

Blazing through a list of these typechecks is very efficient.

So Why Not Use This For Generic Dispatch?

When I realized I could now hum through a table of "fancy-sounding" type constraints like ANY-SERIES? or ANY-SCALAR? and not break a sweat, it made me think this is a perfect way to deal with the methodization of generics.

All I had to do was order the type checks from specific to more general, run the checks in sorted order, and everything from molding to comparison to multiplication could be dispatched at as fine a granularity as you want.

You write in the source your implementations wherever you want, like:

#define IMPLEMENT_GENERIC(name,type) \
    Bounce g_##name##_##type(Level* level_)  // e.g. g_APPEND_Any_List

IMPLEMENT_GENERIC(LESSER_Q, Is_Integer)  // LESSER? -> LESSER_Q in C
{ ... }

IMPLEMENT_GENERIC(LESSER_Q, Any_Context)
{ ... }

Then the build process packages that all up into a table, with the constraint byte in order:

#define GENERIC_CFUNC(name,type)  G_##name##_##type

GenericTable g_generic_lesser_q[] = {
    {2, &GENERIC_CFUNC(LESSER_Q, Is_Integer)},  // => {2, &g_LESSER_Q_Is_Integer}
    {3, &GENERIC_CFUNC(LESSER_Q, Is_Decimal)},
    {6, &GENERIC_CFUNC(LESSER_Q, Is_Money)},
    {7, &GENERIC_CFUNC(LESSER_Q, Is_Time)},
    {8, &GENERIC_CFUNC(LESSER_Q, Is_Date)},
    {10, &GENERIC_CFUNC(LESSER_Q, Is_Bitset)},
    {13, &GENERIC_CFUNC(LESSER_Q, Is_Blob)},
    {26, &GENERIC_CFUNC(LESSER_Q, Is_Frame)},
    {67, &GENERIC_CFUNC(LESSER_Q, Any_Context)},
    {72, &GENERIC_CFUNC(LESSER_Q, Any_Sequence)},
    {76, &GENERIC_CFUNC(LESSER_Q, Any_List)},
    {85, &GENERIC_CFUNC(LESSER_Q, Any_Utf8)},
    {92, &GENERIC_CFUNC(LESSER_Q, Any_Element)},
    {0, nullptr}
};

Then the dispatch is as fast as heck! Remember that the compiler inlines this so it's not calling a Builtin_Typeset_Check() function each time through the loop, it's generating micro-optimized code:

#define Dispatch_Generic(name,cue,L) \
    Dispatch_Generic_Core(g_generic_##name, Cell_Type(cue), (L))

Bounce Dispatch_Generic_Core(
    GenericTable* table,
    Type type,
    Level* level
){
    for (; table->typeset_byte != 0; ++table) {
        if (Builtin_Typeset_Check(table->typeset_byte, type))
            return table->dispatcher(level);
    }

    return fail ("No dispatcher for datatype of generic");
}

Notice how this has the fallthrough to ANY-ELEMENT?, so if you pass something that's unhandled you can still do a baseline handler!

(I'm considering making it possible to do something like return PASS and have it continue bubbling down the chain, but not jumping the gun on features just yet... you can explicitly call a handler via GENERIC_CFUNC(generic, type) if you want, and that doesn't add cost to every dispatch, so I'm seeing how far it goes.)

Goodbye R3-Alphalike switch() Statements!

This trounces the old means of specifying dispatch and subclassing and overriding.

R3-Alpha had cryptic information, with a bunch of * and - and f+ stuff:

github.com/rebol/rebol

src/boot/types.r

25033f897


      
          	email       self        string      +        f*      *      *   [series string]  
          	url         self        string      +        f*      file   *   [series string]
          	tag         self        string      +        +       *      *   [series string]  
          
          	bitset      self        bitset      *        *       *      *   -
          	image       self        image       +        +       *      *   series
          	vector      self        vector      -        -       *      *   series

But at the end of the day you were usually stuck with an all-or-nothing... if you said that a GROUP! handled APPEND the same way that a BLOCK! handled append, you wound up having to say that all generics routed through ANY-LIST?. They had the same entry point for everything else.

This breaks that all wide open, with granular overriding. And blows away the 64 type limit. And with the native entry point available to generics you can do common processing before any generic runs, to enforce things like commutativity in addition and multiplication, etc...

//
//  multiply: native:generic [
//
//  "Returns the second value multiplied by the first"
//
//      return: [any-scalar?]
//      value1 [any-scalar?]
//      value2 [any-scalar?]
//  ]
//
DECLARE_NATIVE(multiply)
//
// 1. Most languages want multiplication to be commutative (exceptions like
//    matrix multiplication do exist, though that likely should be a different
//    operation and reserve MULTIPLY for element-wise multiplication).  To
//    ensure commutativity, we swap the arguments if their heart bytes are
//    not in "canon order".
//
//    (Using the HEART_BYTE as the canon order is a bit of a hack, as the
//    table can be reordered.  But we try to order the types in %types.r
//    such that more complex types come later, so that we dispatch to the
//    more complex type...e.g. multiplying a PAIR! by a DECIMAL! should
//    should dispatch to the PAIR! code.)
{
    INCLUDE_PARAMS_OF_MULTIPLY;

    Stable* v1 = ARG(value1);
    Stable* v2 = ARG(value2);

    if (HEART_BYTE(e1) < HEART_BYTE(e2)) {  // simpler type is on left [1]
        Move_Cell(SPARE, v2);
        Move_Cell(v2, v1);  // ...so move simpler type to be on the right
        Move_Cell(v1, SPARE);
    }

    return Dispatch_Generic(MULTIPLY, v1, LEVEL);
}

Lots to Do, But This Is A Step In The Right Direction

BlackATTR · March 16, 2025, 8:57pm

Wild, exciting stuff! Looking great, HF.