Rethinking the ODBC Dialect: Inline Parameters

The code the ODBC extension is based on was cloning Rebol2's model of thinking of the database services as being a PORT!. You would run queries by INSERT-ing SQL into the port, and then COPY the port to get a BLOCK! of results back. For anyone unfamiliar:

When you passed a BLOCK! as the thing to INSERT to the port, that block would have a string in the first position...this was the SQL. But if that SQL had ? characters in it (that weren't part of a string literal), then those represented substitution slots. The block then was supposed to have as many additional values in it as ?... and these Rebol values would be filled in as parameters.

This approach is fundamental to how ODBC works (the ? syntax is an ODBC standard). It means that you don't have to turn every value you have into a string to insert it. It saves you from worrying about how to escape those string values, may be more efficient (for some types), and protects against SQL Injection Attacks.

We Can Do Better, and I've Made a Start At It

The idea I had was to use the $XXX types to indicate you wanted substitutions.

So:

user-id: 123
user-name: "alice"
user-email: "alice@example.com"

odbc-execute [
    INSERT INTO users (id, name, email)
    VALUES ($user-id, $(uppercase of user-name), $user-email)
]

This produces a parameterized query for the driver:

[
    "INSERT INTO users (id, name, email) VALUES (?, ?, ?)"
    123 "ALICE" "alice@example.com"
]

Questions, Comments, Concerns?

2 Likes

Huge thumbs up. I personally did not like the ? ? ? parameter juggling throughout that dialect-- it made code difficult to read. In the 90's I liked ColdFusion's quick/legible approach, which isn't too far I think from the improvement here. (For the youngsters, ColdFusion is/was a templating system using a markup language, i.e., a dialect). At the time CF was criticized for not being a real programming language, but their approach was very easy to understand compared to CGI and quite acceptable performance-wise. It was commercially successful. Some examples:
https://helpx.adobe.com/coldfusion/developing-applications/accessing-and-using-data/updating-your-database/inserting-data.html

Graham's PatientDB project uses a fair amount of SQL, and he decided he wanted to use this dialect feature.

Most obvious problem: Not all SQL is LOAD-able

There are syntaxes for types that involve a WORD! followed by a GROUP! with no space between, like VARCHAR(255). (for instance)

We're in this unhappy situation of being able to LOAD most of SQL, but missing just enough such that you have to fall back on using strings here and there.

Parameter Substitution Doesn't Work On Table Names

I did not know this--but you can't use the ? substitution technique on table names:

https://stackoverflow.com/q/1208442/

So that puts you back in the situation of having a name in a variable that you know is supposed to be a table name, but if spliced might be interpreted as SQL code...leading to bugs or a way to do an injection attack.

To deal with this I made it so that tied blocks ($[...]) have a purpose in splicing string content into the query. So you can say SELECT * FROM $[table-name]. But if table-name turned out to be "patients; DROP TABLE patients" you could lose a table...so this needs to be used with care when the strings are coming from random places.

Strings Are Conflicted In Purpose

On the one hand, SQL needs string literals. On the other hand, we need a way to represent patterns that aren't LOAD-able. So which of these two purposes do we apply strings for?

It's annoying to have to write:

SELECT * FROM patients WHERE medication = -["ASPIRIN"]-;

But if we render "ASPIRIN" as "ASPIRIN" in the SQL statement--with the quotes--how do you get things like VARCHAR(20), without the quotes?

We could pick another string type for SQL code splices, like TAG! for instance.

    CREATE TABLE vaccines (
        id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
        created TIMESTAMP DEFAULT current_timestamp,
        name <VARCHAR(128)>,  ; Weird, but...
        vaxdate DATE
    )

We can't use TUPLE! because table.field is a syntax that actually gets used, so we can't moosh together VARCHAR.(128) as VARCHAR(128). Maybe we could, but just in the cases of a GROUP! on the right? :frowning:

Or VARCHAR/128 could render as VARCHAR(128).

This really does make me struggle with whether to say we need the "FUSED!" proposal. But that's still not going to get us there.

No matter what, we won't get 100% SQL compatibility unless we use strings. We might be able to cook up some kind of interpolation, but that runs up against the problem of needing to be able to fully and correctly parse SQL in order to be sure that the escaping isn't done in places it shouldn't...a tougher task.

Would Like This Unified With QUERY

I've mentioned that if we're going to have some kind of SQL dialect thing, we should make it all consistent.

But here we've got a situation where we are blindly passing through strings. We don't know what's a keyword and what isn't. So we couldn't enforce a rule like saying "field names must be QUOTED!" if we wanted to.

...but at least we have COMMA!. The comma data type has really been a winner, and I can't think of anytime I've felt a regret about its addition.

Ultimately, The Good And The Bad

The GOOD is just how easy it is to make such a dialect.

The BAD is that the devil is in the details. When you can't use the dialect you're back to using strings or some method of putting strings into the code. And SQL already has a need for the behavior of TEXT! strings.

All this is just real-world examination of why the ideals of Rebol don't play out so well in a world where Rebol doesn't control everything.

Text files are too saturated to be able to represent such composite interests; a graph structure and editor are needed to really pull off this kind of language intermingling.

3 Likes

How about varchar: 128

If you see a set word in the dialect it's to be turned into a group.

I don't think it's too bad to use quotes for table names:

CREATE TABLE "dbo.MyTable"  
(  
    MyDecimalColumn DECIMAL: (5,2)  
    , MyNumericColumn NUMERIC: (10,5)
);

The SET-WORD! idea doesn't look too bad...but it does point to the slippery slope of making you rewrite your SQL that we embark upon.

The problem being that we don't want those quotes to make it into the SQL or it would complain about there being an invalid string in the spot (I think).

And we would need to come up with a way of representing strings. The only choices that make sense to me are:

select * from drugs where name = "aspirin 100mg"

select * from drugs where name = -["aspirin 100mg"]-

If we go with the first way (which I think we should), I suggested we might go with TAG! to splice in arbitrary SQL, e.g. a table name that we can't handle:

select * from <system$SpecialTableName>
1 Like

Most databases don't conform strictly to ANSI SQL, so I doubt we'd ever be able to safely LOAD non-trivial SQL in Ren-C. At least not SQL supported across the spectrum of dbms platforms (cloud/document dbs drift away considerably from the standards).

A very basic example is a database like SQL Server or PostgreSQL only permitting strings in the following format:

WHERE name = 'Brady'

While many others permit either:

WHERE name = "Brady"
WHERE name = 'Brady'

And this is just trivial SQL. Almost every platform permits custom functions and code which isn't supported on other platforms.

1 Like

So I mentioned that the syntax for this is to use a $TIED BLOCK!:

SELECT * FROM $[table-name]

But of course it's worrying, as everyone screams quite loudly "don't make it easy to paste together arbitrary strings to make SQL".

Given that we can make a lot of datatypes, I briefly wondered if we could have a separate TAINTED-TEXT! which input routines would make, and then things like $table-name would refuse to splice it. This idea does have precedent... e.g. Perl has some Taint marking stuff. But I don't know if it's wise to be worrying about going down that road at this time.

However... it did occur to me that in the Rebolverse, we do have sort of a way of marking data as either "processed or not"...which is to say, if something is LOAD'ed vs. just a TEXT!. While you can argue that a block like [this is some sql] might come from a tainted source just like "this is some sql", it seems like it could provide at least some step of conceptual security to get people to use things like WORD! or BLOCK! or INTEGER! instead of strings.

e.g. even the error message could serve as a reminder:

WARNING: $[table-name] is a TEXT!. Out of concern for your own well-being, ODBC dialect won't splice text strings directly. Use $table-name if at all possible--this will form a prepared statement and avoids SQL injection entirely. But because not all positions in ODBC will accept substitutions from prepared statements, the $[xxx] feature is there for you...yet only LOAD-ed data such as a BLOCK! or WORD! or INTEGER! etc. will be accepted. If you're unsure about the source of the string you are trying to paste, do not blindly transcode it or envelop it into a BLOCK! just for expedience to subvert this check. You have been warned."

Like I say, in a Rebol world it's not perfect, if you start thinking about asking users for BLOCK!s instead of strings and transcode stuff just as they type it in this won't do you any good. But I think it would catch a lot of casual mistakes, and point people to the $xxx solution which works in many places.


Note: Because it turns out to be really expedient at times to be able to splice strings, I've added a way to do it: $[<!> expr]

1 Like