Rethinking the ODBC Dialect: Inline Parameters

hostilefork · October 20, 2019, 8:07am

The code the ODBC extension is based on was cloning Rebol2's model of thinking of the database services as being a PORT!. You would run queries by INSERT-ing SQL into the port, and then COPY the port to get a BLOCK! of results back. For anyone unfamiliar:

When you passed a BLOCK! as the thing to INSERT to the port, that block would have a string in the first position...this was the SQL. But if that SQL had ? characters in it (that weren't part of a string literal), then those represented substitution slots. The block then was supposed to have as many additional values in it as ?... and these Rebol values would be filled in as parameters.

This approach is fundamental to how ODBC works (the ? syntax is an ODBC standard). It means that you don't have to turn every value you have into a string to insert it. It saves you from worrying about how to escape those string values, may be more efficient (for some types), and protects against SQL Injection Attacks.

We Can Do Better, and I've Made a Start At It

The idea I had was:

Perform implicit evaluation by default
Assume string components should be implicitly SPACED
Use the new @xxx types to let you put the substitution variables in the spots you want

The first two points will be familiar to anyone who has used PRINT. But the third point means substitutions would be easier to see at the point they are intended to be:

odbc-execute statement [{foo} @value1 {bar} @(expression two) {baz}]
; ...acts like...
insert statement compose [{foo ? bar ? baz} (value1) (expression two)]

The nice things the @ values bring into the mix is that since they do not reduce, you can distinguish them from basic code you're using to try and form the SQL expression from sanitized variables...while still using both.

Behold, More State-of-the-Art

Here's a little wrapper called ODBC-EXECUTE:

odbc-execute: function [
    statement [port!]
    query [text! block!]
    /parameters [block!] "Usually implicitly derived from @params in query"
][
    parameters: default [copy []]
    if block? query [
        query: spaced map-each item reduce query [  ; see note (1)
            switch type of item [
                sym-word! sym-path! [get item]
                sym-group! [reduce as group! item]
            ] then value => [  ; see note (2)
                append parameters value
                "?"
            ] else [
                quote :item  ; see note (3)
            ]
        ]
    ]
    insert statement compose [(query) ((parameters))]  ; see note (4)
]

Pretty brilliant, IMNSHO.

We want to REDUCE first, before we go hunting for @params and assuming they are parameters, since they might be arguments to functions. Think of it a bit like how print ["a" first "bc" "d"] can't a-priori know that "bc" is a string to be printed. It has to evaluate, and "bc" disappears as a function argument...in the same way, an @xxx might be consumed and not be intended as a parameter. (This may or may not be desirable...but again it's up to the dialect designer, who has freedom to decide how liberal they want to be.)
This is a great example of how to use the value vs. null distinction to avoid repeating code. The branches don't have to repeat the append parameters or the fact that they want ? to wind up in the query string, thanks to the protocol.
This shows generalized quoting being used to get out of a potential jam. The problem is, we REDUCE'd our dialect items before we went looking through them for our @params (I explained why in note 1). But then we want to use SPACED to get spacing, and it reduces by default. There should probably be a way to use DELIMIT and its variants without reducing, but no such entry point exists at the moment...so to avoid double-reduction we just quote everything so the evaluation is a no-op.
Just wanted to say I've been really a big fan of the "don't splice by default" for blocks in COMPOSE. Once you get used to the idea that values will be put in as-is unless you use (( )) it really is superior.

Not only do you see several Ren-C-specific tech points making this work, you can also specialize it to make it nicer. I picked a clunky name precisely because I don't know how the PORT! interface should work, but just do this:

sql: specialize 'odbc-execute [statement: statement]

Someday soon that should be able to be sql: (=> odbc-execute statement) as point-free specialization. But maybe it should be sql/execute and that's clearer. Who knows.

Questions, Comments, Concerns?

As a best practices question, one might wonder if it shouldn't be easier to splice unsanitized variables into the SQL than to splice sanitized ones. You might flip it so that WORD! means parameter, and SYM-WORD! means unsafe use...or get even more strict and suggest anyone wanting to put in string components has to do it with an explicit COMPOSE before ODBC-EXECUTE is called.

I feel that's a bit overkill and would frustrate those doing casual hacking on their databases. But the great thing is that if we just build this as a tool on top of a fundamental ODBC operation then people with different opinions can be more aggressive in their versions of the dialect. It's not a big deal if the dialect is easy to make. ミ★

BlackATTR · October 21, 2019, 10:23pm

Huge thumbs up. I personally did not like the ? ? ? parameter juggling throughout that dialect-- it made code difficult to read. In the 90's I liked ColdFusion's quick/legible approach, which isn't too far I think from the improvement here. (For the youngsters, ColdFusion is/was a templating system using a markup language, i.e., a dialect). At the time CF was criticized for not being a real programming language, but their approach was very easy to understand compared to CGI and quite acceptable performance-wise. It was commercially successful. Some examples:
https://helpx.adobe.com/coldfusion/developing-applications/accessing-and-using-data/updating-your-database/inserting-data.html

hostilefork · December 14, 2021, 4:23pm

Graham's PatientDB project uses a fair amount of SQL, and he decided he wanted to use this dialect feature.

It had atrophied and was using ^META for variable solutions instead of the @VAR types, so I changed that back. But then I went and changed it so that plain WORD! and GROUP! were treated as SQL syntax.

So if you write:

sql-execute [
    INSERT INTO fps (title, fname, surname)
    VALUES (@fptitle, @fpinits, @fpname)
 ]

The string you get is:

 {INSERT INTO fps (title, fname, surname)
 VALUES (?, ?, ?)}

And the parameter block passed in is the result of reduce [:fptitile :fpinits :fpname]

Most obvious problem: Not all SQL is LOAD-able

Some table names have $ signs in them (for instance). There are syntaxes for types that involve a WORD! followed by a GROUP! with no space between, like VARCHAR(255).

We're in this unhappy situation of being able to LOAD most of SQL, but missing just enough such that you have to fall back on joining together a SQL string from parts every 5th query you do.

Parameter Substitution Doesn't Work On Table Names

I did not know this--but you can't use the ? substitution technique on table names:

java - Using Prepared Statements to set Table Name - Stack Overflow

So that puts you back in the situation of having a name in a variable that you know is supposed to be a table name, but if spliced might be interpreted as SQL code...leading to bugs or a way to do an injection attack.

To deal with this I made it so that ^META values have a purpose in splicing string content into the query. So you can say SELECT * FROM ^table-name. But if table-name turned out to be "patients; DROP TABLE patients" you could lose a table...so this needs to be used with care when the strings are coming from random places.

Strings Are Conflicted In Purpose

On the one hand, SQL needs string literals. On the other hand, we need a way to represent patterns that aren't LOAD-able. So which of these two purposes do we apply strings for?

It's annoying to have to write:

SELECT * FROM patients WHERE medication = {"ASPIRIN"};

But if we render "ASPIRIN" as "ASPIRIN" in the SQL statement--with the quotes--how do you get things like VARCHAR(20), without the quotes?

We could pick another string type for SQL code splices, like TAG! for instance.

    CREATE TABLE vaccines (
        id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
        created TIMESTAMP DEFAULT current_timestamp,
        name <VARCHAR(128)>,  ; Weird, but...
        vaxdate DATE
    )

We can't use TUPLE! because table.field is a syntax that actually gets used, so we can't moosh together VARCHAR.(128) as VARCHAR(128). Maybe we could, but just in the cases of a GROUP! on the right?

Or VARCHAR/128 could render as VARCHAR(128).

This really does make me struggle with whether to say we need the "pack! proposal". But that's still not going to get us there.

No matter what, we won't get 100% SQL compatibility unless we use strings. We might be able to cook up some kind of interpolation, but that runs up against the problem of needing to be able to fully and correctly parse SQL in order to be sure that the escaping isn't done in places it shouldn't...a tougher task.

Would Like This Unified With QUERY

I've mentioned that if we're going to have some kind of SQL dialect thing, we should make it all consistent.

But here we've got a situation where we are blindly passing through strings. We don't know what's a keyword and what isn't. So we couldn't enforce a rule like saying "field names must be QUOTED!" if we wanted to.

...but at least we have COMMA!. The comma data type has really been a winner, and I can't think of anytime I've felt a regret about its addition.

Ultimately, The Good And The Bad

The GOOD is just how easy it is to make such a dialect.

The BAD is that the devil is in the details. When you can't use the dialect you're back to using strings or some method of putting strings into the code. And SQL already has a need for the behavior of TEXT! strings.

All this is just real-world examination of why the ideals of Rebol don't play out so well in a world where Rebol doesn't control everything.

Text files are too saturated to be able to represent such composite interests; a graph structure and editor are needed to really pull off this kind of language intermingling.

gchiu · December 14, 2021, 6:07pm

How about varchar: 128

If you see a set word in the dialect it's to be turned into a group.

gchiu · December 14, 2021, 7:39pm

I don't think it's too bad to use quotes for table names:

CREATE TABLE "dbo.MyTable"  
(  
    MyDecimalColumn DECIMAL: (5,2)  
    , MyNumericColumn NUMERIC: (10,5)
);

hostilefork · December 14, 2021, 7:55pm

The SET-WORD! idea doesn't look too bad...but it does point to the slippery slope of making you rewrite your SQL that we embark upon.

The problem being that we don't want those quotes to make it into the SQL or it would complain about there being an invalid string in the spot (I think).

And we would need to come up with a way of representing strings. The only choices that make sense to me are:

select * from drugs where name = "aspirin 100mg"

select * from drugs where name = {"aspirin 100mg"}

If we go with the first way (which I think we should), I suggested we might go with TAG! to splice in arbitrary SQL, e.g. a table name that we can't handle:

select * from <system$SpecialTableName>

BlackATTR · December 14, 2021, 9:17pm

Most databases don't conform strictly to ANSI SQL, so I doubt we'd ever be able to safely LOAD non-trivial SQL in Ren-C. At least not SQL supported across the spectrum of dbms platforms (cloud/document dbs drift away considerably from the standards).

A very basic example is a database like SQL Server or PostgreSQL only permitting strings in the following format:

WHERE name = 'Brady'

While many others permit either:

WHERE name = "Brady"
WHERE name = 'Brady'

And this is just trivial SQL. Almost every platform permits custom functions and code which isn't supported on other platforms.