Why You Can't Pass a List to an InterBase SQL Param by Craig Stuntz

By: Craig Stuntz

Abstract: When writing a WHERE clause in your InterBase SQL, it would sometimes be convenient to write WHERE FOO IN (:SomeParam) and pass a list of values to SomeParam. You can't do that; here's why.

Note: This article applies to InterBase. However, many SQL database servers have a similar limitation.

About Parameterized Queries

The purpose of a parameterized query is to reduce some of the workload on the server when executing the same query multiple times. The workload reduction comes from the fact that the query need only be compiled and optimized once. Subsequent executions can use the same compiled statement and same optimization PLAN.

All queries must be prepared before they are executed by the server. Note, however, that in BDE applications, doing an explicit MyBDEDataset.Prepare causes the BDE to read and retain metadata information which would otherwise be refetched each time the query is run. This means that "explicitly" preparing a BDE dataset gives a performance improvement much greater than what is gained from the server-side prepare alone. However, many other data access components, such as InterBase Express, retain this metadata information whether or not you "explicitly" prepare a dataset, so in non-BDE applications there is no reason to do this. When I talk about preparing a statement in this article, I am referring to the server-side prepare which is done implicitly no matter how you execute the statement, not the TQuery.Prepare method.

Because of this, the server can execute the query faster on the second and all subsequent times the same prepared statement is used, because there is no need to recompile and re-optimize the statement. So if you repeatedly look up single records in a large table, it is faster to write a query such as this:


SELECT
  SOME_COLUMN
FROM
  SOME_TABLE
WHERE
  SOME_TABLE_ID = :Foo;

...than it is to literally specify the value of the "Foo" param in SQL and change the dataset's SQL property each time.

So in general it is a good practice to use parameters instead of literal values in SQL whenever you need to execute a query more than once. Parameters also ensure that you don't have to think of issues such as date formatting; date parameters are sent to the server as binary data, not strings, so there is no possibility of error if the client and server use different regional settings.

Parameters and IN

However, there is one thing you cannot do with a parameter: You can't pass a list of values. So if we change the query above to this:

SELECT
  SOME_COLUMN
FROM
  SOME_TABLE
WHERE
  SOME_TABLE_ID IN (:Foo);

...there is no way to assign '1000, 1001, 1002' to the "Foo" parameter, since it accepts only a single value, not an array or a comma-delimited string

A Deeper Examination

Now, let's look at how InterBase optimizes a couple of queries. I'll use a simple query against employee.gdb:


SELECT 
  *
FROM
  EMPLOYEE
WHERE
  EMP_NO IN (136);

InterBase generates the following PLAN when the query is prepared:

PLAN (EMPLOYEE INDEX (RDB$PRIMARY7))

A PLAN is InterBase's syntax for displaying how it has optimized the query. Depending upon the SQL and the database's metadata, there are potentially many different ways that InterBase could optimize a query. For example, for a particular table it can read records in storage order or it can use an index to go directly to particular records. The InterBase optimizer selects a particular plan for each prepared statement at the time the statement is prepared based on what you have asked the server to do and what the server knows about the indices available. Don't worry if you can't read the PLAN syntax; just take my word for it that a different PLAN means that the server will be doing someting different internally to run the query, and that not all statements working on the same table can use the same PLAN, since different statements require the server to do different things.

Here's another query which is the same, except that now there are two items in the IN conjunct:


SELECT 
  *
FROM
  EMPLOYEE
WHERE
  EMP_NO IN (136, 138);

And the PLAN is different:

PLAN (EMPLOYEE INDEX (RDB$PRIMARY7,RDB$PRIMARY7))

If I try to force InterBase to use the PLAN from query 2 with the text from query one:


SELECT 
  *
FROM
  EMPLOYEE
WHERE
  EMP_NO IN (136, 138)
PLAN 
  (EMPLOYEE INDEX (RDB$PRIMARY7));

...InterBase simply ignores the PLAN bit and uses the PLAN it feels appropriate.

This is why you can't pass a list to a param. Since it would change the optimization PLAN, there is not a single compiled statement which could execute for any arbitrary number of values in the list.

Next question: Why does IB use different PLANs for different values in the list? The answer is that, to IB, these two statements:


/* Statement 1 -- IN */
SELECT 
  *
FROM
  EMPLOYEE
WHERE
  EMP_NO IN (136, 138);

/* Statement 2 -- OR */

SELECT 
  *
FROM
  EMPLOYEE
WHERE
  ((EMP_NO = 136) OR (EMP_NO = 138));

...are precisely the same. If you look at the generated BLR, you will find no difference. IN with a list of constants is "syntactic sugar" for a ORed list of conjuncts testing for equality with the same field name. If you view the IN as simply shorthand for an ORed list of conjuncts, then it will make sense as to why a different PLAN is used -- different numbers of conjuncts mean a different query to execute.

BLR is InterBase's internal representation of how a query is run. BLR, roughly, is to InterBase what ASM is to Delphi or C++ source code. Saying that the BLR is the same means that the server will do exactly the same thing to run these two queries.

IN with a Subquery is Different

Contrast this, on the other hand, with the following:

SELECT
  *
FROM
  EMPLOYEE
WHERE
  EMP_NO IN (SELECT EMP_NO FROM EMPLOYEE_PROJECT);

This query is similar to the problem above in that there is a variable number of items in the list, but unlike the lists of constants, there is only one logical way to execute the statement: First run the subquery, then compare the results with the rows in the result set of the outer query. Here is the PLAN:

PLAN (EMPLOYEE_PROJECT INDEX (RDB$PRIMARY14)) 
PLAN (EMPLOYEE NATURAL)

What's interesting about this is that the PK is no longer being used for the EMPLOYEE table -- InterBase's optimizer has (correctly) decided that it will be faster to do a storage-order scan of the table. Since half of the rows will be returned, this is faster than fetching matches out of storage order.

Now, the big question: Since IN with a list of constants is more similar to IN with a subquery than an arbitrary list of ORed conjuncts, why does InterBase treat it more like the arbitrary list than the subquery WRT optimization?

I'm not sure. But it would make our report generation easier if it didn't.

By Craig Stuntz


Server Response from: ETNASC03