Putting Subqueries to Use

<< Click to Display Table of Contents >>

Navigation:  Queries >

Putting Subqueries to Use

Wayne Wallace            
 
An earlier "Working SQL" column (see "Subqueries to the Rescue" in the Aug-1997 issue) discussed subqueries and suggested that you could use subqueries to replace VBA code and speed up your programs. In a guest "Working SQL" column, Wayne provides a set of rules for recognizing when you can do this and how to do it.
 
Several months ago, a "Working SQL" column discussed SQL subqueries. In my opinion, that column didn't give the topic the attention it deserved. So, as a guest columnist, I'm sitting in to show you the most powerful tool in the Access arsenal. More particularly, I want to give you three rules that will let you recognize when your VBA or Access Basic code can be replaced with a simpler, faster SQL subquery.
 
In order to discuss when you can use subqueries, I'm going to have to cover at least some of the material from the earlier column. You'll find that the overlap is minimal, though, because I'm not going to provide a lesson on how subqueries work. Instead, I'm going to show you the where and how of putting subqueries into use. For my coding examples in this article, I'll stick to the scenario from the previous column on finding and updating a set of employee records.
 
In this article, I'm going to concentrate on how some of the standard constructs in programming can be replaced with a well-constructed subquery. Developers don't use subqueries often enough because, generally, they've been taught how to use code to process records. That's too bad, because the benefits of using subqueries are high. Most developers will create a recordset and loop through it to process a recordset. If they used a SQL subquery instead, they could replace all of that processing with a single command. The gains in execution time are incredible and increase with the number of records that you process. If processing 100 records with a subquery cuts your processing time by 25 percent, processing 1000 records will give you a 50 percent improvement -- or better.
 
Types of processing

Any program can be broken down into three kinds of processing: sequential, branching, and looping (see Figure 1). In a typical program that processes a recordset, you'll find all three of these constructs being used.

199804_ww1
 
To begin with, the typical recordset is processed in a loop:
 

Set rec = dbs.OpenRecordset ("Select * from MyTable")

Do Until rec.EOF

   ...processing...

   rec.MoveNext

Loop

 
 
Sequential processing is represented by the "processing" line in these examples. This is the code where the data will be manipulated.
 
Within that loop, branching will be used. Typically, this means records will be tested for some condition and decisions will be made about whether or not to update the record:
 

Set rec = dbs.OpenRecordset ("Select * from MyTable")

Do Until rec.EOF

   If rec("Field1") = strTestValue Then

       ...processing...

   End If

  rec.MoveNext

Loop

 
 
Though not a programming construct like sequence, branch, and loop, there will also be a block of code in the loop to update the table. So, a typical recordset-based processing loop contains, as a minimum, these nine lines of code:
 

Set rec = dbs.OpenRecordset ("Select * from MyTable")

Do Until rec.EOF

   If rec("Field1")= strTestValue Then

       ...processing...

       rec.Edit

       rec("Field1") = strValue

       rec.Update

   End If

   rec.MoveNext

Loop

 
 
All nine of those "standard" lines can be replaced with a single SQL statement that uses a subquery. There's only one restriction on using SQL subqueries to control the processing of your data: Subqueries, like any other query, process data in tables. In order to convert to subquery processing, you must ensure that all of your relevant data is in tables in your database. This is one of the key factors in using SQL effectively, and one of the foundations of the relational database model. In fact, in the relational model, even the data about the database (the metadata that describes the table structures) is stored in tables. In the relational theory, you should be able to manipulate the database by changing the metadata using the same commands that you use to manipulate your application data. To fully exploit the relational model, all the data that your program needs should be in your database. Even if your database hasn't gone this far, the addition of a few lookup tables should give you all the data you need to replace tests against variables in your programs with powerful subqueries.
 
"Optimizing" code

 
I'll begin with branching. Branching, or testing, is used to decide which records in a recordset are to be processed. What this really means is that the SQL statement that built the recordset retrieved too many records and code is being used to whittle the recordset down.
 
A typical example of branching code checks a value in a field to see if it matches some other. For instance:
 

   If rec("Department") = "Finance" Then

       ...processing...

   End If

 
 
Obviously, this kind of test can be handled with a Where clause in the original SQL statement that would limit the records in the recordset.
 
Sometimes, though, you find the program's code issuing another SQL statement to test for data found in another table. Here's a case where the programmer only wants to process Employee records where the Employee makes more than the average salary:
 

recMain = dbs.OpenRecordset("Select * from Employees")

Do Until recMain.EOF

   recTest = dbs.OpenRecordset("Select avg(salary) " & _

        "from Employees")

   If recMain("Salary") > recTest(0) Then

       ...processing...

   End If

   recMain.MoveNext

Loop

 
 
This obviously is inefficient code, and you're probably already figuring out how to optimize it.
 
A clever programmer will say that the calculation of the average salary should be done outside the loop and stored in a variable:
 

recMain = dbs.OpenRecordset("Select * from Employees")

recTest = dbs.OpenRecordset("Select avg(salary) " & _

        "from Employees")

curSalary = recTest(0)

Do Until recMain.EOF

   If recMain("Salary") > curSalary Then

 
 
Even with this enhancement, the recMain recordset is still retrieving records that won't be processed. So, a more clever programmer will use the results of the average salary statement to modify the statement that builds the recMain recordset. The even more optimized query looks like this:
 

recTest = dbs.OpenRecordset("Select avg(salary) " & _

        "from Employees")

recMain = dbs.OpenRecordset("Select * " & _

 "from Employees Where Salary > " & recTest(0))

 
 
There's no doubt that this new version of the program is better. The records retrieved now match the records to be processed. Further enhancements are really just icing on the cake. The next programmer looking at this code, for instance, will probably say that you should put both SQL statements in Access queries. By putting the statements in queries, the SQL statements will be compiled, parsed, and optimized at design-time, saving you processing seconds (or milliseconds, at the very least) at runtime. The recMain query will have to be parameterized to accept the salary, so the code does get a little longer:
 

recTest = dbs.QueryDefs("queAvgSalary").OpenRecordset

Set que = dbs.QueryDefs("queEmployees")

que.Parameters(0) = recTest(0)

recMain = que.OpenRecordset

 
 
Our basic nine-line recordset loop is now up to 11 lines without the application-specific processing.
 
Subquery branching

 
As I said, there's no doubt that these optimizations will result in a faster program. But subqueries offer a much simpler way to do the same thing, and they eliminate the test in the loop altogether. As soon as you find yourself coding a SQL statement inside a recordset loop in order to test a record in the main recordset, I want you to stop. Just don't do it! Instead, change the original query by adding a subquery to it.
 
This code is shorter and, in most situations and depending on the number of records involved, significantly faster than any of the optimizations shown previously:
 

recMain = dbs.OpenRecordset("Select * " & _

  " from Employees " & _

  " Where Salary > (Select Avg(Salary) from Employees)")

 
 
This one line of code eliminates the second recordset, the curSalary variable, and the parameterized query. You can even get the benefits of design-time compilation by putting this SQL statement in a query. I can sum this up in Wayne's Rule #1.
 
Rule #1: If you issue a query while processing a recordset, go back and add it to the original query that created the recordset.
 
Multiple values

 
This code will work as long as your subquery returns only one value. In my example, for instance, there will only be one average salary, so the subquery will return only one value. In real life, though, your programs will frequently need to test against a number of values.
 
In this code, for example, the programmer wants to process employees who make more money than any of the managers:
 

recMain = dbs.OpenRecordset("Select * from Employees")

recTest = dbs.OpenRecordSet("Select * " & _

 "from Employees Where Status = ""Manager""")

Do Until recMain.EOF

   recTest.MoveFirst

   Do Until recTest.MoveNext

       If recTest("Salary") > _

            recMain("Salary") Then

             Exit Loop

       End If

   Loop

   If recTest.EOF Then

       ...processing...

 
 
In this code, the programmer builds a second recordset and loops through it looking for a specific condition (a manager's salary greater than the employee's salary). As soon as one is found, the loop is terminated and the processing of the record skipped.
 
You'll also see this kind of test coded this way:
 

recMain = dbs.OpenRecordset("Select * from Employees")

Do Until recMain.EOF

   recTest = dbs.OpenRecordSet("Select * " & _

     "from Employees Where Status = 'Manager' " & _

     "and Salary > '" & recMain("Salary") & "'")

   If recTest.RecordCount = 0 Then

       ...processing...

 
 
In this version, the programmer has issued a query to see whether a manager exists who has a salary lower than the employee being processed. If some records are returned, the Employee record is processed. Again, this use of SQL will result in a faster program than the earlier, code-based testing.
 
A subquery can handle this kind of branching even faster by using the ALL keyword to link a subquery to the original query:
 

recMain = dbs.OpenRecordset("Select * " & _

" from Employees " & _

" Where Salary > ALL (Select Salary " & _

 "from Employees Where Status = ""Manager"")")

 
 
This leads to Wayne's rule #2.
 
Rule #2: If you build a loop inside of your recordset loop or issue a second query and check its RecordCount, use a subquery instead.
 
Words you don't need

 
In addition to ALL, there are three other keywords (ANY, IN, and EXISTS) that can be used to link a subquery to the main query. I must admit that I don't think I've ever used them. Here's a typical query/subquery combination that uses IN:
 

Select *

from Employees

Where Salary IN (Select Salary

from Employees Where Status = "Manager")

 
 
This query retrieves exactly the same records as the following query, which simply uses an inner join:
 

Select a.*

from Employees As A Inner Join Employees As B

On A.Salary = B.Salary

Where B.Status = "Manager"

 
 
Another way to link queries and subqueries is to use NOT EXISTS. Again, I find I don't use this either. NOT EXISTS is used when the query isn't supposed to find a match in the subquery. This query uses NOT EXISTS to find all of the employees who have salaries that don't match any manager's salary:
 

Select *

from Employees

Where Salary NOT EXISTS (Select Salary

from Employees Where Status = "Manager")

 
 
You could just as easily use an outer join to retrieve those records:
 

Select a.*

from Employees As A Left Join Employees As B

On A.Salary = B.Salary

Where B.Status = "Manager"

 And B.Salary Is Null

 
 
I'm sure that there are situations where IN, ANY, EXIST, and NOT EXISTS are useful and I simply haven't run across them yet. I'm also sure that I could contrive some obscure problem as an example of where you would need these keywords. Instead, I'll suggest you keep these words in mind if you find yourself coding a recordset loop and are having trouble building the subquery that will drive the loop.
 
Relations

 
There's one last situation that needs to be considered when converting a test to a subquery. In the code below, the programmer is retrieving the average salary for the employee's department (rather than the average salary for all employees):
 

recMain = dbs.OpenRecordset("Select * from Employees")

Do Until recMain.EOF

   recTest =

    dbs.OpenRecordset("Select avg(salary) " & _

    "from Employees " & _

    " Where Department = '" & _

    recMain("Department") & "'")

   If recMain("Salary") > recTest(0) Then

       ...processing...

   End If

   recMain.MoveNext

Loop

 
 
In this situation, you might feel that you can't use a subquery to generate the original recordset. After all, you have to retrieve the average salary for the department for each employee. Surely, you might say, you have to loop through each record to retrieve the average salary for each department. Well, you don't have to -- and you can use a subquery.
 
Before looking at the subquery solution, there's at least one way to make this code more efficient without using a subquery. The same clever programmer who, in my first example, suggested retrieving the average salary before entering the loop might suggest a similar solution for this problem. In this case, you could retrieve all of the departmental average salaries and store them in an array before entering the loop. Inside the recordset loop, you'd get the departmental average salary from the array, saving you from reading the data from the disk. Provided that you have more than one employee per department (and depending on how many employees and departments you have), you'd probably cut your runtimes significantly. The only cost is slightly more complicated code.
 
The subquery solution eliminates the more complicated code. By using a co-related subquery, you can have the subquery run for each record in the main query in order to retrieve records related to the record in the main query. To move the average department salary test into a subquery, you'd write the query like this:
 

Select A.*

From Employees As A

Where A.Salary >

(Select Avg(B.Salary)

From Employees As B

Where B.Department = A.Department)

 
 
In this query, I've given the Employees table in the main query an alias of A. Now, anywhere in the query where I'd use the table name Employees, I use A. In the subquery, I gave the Employees table an alias of B. My subquery's Where clause requires the Department field from table A to match the Department field from table B. This causes the subquery to retrieve information related to the record in the main query.
 
There's no guarantee, however, that this subquery will be more efficient than the solution suggested by the clever programmer. If the subquery is, indeed, run for every Employee record, I suspect that the subquery version of the program will be slower than the clever programmer's version. It all depends on how clever the SQL optimizer is when analyzing the code. Because both versions of the code shouldn't take more than a few minutes to write, you'd be well-advised to run some sample tests and see what the result is. I did some quick tests against Access 97 that suggest that the clever programmer's solution is faster in that environment (but only marginally).
 
Because this is a "Working SQL" column, though, I should point out that there's another SQL solution to this problem that eliminates branching. The first step in this solution is to build a table (called "Dept") to hold a list of departments and their average salaries, with Department as the primary key. At runtime, the program will delete all the records out of this table and then load it with the current data using these two SQL queries:
 

Delete *

From Dept;

 

Insert Into Dept

Select Department, Avg(Salary)

From Employees

Group by Department;

 
 
The SQL query to generate the main recordset uses a join on this new table to retrieve the relevant records:
 

Select *

From Employees Inner Join Dept

On Employees.Department = Dept.Department

Where Employees.Salary > Dept.Salary

 
 
My simple testing suggests that this solution will give you the fastest run times (though, again, only marginally).
 
Looping

 
Hopefully, I've convinced you that any time you need to do a test in a recordset loop, you could just as easily use a subquery. But what about looping? In all of my examples, I'm still building a recordset and looping through it. How can a subquery replace that loop?
 
Up front, let me be clear: SQL doesn't support looping. Instead, SQL processes all the records in a set. So, rather than loop through a recordset processing each record, you use a subquery to specify which set of records should be updated. SQL will then retrieve those records and update them -- no looping required.
 
In my previous examples, when I found the records I wanted, I did some processing with them. I could just have easily used a SQL Update or Delete to replace the code in those examples. In one of those examples, the code identified a set of employees who had salaries larger than any manager's salary. The SQL statement that retrieved the records looked like this:
 

Select *

from Employees

Where Salary > ALL (Select Salary

from Employees Where Status = "Manager")

 
 
Presumably, the company wanted to find those employees in order to fix this imbalance in the salaries. If the solution to the company's problem is to give the employees a 10 percent cut in pay, the "processing" in that recordset loop would have looked something like this:
 

rec.Edit

rec("Salary") = rec("Salary") ­ (rec("Salary") * .1)

rec.Update

 
 
Instead of those three lines of code, the Select statement that retrieved the records for processing could be replaced with an Update statement. The subquery that selects the records remains unchanged:
 

Update Employees

Set Salary = Salary ­ (Salary * .1)

Where Salary > ALL (Select Salary

from Employees Where Status = "Manager")

 
 
This query finds the same records as the Select query, but does the update without any looping at all. In most situations, this query will run faster than the version that looped through the recordset and -- to repeat myself -- the more records there are to process, the larger the difference in runtime will be.
 
If the company wanted to delete all of the employees who made more money than all of the managers, the SQL statement would be:
 

Delete *

From Employees

Where Salary > ALL (Select Salary

from Employees Where Status = "Manager")

 
 
If the company just wanted to add all of those employees to another table in order to process them later, the change is even simpler. All you must do is add an Into clause after the Select clause in the original SQL statement:
 

Select *

Into BadEmployees

From Employees

Where Salary > ALL (Select Salary

from Employees Where Status = "Manager")

 
 
So, this brings me to Wayne's third and final rule.
 
Rule #3: After upgrading your original query with a subquery, replace your Select with an Update, Delete, or Insert clause and eliminate your loop altogether.
 
Complicated processing

 
You might say that you need to build your recordset-based loop because you have some really complicated code that's used to generate the new values for your record. If that's the case, the solution is to put that code into a function and use the function to set the new value for the field.
 
For instance, I could move my salary calculation into a function called NewSalary in an Access module:
 

Function NewSalary(curOldSalary as Currency)

   NewSalary = curOldSalary ­ (curOldSalary * .1)

End Function

 
 
Now I can rewrite my Update SQL statement to use this function:
 

Update Employees

Set Salary = NewSalary(Salary)

Where Salary > ALL (Select Salary

from Employees Where Status = "Manager")

 
 
This combination of SQL and Basic won't run as fast as a pure SQL command, but it will still be faster than any recordset loop. Should Access ever allow a subquery to be used in the Set clause, you'll find that you can probably replace that Basic function with another SQL query.
 
More power

There's more to subqueries than I've touched on here. However, I hope that I've shown you that you can easily replace your Basic code with faster, more efficient subqueries. With the three rules from this article, you should be able to spot when you can use a subquery and eliminate excess code. The result: Your applications will fly!