Distinctive SQL

<< Click to Display Table of Contents >>

Navigation:  Smart Access 1996-2006 > Jul-1997 >

Distinctive SQL

Peter Vogel

Nobody’s version of SQL adheres completely to the ANSI standard, and Access’ version is no exception. In this article, Peter looks at the most obvious addition. Once he gets started, though, he ends up walking through some index minefields.

ONE theme of this column has been that SQL is a universal language. I believe that an understanding of SQL will allow you to work anywhere. I also believe that understanding SQL will allow you to write better or, at any rate, faster running programs. In this article, I’ll start with a SQL keyword that isn’t universal: DistinctRow. That will lead us to the universal, and very important keyword, Distinct. On the way, I’ll show you the impact of indexes on these types of queries. From there, I’ll side step to what might seem to be an unrelated topic, Totals queries. While I’ll discuss some important issues about using Totals queries (especially if you’re using criteria), I really want to compare the way these queries use indexes as compared to queries using Distinct.

     Once again, let me point out that I’m using a very rough and ready testing method. I can’t guarantee that you’ll get the same results that I do. You should see the same kind of relationships, though: if one query runs twice as fast as another for me, it should do the same for you. The other thing I can’t guarantee is that my test cases correspond to the data that you face in your projects. Hopefully, though, these sample datasets have some relationship to the real-world data that you’re working with.

     Prior to Access 97, the underlying SQL statement that Access created for you whenever you created a query included the DistinctRow keyword. If you dropped the tblEmployees table into a query, for instance, and displayed the FName and LName fields, the generated SQL statement looked like this:

Select DistinctRow FName, LName From tblEmployees

Outside of Access, that DistinctRow keyword is unknown to SQL. Access added it to any query that had its UniqueRecords property set to True, which was the default for queries in Access versions 1.0 through 95. With Access 97, the default value for the UniqueRecords property is now False, and thus any query you create in Access 97 won’t have the DistinctRow keyword added. (The reverse is also true, by the way: if you add the DistinctRow keyword to the SQL statement, the UniqueRecords property will be set to True.)

     What, you may ask, does this keyword do? The answer is: not much, but what it does can be very important to you. Before I go into that, though, I’ll back up and look at the more common (and more frequently used) Distinct keyword.

Distinctly yours

Unlike DistinctRow, the Distinct keyword is part of every SQL implementation. Distinct eliminates duplicate rows from your query result. Access places it in your query when the UniqueValues property is set to True (and vice versa: the UniqueValues property is set to True if the Distinct keyword is present).

     Let’s say that you want to determine the number of cities that your company ships to by pulling the city name out of the City field of the sales order table. This simple query won’t give you the answer you want:

Select City From tblSalesOrder;

This is because the cities you’ve shipped to more than once will appear in the resulting recordset more than once. Adding the Distinct keyword will eliminate those duplicates, returning only unique rows:

Select Distinct City From tblSalesOrder;

There’s no limit to the number of fields that can be used with the Distinct keyword. If, for instance, you noticed that the city of Athens appeared only once, despite the fact that you’re shipping to Athens, Georgia and Athens, Tennessee, you could add the state to the query:

Select Distinct City, State From tblSalesOrder;

Again, no matter how many orders went to Athens, Georgia, it would still appear only once on the list.

     The performance of the Distinct keyword is affected very much by the presence of an index on the fields in the Select clause. To demonstrate this, I created an Employees table with 12,000 records, where 11,999 employees have the first name of ‘Peter’ and the remaining employee has a first name of ‘Tom.’ The query returns two records: ‘Peter’ and ‘Tom’:

Select Distinct FName From tblEmployees;

Without an index, the query takes two seconds to execute on my laptop. With an index, the query executes in less than one second. This makes sense because, with an index present, Access can simply return the contents of the index to answer this query.

Indexes matter

The fields involved in the index determine how quickly Access can return an answer. For instance, the employees in my Employees table have three different last names (‘Vogel,’ ‘Smith,’ and ‘Jones’), with the employees evenly distributed among those three names. Without an index on the last name, it takes more than two seconds for this query to run:

Select Distinct FName, LName From tblEmployees;

Adding or removing indexes on the first or last name has no effect on the query’s run time. Even with both indexes present, it takes more than two seconds for the query to run. Only when an index that combines the last name and the first name is created does the query’s run time drop back to less than one second. The SQL to create that index looks like this:

Create Index TestDistinct On tblEmployees (FName, LName); 

I should tell you that I tend to get paranoid when doing these time trials. For instance, I was concerned that the queries that used the index might be producing spurious results. The results of a Distinct query are always returned in order, sorted by the fields in the Select statement. I knew that, without an index, Access must read every record in the table to find the unique values and only then sort them into ascending order. As a result, I knew that Access wouldn’t return control to my test program until it had processed the whole table. However, with the index in place, I was afraid that Access might just be reading the first record in the index and then immediately returning control to my program. Access would be able to do this because it would know that the index already had the records in order. If the indexed versions of the query weren’t returning as many records as the non-indexed version, then the results of the two trials couldn’t be compared. To make sure that the two trials did read the same number of records, I added a MoveLast to my test program to force Access to retrieve all the records. The final version of the test routine looked like this:

Sub TestDistinct(stzSQL as string)

    Dim dbs As Database

    Dim rec As Recordset

    Set dbs = CurrentDb

    Debug.Print Time

    Set rec = dbs.OpenRecordset(stzSQL)     
    rec.MoveLast
    rec.Close
    Debug.Print Time

End Sub

Now I could run my test queries by calling the TestDistinct routine, passing it the name of the query. Fortunately, when I re-ran my tests using the TestDistinct routine, the results didn’t change.

     It appears that Access’ query optimizer isn’t smart enough to recognize that the TestDistinct index that I created earlier could be used even if the order of the fields is swapped. This query also takes two seconds to run:

Select Distinct LName, FName From tblEmployees;

Here’s another example of tester’s paranoia. It occurred to me that, with the Last Name and First Name fields swapped, Access might actually be using the TestDistinct index to retrieve the records, but then taking a second or so to sort the results. To test this, I set the last name of all employees to Vogel and ran the query again. It still took two seconds. Because the recordset contains only two records ("Vogel Peter" and "Vogel Tom") it seems unlikely that much time is being spent on sorting the recordset.

     The moral of the story so far: If you’re going to use the Distinct clause, make sure you have an index built on all the fields in your Select statement, and in the same order, for fastest performance.

DistinctRow

I started by promising to discuss the DistinctRow keyword. As its name suggests, it performs a function similar to the Distinct keyword. However, the DistinctRow keyword eliminates duplicate rows from your resultset only when undisplayed values from another table would cause duplicate rows to be included—to which your reaction is probably "Huh?".

     I can demonstrate the effects of the DistinctRow keyword by joining the employee’s table to a table of job positions. The job position table lists the job for each employee. However, in this company one person can hold many jobs, so an employee may appear many times in the job position table. I then boiled my employee list down to three names: ‘Peter Vogel,’ ‘Peter Smith,’ and ‘Peter Jones.’ Peter Vogel has two jobs in the job position table, while Jones and Smith have one each. The following query displays the first name of all the employees:

Select DistinctRow tblEmployees.Fname

From tblEmployees Join tblJobPostion

On tblEmployees.FName = tblJobPosition.FName

And tblEmployees.LName = tblJobPosition.Lname;

The question of the day is "How many records are returned, considering that everybody’s first name is Peter?" The answer is three: one for each of the three employees (see Figure 1). Notice that using the Distinct keyword returns only one row, containing ‘Peter’ (see Figure 2). And, not surprisingly, removing the DistinctRow keyword causes four records to display, two for Peter Vogel (because he has two jobs) and one for each of the other two employees (see Figure 3).

     I then added the FName field from tblJobPosition to the Select clause and re-ran the query with the DistinctRow keyword present. Four rows are returned: one for each of Peter Vogel’s two jobs, and one for each of the other two employees (see Figure 4). This happens even though all of the rows are identical. Changing DistinctRow to Distinct boils the table down to one row again (see Figure 5).

     As long as you display fields from every table in a query, DistinctRow makes no difference to the query. However, if fields from only some of the tables in the query are displayed, DistinctRow eliminates duplicate rows generated by the join. So why would you want this peculiar behavior? The answer is obvious if you compare Figure 1 and Figure 3. As the presence of the blank row in Figure 1 indicates, the version of the query that used DistinctRow is updateable while the version that doesn’t use it isn’t updateable.

     This makes sense if you think it through. As Figure 3 shows, when DistinctRow is missing, you get four rows: two for Peter Vogel (because he has two jobs) and one for each of the other employees. If you were to change Peter Vogel’s first name to ‘Fred’ on one line, what would happen to the other line that is displaying Peter’s name? How could Access reconcile the two lines that reference the same underlying ‘Peter Vogel’ record? The answer is that Access can’t reconcile them and so it makes the query non-updateable.

     The DistinctRow version of the query in Figure 1, however, displays only one record for each employee in the Employee table. Each record is displayed once, so there’s no reconciliation problem and Access will let you change the values of the records in the recordset (and even let you add a new record).

Totaling up

What does all this have to do with Totals queries? Interestingly enough, my tests with Totals queries showed that indexes didn’t affect their performance. Take this query, which essentially returns the unique values of the First Name and Last Name fields:

SELECT tblSource.FName, tblSource.LName

FROM tblSource

GROUP BY tblSource.FName, tblSource.LName;

This query returns a resultset that is identical to the query that produced the Distinct Last and First Names. Without an index, the Distinct query takes about 50 percent longer than the Totals query. Adding various combinations of indexes on the two fields, as before, produced no effect on either query’s run time. When I finally added the index that combined the two fields into one, the Distinct query once again dropped to less than one second. However, the performance Totals query was unaffected by the index. Once again, I got a result that was so surprising that I ended up spending more time on this test than any other. But despite further tweaking and testing (and quadrupling my test data), I was never able to alter the performance of the Totals query by adding or deleting indexes.

     So, the moral of this story is to use the Distinct keyword if you’re boiling down a table to its unique values—but make sure that you have the right indexes in place. If you don’t have exactly the index that your Distinct query needs, you should consider using a Totals query instead.

A summary tip

As long as I’ve been discussing Totals queries, I can’t leave without mentioning a common error. More than once, I’ve seen Totals queries set up like the one in Figure 6. The user who set up this query was under the impression that the query was totaling only the records for the Finance department. This was close to what was happening, but not quite right. In fact, the query totals the records for all the departments and then discards the records for every department but the Finance department.

     Figure 7 shows what you should do to retrieve and total only the Finance department records. The final column in the query uses a "Where" option in the Total row to specify the criteria to use when selecting records to total. There’s a big difference in performance: with more than 50,000 records, the first query runs in 19 seconds. The second query runs in 15 seconds, a 20 percent improvement.

     The SQL for the two queries is significantly different, too. The first query generates a Having clause, which acts like a Where clause on totaled fields:

SELECT tblSource.Fname, tblSource.LName, tblTarget.Department, Sum(tblTarget.Budget) 

AS SumOfBudget

FROM tblSource INNER JOIN tblTarget 

ON (tblSource.LName = tblTarget.LName) 

AND (tblSource.Fname = tblTarget.Fname) GROUP BY tblSource.Fname, tblSource.LName, tblTarget.Department

HAVING (((tblTarget.Department)="Finance"));

The second query just generates a Where clause to select the records to total:

SELECT tblSource.Fname, tblSource.LName, tblTarget.Department, Sum(tblTarget.Budget) 

AS SumOfBudget

FROM tblSource INNER JOIN tblTarget 

ON (tblSource.Fname = tblTarget.Fname) 

AND (tblSource.LName = tblTarget.LName)

WHERE (((tblTarget.Department)="Finance")) GROUP BY tblSource.Fname, tblSource.LName, tblTarget.Department;

Well, at the risk of another pun, that sums it up. This Working SQL column has covered a lot of territory: Distinct, DistinctRow, indexes, and Totals queries. Now you can practice your SQL with distinction.