Access Database Performance

<< Click to Display Table of Contents >>

Navigation:  Managing Quality, Performance, Bugs and Errors >

Access Database Performance

Peter Vogel          

With a combination of abstract rules and real-world stories, editor Peter Vogel provides you with the rules that you need to create Access applications that run as fast as possible. And he throws in some advice on advancing your career.

The slowest thing that you can do is read from or write to your hard disk. This is an ironic fact of life for Access developers (or any database developer): Your goal is to access and update your database, and that very activity is what slows down your application more than anything else. Life's like that. This is reality; adjust!

Since data access is what slows you down, you shouldn't worry about all those programming tricks that are supposed to speed up your application. Writing faster code is like speeding in the city: You just get to the next red light as fast as possible. What you want to do is reduce the time spent at the red light–the time spent waiting for data access. So that's what this article is all about: reducing the time spent at your red lights.

I never believe in anything that I haven't seen for myself. So for many of these rules, I've provided a real-world story from my life that demonstrates the rule in action. If you're a longtime reader of Smart Access, you'll recognize some (or all) of these stories. There's a reason for this: I'd like to think that having made these mistakes once, I've never made them again. At any rate, that's my story and I'm sticking to it. In about half of these stories, applying these techniques made me a hero (in the rest, not applying them made me the goat).

First: Normalize your operational data

If you don't know the first three laws of normalization, learn them. As you normalize your data, you'll find that you generate extra tables. Many developers back off from normalization in order, they think, to improve their performance. They're concerned that having to join together multiple tables will slow them down. Here's a news flash: The designers of your RDBMS have made your RDBMS very good at what it does. And one of the things that your RDBMS is very, very good at is joining tables.

However, instead of splitting data in some table into two separate tables, developers create a single table that holds related data. Their goal is to be able to retrieve a single row that holds all the data they need.

In almost every case, developers who fail to take their table design to at least third normal form will be hurting rather than helping application performance. At the very least, failing to normalize data results in data being repeated on multiple rows. Whatever speed benefits are gained by failing to normalize are lost when, instead of updating a single row, your code has to update multiple rows of repeating data.

If having to update many rows instead of one row doesn't inspire you to normalize your data, there's another consideration: Updating non-normalized data is hard work. With non-normalized data, you'll have to write more code to manage your updates. More code takes longer to run than less code and–as a bonus–makes you less productive. The more code that you write, the longer it will take you to write it, and the longer it will take you to debug it, test it, and maintain it. And, of course, more code equals more opportunities to introduce bugs into your code.

When you leave data in a non-normalized state, you may speed up one specific data retrieval activity. You also slow down a potentially infinite number of update activities. There may be database applications where this is a smart trade-off but I suspect that you don't have one of them.

However, third normal form is only required for operational data–data that you're not updating, deleting, or inserting. This distinction is important for your next step also. My war story on this topic is chronicled in my editorial in the December 2004 issue of Smart Access. I see no need to embarrass myself again by repeating it.

Second: Index your data

When relational databases were first proposed, many in the "data processing" community (which is what it was called back then) observed that there was no way that sequentially reading tables would provide the kind of performance that business applications needed. Of course, they were right. But Relational Database Management Systems just make it appear that your data is stored in tables. In reality, RDBMSs store your data in some highly optimized (and highly proprietary) data structure. Furthermore, the critics also didn't consider the impact on performance of applying indexes to the table–not really the critics' fault because indexes aren't part of the relational database theory.

Indexes are key to retrieving data quickly (there was a pun there). Without an index, the database engine is forced to sequentially search your tables to find matching records, exactly the criticism applied by the critics of the relational database. Indexes, on the other hand, let you find data by searching lean data structures (containing only indexed fields) stored in very highly structured data formats. In addition, RDBMSs give priority to keeping indexes in memory, so searching an index often means searching through an in-memory data structure (you can't get much faster than that).

You should apply indexes so that your data retrieval activities will run faster.

Third: Don't index your data

The reason that all your data isn't kept in the same structures that an index is kept in is that updating indexes tends to be expensive. In order to make searching an index very fast, the structures used for indexes usually means that they are slow to update. The more fields that you apply indexes to, the more likely it is that you'll be updating fields that will be indexed and, as a result, the more indexes that you'll have to update.

You should avoid indexes because your update activities will run slower.

So what are you to do about these conflicting priorities? The first step is to decide whether you have an operational database or a reporting database. An operational database supports your ongoing business activities and sees many updates, insertions, and deletions. A reporting database is used just for, well, reporting. Since reporting databases typically are updated in a single batch operation that often consists of nothing but inserts, a reporting database is, in many ways, a database that's never updated.

Once you know what kind of database you have, you can apply these rules:

• Don't index at random. The relational database theory, which provides such excellent guidance in designing your tables, provides you with no help in deciding on what indexes to apply. You have to look at what queries will be used with your databases and apply indexes to support the realities of your database. Those realities will change over time, so you need to review the queries being used against your database. The results of that survey should not only be a list of indexes to add but a list of indexes that are no longer required.

• In an operational database, keep the number of indexes to a minimum. In a reporting database, add as many indexes as you could possibly want. For an operational database, the indexes automatically created when you define foreign and primary keys will handle most of your indexing needs. When it comes time to insert records into your database, remember that many RDBMSs support a "bulk mode" where indexes are updated only after all inserts have been made. Not only does this run much faster than indexing on each insert, but the resulting indexes are usually more efficient. In a Jet database, you can get the same result by dropping all indexes on your reporting tables, inserting your data, and then re-creating the indexes.

• Don't index small tables. In the same way that most RDBMSs try to hold an index in memory, most RDBMSs will hold small tables in memory. A small table is one that has few rows and few fields. I wouldn't, for instance, index a table of countries (200 to 250 entries but a single field in each row). I certainly wouldn't index a table of Canadian provinces and their abbreviations (only 10 rows with two fields in each row).

• Don't index fields that lack discrimination. For most RDBMSs, an index won't be used unless it limits the search to less than 20 percent of the rows in the table. For most situations, an index on a Gender column won't be used because it only divides the table into two groups. However, if your table has very few entries for "Male" in the Gender column and your database frequently sees queries for all the Males, an index on Gender would make sense. Similarly, indexes on fields that contain many Null values aren't efficiently handled by most RDBMSs.

• Don't, in an operational database, index to support infrequently used queries to support scenarios where the business can live with slow response times. I wouldn't add an index to support a batch process run at night as part of month-end processing.

• Build indexes on multiple fields only if those fields are frequently used together in the queries applied to your database. The database engines that readers of Smart Access are most likely to use (Jet and SQL Server) will automatically cross reference indexes to resolve queries. As a result, the benefits of combining fields in a single index are often not dramatically greater than indexing the two fields separately.

• Structure multi-field indexes so that the first field in the index is the field that's used without the other field. Think of a telephone book, which indexes on a Last Name, First Name basis. Because of that structure you can't look someone up by their first name. Fortunately, most telephone book searches are based on the last name so the field ordering is optimal for the phone book's users. Ensure that the first field in your database supports the most common single-field queries.

The war story

I worked on a payroll application whose average response time rose to six seconds on Fridays. Throughout the week, users entered their hours in order to be paid. On Fridays, staff were using our application to review their time for the week before they went home for the weekend (just in case they'd left out some hours). We had indexes to support managers reviewing time for the employees in their work groups on a particular day, but we had no indexes that would support the search that employees used: an employee number, a start date, and an end date. When I added that index, the average response time for the application dropped from over six seconds to under three seconds. And I was a hero, by the way.

I'm not suggesting this as a career strategy: Write something wrong and then write it correctly. But it's worked for me over the years. You just need employers/clients with short memories so they don't remember who wrote the original version of the program you just fixed. We don't charge any extra for this career advice.

Fourth: Retrieve what you need

The mantra here is simple: "Less is faster." The "less" in this mantra refers to the number of rows being retrieved. While reducing the number of fields being retrieved is never a bad idea, the real speed gains come from reducing the number of rows that you're going to retrieve.

The only exception to reducing the number of fields is when a row contains Memo or other types of Binary Large Object (BLOB) data types–don't retrieve those fields unless you know that you're actually going to use them. DAO supports a row retrieval type that lets you specify that BLOB-type fields aren't retrieved with the row but are only retrieved when the BLOB field is accessed. Use it if you have tables with Memo/BLOB fields.

However, more typically, you should worry about reducing the number of rows being retrieved. Don't, for instance, set the RecordSource for your form to a table name, then gather information from your user, and then use that information in the form's Filter property to restrict the rows displayed. It makes no sense to retrieve all the rows in a table and then show only a few of them. Instead, get your users to tell you what data they want, build the Where clause for your SQL statement, and retrieve only the rows that you need.

Closely related to this rule is reducing the number of trips to the server that you make. Next to actually reading or writing your database, the slowest thing that you can do is leave your computer. Every call that you make to your RDBMS (Jet or SQL Server) slows you down. If you haven't investigated ADO's ability to issue multiple Select statements at one time and use the NextResult method to process the results, now's the time. You can't be guaranteed that the RDBMS will perform more efficiently when you submit multiple statements at the same time (see Rule 7 when you get to it), but you've made every effort to get the best performance from the RDBMS.

The war story

Another company that I worked for had a month-end batch application that ran for four hours. Because of scheduling conflicts, the application had to be left to run unattended. If this application died in the night and had to be run during the business day, it took seven hours to complete and doubled the response times for applications using the same database. When I examined the application, I found that it:

1. Retrieved all the data from one table.

2. Retrieved all the related data from a second table based on rows in the first table.

3. Retrieved all the related data from a third table based on rows in the second table.

4. Checked a field in the first table and used that to delete that row and all related rows in the second and third tables.

5. Checked a field in the second table and used that to delete that row, its parent, and all of the related rows in the third table.

6. Checked a field in the third table and used that to delete that row, its parent, and its grandparent.

I rewrote the application to:

1. Retrieve only the rows from the first table that met the condition specified in Step 4.

2. Retrieve only the related rows from the second table that met the criteria in Step 5.

3. Retrieve only the related rows from the third table that met the criteria in Step 6.

4. Delete any rows that didn't have a result from the third table.

The new version ran in less than 45 minutes. Again, I was a hero, and this time I hadn't written the first version of the program. Mind you, when I worked the same magic on another application two months later, it was an application that I had written the first version of. I didn't see any need to point that out to my employer and, fortunately, no one seemed to notice.

Fifth: Use the right data access objects

There are just two rules for using the right data access objects:

• If you're reading and writing a Jet database with tables of less than 10,000 records, use DAO.

• For any other RDBMS (SQL Server) or Jet databases with tables with tens of thousands of rows, use ADO.

This makes sense: DAO and Jet are a team and get the best out of each other.

The war story

My war story here is less embarrassing: I was preparing a presentation on ADO that I was to give at a conference. As part of demonstrating the obvious superiority of ADO over DAO, I created some time trial applications that would show how much faster ADO was than DAO with Access. Imagine my surprise when DAO consistently outperformed ADO. Fortunately, I spotted this before I gave the presentation and modified the slides.

Sixth: Update with SQL

Issuing a SQL statement to update, delete, or insert new rows is always faster than doing the same processing by updating a field, inserting rows, or deleting rows in a recordset.

There's a trade-off to be made here between the application's speed and your productivity. For instance, I'll admit that I use ADO almost exclusively, even when DAO would be faster. The reason is that, as a consultant who isn't always working with Jet, I had to become an expert in something. I decided to standardize on ADO (and, now, ADO.NET). I realize that I'm sub-optimizing my client's applications but, on the other hand, I'm reducing their costs by reducing my billable hours.

In addition, Access' ability to bind to a data source means that you can quickly build an application that reads and writes data by letting Access update your bound data from the form. You could have your Access application run faster (at least during updates) by creating an unbound form and doing your updates through SQL statements. There are costs associated with this solution: It will take you longer to build the application, delaying the benefits of delivering the application; you'll have more code to maintain; and, of course, there will be more opportunities to insert bugs.

In addition, you need to consider the nature of your application. For single row updates, how much time is there to save? Replacing single row updates with SQL statements won't give you much of a performance boost. The real opportunity in avoiding recordset updates is when you have multiple rows to update. Rather than loop through a recordset, you're better off issuing SQL statements to accomplish the same task.

The war story

My client had an application that used only recordset processing. It took seven minutes to execute a single transaction and involved something like 100 lines of VBA code. After normalizing the tables, I replaced the routine with three queries (one of these queries called the other two queries) and three lines of code that called the query. The whole thing ran in less than seven seconds. I was a hero until someone pointed out that the original table design (which I had just thrown out) had been done by the president and CEO of the company. This was when I became a consultant.

Seventh: Don't believe everything you read

I'll begin this rule with my war story. Again, this one occurred many years ago at a conference where I was making a presentation on ADO (the same presentation with the DAO comparison, as I remember). I was sitting in the speaker room reviewing my presentation. Sitting beside me was Bill Vaughn, who knows more about SQL Server than anyone I know of.

I was reviewing the slide that described how ADO's Command object lets you prepare parameterized SQL statements in advance and then call them repeatedly in a single application, just changing the parameters with each call. Among other things, the slide talked about the performance benefits of using this feature with SQL Server. Mr. Vaughn (I can't bring myself to call him "Bill") leaned over and said something like, "That doesn't work in SQL Server, you know." I said "Really?" sounding a lot like Kermit the Frog. "Nope," he continued, "it doesn't make any difference. Won't until the next version of SQL Server." A couple of time trials showed that he was right. Since the slides were already in the audience's hands, I had to eat a certain amount of crow during the presentation.

What you need to take away from this story is that you can't believe it's true just because someone says it is. And all of these rules will be affected by the realities of the hardware and software that your application is running on. One of the wonderful things about the tools that we use is how easy it is to add or remove an index, switch from DAO to ADO, or do whatever you need to perform a test. Run some tests, see what difference it makes (if any), and do what works best for you. When you run your tests, make sure that the data in your test sample reflects the real world of your application–a database with five customers and 12 salesorders will behave very differently than a database with 1,000 customers and 10,000 salesorders. To quote that great database designer, Ronald Reagan, when speaking in Russian, "Doveryai, no proveryai–trust, but verify."

That's the best advice in this article (next to the career advice, of course).