Perspectives on LedgerSMB: On Long Queries/Stored Procedures

In general, in most languages, programmers try to keep their subroutines down to a couple screens in size. This helps, among other things, with readability and ease at debugging. Many SQL queries are significantly longer than that. I have queries which are over 100 lines long in LedgerSMB, and worked on debugging queries that are more than 200 lines long. In general, though, the maximum length of a query is not a problem if certain practices are followed. This post describes why I think SQL allows for longer procedures and what is required to maintain readibility.

Why Long Queries Aren't a Problem Per Se

In a typical programming language, there isn't an enforced structure to any subroutines. What this means is as a subroutine becomes longer, maintaining internal patterns and an understanding of state becomes more difficult. SQL queries themselves, however, are defined by their structure. For example, a standard select statement (excluding common table expressions) includes the following parts in order:

Returned column list
Starting relation
Join operations
Filters
Aggregation criteria
Aggregation filters
Postprocessing (Ordering, limits, and off-set)

The order of these elements are not interchangeable. I can't put my returned column list at the end, or my join operations at the front. Therefore in a 100 line query, chances are you know initially what is being returned first, and can work onward to see why the value is what it is. Generally also you will have an idea of likely causes before you get started. Does this look like a join projection problem? Like a misbehaving filter? Bad aggregation? Ok, we know where to look for logic. Ok, now we have a block, and an idea of what to look for. It's pretty quick to figure out which lines are likely at issue.

Because of the structure, it is pretty easy to dive into even a very long query and figure out exactly where the problems are quickly. Maintainability is not dependent on length or overall complexity.

Moreover if you have common table expressions, it is easy to jump back to the beginning, where these are defined, and reference these as needed.

A second difference is that SQL statements work with state which is typically assumed to be unchanging for purposes of the operation. With rare exceptions, the order of execution doesn't matter (order of operations however does). Consequently you don't have to read line-by-line to track state.

In essence debugging an SQL statement is very much like searching a b-tree, while debugging Perl, Python, or C is very much like traversing a singly linked list.

What Can Be a Problem

Once I spent several days helping a customer troubleshoot a long, complex query. The problem turned out to be a bad filter parameter, but we couldn't tell that right away. The reason was that the structure of the query had decayed a bit and this made maintenance difficult.

The query was not only long, but it was also difficult to understand, because it didn't conform to the above structure. The problem in that case was the use of many inline views to create a cross-tab-type report.

If you can't understand a long query, or if you don't know immediately where to look, it is hard to troubleshoot these.

Best Practices

The following recommendations are about keeping long queries well-structured. Certain features of the SQL language are generally to be avoided more as queries become longer, and other features should be used carefully.

Avoid inline views. Ok, sometimes I use inline views too on longer queries, but usually these are well-tested units in their own right, and re-used elsewhere. These should probably be moved into defined views, or CTE's if applicable.
Avoid union and union all in long queries. These complicate query maintenance in a number of ways. It's better to move these to small testable units, like a defined view based on a shorter query or a common table expression.
In long stored procedures, keep your queries front and center, and move as much logic as is reasonable into them.
Avoid implicit joins, which work by doing cross-joins and then placing the filter condition in the result. Join logic should be separate from filter logic.

What do you think? What practices do you find helpful?

Perspectives on LedgerSMB

Friday, July 20, 2012

On Long Queries/Stored Procedures

3 comments: