Tuesday, February 12, 2013

Building SOLID Databases: Interface Segregation, or Keep Stored Procedures Simple

As far as interfaces go, we are talking specifically about functional interfaces (UDF's, or "User Defined Functions", and stored procedures).  Raw interfaces like SQL queries have very different profiles of issues.  This is one principle that more or less takes care of itself in most databases due to existing concerns regarding normalization and the like.  Subtle violations are quite possible, however, but they can be avoided by careful design and the signs are relatively easy to spot.  Problems can occur in both purely relational and object-relational designs though.

Interface Segregation in Application Programming


In application programming, interface segregation is used to ensure that dissimilar interfaces are not too tightly coupled, posing maintenance problems.  The canonical example of a violation is the actual case of the Xerox print system job class which did everything from printing to collation and stapling of papers.  The problem is simply that the methods would become sufficiently entangled that small changes would require significant cascading interventions elsewhere in the system.

Interface segregation is a principle which suggests that the basic abstractions should be internally cohesive but externally less so, such that cascading changes are minimized.  It therefore goes hand in hand with the single responsibility principle but focuses on the interface level rather than the class level.

Interface Segregation in the Database


We'd like to think that interface segregation is a solution to a problem that only affects applications.  We really would.  However, many of the bad experiences people have had with stored procedures are in fact violations of this rule.

In general, my experience has been that stored procedure interfaces (or UDFs returning sets) work well when they are a single large query with some minor support logic, and become much harder to work with when they are many smaller queries chained together.  When a stored procedure or UDF tries to do to much, that's where you get problems and rather than counting screen lines, numbers of individual statements provide a better estimate of reducible complexity there.

When a stored procedure or UDF tries to do too much,  maintainability problems occur because functionality is intertwined in a way which exactly mirrors the problems found in the application layer.  This gives people bad experiences....

An Example of a Violation


In LedgerSMB 1.3 we introduced a number of stored procedures for payment management.  These each have some difficulties associated with them so what I am talking about here is not necessarily a condemnation of one interface over the other, but one specifically violates this principle by combining support for prepayments with support for paying existing invoices when the input requirements are different and these are handled dissimilarly in the underlying database.

This single payment interface, which can support either an overpayment or a payment of an invoice, is difficult to maintain because the input complexity is much larger than necessary.  In fact the function requires nearly 20 inputs vs 12 for each interface if these were split up, and about 20 statements as well, vs far fewer if these were split up (additional refactoring might be possible in that case too, to further reduce the number of statements through consolidation).  This is a piece of code I tend to dread fixing problems in.  It works well currently, but there is always a fear that I am breaking something that I am not aware of, and test cases only get you so far.  Fortunately we haven't had to change much there but all our payment interfaces are up for redesign when we get back around to them.

Danger Signs


 The most obvious danger sign that a stored procedure is trying to do too much is what I call "reducible complexity."  The ideal stored procedure in my view is a single very large query with some minor supporting logic around them.  Reducible complexity is bad, and should be reduced.  Singular large queries can be debugged much more easily than a series of shorter ones because one can take advantage of the highly structured nature of SQL.  Consolidating queries thus increases the complexity locally in the SQL statement but reduces it over the scope of the function.

For purposes of a blank rule, my thinking is that if you have more than about 5 queries in a stored procedure, it is worth taking a look, and if you have more than 10 or 20, something is definitely worth looking at.

A large number of small queries, or worse, a large number of large queries, is a sign that something is amiss.  The first step is to consolidate queries to the extent possible.    However if a large number of queries persist, then the immediate question is what can be done to break the logic into separate interfaces which may be independently used.

A second warning sign are stored procedures that require a very large number of inputs.  Large here is best seen in the context of distribution relative to other interfaces, and it is not an absolute guide.

Recommendations


I am a very large fan of using functional interfaces (stored procedures and user-defined functions) to grant access to the data, but such interfaces must be kept as simple as possible.  Simple in this case is defined by complexity that can be reduced by consolidation of logic at least as much as it is defined by complexity that can be reduced by breaking a function into several components.

No comments:

Post a Comment