Wednesday 8 September 2010

NOTE: Sort Your Sorted Data

In my previous post I spoke of using views to give you access in EG visual coding to tables with dynamic names. This works well, but there's one final observation to make if you want to avoid poor performance in some circumstances.

If you're used to using DATA step merges then you'll also be used to sorting your input data sets prior to the merge. If you're using EG visual coding then you'll have realised that you don't need to sort the data sets that you use as input to your joins. Why? Because SQL will automatically sort your input data sets for you.

Neat huh? Well, yes it is, except in cases where your input data is already properly sorted and re-sorting is a waste of resource and causes unnecessary delays in the execution of your job.

So, can we tell SQL that one or more input data sets are already sorted? Yes, we can, by using the SORTEDBY data set option.
All SAS data sets have a SORTEDBY attribute. SAS procedures such as SORT and MEANS automatically set it for their output data sets. PROC SORT and PROC SQL check this attribute before sorting any data. If it set appropriately then those PROCs will sip the sorting step.

So, if the data's in the right order, SQL won't wastefully re-sort my data, right? Wrong - in some cases it will. If you're data is sorted right, but the attribute isn't set, then SQL will re-sort your data.

Yikes! Under what circumstances could this happen? Well, one situation where it'll happen is if you access a sorted data set through a view, as I recommended in my previous post! If you take the recommendations of my previous post, you need to add the SORTEDBY data set option into the creation of the view if your table is reliably sorted:

proc sql;
  create view work.master_dot_sales_month
    as select * from master.sales_&month. (sortedby=region);
quit;


You also need to consider this approach when you're sorted data comes out of a database through SAS/ACCESS- in my experience, the SORTEDBY attribute won’t be automatically applied even though the database knows the table's sort order and SAS/ACCESS could apply a SORTEDBY data set attribute. I've certainly seen this is the case using SAS/ACCESS for ODBC to connect to Netezza.

You can read more on the subject in the SAS 9.2 Language Reference manual in the section entitled The Sort Indicator. See also Paul Kent's paper TS-553 SQL Joins -- The Long and The Short of It.