NOTE: Development With SAS: Syntax

Showing posts with label Syntax. Show all posts

Friday, 10 January 2014

NOTE: Wrap-Up on Test Coverage and MCOVERAGE

I've spent this week describing the functionality and purpose of the MCOVERAGE system option introduced in SAS V9.3. Coverage testing is an important consideration for your testing strategy - it's important to know how much of your code has been tested.

As its name suggests, MCOVERAGE only logs macro coverage. It's a great shame that there isn't an equivalent for DATA steps. Perhaps it will be added in due course, to DATA steps or DS2, or both.

With some planning, and judicious use of some post-processing capability to make sense of the log(s), MCOVERAGE can be an important tool in your testing arsenal.

I note that HMS Analytical Software GmbH's testing tool (SASunit) includes coverage testing through the use of MCOVERAGE. I've not used SASunit myself, and I can't speak for how complete, reliable and supported it may be, but if you're interested in learning more I suggest you read the SASUnit: General overview and recent developments paper from the 2013 PhUSE conference and take a look at SASunit's SourceForge pages.

What is your experience with using coverage testing and/or MCOVERAGE? Post a comment, I'd love to hear from you.

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014 (this article!)

Thursday, 9 January 2014

NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing

Over the last few days I've been revealing the features and benefits of the MCOVERAGE system option introduced in SAS V9.3. This system option creates a log file to show which lines of our macro(s) were executed, e.g. during our tests.

Knowing that we tested all lines of code, or knowing that we tested 66% of all lines of code is important when judging whether we have tested sufficient amounts of our code to give sufficient confidence to put the new/updated system into production. This information relates back to our testing strategy (where we specified targets for the proportion of code lines tested). It also helps us spot dead lines of code, i.e. lines of code that will not ever be executed (perhaps due to redundant logic).

Yesterday I showed code to read an mcoverage log file and create a table to show which macro lines had been executed and which had not. My code was basic and only worked for one execution of the tested macro. Quite often we need to run our code mor than once to test all branches through our logic, so today I'll discuss upgrading my mcoverage processing code so that it handles multiple executions of the tested macro.

We might start by running our tested macro twice, with two different parameter values...

filename MClog "~/mcoverage2.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);
%fred(param=1); /* Take a different path through the code */

filename MClog clear;
* BUT, see my note about closing MClog at 
  the end of my earlier blog post;

The mcoverage log produced from these two consecutive executions looks like this:

1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED
1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 14 14 FRED
2 15 15 FRED
2 16 16 FRED
2 16 16 FRED
2 18 18 FRED

You will recall that type 1 records mark the beginning execution for a macro, type 3 records indicate non-compiled lines (such as blank lines), and type 2 records indicate executed lines of code.

Note how we now get two type 1 records. These each mark the start of a new execution of the %fred macro. Close inspection of the type 2 records shows different sets of line numbers for the first and second executions, reflecting different paths through the %fred macro code.

We're aiming to create an output that shows whether the lines of %fred macro code were executed in one or more tests, or not. So, given that non-executed rows of macro code don't create a record in the mcoverage log, we can process the mcoverage log quite simply by counting the number of type 2 records for each line of macro code. For simplicity, we'll count the type 3s too. The output that we get will look like this:

Recordnum	Record	Rectype	Executions	Analysis
1	`%macro fred(param=2);`	2	4	Used
2	`* comment ;`	2	2	Used
3	`%put hello world: &param;`	2	2	Used
4	`%if 1 eq 1 %then %put TRUE;`	2	6	Used
5	`%if 1 eq 1 %then`	2	2	Used
6	`%do;`	2	2	Used
7	`%put SO TRUE;`	2	2	Used
8	`%end;`	2	4	Used
9	`%if 1 eq 0 %then`	2	2	Used
10	`%do;`	.	.	NOT used!
11	`%put FALSE;`	.	.	NOT used!
12	`%end;`	.	.	NOT used!
13	`%if &param eq 1 %then`	2	2	Used
14	`%do;`	2	1	Used
15	`%put FOUND ME;`	2	1	Used
16	`%end;`	2	2	Used
17		3	2	Not compiled
18	`%mend fred;`	2	2	Used

So, we can see that executing the %fred macro with two different values for param has resulted in all but three lines of code being executed. We might choose to add additional tests in order to exercise the remaining lines, or a closer inspection might reveal that they are dead lines of code.

The code to create the above output is included at the end of this post. The sequence followed by the code is as follows:

Read the mcoverage log file into a data set. Process the data set in order to i) remove type 1 records, and ii) count the number of rows for each line of macro code
Read the macro source into a data set, adding a calculated column that contains a line numbering scheme that matchers the scheme used by the mcoverage log. We are careful to preserve leading blanks in order to preserve indentation from the code
Join the two data sets and produce the final report. Use a monospace font for the code and be careful to preserve leading blanks for indentation

I'll wrap-up this series tomorrow with a summary of what we learned plus some hints and tips on additional features that could be added.

Here's the code:

/* This code will not cope reliably if the macro    */
/* source does not have a line beginning with the   */
/* %macro statement for the macro under inspection. */
/* This code expects a coverage log file from one   */
/* macro. It cannot cope reliably with log files    */
/* containing executions of more than one different */
/* macro.                                           */
/* Multiple different macros might be required if */
/* testing a suite of macros.                     */
filename MClog "~/mcoverage2.log"; /* The coverage log file (MCOVERAGELOC=) */
filename MacSrc "~/fred.sas";      /* The macro source  */

/* Go get the coverage file. Create macro */
/* var NUMLINES with number of lines      */
/* specified in (first) type 1 record.    */
data LogFile;
  length macname $32;
  keep Macname Start End Rectype;
  infile MClog;
  input Rectype start end macname $;
  prevmac = compress(lag(macname));

  if _n_ ne 1 and prevmac ne compress(macname) then
    put "ERR" "OR: Can only process one macro";

  if rectype eq 1 then
    call symputx('NUMLINES',end);

  if rectype ne 1 and start ne end then
    put "ERR" "OR: Not capable of handling START <> END";
run;

%put NUMLINES=&numlines;

/* Count the number of log records for each line of code. */
proc summary data=LogFile nway;
  where rectype ne 1;
  class start rectype;
  var start; /* Irrelevant choice because we only want N statistic */
  output out=LogFile_summed n=Executions;
run;

/* Go get macro source and add a line number value that */
/* starts at the %macro statement (because this is how  */
/* MCOVERAGE refers to lines.                           */
/* Restrict number of lines stored to the number we got */
/* from the coverage log file.                          */
/* Numlines does not include %mend, so we implicitly    */
/* increment the number of lines by one and thereby     */
/* retain the line containing %mend, purely for         */
/* aesthetic reasons for the final report.              */
data MacroSource;
  length Record $132;
  retain FoundStart 0 
    LastLine 
    Recordnum 0;
  keep record recordnum;
  infile MacSrc pad;
  input record $char132.; /* Keep leading blanks */

  if not FoundStart and upcase(left(record)) eq: '%MACRO' then
    do;
      FoundStart = 1;
      LastLine = _n_ + &NumLines - 1;
    end;

  if FoundStart then
    recordnum + 1;

  if FoundStart and _n_ le LastLine then
    OUTPUT;
run;

/* Bring it all together by marking each line of code */
/* with the ecord type from the coverage log.         */
proc sql;
  create table dilly as
    select  code.recordnum
      ,code.record
      ,log.rectype
      ,log.Executions
      ,
    case log.rectype
      when 2 then "Used"
      when 3 then "Not compiled"
      when . then "NOT used!" 
      else "UNEXPECTED record type!!"
      end 
    as Analysis
      from MacroSource code left join LogFile_summed log
        on code.recordnum eq log.start;
quit;

proc report data=dilly nowd;
  define record /display style(column)={fontfamily="courier" asis=on};
run;

filename MacSrc clear;
filename MClog clear;

*** end ***;

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014 (this article!)
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Wednesday, 8 January 2014

NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros

Over the last couple of days I've been uncovering the MCOVERAGE system option for coverage of testing of macro code. Coverage testing shows which lines were executed by your tests (and which were not). Clearly, knowing the percentage of code lines that were executed by your test suite is an important measure of your coding efforts.

Yesterday we saw what the mcoverage contained for a typical execution of a macro. What we would like to do is make the information more presentable. That's what we'll do today. We'll produce some code that will output the following summary (from which, we can determine that 33% of our code lines weren't executed by our test).

recordnum	record	rectype	analysis
1	%macro fred(param=2);	2	Used
2	* comment ;	2	Used
3	%put hello world: &param;	2	Used
4	%if 1 eq 1 %then %put TRUE;	2	Used
5	%if 1 eq 1 %then	2	Used
6	%do;	2	Used
7	%put SO TRUE;	2	Used
8	%end;	2	Used
9	%if 1 eq 0 %then	2	Used
10	%do;	.	NOT used!
11	%put FALSE;	.	NOT used!
12	%end;	.	NOT used!
13	%if &param eq 1 %then	2	Used
14	%do;	.	NOT used!
15	%put FOUND ME;	.	NOT used!
16	%end;	.	NOT used!
17		3	Not compiled
18	%mend fred;	2	Used

To create this table, we need to read the mcoverage log and the macro source for %fred as follows:

We need to process the mcoverage log by reading it into a data set and i) removing record type 1 (because it has no part to play in the above table, and ii) removing duplicated log rows for the same code line (which happens when a line of code is executed more than once).
We need to process the macro source by reading it into a data set and adding a column to record the line number (matching the numbers used in the coverage log).
Having read both files into separate data sets (and processed them as outlined above), we can join them and produce our report. The code to achieve this is shown at the end of this post.

The code that I've created expects a coverage log file from one execution of one macro. It cannot cope reliably with log files containing either multiple executions of the same macro or executions of more than one different macro. Is this a problem? Well, multiple executions of the same macro might be required if testing various permutations of inputs (parameters and data); and multiple different macros might be required if testing a suite of macros.

Tomorrow I'll augment the code so that it can deal with multiple executions of the same macro, e.g. testing %fred with param=2 and param=1.

Meanwhile, here's today's code...

/* This code will not cope reliably if the macro    */
/* source does not have a line beginning with the   */
/* %macro statement for the macro under inspection. */

/* This code expects a coverage log file from ONE */
/* execution of ONE macro. It cannot cope         */
/* reliably with log files containing either      */
/* multiple executions of the same macro or       */
/* executions of more than one different macro.   */

filename MClog "~/mcoverage1.log"; /* The coverage log file (MCOVERAGELOC=) */
filename MacSrc "~/fred.sas";     /* The macro source  */

/* Go get the coverage file. Create macro */
/* var NUMLINES with number of lines      */
/* specified in type 1 record.            */
data LogFile;
  length macname $32;
  keep macname start end rectype;
  infile MClog;
  input rectype start end macname $;
  prevmac = lag(macname);

  if _n_ ne 1 and prevmac ne macname then
    put "ERR" "OR: Can only process one macro";

  if rectype eq 1 then
    call symputx('NUMLINES',end);

  if rectype ne 1 and start ne end then
    put "ERR" "OR: Not capable of handling START <> END";
run;

%put NUMLINES=&numlines;

/* Remove duplicates by sorting START with NODUPKEY. */
/* Hence we have no more than one data set row per   */
/* line of code.                                     */

  /* This assumes the log file did not contain different 
     RECTYPEs for the same start number */
  /* This assumes log file does not contain differing
     permutations of START and END */
proc sort data=LogFile out=LogFileProcessed NODUPKEY;
  where rectype ne 1;
  by start;
run;

/* Go get macro source and add a line number value that */
/* starts at the %macro statement (because this is how  */
/* MCOVERAGE refers to lines.                           */
/* Restrict number of lines stored to the number we got */
/* from the coverage log file.                          */
data MacroSource;
  length record $132;
  retain FoundStart 0 
    LastLine 
    recordnum 0;
  keep record recordnum;
  infile MacSrc pad;
  input record $132.;

  if not FoundStart and upcase(left(record)) eq: '%MACRO' then
    do;
      FoundStart = 1;
      LastLine = _n_ + &NumLines - 1;
    end;

  if FoundStart then
    recordnum + 1;

  if FoundStart and _n_ le LastLine then
    OUTPUT;
run;

/* Bring it all together by marking each line of code */
/* with the record type from the coverage log.        */
proc sql;
  select  code.recordnum
    ,code.record
    ,log.rectype
    ,case log.rectype
       when 2 then "Used"
       when 3 then "Not compiled"
       when . then "NOT used!"
       else "UNEXPECTED record type!!"
     end as analysis
  from MacroSource code left join LogFileProcessed log
    on code.recordnum eq log.start;
quit;

filename MacSrc clear;
filename MClog clear;

As an endnote, I should explain my personal/idiosyncratic coding style:

I want to be able to search the log and find "ERROR" only if errors have occurred. But if I code put "ERROR: message"; then I will always find "ERROR" when I search the log (because my source code will be echoed to the log). By coding put "ERR" "OR: message"; my code looks a little odd but I can be sure that "ERROR" gets written to the log only if an error has occured

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014 (this article!)
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Tuesday, 7 January 2014

NOTE: Macro Coverage in Testing (MCOVERAGE)

Yesterday I introduced the MCOVERAGE system option (introduced in V9.3) for capturing coverage of macro execution. This is useful in testing, to be sure you executed all lines of your macro. This may take more than one execution of your macro, with different input parameters and data.

I finished yesterday's post by showing the mcoverage log file created from the execution of a sample macro. I've listed all three files below. They are:

The program that I ran
The mcoverage log file
The macro source for %fred (with line numbers added; the blank lines were intentional, to show how they are dealt with by MCOVERAGE)

filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);

filename MClog clear;

* BUT, see my note about closing MClog at
  the end of yesterday's blog post;

1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED

1.
2.  %macro fred(param=2);
3.    * comment ;
4.    %put hello world: ¶m;
5.    %if 1 eq 1 %then %put TRUE;
6.    %if 1 eq 1 %then 
7.    %do;
8.      %put SO TRUE;
9.    %end;
10.   %if 1 eq 0 %then 
11.   %do;
12.     %put FALSE;
13.   %end;
14.   %if &param eq 1 %then
15.   %do;
16.     %put FOUND ME;
17.   %end;
18.
19. %mend fred;
20.

The SAS 9.4 Macro Language: Reference manual tells us that the format of the coverage analysis data is a space delimited flat text file that contains three types of records. Field one of the log file contains the record type indicator. The record type indicator can be:

1 = indicates the beginning of the execution of a macro. Record type 1 appears once for each invocation of a macro
2 = indicates the lines of a macro that have executed. A single line of a macro might cause more than one record to be generated.
3 = indicates which lines of the macro cannot be executed because no code was generated from them. These lines might be either commentary lines or lines that cause no macro code to be generated.

We can see examples of these in the listing shown above. The second and third fields contain the starting and ending record number, and the fourth field contains the name of the macro (you figured that out yourself, right?).

So, record type 1 from our log is telling us that %fred is 18 lines long; record type 3 is telling us that line 17 has no executable elements within it (because it's blank); and the record type 2 lines are telling us which code lines were executed. By implication, lines of code that were not executed don't feature in the mcoverage log. How do we interpret all of this?

The first thing to note is that the line numbers shown in the mcoverage log are relative to the %macro statement and hence don't align with our own line numbers (I deliberately included a blank first and last line in the fred.sas file in order to demonstrate this). The type 2 records show that all lines were executed by our test except 10-12 and 14-17 (these are numbered 11-13 and 15-18 above). Given the logic and the fact that we supplied param=2 when we executed the macro (see yesterday's post), this would seem understandable/correct.

However, surely we can write a quick bit of SAS code to do the brainwork for us and show which lines were executed and which weren't. Of course we can, and I'll show an example program to do this tomorrow...

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014 (this article!)
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Monday, 6 January 2014

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE)

The SAS macro language is almost as old as SAS itself (who knows exactly?) so you'd think the need to add new functionality would have ceased - particularly with the ability to access most DATA step functions through %sysexec. But apparently not...

SAS V9.4 introduces a few new macro features, but not a huge number. The two that caught my eye were:

The SYSDATASTEPPHASE automatic macro variable which offers an insight into the current running phase of the DATA step
The READONLY option on %local and %global.

Not so long ago, SAS V9.3 introduced a raft of new automatic macro variables, macro functions, macro statements and macro system options.

When 9.3 was launched, one of the new system options caught my eye: MCOVERAGE. It claimed to offer coverage analysis for macros, i.e. highlighting which macro code lines were executed and which were not (particularly useful whilst testing your macros). When I wrote of the release of 9.3 I didn't have immediate access to 9.3, the documentation offered little in the way of real-world explanation, and (I confess) I forgot to return to the topic when I got use of a copy of 9.3.

Well, I was reminded of MCOVERAGE recently and I've spent a bit of time over Christmas figuring out how it works and what it offers in real terms (what is Christmas for if it's not for indulging yourself in things you love?). If you do a lot of macro coding then you'll be interested to know that MCOVERAGE offers plenty. Read on...

Consider this piece of code:

filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);

filename MClog clear;

The SAS log doesn't include any extra information, but we've created a new file named mcoverage1.log in our unix home directory (if you're on Windows, substitute "~/mcoverage1.log" with "C:\mcoverage1.log". I'll describe what the %fred macro does later but, for now, let's just say it's a macro that we want to test. So, we've tested it (with param=2), it worked fine, but have we tested all of the lines of code, or did we only execute a sub-set of the whole macro? If we look into mcoverage1.log we can find the answer. It looks like this:

1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED

What does this mean? I'll explain tomorrow...

But before tomorrow, I must add one further piece of information. In order to see the mcoverage log, it needs to be closed by SAS. One does this by coding filename MClog clear;. However, I found that SAS refused to close the file because it was "in use". Even coding options nomcoverage; before closing it didn't help. In the end I resorted to running another (null) macro after setting nomcoverage. This did the trick, but if anybody can suggest how I can more easily free-up the mcoverage log I'd be very interested to hear. Here's the full code that I used:

%macro null;%mend null;

filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%include "~/fred.sas";

%fred(param=2);

options nomcoverage mcoverageloc='.';

%null;

filename MClog clear;

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014 (this article!)
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Monday, 15 July 2013

NOTE: SAS 9.4 is Functioning Well

The unveiling of a new version of SAS always brings big, exciting new features and capabilities, but I love checking-out the detail too. New functions (and new function parameters) invariably provide interest too.

SAS 9.4 brings the DS2 language; Powerpoint and HTML5 as ODS destinations: customised attributes for data sets; the ability to write to a Sharepoint library; the direct ability to read zip files; clustered metadata servers; and much more. But how about these new/enhanced functions? I'm keen to find a use for each!

DOSUBL

In essence, the DOSUBL function allows macro code to be executed in the middle of a DATA step. This almost turns the order of SAS execution on its head!

FCOPY

The FCOPY function makes file copying simpler. FCOPY reads a record from one fileref and writes it to a second.

PUTN/PUTC

These old favourites have been enhanced with the ability to override the justification of your output. You can centre, right-align, or left-align the output you create.

Friday, 12 July 2013

NOTE: SAS9.4 Documentation Now Available

It's good to see the appearance of the SAS 9.4 documentation publicly in the last day or so. I shall be scouring it over the weekend!

SAS 9.4 undoubtedly provides new capabilities for programmers and data analysts, but it also promises simplified and broader administration capabilities - some of which benefit users as well as administrators by offering enhanced availability.

One of the most intriguing aspects of 9.4 for advanced programmers will surely be the formal introduction of DS2. I focused a number of items on DS2 earlier in the year and it holds a lot of promise for improved performance.

For administrators (and users), the high-availability metadata server is well worth investigation.

Monday, 17 June 2013

NOTE: Describe Your Table in SAS to Write the SQL Code

A week or so back I highlighted two best paper winners from this year's SAS Global Forum - Steve Overton and John Heaton - and I then subsequently highlighted one of John's papers. I thought I should do the same for Steve, but then I saw a blog post from Steve of such wonderful simplicity that I thought I'd highlight the blog post instead. For those familiar with the SQL DESCRIBE statement you'll know that it not only "describes" a table but it does so by offering code to recreate the table.

In Describe Your Table in SAS to Write the SQL Code, Steve describes how to use SQL DESCRIBE to generate executable code to (re)create a table. As Steve says, this is a useful reverse-engineering technique for generating empty tables as a step in establishing your data warehouse or data mart. It's one of those things that seems to have no purpose until you need to do it!

I've seen experienced SAS coders writing reams of DATA step code to reverse-engineer a data set. SQL DESCRIBE does it for you nice and easy!

Thursday, 9 May 2013

NOTE: Metadata-Bound Libraries - Updates at SGF 2013 #sasgf13

Back in November last year I mentioned Metadata-Bound Libraries. This v9.3 M2 (and above) functionality allows you to force access to your data through metadata libraries, thereby enforcing your metadata security plans.

One of the nuggets of information I learned at SAS Global Forum 2013 was that v9.4 will introduce menus in SAS Management Console to ease the effort of building PROC AUTHLIB code. Plus, the process of unbinding data sets from the metadata libraries will be made easier and simpler. Currently, one has to copy the data sets to an unbound library; v9.4 will allow unbinding to be performed in-place.

In a future release, administrators will optionally be able to make encryption compulsory for all data sets and libraries; and support for AES encryption will be provided. Finally, the metadata server will be able to store the encryption key and send it (encrypted) when required. This will remove the current need to hard-code keys into batch code (and thereby remove the security weakness).

Wednesday, 27 February 2013

NOTE: DS2 Final, Final Comments

I got a lot of feedback about the DS2 articles I recently wrote. I should add a few more summarising comments:

I mentioned that DS2 doesn't currently support MERGE. Chris Hemedinger commented that the ability to use SQL in your SET statement means that you can use a SQL join on the SET statement and thereby a) achieve a MERGE, and b) benefit from "pushing" the work to a source database (if using non-SAS data).

Jack Hamilton observed that accidentally omitting the QUIT at the end of a call to PROC DS2 will cause some less-than-obvious syntax errors for the following conventional DATA step. Beware!

With power comes great responsibility! Whilst a DS2 program and its executing threads are constrained to one process on one machine, there are currently no limits to how many threads a user can request to use with thread programs. So, a user could take full advantage of all the cores on the machine - to the detriment of their fellow users.

DS2 takes no account of the CPUCOUNT option, and it has no association with SAS Grid controls (with which, SAS administrators are able to manage and control users' use of system resources). I'm sure that some degree of control, facilitating workload balancing, will follow in time

DS2 promises much, and I'm keen to see how it evolves over the coming years.

DS2:

NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Tuesday, 26 February 2013

NOTE: Plan Early and Avoid a Year 2020 Problem!

My friend Stuart Pearson alerted me to a distant but fast approaching issue for some SAS sites. Stuart observed a problem at one of his clients as they were working on forecasts for the next ten years and one of their processes was feeding-in 2 digit years. 2021 was becoming 1921. This was because the default YEARCUTOFF option value is 1920. As Stuart said to me, as we near 2020 it is something that SAS users need to be fully aware of.

It's not a difficult issue to overcome. But to overcome it, you need to be aware of it.

Stuart also highlighted that SAS Data Quality handles 2 digit years differently to SAS/BASE. By default, anything greater than 04 and less than 99 will be assigned to the 20th century (19nn); for years 01 - 03, the 21st century (20nn) is assigned. You can create your own schemes and rules, but you need to be aware of the defaults.

Wednesday, 20 February 2013

NOTE: DS2, Final Comments #sasgf13

In my previous posts, I've covered many aspects of DS2 (previous posts are listed at the bottom of this post). It's time to wrap up by offering a few more final details.
Whilst DS2 provides a wide range of data types, not all types are supported by all output data structures. The BASE engine, for example, has not been updated to allow storage of anything other than numeric and character variables in SAS datasets, so an attempt to create a data set with a variable of type BIGINT will be met with a warning message:

WARNING: BASE driver, creation of a BIGINT column has been requested, but is not supported by the BASE driver. A DOUBLE PRECISION column has been created instead.

The traditional SAS numeric variable is known in this context as a double precision column!

Paired with the SAS Embedded Process, DS2 enables you to perform processing similar to SAS in completely new places, such as in-database processing in relational databases, the SAS High-Performance Analytics grid and the DataFlux Federation Server.

If you want to know more, consider attending Mark Jordan's pre-conference tutorial at this year's SAS Global Forum. In the seminar, entitled "What Will DS2 Do for You?", you will learn the basics of writing and executing DS2 code. Mark promises to shows attendees how to explicitly control threading to leverage multiple processors when performing data manipulation and data modeling. He will demonstrate how DS2 improves extensibility and data abstraction in your code through the implementation of packages and methods. It's an extra fee event ($155) but could add a powerful extra string to your SAS bow!

DS2:

NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Monday, 18 February 2013

NOTE: DS2, Threaded Processing

In my recent posts on DS2 (DATA step evolved), I showed the basic syntax plus packages & methods, and I showed the use of SQL within a SET statement. In today's post, I'll show the biggest raison d'être for DS2 - the ability to run your code in threads to make it finish its job more quickly.

"Big data", that's the big talking point. One of the key principles of performing speedy analytics on big data is to split the data across multiple processors and disks, to send the code to the distributed processors and disks, have the code run on each processor against its sub-set of data, and to collate the results back at the point from which the request was originally made. Thus, we're sending code to the data rather than pulling the data to the code. It's quicker to send a few dozen lines of code to many processors than it is to pull many millions of rows of data to one (big) processor.

DS2 was designed for data manipulation and data modeling applications. DS2 also enhances a SAS programmer’s repertoire with object-based tools by providing data abstraction using packages and methods. DS2 executes both within a SAS session by using PROC DS2, and within selected databases where the SAS Embedded Process is installed.

Here's a simple example. In summary, it shows how the use of eight threads reduces the turnaround time of the task from 24 seconds in a conventional DATA step to 4.4 seconds in a call to DS2 with eight threads.

16 /*****************************/
17 /* Create a chunky data set. */
18 /* Then read it:             */
19 /* a. With one thread        */
20 /* b. With eight threads     */
21 /* c. Using "old" DATA step  */
22 /*****************************/
23 
24 options msglevel=n;
25 options cpucount=actual;
26 
27 proc options option=threads;run;SAS (r) Proprietary Software Release 9.2 TS2M3
THREADS Threads are available for use with features of the SAS System that support threading

28 proc options option=cpucount;run;
SAS (r) Proprietary Software Release 9.2 TS2M3
CPUCOUNT=24 Number of processors available. 29
30 /****************************/
31 /* Create a chumky data set */
32 /****************************/
33 data work.jmaster;
34   do j = 1 to 10e6;
35     output;
36   end;
37 run; 38
39 /**************************/
40 /* Now read it three ways */
41 /**************************/
42
43 /* But first define the threaded code thread */
44 proc ds2;
45   thread r /overwrite=yes;
46     dcl double count;
47     method run();
48       set work.jmaster;
49       count+1;
50       do k=1 to 100;/* Add some gratuitous computation! */
51         x=k/count + k/count + k/count;
52       end;
53     end;
54     method term();
55       OUTPUT;
56     end;
57   endthread;
58 run;
NOTE: Execution succeeded. No rows affected.59 quit; 60
61 /* One thread */
62 proc ds2;
63   data j1(overwrite=yes);
64     dcl thread r r_instance;
65     dcl double count;
66     method run();
67       set from r_instance threads=1;
68       total+count;
69     end;
70   enddata;
71 run;NOTE: Execution succeeded. One row affected.72 quit;NOTE: PROCEDURE DS2 used (Total process time):
real time 25.09 seconds
cpu time 25.16 seconds73
74 /* Eight threads */
75 proc ds2;
76   data j8(overwrite=yes);
77     dcl thread r r_instance;
78     dcl double count;
79     method run();
80       set from r_instance threads=8;
81       total+count;
82     end;
83   enddata;
84 run;NOTE: Execution succeeded. 8 rows affected.85 quit;NOTE: PROCEDURE DS2 used (Total process time):
real time 4.40 seconds
cpu time 32.96 seconds86
87 /* And read it in DATA step */
88 data jold;
89   set work.jmaster end=finish;
90   count+1;
91   do k=1 to 100;/* Add some gratuitous computation! */
92     x=k/count + k/count + k/count;
93   end;
94   if finish then output;
95 run;NOTE: There were 10000000 observations read from the data set WORK.JMASTER.NOTE: The data set WORK.JOLD has 1 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 23.98 seconds
cpu time 23.75 seconds
The code does six things:

Firstly, it sets the maximum number of CPUs available to the SAS task. It sets this to 24 (this box has 12 cores, each with two hyperthreads)
Next, it creates a sample data set (to be read in the subsequent steps)
It then uses DS2 to define a thread - like a function or a method. The thread will read a row from the sample data set, increment a count and do some arbitrary compute activity (to ensure the test exercise isn't I/O-bound, thereby preventing DS2 from showing its capabilities)
Next, it executes DS2 again. The previously defined thread is used inside an executable piece of code (once, because threads=1). This takes 25 seconds elapsed time (and 25 seconds CPU time)
Now we execute the same code with DS2 but we use eight threads (each processing one eighth of the input records). This takes 4.4 seconds to complete (but consuming a total of 33 seconds of CPU time across the eight threads)
Finally, we execute a traditional DATA step to do the same task. Not surprisingly, it takes a similar amount of time to complete the task as the single-threaded DS2 code

So, we see how the use of eight threads reduced the turnaround time by a factor of 25/4.4=5.7. Not a factor of 8, but not bad nonetheless!

This example suggests that great benefit can be achieved with DS2 threads. In our simple case, the code was compute-bound. If your task is more I/O-bound then the benefits may be less predictable. Jason, from the DS2 development team, recently told me:

When DS2 does I/O, it starts one "reader" thread. This reader thread requests blocks of data from a data source and puts the blocks on a queue. Then, either the DS2 "DATA" program fetches those blocks off the queue or DS2 "THREAD" programs fetch the blocks off the queue.

The key here is there is one reader thread and one or more "compute" threads. As long as the reader thread can keep up with the compute threads, we should see a speed up in execution time. One reader thread works well for most data sources as data sources usually present one source to read from.

With a data source like SPDE, there could be multiple data sources across multiple devices. Right now DS2 does not take advantage of having multiple reader threads. However, I believe the architecture is flexible enough to allow mulitple reader threads.

At this point, we have been discussing I/O on a single machine. SAS High-Performance Analytics and SAS Scoring Accelerator execute DS2 in parallel across many machines. The I/O model for these types of systems is different and enables DS2 programs to better take advantage of multiple storage devices.

So, as is usually the case when we talk of performance, a lot depends on your hardware architecture and the amount of effort you put into the tuning of your architecture and code. Nonethess, with DS2 it seems that there are benefits aplenty to be had.

In my next post, I'll wrap up the topic with a few extra details.

DS2:

NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Wednesday, 13 February 2013

NOTE: DS2, SQL Within a SET Statement

Earlier this week, I suggested taking a look at DS2 ("the DATA step evolved") as a learning opportunity for the New Year. I listed the advantages that DS2 offers over the traditional DATA step. One of those was the ability to use SQL within a SET statement, thereby combining the benefits of both languages. There follows a simple example that combines SQL's familiar aggregation capabilities with SAS's ability to produce more than one output data set from a single pass of the input data.

In fact, DS2 takes a number of ideas from SQL:

Apart from using ANSI SQL data types in addition to the traditional Numeric and Character types, DS2's DATA steps takes a leaf out of SQL's book when no output data set name is specified. Traditional data steps would create output data sets with names like WORK.DATA1; DS2 returns the result to the Output window, much like PROC SQL would do.
Another DS2/SQL similarity is that DS2 will not automatically replace or over-write an existing table; you must specify the OVERWRITE=YES data set option.
And, finally, PROC DS2 uses RUN and QUIT statement in a similar manner to PROC SQL. PROC DS2 uses the RUN statement to indicate that preceding code is complete and should be executed, whilst the QUIT statement tells the procedure that its work is fully complete and it can be unloaded from memory.

One interesting point to note is that DS2 does not respect the SASHELP library. If you reference SASHELP (on a SET statement, for example) you'll get an error message informing you that the "schema name SASHELP was not found". This is a shame because SASHELP's example data sets (such as CLASS and PRDSALE) are so useful when demonstrating and experimenting with code and ideas.

17 proc copy inlib=sashelp outlib=work;
18   select prdsale;
19 run;
NOTE: Copying SASHELP.PRDSALE to WORK.PRDSALE (memtype=DATA).
NOTE: There were 1440 observations read from the data set SASHELP.PRDSALE.
NOTE: The data set WORK.PRDSALE has 1440 observations and 10 variables.
20
21 proc ds2;
22   data high(overwrite=yes) low(overwrite=yes);
23     drop count;
24     method run();
25       set {select country
26                   ,year
27                   ,quarter
28                   ,sum(actual) as actual
29                   ,sum(predict) as predict
30              from work.prdsale
31              group by country,year,quarter};
32       if actual lt 5000 then output low;
33       else output high;
34     end;
35   enddata;
36   run;
NOTE: Execution succeeded. 24 rows affected.
37 quit;

Clearly, you could do all of this with a PROC SUMMARY (or PROC MEANS) followed by a conventional DATA step, but the input/output activity is much reduced with the above technique, and use of a SQL SELECT statement within a SET statement is particuarly useful for those who are less familiar with some SAS procedures.

However, observe how DS2's NOTE statement offers little information about the numbers of rows written to each of its output tables. Maybe greater information will be offered in a future version (I'm using V9.2 TS2M3).

And you should be aware too that DS2 does not (currently) incorporate all functions and statement that you'd expect in traditional DATA steps. For instance, we have SET but we don't have MERGE or UPDATE or MATCH. We must assume that these will come in-time too. However, the ability to provide some of these capabilities may, in part, conflict with DS2's ability to perform threaded execution, so some traditional DATA step capabilities may never make their way into DS2.

In the next DS2 article, I'll discuss high performance data access using DS2's threading capabilities.

DS2:

NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Monday, 11 February 2013

NOTE: DS2, Learn Something New!

There's never a bad time to learn something new. How about DS2?

In May of last year, upon returning home from SAS Global Forum, I wrote about DS2, a new alternative to DATA step. Generally available in V9.4, PROC DS2 is currently available in SAS V9.3 as an experimental technology (and V9.2 TS2M3 too).

DS2's focus is on high performance for data manipulation and data analysis; it incorporates threading. But there's a lot more functionality available besides. DS2 gives you, the programmer, far more flexibility with your coding and gives you far greater abilities to structure your code.

The DS2 syntax includes the ability to specify SQL within a SET statement, combining the benefits of both languages, additional data types, ANSI SQL types, programming structure elements, and user-defined methods and packages. In advance of the 9.4 launch, you can find a Getting Started and a Language Reference manual in the product documentation.

Here's a simple example of some DS2 code and its output:

17 proc ds2;
18
19   data _null_;
20     method init();
21       dcl varchar(16) str;
22       str = 'Hello World!';
23       put str;
24     end;
25   enddata;
26
27 run;
Hello World!
NOTE: Execution succeeded. No rows affected.
28 quit;

As you can see, DS2 is actually a PROC, but it takes DATA step-like code as statements within the PROC. Within the DATA step, code is arranged into methods - sub-routines if you like (similar to functions in java, or functions created by PROC FCMP). The most basic methods are called INIT, RUN, and TERM. INIT and TERM run once (at the beginning and the end of the DATA step, as their names imply), and RUN is executed once for each input row, just as a conventional DATA step does.

The Getting Started manual takes you through some of the basics of the syntax. Let's move on and see some of what I find most exciting about DS2.

Firstly (below), I'll take a look at structured programming with packages and methods. In subsequent articles I'll look at SQL in the SET statement and at high performance data analysis with threading.

So, let's look at packages and methods; a great means of structuring our code. Take a look at the code and log output below. Note how we create a package named pythagorus, and the package contains two methods (gethyp and getside). The methods can contain any amount of code, and the package can be stored as an external file. There are large overlaps with the abilities of the macro language here, but DS2 brings the advantage of using one coherent language, with many different types of data (not just character!).

17 proc ds2;
18
19   package pythagorus/overwrite=yes;
20     method gethyp(double a, double b)
21                  returns double;
22       a_sq = a**2;
23       b_sq = b**2;
24       return sqrt(a_sq + b_sq);
25     end;
26     method getside(double hyp, double sidea)
27                   returns double;
28       return sqrt(hyp**2 - sidea**2);
29     end;
30   endpackage;
31   run;
NOTE: Execution succeeded. No rows affected.
32
33   data demo(overwrite=yes);
34     method init();
35       short=3; long=4; hyp=.; output;
36       short=4; long=5; hyp=.; output;
37       short=.; long=4; hyp=5; output;
38       short=3; long=.; hyp=5; output;
39     end;
40   enddata;
41   run;
NOTE: Execution succeeded. 4 rows affected.
42
43   data results(overwrite=yes);
44     dcl package pythagorus pyth();
45     method run();
46      set demo;
47      select;
48        when (missing(hyp))
49          hyp=pyth.gethyp(short,long);
50        when (missing(short))
51          short=pyth.getside(hyp,long);
52        when (missing(long))
53          long=pyth.getside(hyp,short);
54      end;
55    end;
56 enddata;
57
58 run;
NOTE: Execution succeeded. 4 rows affected.
59 quit;

After the package and its methods are defined, we create some test data in a small DATA step, and then we generate our results in a final DATA step. Notice how we need to declare the package in our results DATA step before we use it.

The ability to create packages of methods will allow SAS code to be written in a clearer, more structured form than was previously possible. I look forward to discovering the full benefits of packages and methods as I use there new capabilities more and more.

DS2 will still perform type conversions (one of my bête noires of the DATA step; I wrote about it in September 2009 and in a follow-up article) but the rules are more complicated because DS2 introduces so many different types. Indeed, there's a whole chapter on it in the Language Guide.

Not quite sure where to start? Let PROC DSTRANS help you by translating a subset of your DATA step code into DS2 code. Then, if necessary, you can revise your program to take advantage of DS2 features before submitting your program using PROC DS2.

In my next DS2 post, I'll show how to sue SQL statement in your SET statements.

DS2:

NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Monday, 4 February 2013

NOTE: Unable to Write to Template Store!

Sometimes you get into a habit that is so deep-seated that you don't realise you're doing it. I was just reminded of a little trick that I regularly use when editing ODS graphics templates.

A colleague of mine was using PROC TEMPLATE and getting the following (apparently) benign error in his log:

15 proc template;
16   edit styles.default as styles.defsize2;
17     style container / font_size=2;
18   end;
ERROR: Template 'Styles.Defsize2' was unable to write to template store!
19 run;

I say "benign" because the code that subsequently used styles.defsize2 worked fine. But we have a coding guideline that says that no errors are allowed in our team's code (not even benign ones!). He was struggling to make the error go away and was promoting the argument that benign errors should be permitted by our coding guidelines.

We discussed the message and what it meant. Templates, by default, are stored in an ITEMSTOR named SASUSER.TEMPLAT. My programmer's code was reading SASUSER.TEMPLAT.STYLES.DEFAULT and attempting to create a new style named SASUSER.TEMPLAT.STYLES.DEFSIZE2. The error message was saying he couldn't do that because DEFSIZE2 already existed. Why did it already exist? Because his code had worked without an error message the very first time he ran it - but he'd forgotten this fact.

So, it wasn't necessary to run the code every time because the DEFSIZE2 style already existed. Of course, each new programmer that ran the code had to run the PROC TEMPLATE once in order to create DEFSIZE2 in their own SASUSER library.

I proposed a slightly different approach: let's create the template in the work library each time the code was run. With this change, the code would run consistently for everybody, everytime.

Making this change is simple, we just need to change the default location for templates to be read from and written to. And the easiest way to do that is to add the WORK library to the ODS search path (adding it to the front of the search path). Specifically, we want to add an ITEMSTOR that's in the WORK library to the ODS search path. The ITEMSTOR need not exist before we write something to it - it'll be created automatically. We can use ODS PATH SHOW to demonstrate the effect:

15 ods path show;

Current ODS PATH list is: 
1. SASUSER.TEMPLAT(UPDATE) 
2. SASHELP.TMPLMST(READ)

16 ods path(prepend) work.templat(update);

17 ods path show;

Current ODS PATH list is: 
1. WORK.TEMPLAT(UPDATE) 
2. SASUSER.TEMPLAT(UPDATE) 
3. SASHELP.TMPLMST(READ)

The first call to ODS PATH SHOW shows us the two default ITEMSTORs, and then we see that we've added WORK.TEMPLAT to the head of the search path.

We don't need to change any other part of our code because all elements of ODS respect this search path.

So, we updated the code, got rid of the (not so) benign error message, and didn't make any changes to our coding guidelines. An interesting and informative exercise, demonstrating not just the ODS PATH for templates but also the wisdom of not allowing any error messages in SAS logs. The programmers in Cary put those messages in for a reason!

Oh, one more thing. You can use PROC DATASETS to see ITEMSTORs in your libraries, and you can view a list of templates with PROC TEMPLATE's LIST statement (see below), but I haven't figured-out a means of inspecting the content of an ITEMSTOR. Does anybody know how? I appreciate that the SAS Registry is an ITEMSTOR, and you can view that in PC SAS; but how do you view these things if they're on a server and you're using Enterprise Guide?

proc datasets lib=sasuser;
run;

proc template;
  list ;
run;

Answers on a postcard please...

Thursday, 31 January 2013

NOTE: Ampersands Again

My recent post on double and triple ampersands in macros ended with some doubt. Laurie and Dave immediately took up the challenge to resolve the doubt. You can see the whole painful saga in the comments of the original post, but I've repeated Dave's summary of the conclusions here and I offer my thanks to Laurie and Dave for their tenacious work!

The macro parser performs two basic processes on each pass. If it finds a double ampersand , it converts it into a single one and then proceeds with the rest of the statement; and if it finds a string &xyz that it recognises as a macro variable, it resolves it and then proceeds with the rest of the statement. At the end of the statement, it checks to see if it has been left with any ampersands in the parsed output. If there were, it runs through the process again.

That's exactly how we understood it to work.

The detail we had missed is in what happens if it finds a string &abc that it DOESN'T recognise as a macro variable name. If on the same pass, in the head of the same text string, it also performs a successful double ampersand resolution, it does not regard the unrecognised variable as an error, but allows the string &abc to go through unchanged to the next iteration. But if there were no double ampersand resolutions, it knows that it has had its last chance at a successful resolution and so it will throw a warning - I think it's not an error but I can't check as I don't have SAS installed here.

Monday, 28 January 2013

NOTE: Getting More for Less With SPDE

Would you like to read or write your data more quickly? SAS V9.1 introduced Scalable Performance Data Engine (SPDE, "Speedy Engine"), for free. Are you using it? Speedy Engine is the little cousin of Speedy Server (Scalable Performance Data Server, a chargeable product), but its potential benefits are not so little. With a simple set of changes to one LIBNAME statement, your code could run a lot faster.

Put simply, when you read data from a traditional SAS data set, the bits and bytes are read one-at-a-time from the disk. The same is true when you write data. However, if you use the Speedy Engine on your LIBNAME statement then your data will be written across (or read from) multiple disk drives, allowing SAS to send more data at the same time (because it can send the first 10 observations of your table to disk #1 and then send the second ten observations to disk #2 whilst waiting for disk #1 to finish writing the data). The parallel writing and reading of data means that the processor needn't spend most of its time waiting for disk operations to be completed.

Using Speedy Engine is simple, so long as you have multiple disks drives *and* your processor has multiple channels with which it communicate with the disk drives. Without multiple channels, the channel will become a bottleneck and you won't see the improvements you'd hoped for.

Here's a simple example LIBNAME statement:


libname sales spde "f:\alphasales\metadata"

datapath=("g:\alphasales\datapart1"
"h:\alphasales\datapart1" )
indexpath=("j:\alphasales\index" )
;

Note how the LIBNAME statement largely follows the standard syntax, but we specify the SPDE engine after the "sales" libref, and we specify datapaths and an indexpath. What you need to know is that the Speedy Engine writes three different kinds of data: metadata, the index, and data. It simple needs to know where to put each kind of data. The first location specified in the LIBNAME statement (the F: drive in our example) is where the metadata will be stored. You don't need to worry about what the metadata contains, just be aware that it needs to go somewhere! For quick searching and retrieval, Speedy Engine will want to store index files somewhere, hence the indexpath. Finally, Speedy Engine will store your data in the multiple locations specified in the datapath. I've used two datapath directories in the example, but you can specify more.

If you look in these locations, after writing some data to the sales libref, you'll see files with unfamiliar suffixes to their names, i,e. not sas7bdat. This need not worry us, but it does emphasise a key point: you have to write data in Speedy Engine format before you can read it with Speedy Engine.

To get the most from Speedy Engine, make sure you have multiple processors, multiple channels, and multiple disk drives. If you don't know the detailed architecture of your SAS environment, and are not able to discover more information from your local support team, you can still use an empirical approach by trying different combinations of disk drives and see which combinations give best results. However, Speedy Engine offers features over-and-above performance improvements. For example, it offers implicit sorting (whereby you needn't pre-sort your input data when you use a BY statement), WHERE clause optimisation with OR operators, and more than 32k columns in a table(!).

Speedy Server (the big cousin) offers further advantages such as an Access Control List–based security model, support for parallel BY-group processing, and processing of implicit pass-through.

For more information on the Speedy Engine, take a look at the SPDE section of the SAS Support Focus Areas, including the Quick Set-Up Guide and Tips and Tricks. The NESG 2006 paper An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine by Russ Lavery and Ian Whitlock is a little dated, but it is useful and reliable nonetheless.

Sunday, 27 January 2013

NOTE: The Dreaded Double and Triple Ampersands

Aside from Chris Hemedinger's comment on SAS V9.3's PROC SQL INTO enhancements, my Macro Arrays, Straight From a Data Set article last week also caught the attention of another friend and regular correspondent - LeRoy Bessler. LeRoy emailed me to suggest that an explanation of why I used a triple ampersand in my macro code (see below) might be of value to many (especially since he is used to successfully using double ampersands in this situation).

%do i=1 %to &name0;
  %put I=&i, NAME&i=&&&name&i, AGE&i=&&&age&i;

  %end;

LeRoy asked for an explanation. So here it is...

We're familiar with the single ampersand indicating that the name of a macro variable follows the ampersand, e.g. &guide instructs the SAS macro processor to substitute &guide with the value of the macro variable named guide. Thus, the value of guide gets passed to the SAS compiler.

The process of tokenisation and the processing that is done by the SAS macro processor includes a rule whereby two consecutive ampersands (a double ampersand) get resolved to a single ampersand. Not only that but the resulting single ampersand will be put to one side whilst the remainder of the token is parsed in the same manner. Once the first pass of the token is complete, it is all parsed again with the same rules. Here are some examples that illustrate what happens (I've used colour to highlight the separate clusters of characters that get processed):

%let guide = enterprise;

%let enterprise = starship;

Submitted by User	After Pass #1	After Pass #2	Sent to SAS Compiler
guide	n/a	n/a	guide
&guide	enterprise	n/a	enterprise
&&guide	&guide	enterprise	enterprise
&&&guide	&enterprise	starship	starship

Let's use this information to decypher my original piece of SAS macro code:

%do i=1 %to &name0;
  %put I=&i, NAME&i=&&&name&i, AGE&i=&&&age&i;

  %end;

In this case, we're looping around all values from 1 to whatever number &name0 resolves to. I want the code to produce the following output in the log:


I=1, NAME1=Alfred, AGE1=14

I=2, NAME2=Alice, AGE2=13

I=3, NAME3=Barbara, AGE3=13

[snip]

In the put statement, we print "I=" following by the value of &i, i.e. the index variable of the loop; then we print ", NAME", followed by the value of &I, followed by an equals sign. So far we've got "I=1, NAME1=". This is followed by &&&name&i. The first pass of &&&name&i resolves to a single ampersand followed by.... oh dear, there's a bug in my program!

a) There's no value for &name; my code had not previously created a macro variable called name

b) Moreover, I had not intended to use a macro variable named &name, I had wanted my code to merely resolve to name at this point

I should have used a double ampersand like LeRoy is used to. If I had specified &&name&i then the first pass would have produced &name2; and the next pass would have resolved to a student name (the array of macro variables had been loaded from sashelp.class).

So, I offer my thanks to LeRoy for politely highlighting my error, and inspiring this blog post.

But, one question remains: we understand how triple ampersands are the correct technique to be used in the first examples, but why did my code from the previous blog post work when the rules suggest it shouldn't? And my answer is: I don't know! Try it for yourself; SAS silently does what I wish, despite my code having too many ampersands.

Answers on a postcard please...

Saturday, 26 January 2013

NOTE: SQL INTO, Revisited

In Macro Arrays Straight From a Data Set last week I wrote about using SQL's INTO in order to create macro variables (and arrays of macro variables) from PROC SQL. I said that we had to use an arbitrarily large number to define the maximum size of the macro array. Well, Chris Hemedinger was quick to post a comment and point out that with SAS 9.3 there are a few improvements in the SELECT INTO range options.

Not only that but Chris highlighted his Improving on a SAS Programming Pattern post from March last year in which Chris showed the new SAS 9.3 syntax where we can now specify an open-ended range for macro variable names. Apparently, SAS will allocate as many macro variables as are needed to store each value, and the values will be TRIMMED of whitespace by default (but you can control this by specifying NOTRIM before the FROM keyword).

I've reproduced Chris's example below.

/* SAS 9.3 approach */
/* Create macro vars with values & count of distinct values */
proc sql noprint;
select distinct TYPE into :varVal1-
from SASHELP.CARS;
quit;

%let varCount = &SQLOBS.;
quit;

Thanks for the tip Chris.

The small print:

Please send general comments to NoteEditor@RTSL.eu

The RTSL blog: © 2001-2015 Ratcliffe Technical Services Limited. All rights reserved.

5 Willow Close, Bexley, Kent, GB-DA5 1QY, England

Re-publication by permission only. You may forward copies only if no fee is involved. Please encourage the recipients of the forwarded copy to subscribe and get their own free copy

SAS® is a trademark of SAS Institute, Inc. All other trademarks, company names, products names and logos mentioned are the property of their respective owners.

Pages