Saturday 31 December 2011

NOTE: The Hits of 2011

To join all of the others on the internet, here's an "end of year" list... Blogger tells me that the most popular posts on the NOTE: blog this year were:

1. NOTE: SAS Global Forum 2011 - Update #3, 1,069 page views
2. NOTE: SAS Global Forum 2011 - Update #1, 929 page views
3. NOTE: SAS Global Forum 2011 - Update #2, 826 page views
4. NOTE: SAS Global Forum 2011 - Back-Ups, and The Last Word, 770 page views
5. NOTE: SAS Professionals Convention 2011 - Registration is Open, 753 page views

It's interesting to see that all of the posts relate to conferences - SGF in particular. I guess that's a reflection of our shared interest in the latest SAS news. I have every intention of being at SGF 2012 in Florida, USA between April 22 -25. And I'll endeavour to post all the latest news as soon as I find it.

Thanks for your continued support of NOTE:. Here's hoping we all have a healthy and happy New Year.

Tuesday 27 December 2011

NOTE: Decoding the _TYPE_ Variable (Golden Oldies)

Our mini-series of hints and tips from past decades continues with 1999...

Appearing in the same issue of The Missing Semicolon (TMS) was i) an advertisement for Year 2000 Support (do you remember "Y2K"?), and ii) an article entitled "Decoding the _TYPE_ Variable (in PROC SUMMARY and PROC MEANS)" on page 5.

The issue in question was June, 1999: Volume 2, Number 2. The Missing Semicolon, published by Systems Seminar Consultants, has been informing SAS practitioners since June 1998 (just a few months after the first edition of VIEWS News was put to print and posted to its subscribers). It continues to this day and offers free subscriptions to all SAS practitioners.

Contemporary users of PROC MEANS and SUMMARY have the benefit of the CHARTYPE option, but the principles of understanding the values of the numeric _TYPE_ variable or the 1's and 0's of the character _TYPE_ are the same.

Tuesday 20 December 2011

NOTE: 180 Terrific Ideas for Using and Learning the SAS System (Golden Oldies)

Continuing my mini-series of hints and tips from previous decades, here's a paper from my first SAS Users Group International (SUGI) conference. The 18th SUGI (predecessor to today's SAS Global Forum (SGF)) was held in 1993 in New York. Helen and Ginger Carey's paper was boldly entitled "180 Terrific Ideas for Using and Learning the SAS System", and it didn't disappoint. Whilst not all of their ideas are useful today, a significant number of them are. Note in particular their ideas regarding a) work habits, b) testing and debugging, and c) programming style.

Thursday 15 December 2011

NOTE: BY Variables in Titles (Golden Oldies)

Whilst the capabilities of SAS software increase markedly from release to release, the core capabilities remain the same. And that means that many hints and tips from years gone by are still valid today. Starting today I'm running a mini-series of tips from past decades, starting with an article by myself in the first ever issue of VIEWS News (winter 1998).

VIEWS News was a printed and posted newsletter that was published quarterly by VIEWS - the UK's independent user group for SAS practitioners.

BY Variables in Titles

You can insert labels and/or values of BY variables into titles. For example, suppose you have a data set sorted by the variable TRTSEQ, with label "Group" and formatted values "Placebo", "Wonder Drug", etc.

options nobyline;

proc print;
  title "List for #byvar1 - #byval1";
  by trtseq;
  var var1-var10;
run;


For each BY group, you'll get the titles:

List for Group - Placebo
List for Group - Wonder Drug
etc.


The NOBYLINE option suppresses the normal printing of the BY lines, so you don't see "Group=Placebo." You can turn it back on with OPTIONS BYLINE.

Tuesday 13 December 2011

Testing - Peer Reviews

In my recent series on testing I concentrated on dynamic testing, i.e. executing some or all of your code to see if it works. The other type of testing is static testing. This was featured briefly in my "SAS Software Development With The V-Model" paper at this year's SAS Global Forum ( SGF).

The principle form of static testing is peer review. Like all tests and reviews, you have to have something to test or review against. For testing you will have a set of requirements or a set of specifications (ref: the V-Model). For peer review, a fellow programmer will take a look at your code and offer opinion on whether it meets the team’s coding guidelines and other standards (and whether it gives the impression of meeting the unit specification). Thus, to perform an effective peer review, you need some documented standards against which the review can be set.

I wrote a NOTE: post on peer review back in 2009, so I won't repeat those thoughts here; however, I did see a good post from Tricia on the BI Notes blog back on October. Tricia used the post to run through a number of potential pitfalls for those who are new to the process. In the article, Tricia offered valuable tips on how to keep the review focused & at the right level, and how to be properly prepared.

Done properly, peer reviews offer increased code quality, reduced cost of rework, increased amounts of shared knowledge of your applications, and increased sharing of code and design good practice.

NOTE: Programming Languages Can Be Copied...

... so says Yves Bot, advocate general at the Court of Justice of the European Union (CJEU). Mr Bot's guidance was issued in response to the July 2010 High Court case between SAS and World Programming Limited (WPL). The CJEU still needs to make a full judgement before handing things back to the High Court, but WPL will doubtless be pleased that Mr Bot appears to have sided with them (along with a US court in March 2011).

However, whilst the language itself cannot be protected, the implementation of the language *is* protected. So, the code that SAS's staff in Cary (and around the world) have written in order to implement the language cannot be copied. Anybody who wishes to produce software that can interpret the SAS language must write that interpreter all by themselves. As a result, you might expect the bigger company to be able to invest more time and effort into optimising the speed and efficiency of their product.

So, before you rush out hoping to find a cheaper, better solution, be sure to do your full due diligence on all of the candidates. Check that all elements of the language are implemented; be sure to perform some performance and reliability testing; and probe the support provisions of your potential suppliers. These things are less easy to duplicate. Caveat emptor.

Thursday 8 December 2011

NOTE: Hash Tables, An Introduction

In my recent series of articles on testing I used the hash table to provide a means to write/update a table from within a DATA step without demanding changes to the DATA statement, etc. I've had some very kind feedback on the series, plus some requests to explain what the hash table does and how it works. So here goes...

The hash table was introduced to the DATA step as a means of performing fast, in-memory lookups. The fact that the hash table is a "component object" means that there's a steeper than usual learning curve for this feature. The SAS Language Reference: Concepts manual provides an introductory section on the hash table, but for many DATA step coders the hash table remains unused. A better route to getting started with hash tables is the "Getting Started with the DATA Step Hash Object" paper in the DATA Step sub-section of R&D's SAS/BASE section of the SAS Support web site (follow the link to Component Objects and Dot Notation).

In a nutshell, the hash table is a lookup table that's stored (temporarily) in memory and allows you to search for values within it and thereby get associated values returned. Let's introduce ourselves to the hash table by taking a two step approach: firstly we'll create the hash table, secondly we'll use it for lookups against each of the rows in our input table. Our DATA Step will look like this:

data result;
  set input;
  if _n_ eq 1 then
  do; /* It's first time through, so let's create hash table */

    <create the hash table>
  end;
  /* For each row in input table, do a lookup in the hash table */

  <do a lookup>
run;


Let's make ourselves some test data and assume it contains the sales from our small car sales operation last week:

data SalesThisWeek;
  length make $13 model $40;
  infile cards dsd;
  input make model;
  cards;
Jaguar,XJR 4dr
Land Rover,Discovery SE
Jaguar,XKR convertible 2dr
;
run;


We have a price list for all of the cars we sell; it's in sashelp.cars and contains variables named make, model and invoice. Frustratingly, the MODEL column contains leading blanks, so we use a quick DATA Step to get rid of them, thereby creating work.cars.

data work.cars; set sashelp.cars; model=left(model); run;

We want to load the price list into a hash table, then lookup each of our sold cars to find its invoice value. Here's the code to <create the hash table>:

  DECLARE HASH notehash (DATASET:'work.cars');
  notehash.DEFINEKEY('make','model');
  notehash.DEFINEDATA('invoice');
  notehash.DEFINEDONE();


Woh! That code looks like no other SAS code we've ever seen!! That's because the hash table is a "component object" and the syntax for dealing with components objects differs from mainstream DATA Step syntax. It's called "dot notation". It quickly makes sense once you get over the fact that it's different.

The first line tells SAS that we want to create a new hash table. Hash tables only exist temporarily in memory for the duration of the DATA Step. We use the DECLARE statement to begin to create a new component object; the first parameter (HASH) says what kind of component object we want to create; the second parameter (notehash) is an arbitrary name that we have chosen for the hash table; within the brackets we have told SAS that we're going to use some of the columns of the work.cars table as our lookup table.

The following two lines tell SAS a bit more about how we'd like to load and use the hash table; the fourth line (with DEFINEDONE) tells SAS we've nothing more to tell it about the hash table.

When we use dot notation we type i) the name of a component object, ii) an action we want to perform on the object, and optionally iii) parameters for the action. Parts (i) and (ii) are separated by dots, and the parameters (iii) are enclosed in brackets.

When we create a hash table, we have to declare it, then we have to specify a) the key column(s), i.e. the column(s) that we'll use to find something in the hash table, and b) the data column(s), i.e. the column(s) whose values will be returned once the key values are found in the hash table. In our case, MAKE and MODEL are our key columns, and INVOICE is our data column.

After specifying our key and data columns (with the DEFINEKEY and DEFINEDATA actions) we tell SAS that we're done by performing the DEFINEDONE action on the hash table.

The dot notation is different to what we're used to, but it's not too tricky to get your head around.

Now that we've created our hash table in memory, for use during the DATA Step, all we need to do now is use it. We lookup things in the table by performing the FIND action on the hash table. If SAS finds the key value(s) in the hash table, it will automatically put the associated data value(s) into the data variable(s) in the DATA Step. So, in our case, we need a variable in the DATA Step named INVOICE. If we don't create that variable prior to using the hash table we'll get an error.

When we do a FIND, we're given a return code value that tells us whether SAS found the key(s) in the hash table. A return code value of zero tells us that all is well and the value was found; any other value tells us that SAS did not find the value. So, our code to <do a lookup> will look like this:

  length invoice 8;
  rc = notehash.FIND();
  if rc ne 0 then
    put "Oh dear, we sold something we can't identify";


Notice that there's no direct reference to INVOICE when we do the find. The fact that FIND will put a value into INVOICE is implicit from our preceding DEFINEDATA.

Put all together, our code looks like this:

/* Create our test data */
data SalesThisWeek;
  length make $13 model $40;
  infile cards dsd;
  input make model;
  put make $quote22. model $quote50.;
cards;
Jaguar,XJR 4dr
Land Rover,Discovery SE
Jaguar,XKR convertible 2dr
;
run;

/* Strip leading blanks from MODEL */
data work.cars; set sashelp.cars; model=left(model); run;

/* Add invoice values to sales by using lookup */
data result;
  set SalesThisWeek;
  keep make model invoice;
  if _n_ eq 1 then
  do; /* It's first time through, so let's create hash table */
    DECLARE HASH notehash (dataset:'work.cars');
notehash.DEFINEKEY('make','model');
notehash.DEFINEDATA('invoice');
notehash.DEFINEDONE();
  end;
  /* For each row in input table, do a lookup in the hash table */
  length invoice 8;
  rc = notehash.FIND();
  if rc ne 0 then
    put "Oh dear, we sold something we can't identify";
run;


Once you've got the basic hang of hash tables, the two best sources of reference information are:

a) The hash table tip sheet, available from R&D's SAS/BASE section of the SAS Support web site (see the link to the tip sheet over on the right hand side of the page)

b) Component Objects: Reference in the SAS Programmer's Bookshelf

There are many ways to perform lookups in SAS. Some examples would be i) formats, ii) the KEY= parameter of the SET statement, iii) table joins. The hash table is another option which can offer advantages in many cases. Have fun...

Tuesday 6 December 2011

NOTE: Testing Macros - Parameters Revisited

As my planned series on testing drew to a close last week, I got an email from Quentin McMullen with some very kind words about the NOTE: blog, but also some very erudite comments about my choice of parameters for my testing macros. Rather than paraphrase Quentin's comments, I decided to publish his email verbatim (with his permission). Here's the heart of Quentin's email, followed by a few brief comments from me.
Just a quick thought:

I have a similar macro to %assert_condition, but it only has one (main) parameter, &CONDITION, instead of three; &LEFT &OPERATOR &RIGHT.  So it looks like:

%macro assert_condition(condition,tag=);
 if &CONDITION then
   put "TESTING: &sysmacroname: TAG=&tag, OUTCOME=PASS";
 else
   put "TESTING: &sysmacroname: TAG=&tag, OUTCOME=FAIL";
%mend assert_condition;


So you can call it like:

%assert_condition(incount eq outcount)
or
%assert_condition (age > 0)
or
%assert_condition ( (incount=outcount) )

I tend to like the one parameter approach.

The only tricky part is if you have an equals sign in the condition, you have to put parentheses around the condition so the macro processor does not interpret the left side as a keyword parameter.  The nifty thing is that the parentheses also mask any commas,e.g.:

%assert_condition(gender IN ("M","F") )

Do you see benefits to the 3 parameter approach vs 1 parameter?
Yes, Quentin, I do very much see the benefits of your approach. Your example, using the IN operator, is particularly well chosen. Rest assured I'll be adapting the McMullen approach in future. Thanks for the comments. Keep reading NOTE:!

Thursday 1 December 2011

NOTE: Testing (Presenting the Results)

The preceding two articles in this series on testing presented a simple, generic macro for testing and recording test results. All that remains now is for us to tidy-up some loose ends.

Firstly, the macro assumes data set work.results already exists. And it also assumes that the data set contains appropriate variables named Tag and Result. We can quickly arrange that by being sure to include a call to the following macro in our testing code:

%macro assert_init(resultdata=work.results);
  data &resultdata;
    length Tag $32 Result $4;
    stop;
  run; 
%mend assert_init;

Finally, we want to present our results. We can do this easily with a simple call to PROC REPORT:

%macro assert_term(resultdata=work.results);
  title "Test Results";
  proc report data=&resultdata;
    columns tag result;
    define tag / order; 
  run; 
%mend assert_term;

Equipped thus, we can focus on our testing code, not the mechanics of collating and presenting results. For example, let's imagine we have some new code to test; the purpose of the code is to read a raw file (a.txt), create some computed columns, and write-out a SAS data set (perm.a). One of our tests is to check that the number of rows in the raw file matches the number of rows in the SAS data set. Here's our code to test this functionality:

%assert_init;
%include "code_to_be_tested.sas";
%assert_EqualRowCount(infile=a.txt,outdata=perm.a,tag=T01-1);
%assert_term;


We can make the results a tad more visual by colourising the pass/fail values:

%macro assert_term(resultdata=work.results);
  proc format;
    value $bkres 'PASS'='Lime'
                 'FAIL'='Red';
  run;

  title "Test Results";
  proc report data=&resultdata;
    columns tag result;
    define tag / order;
  define result / style(column)={background=$bkres.};
  run;
%mend assert_term;


This assumes you're using SAS Enterprise Guide. If not, you'll need to add some appropriate ODS statements around the PROC REPORT.

The downside of the macros as they stand at this point is that the results data set gets recreated every time we run the code. Maybe we don't want that because we want to collate test results from a number of separate bits of test code. So, finally, we can make the creation of the results data set conditional, i.e. if it doesn't exist we'll create, if it already exists then we'll leave it alone:

%macro assert_init(resultdata=work.results);
  %if not %sysfunc(exist(&resultdata)) %then
  %do;
    data &resultdata;
      length Tag $32 Result $4;
      stop;
    run;
  %end;
%mend assert_init;

NOTE: Advent Calendar 2011

Today is 1st December, and that can mean only one thing: time to turn to the traditional SAS Professionals advent calendar! There will be prizes every day throughout December, so make sure you visit http://www.sasprofessionals.net/ every day.

And if you're not already a member, sign-up, check-out the Expert Channel, the forums, and benefit from the ability to network with fellow SAS practitioners online! Attend next year's SAS Professionals Convention in Marlow (July 10 to 12) and learn from papers delivered by a mixture of SAS staff and customers.

Tuesday 29 November 2011

NOTE: Testing (Collating the Results)

We began this series of posts on testing with an introduction to testing disciplines, and I followed that with a description of how we could quickly create some small macros that would allow us to automate some of our testing and present the results in a consistent fashion in the SAS log. However, putting the results into a SAS data set rather than the log was the next step to improve our efforts.

We can use the tag as a unique key for the test results data set so that the data set has the most recent test result for each and every test. We can re-run individual tests and have an individual row in the results data set updated.

To give us the greatest flexibility to add more macros to our test suite, we don't want the process of writing results to a data set to interfere with activities that are external to the macro. So, using a SET statement, for example, would require the data set to be named in the DATA statement. This seems a good opportunity to use the OUTPUT method for a hash table. We can load the results data set into the hash table, use the TAG as the key for the table, and add/update a row with the result before outputting the hash table as the updated results data set. Here's the code:

%macro assert_condition(left,operator,right,tag=
                       ,resultdata=work.results);
  /* Load results into hash table */
  length Tag $32 Result $4;
  declare hash hrslt(dataset:"&resultdata");
  rc = hrslt.defineKey('TAG');
  rc = hrslt.defineData('TAG','RESULT');
  rc = hrslt.defineDone();
  /* Update the hash table */
  tag = "&tag";
  if &left &operator &right then
    result="PASS";
  else
    result="FAIL";
  rc=hrslt.replace(); /* Add/update */
  /* Write back the results data set */
  rc = hrslt.output(dataset:"&resultdata");
  rc = hrslt.delete();
%mend assert_condition;


By adding the maintenance of the results data set to our basic assert macro, the functionality gets inherited by any higher-level macro (such as yesterday's %assert_EqualRowCount).

Clearly, the new macro won't work if the results data set doesn't already exist, and we'd like to present the results in a format better than a plain data set. We'll cover that in the next post.

NOTE: Testing is Like Visiting the Dentist?

In a comment in response to my recent Testing - Discipline article, Rick@SAS provided a link to an article he'd posted on the American Statistical Association community forum wherein he drew an analogy between testing and going to the dentist. I think it's a very well drawn analogy. Thanks Rick.

Thursday 24 November 2011

NOTE: Testing (With Basic Macros)

So, having stressed the importance of testing in my previous post on the subject, let me give you some hints on how I keep the test phase efficient and effective on my projects. This will be a series of hints across the next few days. Today I'll offer some tips for automating your tests, and I'll describe a simple macro that you can use to highlight test results in your log. As the series continues I'll describe how to enhance the macro and easily add a reporting system to summarise your test results in one place.

There is a degree of overlap between FUTS (mentioned yesterday) and the macros that I shall describe. I recommend FUTS, but some of my clients find its size intimidating, and others don't have authority to download anything (especially "code") from the internet.

Let's assume you have a set of tests scripts, i.e. steps for the tester to take, accompanied by expected results for each step. Let's take a simplified example:

1) Run the customer load. Inspect the SAS log. Expect no error or warning messages.

2) Count the rows in the input customer file (CSV) and the warehouse customer table (SAS data set). Expect the number of rows in each to match. [I said it's a simplified example!]

We can automate the second test (the first too, but that's for another time). The benefit of automating is that we can re-run it quickly and effectively (after test failure, and for regression testing).

Tuesday 22 November 2011

Testing - Discipline

My dad says "if a job's worth doing, it's worth doing well". I say "if a bit of code is worth writing, it's worth testing it properly". Maybe I'm stretching the old saying a little, but the principle remains true.

Software testing is a very large subject area; I'm not going to try to reproduce a text book here. I'm simply going to list some of the principles I apply to the testing phases of my projects and then show some useful macros that I have developed to aid the re-use of tests. There are many different types of test phase, each with different objectives. Some of these were briefly covered in my "SAS Software Development With The V-Model" paper at this year's SAS Global Forum (SGF).
  • To test something, you need to know what it should do, in all circumstances. This means you need to have established an agreed set of requirements and/or specifications.
  • There are a number of reasons why you might need to re-run a test - because the test failed, or for regression testing. For this reason, and for others, automated tests are preferable to manual tests.
  • Look upon your tests as an investment. Firstly, finding bugs before go-live is always "a good thing" for a number of reasons. But secondly, tests invariably need to be re-run, so the more effort you put into them the more they'll repay you when you have to re-run them. A library of re-usable tests is an asset.
  • Don't just test the "happy path" for your system. Test that the system rejects bad input and handles unexpected situations elegantly. This is called "Negative Testing". In simple terms this might mean testing with values of zero, one, two, negative, non-integer, and very large numbers
  • Document your test strategy. This includes stating which testing method & tools will be used for each different type of system element, e.g. data entry screens, report-generation wizards, small files, big files, important reports (to be sent to regulatory authorities, for example), less important reports (for internal information only, for example)
  • Document your test plan and test cases, i.e. the individual steps (and expected results) that the tester should follow.
  • Documenting your test steps means that they can reliably be re-run if the tests have to be done again
  • With regard to documentation, I always preach the "barely adequate" approach, i.e. do what needs to be done ("adequate") but don't go beyond ("barely"). In order to do this, you need to clearly understand the objectives of each document and the intended audience(s). Sometimes you need separate documents; sometimes you can put all of the content into one document.
So, having stressed the importance of testing, let me give you some hints on how I keep the test phase efficient and effective on my projects. Actually, I'm going to offer a series of hints over the next few days. In the first I'll offer some tips for automating your tests, and I'll describe a simple macro that you can use to highlight test results in your log.

I'll finish today by recommending a suite of SAS macros named FUTS (Framework for Unit Testing SAS programs) from Thotwave. These are available for free download after registering with the site (the download includes documentation and some examples of usage too). Developed by Greg Barnes-Nelson and colleagues, the macros are pure gold.

You can read background to the macros in the following SAS conference papers which chart the development and use of the macro (from their original incarnation as SASUnit through to FUTS):

Automated Testing and Real-time Event Management: An Enterprise Notification System, SUGI 29, 2004

SASUnit: Automated Testing for SAS, Phuse, 2004

Drawkcab Gnimmargorp: Test-Driven Development with FUTS, SUGI 31, 2006

NOTE: SAS Global Forum 2012 - One door closes, another opens

The closing date for submission of papers for SAS Global Forum (SGF) 2012 has passed, but registration is now open. If you can get yourself to Florida, USA 22-25 April 2012, you will benefit from a vast array of papers (from SAS staff and from SAS customers), plus myriad opportunities to talk to and learn from fellow SAS practitioners.

Register before March 19th 2012 to get the best deals. Go with 3 or more colleagues and get a greater discount.

SGF is an annual event organised by the SAS Global Users Group, a non-profit organisation that is open to all SAS software users throughout the world. SAS Global Users Group is governed by an Executive Board whose membership is composed of individuals who have been selected to chair the annual conference plus three SAS Institute representatives. The organisation was formed to provide a means for SAS software users to exchange ideas, explore ways of using SAS software, and participate in activities of mutual interest. Since its first event in 1975 (known at that time as SAS Users Group International - SUGI), SGF has been the premier event of the year for SAS practitioners worldwide, offering educational and networking opportunities. This year's conference chair is in the safe hands of Andy Kuligowski and will doubtless be the best ever.

Wednesday 2 November 2011

NOTE: Every Day's a Learning Day!

Pope Gregory XIII
The purpose of me attending SAS courses is to teach, and I'd like to think I'm successful at doing that, but not a course goes by without me learning a little tip or trick from one of the learners. Today's was most interesting...

We all know that SAS dates are stored as numbers, and the number represents the number of days since 1st January 1960 (or, number of days since my birthday plus 890 as I prefer to think of it!). Hence, 1st January 1960 is zero, and 2nd January 1960 is one.

A learner asked "what about dates before 1960, does SAS use negative numbers", to which I replied in the affirmative, but I then tried to enrich the answer by saying that you can only go back to 1548 because King Henry VIII changed the calendar at that time. Another learner politely pointed out that King Henry died in 1547, so I couldn't possibly be correct (my learners are well educated!).

Some quick research in the next break revealed some interesting information in the SAS 9.3 Language Reference: Concepts manual in the About SAS Date, Time, and Datetime Values section. Two things caught my eye:

1) SAS can perform calculations on dates ranging from A.D. 1582 to A.D. 19,900

2) SAS date values can reliably tell you what day of the week a particular day fell on as far back as September 1752, when the calendar was adjusted by dropping several days

So, (1) confirms that I got the year wrong, and it's long after King Henry's death, so I was proved utterly wrong! Further research turned-up an explanation of events in 1582. SAS Knowledge Base article 24808 by William Kreuter describes Calculating Age with Only One Line of Code but also mentions "Pope Gregory XIII proclaimed the Gregorian calendar in 1582" to properly deal with leap years. So there's the full explanation. Where I got my belief that it was 1548 and King Henry VIII I do not recall.

Item (2) was interesting for me too because I had not realised that SAS's day of the week capabilities only worked for a subset of the SAS date range.

So, every day's a learning day - teachers included!

Tuesday 1 November 2011

Avoid Magic Numbers

Mention of my university days last week reminded me of an old maxim that my professor drummed into me. Avoid magic numbers.

Have you ever had to amend a colleague's code and found a number in the middle of a calculation and wondered what the significance of the number was? Did it make you nervous about changing the code because you didn't fully understand it? Your colleague coded a "magic number".

The term "magic numbers" refers to the practice of including numbers in source code. Doing so is likely to leave very little explanation as to why the developer chose that number, and thus the program becomes difficult to confidently maintain. It is far preferable to declare numeric constants at the top of your program as macro variables. This has a number of advantages:

1) Any mis-typed reference to the numeric constant will be highlighted by the SAS macro processor or compiler as an uninitialised variable, whereas a mis-typed number can be very difficult to spot

2) The numeric constant can be defined once and then used many times throughout the program. Thus, if it needs to change, the change needs to be made just once, and there's no danger of missing one or more occurrences that also need to be changed

3) Placing the definition at the top of the code makes it very easy to add some comments around it that describe the rationale for the value, plus advice for making changes.

As you can see, there are distinct and compelling reasons to avoid magic numbers in your code. To illustrate the point, consider the following snippet of code:

data ada;
  set sashelp.buy;
  GrossPrice = amount * 1.2;
run;

The code offers no indication/explanation that it is adding Value Added Tax (VAT) to the price of each item. We could re-code things this way:

%let VATrate = 0.20; /* VAT is currently 20% - 20OCT2011 */

data ada;
  set sashelp.buy;
  GrossPrice = amount * (1+&VATrate);
run;

We have documented the meaning of the number, and used a descriptive name throughout our program. And, if we mis-type the name of the macro variable in our DATA step we'll be rewarded with an error message, whereas mis-typing 1.2 in our earlier code may have gone unnoticed.

Inevitably there are drawbacks to this approach (such as making the code more complex in some people's eyes, and slowing the compilation process by a fraction) but the benefits, in my opinion, far outweigh the drawbacks.

There are cases where it is acceptable to use numeric constants. The true and false values (1 and 0) are two such examples.

Making your code data-driven and thereby eliminating the use of numeric constants is better still. But that's a topic for another day!

NOTE: Macro Options

Last week I wrote about the availability of the IN operator in macro through the use of the MINOPERATOR system option. I mentioned that SAS v9.0/9.1 had introduced a number of useful features to the SAS macro capability. Here are some options that you might find useful:

MPRINTNEST - Augments the information written to the log by MPRINT. If the MPRINT option has also been set, MPRINTNEST will show the level of nesting of macro calls, and the names of the nested macros. Very useful. V9.3 introduced a couple of new macro functions in this area: %SYSMEXECDEPTH returns the depth of nesting from the point of call, and %SYSMEXECNAME returns the name of the macro executing at a nesting level.

MCOMPILENOTE - Issues a message ot the log after a macro has been compiled. Successful completion of macro compilation is normally silent, so this forms a handy, positive confirmation.

MAUTOLOCDISPLAY - Echoes the source's location to the log each time an autocall macro is used (rather than once at compile-time). This provides a neat confirmation that SAS has picked-up the macro source from the directory you intended. In v9.3, MAUTOCOMPLOC specifies the location of an autocall macro when compiled by writing a message to the log.

NOTE: Business Intelligence Notes

Tricia Aanderud (@TAanderudcommented on my post about the late Dennis Ritchie. Her observation that she frequently finds herself creating elaborate SAS programs to avoid using the dreaded command line rang true with me. Big time!

Tricia and Angela Hall (@Angela, Real BI for Real Users blog) are working on a new SAS publication entitled Building Business Intelligence Using SAS. Judging by their recent tweets it'll be published soon. The advance information I've seen makes the book look a real winner.

And another winner is their joint web site. It'd somehow passed me by until Tricia subtly introduced it in her comment. In addition to being beautifully presented, http://www.bi-notes.com/ is full of useful BI tips and tricks - tasters of what is to come in their book. I've already added it to my Google Reader subscriptions, and I've added it to NOTE:'s blogroll too. Recommended.

From a quick overview of SAS EBI to squeezing more visible columns into your Web Report Studio output, there's plenty of interest.

Tuesday 25 October 2011

NOTE: There's No In In Macro (or is there?!)

I love the SAS macro language and the power it gives you. Being able to control the creation of a SAS program, or its execution path, is immensely valuable. The SAS macro language contains the vast majority of the functional and logic capabilities of the Base SAS language but one notable exception is the ability to use the IN operator in a %do / %end statement. However, you might be pleased to know there's a little-known option to allow the use of IN in macros.

SAS v9.2 introduced the MINOPERATOR system option. It defaults to NO, but if you switch it on then you gain the use of the IN operator in SAS macro language. See the code example in the SAS log below:

26 option minoperator;
27
28 %macro babbage;
29
30   %if &sysver in 9.2 9.3 %then
31     %put I have the use of MINOPERATOR! ;
32   %else
33     %put I have no MINOPERATOR option ;
34
35 %mend babbage;
36
37 %babbage
I have the use of MINOPERATOR!


The SAS macro IN operator has had a checkered history. Introduced in v9.0/9.1 (together with a lot of other nice new macro features), it disappeared in later versions of 9.1 (for instance, the copy of 9.1.3 I have in front of me does not support IN). Regardless, it seems to be back to stay now! I presume there are some possible syntax clashes with existing SAS programs and that's why it firstly disappeared and secondly reappeared as a switched-off option..

You'll notice that the syntax differs a little from Base SAS, but it's all described in the online help. In short, there's no brackets around the list of values, and the values are separated by spaces. You can use the MINDELIMITER system option to specify an alternate delimiter. You can use the hash character (#) instead of "in", but I suggest that would just make your code hard to read.

If you're not already a keen user of SAS macro, find out more and see what you're missing!

NOTE: Updated Pocket Reference

When Rick Aster first saw his book Professional SAS Programmer’s Pocket Reference published can he have imagined that a sixth edition would be published nearly twenty years later? Well, it has. Suitably updated for SAS 9.3.

There's a good reason why Rick's book is still in print: people find it an invaluable tool. The book offers key information about SAS statements, options, routines, and more, all in a compact, easy-to-use, handily-sized format.

To find out more about Rick, and to see the variety of other SAS resources on offer, see Rick's Global Statements web site. For some crosswords with a SAS and technology theme (and some other R&R activities), see Rick's PROC RASTINATE page. Don't delay.

Dennis Ritchie (1941-2011)

I was saddened to hear of the death of Dennis Ritchie recently. As creator of the C programming language and co-creator of unix, there are elements of his work throughout my personal and professional life: Android and Apple iOS are based on linux (a flavour of unix), and many of my clients run unix or linux servers for SAS software. As a Computer Science student between 1980 and 1983, C and unix were my bread and butter.

As somebody who has struggled, not yet mastered, but at least come to terms with unix, I'm amused by a quote associated with Dennis:
UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity
Dennis's name and work may not be as well recognised as others with greater marketing and self-promotion skills, but rest assured that the scope and influence of his work is far greater.

Dennis, we salute you. RIP.

Tuesday 18 October 2011

NOTE: (Constructive) Feedback is Good

Sometimes it's adequate that SAS simply does what you ask. On other occasions, it's useful to know a little more detail of what SAS is doing on your behalf - maybe because it's making decisions for you, or maybe you're trying to figure-out a problem with your code.

Adding the FEEDBACK keyword to a PROC SQL statement will result in a bunch of useful information being written to the SAS log:
  • Any asterisk (for example, SELECT *) is expanded into the list of qualified columns that it represents. 
  • Any PROC SQL view is expanded into the underlying query. 
  • Macro variables are resolved. 
  • Parentheses are shown around all expressions to further indicate their order of evaluation. 
  • Comments are removed.
Here's an example log, showing the SELECT * being expanded and a macro variable being fully resolved (note how the column name is prefixed with the table name, and the sort order (ASC) is shown):


15 %let ob = age; 
16 proc sql feedback; 
17   select * 
18     from sashelp.class 
19     order by &ob 
20     ; 
NOTE: Statement transforms to: 


        select CLASS.Name, CLASS.Sex, CLASS.Age, CLASS.Height, CLASS.Weight 
          from SASHELP.CLASS 
      order by CLASS.Age asc; 


 21 quit; 

Quite apart from revealing some of what SAS is doing behind the scenes, the expansion of SELECT * into the log allows you to subsequently copy/paste the column names back into your program and then remove one or two that you don't want to keep.

Another good option for revealing information is OPTION MSGLEVEL=I. Setting this option will also result in some hitherto "secret" information being written to the SAS log. For example, SAS will tell you if it has used an index (and it will tell you which index it has used). Here's an example log (demonstrating how indexed data sets can be successfully read through a BY statement even if the data set is not physically sorted):

15 option msglevel=i; 
16 data holiday; 
17   set maps.spain2; 
18   by regname; 
19 run; 
INFO: Index REGNAME selected for BY clause processing. 
NOTE: There were 53 observations read from the data set MAPS.SPAIN2. 
NOTE: The data set WORK.HOLIDAY has 53 observations and 10 variables.

The FEEDBACK and MSGLEVEL options are both useful techniques to get more of an insight into what SAS is doing and hence what your program is doing.

NOTE: SAS Global Forum (SGF) 2012 Call For Papers

Next year's SAS Global Forum will be held in Orlando, Florida, USA from April 22 to 25. As ever, it will be your best opportunity of the year to get up-to-date with SAS news and latest releases; pick up a bucket-load of hints, tips, and experiences; meet SAS's developers and pick their brains.

Why not make your own contribution to the event? Consider submitting a paper or poster. You have until November 14 to make a proposal. If you've not done such a thing in the past, fear not, some of the section heads are very supportive of new blood and will provide invaluable help, advice and feedback. Just go to the event home page and click the Get Involved link near the top of the page.

Will I see you there?

Wednesday 12 October 2011

NOTE: More Tip Sheets

In response to the SAS/Base tip sheets I mentioned a couple of weeks back, Rick Wicklin sent me a comment to say he'd been inspired to publish a couple of tip sheets for SAS/IML. He's highlighted these in his latest posting on his The DO Loop blog. One is for standard SAS/IML coding, the other is for the IMLPlus language - available through SAS/IML Studio. Both look very smart and very useful.

Tuesday 11 October 2011

NOTE: IBM to Acquire Platform Computing

There's interesting news coming out of IBM this afternoon with their announcement that they are to acquire Platform Computing. This is of relevance to SAS customers because Platform Computing supplies key parts of some SAS solutions (such as the LSF for scheduling and RTM for grid monitoring).

IBM would like to see themselves as competitors for SAS. Their disparate portfolio of products includes SPSS, Cognos and DataStage. All have been acquired (rather than developed in-house) and all provide various levels of integration with each other. IBM may be reluctant to share their soon-to-be-acquired technology with competitors such as SAS. On the other hand, SAS may have a firm, long-term supply contract with Platform.

If we assume that SAS were aware of IBM's interest in Platform, one wonders whether they considered out-bidding IBM. Or do SAS have an in-house set of solutions that they might roll-out soon?...

Clearly, nothing's going to suddenly stop working, and nothing's going to change in the near future, so this acquisition isn't a concern for SAS customers. However, it's a matter of interest for SAS customers, and I for one am intrigued to see what the longer-term outcome of this acquisition turns-out to be.

NOTE: Enterprise Guide Autoexec

I'm part-teaching 10 Tips for Organising Your SAS Enterprise Guide Projects (EG1) in Marlow this week. Thus I was all the more interested to note the latest posting in the SAS Dummy blog wherein Chris Hemedinger lists his top ten EG tips.

The tips are applicable to all levels of EG expertise. Quite apart from being reminded that we can change the background colour of each process flow (tip 8), I was most grateful for the reminder of the ability to have an automatically executed process flow in EG 4.3 (tip 9). Chris provided some background/history to this back in January. If you're using EG 4.2 you can't use the autoexec process flow to provide a project-specific autoexec, but you can use Tools -> Options -> SAS Programs to create a "user-specific" autoexec.

NOTE: Free SAS Book (ODS Graphics)

You don't have to have an Amazon Kindle device to read Kindle books, which is why the free copies of the Kindle version of SAS 9.3 ODS Graphics caught my eye and led me to acquire a copy for myself (albeit through a convoluted route because I live in the UK).

I have the kindle reader software on my PC, my Android tablet, and my Android phone, so I can get myself updated on SAS graphics almost anywhere I like. Amazon's own readers are limited to black and white viewing (except for the new Kindle Fire), whereas my PC and Android devices display the book's contents in colour - handy for a book about graphics.

This Kindle book is free until October 16. Don't hesitate, don't miss out!

Thursday 6 October 2011

NOTE: Security Metadata Tool Enhanced (Metacoda Security Plug-Ins)

Back in April at SAS Global Forum (SGF) I wrote to say how I was impressed by the Metacoda Security Plug-Ins. SAS's security capabilities are not in question; the Metacoda Security Plug-Ins are a collection of plug-ins for SAS Management Console offering a large number of new, business-focused views on your SAS security metadata, making SAS's range of security capabilities more accessible, understandable, and easier to use.

The version I saw in April worked for SAS version 9.1.3 and 9.2. Michelle from Metacoda recently dropped me a line to highlight version 2 of the plug-ins. Version 2 introduces support for SAS 9.3. I recommend you take a look and contact Metacoda for more information.

And, Metacoda's Paul Homes's blog is featured in the right-hand margin of the NOTE: web site. If your job responsibilities/interests include SAS platform administration then Paul's blog will undoubtedly be a good read for you.

BBC, on SAS

The BBC's Technology of Business site featured an interview with SAS's Chief Technology Officer (CTO) Keith Collins this week. The interview is themed on "big data" and it includes Keith's thoughts on the accelerating interest in understanding one's customers. Use of the internet as a channel for promoting and selling products and services means that even the smallest of organisations can reach large volumes of (potential) customers. All of these companies are potential beneficiaries of customer intelligence.

It's an interesting interview, and worthy of a read.

Keith finishes-off by talking of one or two mistakes he feels he's made, one of which was not realising how fast the market for tablet computers would develop. If you thought there a lot of mobile BI stuff at SAS Global Forum this year, next year will be even bigger!

Friday 30 September 2011

NOTE: Base SAS Tip Sheets

I think I may have mentioned these in passing before, but they're certainly worth a repeat reference. The right-hand margin of the Base SAS Focus Area on the SAS Support web site offers a number of free tip sheets for download. I've been using the PERL Regular expressions sheet a lot recently. The Hash Object sheet is a popular handout when I'm teaching PRG3 - Advanced techniques and Efficiencies.

Why not earn yourself some Brownie points within your team by printing-off some of these sheets, laminating them, and handing them around the team?

Wednesday 14 September 2011

Predictive Modelling Competition Ends Soon

Back in August I mentioned that one of my clients was running a predictive modelling competition with a $10,000 prize fund. The competition closes on Friday September 30th, just a couple of weeks away, so if you intend to enter but haven't done so yet, you need to get your skates on!

I'm no predictive modeller, so I entered the competition out of sheer curiosity. At the time that I made my entry I was ranked 90th, today I'm 151st. I guess that none of the prize fund will be heading in my direction!

Tuesday 13 September 2011

NOTE: Substringing Without Notes

The SUBSTR function is a much-used function but can be frustrating when it issues an abrupt NOTE to the log when the string isn't long enough to produce a result.

If the design of your code means that you're happy to allow a null result from your sub-string function, consider using SUBSTRN. This function provides the same essential functionality as SUBSTR but will not issue a NOTE message if it returns a null result. The two DATA steps below show the contrast in results:

15 data _null_;
16   MyString = 'Shorty';
17   a = substr(MyString,1,10);
18   b = substr(MyString,15,10);
19   put a= b= ;
20 run;

NOTE: Invalid third argument to function SUBSTR at line 17 column 8.
NOTE: Invalid second argument to function SUBSTR at line 18 column 7.
a=Shorty b=
MyString=Shorty a=Shorty b= _ERROR_=1 _N_=1
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

21
22 data _null_;
23   MyString = 'Shorty';
24   an = substrN(MyString,1,10);
25   bn = substrN(MyString,15,10);
26   put an= bn=;
27 run;

an=Shorty bn=


Notice in particular how the attempt to assign a value to B where the starting position is beyond the length of the string produces _ERROR_ set to 1.

In the right situation, SUBSTRN can be very handy.

Tuesday 6 September 2011

NOTE: Charity Drive Ends Successfully

Congratulations to David Shannon and his brother Paul for completing their 1,200+ mile journey from Land's End to John O'Groats, by B-roads, in their 1974 MGB Roadster.

Even bigger congratulations to them for raising over £2,100 pounds for their chosen charity, exceeding their £1,200 target by quite some margin. It's not too late to contribute if you would like to offer them congratulations in a material way.

The constant stream of tweets has told us that their car was totally reliable. David and Paul sound very tired, but satisfied with the range of teas and cakes they've sampled along the way!

NOTE: Better Access to SYSTASK and OS Commands in SAS 9.3

Back in June I extolled the virtues of the SYSTASK statement, but was frustrated by my inability to use it in certain circumstances. So I was very pleased to see Chris Hemedinger writing in his blog recently that the default ability to use SYSTASK (and its equivalents) has been eased in appropriate circumstances such as when using EG and SAS on the same PC.

Thanks for the good news, Chris.

NOTE: Changing the Default Length for Numeric Values

Choosing appropriate lengths for the variables in your SAS data sets can have a significant effect on the size of your SAS datasets and hence a) the disk space they consume, and b) the speed with which they are written to and read from disk.

The default length for numeric values is 8 bytes; this is the maximum length too. For non-integer values you shorten the length at your peril, but if you're using integer values then you can consider shortening the length(s) and saving disk space.

The LENGTH statement's DEFAULT parameter allows you to specify a default length for newly-created numeric values. This can be useful if you are creating a number of date variables and/or binary flags.

SAS paper TS-654 provides plenty of information on numeric precisaion. Crucially, the table at the bottom of the second page shows the range of integer values you can safely store in variables of reduced length.

NOTE: SAS Available on iPad! Free!!

Have you seen the SAS News app (available from the iTunes App Store)? It looks to be a neat alternative to following SAS blogs and press channels through an RSS reader. In SAS's own words: "Follow and share the latest news, social feeds, product information, customer stories and more from SAS, the leader in business analytics software and services."

Okay, I confess, I was just aiming for an eye-catching headline! I'm not an Apple customer, so I shan't be downloading it soon.

Perhaps of more interest, SAS have a second product in the iTunes App Store. If you have an iPad, iPhone or iTouch, take a look at SAS Flash Cards. This is an education/business product that is not directly related to SAS software. The free app allows you to create your own decks of flash cards for use in teaching and/or presentation situations. It's worth a look if you have an iOS product to run it on. If you gave it a try, let me know what you think...

Wednesday 31 August 2011

NOTE: BI Adoption

I spotted a nice post from Lisa Pappas in the SAS Voices blog recently. It's entitled "Boost your BI adoption rates with mobility & collaboration, part 1" and I'm looking forward to the second (and final) part.

Lisa describes how recent developments in four key areas will help promote the adoption of BI by a much wider audience. Using Roambi to illustrate her points, Lisa talks us through Mobile BI and Collaboration in part 1; part 2 will cover Interactive Data Visualisation and Search.

The first article is brief (I guess the second will be too), but it's an interesting reminder of current BI trends.

NOTE: CALCULATED in SQL

If you use SQL, either hand-cranked from the Editor, or generated via Enterprise Guide or DI Studio, you're probably familiar with creating new columns ("computed columns"). However, have you ever struggled to create a computed column from a computed column?

You can do this easily in SQL, but the syntax isn't immediately obvious. For example, calculating body mass index could be attempted with a two-stage calculation:

18 /* Assume height measured in metres, weight measured in kilos, */
19 /* then BMI = weight / height^2 */
20 proc sql;
21   create table alpha as
22     select  class.*
23            ,height * height  as heightSquared
24            ,weight / heightSquared  as bmi
25     from sashelp.class;
ERROR: The following columns were not found in the contributing tables: heightSquared.

We can create a "heightSquared" computed column, but trying to use heightSquared to create bmi fails.

The solution is simple: just add the keyword "calculated" in front of the computed column when you want to refer to it. Here's the corrected select expression:

     select  class.*
            ,height * height  as heightSquared
            ,weight / CALCULATED heightSquared  as bmi

You can use the CALCULATED keyword in all places where you can use SQL in SAS.

NOTE: SAS Blogs Have Moved

I'm rather late in highlighting this, but "better late than never".

The SAS blogs have moved their locations. The web pages have moved, and so have the RSS feeds. So, for instance, the blogs of Angela and Chris, who I mentioned yesterday, have moved to the following locations:

Real BI for Real Users (Angela Hall): http://blogs.sas.com/content/bi/

The SAS Dummy (Chris Hemedinger): http://blogs.sas.com/content/sasdummy/

Both blogs are highly recommended and you can always find their latest posts on the home page of NOTE:.

Tuesday 30 August 2011

NOTE: Charity Drive Starts Tomorrow

David Shannon, known far and wide in the UK SAS world (and beyond), is starting a charity drive from Land's End to John'O'Groats (LEJOG) tomorrow, i.e. from the most south westerly tip of Britain to the furthest point in the north westerly direction.

You may recall I mentioned this back in January, and also highlighted the fact that my wife had done the same trip (1,000+ miles) on a bike in ten days. David and Paul plan to do it in five days.

The drive will be fun(!) but is also in aid of Cancer Research UK. Never one to shirk a challenge, David (and his brother, Paul) will be making the journey in a 1974 MGB Roadster, avoiding motorways and A-roads.

You can follow David and Paul's adventure via their blog. If you'd like to offer them support and encouragement, you can contribute to Cancer Research UK via their JustGiving page.

Good luck chaps!

NOTE: Keyboard Shortcuts in the SAS Enhanced Editor (revisited)

My recent article on keyboard shortcuts generated a number of valuable responses.

PeterC (seconded by Chris@SAS) mentioned that Ctrl+i, available in EG 4.3, would automatically re-indent my code and thereby tidy it up. A very recommendation, but sadly the client in question is still running SAS 9.1.2 and hence EG 4.1.

Angela@SAS said that one of her favourite keyboard shortcuts is ctrl+? (properly known as ctrl+/). It's probably my most-used shortcut too. Ctrl+/ turns the selected lines into comments; and ctrl+shift+/ removes comments from the selected lines.

NOTE: Upgrading to SAS 9.3

Paul Homes of Metacoda in Australia (suppliers of the Metacoda Security Plug-Ins for SAS) recently published a very well written blog article on SAS 9.3, the installation thereof. Amongst all of the other good information and advice from Paul, I was particularly struck by his recommendations about the install. I've quoted them below.
Whilst SAS installations/migrations are getting more straightforward, I’d still recommend getting SAS Professional Services or one of the SAS partners to help – being businesses that install, migrate and configure SAS software on a daily basis they are very efficient and have extensive support resources at their disposal. I can’t imagine how it could be cost effective for SAS customers to acquire those skills when it’s something they might only do once or twice every few years.
Paul's comments surprised me, but after I'd given them a little more thought I concluded that I was inclined to agree. SAS UK (and many other local SAS offices) has a specialised team of installers, and many partners (such as Amadeus in the UK) offer a specialised install service. So, I certainly can't disagree with Paul's assertion that SAS and their partners do installs on a daily basis and can thus do a better job than most clients.

I'd offer some further advice though. Be sure that your chosen installers:
  1. have a clear and detailed architectural plan to work from (either supplied by you, or created by your installers and handed-over to you), and
  2. provide you with a clear and detailed report on i) the steps they took during the install, ii) the key options they selected during the install, and iii) deviations from the original architectural plan (with reasons for doing so).
I've seen several clients where they've out-sourced an install and then found themselves with a system of unknown provenance that they are unable to replicate for new development, test and/or production environments.

The typical SAS 9 install (if there is such a thing) is a far more complex beast than was version 8. SAS 9 deployments incorporate a lot of different technologies from a number of suppliers. The benefit of this is a SAS system that can deliver its insights in a far broader and deeper fashion than ever before; the cost of this is the increased complexity. Be prepared to out-source the install, but also make sure you're able to maintain the environment yourself after your installer has packed-up and left your site.

Tuesday 23 August 2011

NOTE: Keyboard Shortcuts in the SAS Enhanced Editor

I was recently looking at a long piece of macro code where each level of indentation only used one space. It was very difficult to see where some of the %do / %end blocks began or ended. I was reminded that there's a keyboard shortcut for jumping between matching do / end blocks, so I hunted-down the complete list of shortcuts for use in the Enhanced Editor.

Sadly, I didn't find the ability to skip between matching %do / %end (it only works for non-macro do / end blocks)!! but I was reminded of Alt+[ for moving between matching do / end, and ctrl+[ for moving between matching brackets. PLUS, not spotted before, I found the ability to use F2 to add one or more (non-permanent) line markers that you can then use F2 (or shift+F2) to quickly jump to. Surely this is very useful for working on two disparate areas of a large piece of code and repeatedly jumping between them (or three, or more).

Looking over the whole list of shortcuts, it's probably not strictly necessary to document the fact that pressing the Enter key will "insert a carriage return"! But it's probably better to have too much documentation than too little.

I use ctrl+home and ctrl+end a lot in very many Windows apps to get to the top and bottom of a file. And I use ctrl+f in many Windows apps to give me the find dialog too.

Finally, I was surprised to find shortcuts for recording and playing macros. Very useful for doing repeated editing of your code. I seem to recall this was available throughout Windows but I don't see it any more. My thanks to SAS for retaining its availability.

Monday 22 August 2011

Predictive Modelling Competition

One of my clients is running a predictive modelling competition. The competition is free to enter and has a $10,000 prize fund. Interested? Read on...

The competition is being hosted at Kaggle. It's very simple to register and to enter. The challenge is to predict when supermarket shoppers will next visit a store and how much they will spend. You're provided with a set of modelling data, and a test data set. You can enter the competition as many times as you like, there's a leaderboard that shows the score (percentage of correctly forecast visits) you need to beat, and you get immediate feedback on how well you've done.

I don't have any modelling skills but I gave it a go. I scored 8.4, placing me in 90th position. The current leader has managed to score 17.5. Can you top that?...

Wednesday 13 July 2011

NOTE: "Big" SAS 9.3 is Officially Released

SAS's web site is now making copious references to SAS 9.3, and yesterday's press release seems to be bigging-up SAS's "big data" capabilities. At a more technical and detailed level, there's a good summary of changes in the Support site, and a comprehensive What's New in SAS 9.3 publication in the SAS Documentation site.

I'm always intrigued to see what functions are introduced with a new release. Although I will inevitably forget 90% of them (it's an age thing!), they often provide a degree of interest. This time around I see lots of SOAPxxxx functions that provide interfaces to web services, a bunch of financial rate calculations, plus the following two that I hope to remember and use:

MVALID
checks the validity of a character string for use as a SAS member name.

SYSEXIST
returns an indication of the existence of an operating environment variable.

All I need to do now is get hold of a copy of the software.

Oh... and... I couldn't possibly make this post without highlighting the introduction of a new PROC... ladies and gentlemen, I give you... PROC GROOVY. Apparently, Groovy is an object-oriented programming language for the Java platform, but that's a far too dull explanation for a PROC which must surely return messages featuring words and phrases such as "cool", "far out", and "amazing".

Thursday 7 July 2011

NOTE: SAS-Themed Crosswords

I've just been having a bit of fun with a couple of the SAS-themed crosswords I created a long while ago (circa 2003); so long ago that I couldn't remember the answers! Great fun, and I was struck by the difficulty of the two prize-crosswords. Repeated congratulations to the respective winners.

Have a go yourself. It's a good way to challenge your knowledge of SAS and information technology; it's a great training tool. I recommend you start with one of the coffee-time crosswords before attempting either of the prize-crosswords. However, before starting any of them, you may wish to read some background information about the style of the crosswords so that you can more easily make sense of the clues.

Have fun! Tell me what you think of them (what you like, what you don't like, what you find easy/hard). I intend to produce some more crosswords soon. I'll let you know when they're published...

Wednesday 22 June 2011

NOTE: SYSTASK With An Unknown Number of Calls

In an earlier article (and the associated article on security) I extolled the virtues of SYSTASK for doing operating system activities in parallel. I gave an example that executed two gzip commands in parallel. But what would you do if you didn't know how many files you needed to zip?

Well, let's assume you have a table containing a list of files (WORK.FILES in the example below); we need to issue a SYSTASK statement for each row in the table; and then we need to issue a WAITFOR statement that refers to the names of each of the SYSTASKs so that we don't proceed any further until all of the zips are complete.

data files;
  file='Alpha.csv'; output;
  file='Beta.csv'; output;
  file='Gamma.csv'; output;
run;

%macro zippem(data=,var=);
  data _null_;
    set &data end=finish nobs=numobs;
    length stmt $256;
    stmt = cat('systask command "gzip '
              ,&var
              ,'" nowait taskname=TSK'
              ,putn(_n_,'Z5.')
              ,';'
              );
    call execute(stmt);
    if finish then
    do;
      stmt = 'waitfor _all_';
      do i = 1 to numobs;
        stmt = cat(trim(stmt),' TSK',putn(i,'Z5.'));
      end;
      stmt = cat(trim(stmt),';');
      call execute(stmt);
    end;
  run;
%mend zippem;

%zippem(data=files,var=file);


The macro produces the following log output:

NOTE: CALL EXECUTE generated line.
1 + systask command "gzip Alpha.csv" nowait taskname=TSK00001;
2 + systask command "gzip Beta.csv " nowait taskname=TSK00002;
NOTE: LOG/Output from task "TSK00001"
> gzip: Alpha.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00001"
3 + systask command "gzip Gamma.csv" nowait taskname=TSK00003;
4 + waitfor _all_ TSK00001 TSK00002 TSK00003;
NOTE: LOG/Output from task "TSK00003"
> gzip: Gamma.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00003"
NOTE: LOG/Output from task "TSK00002"
> gzip: Beta.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00002"


Ignoring the fact that my files don't exist(!), you can see that the output from each command is echoed to the log (useful). It's a simple macro, but it can speed-up your jobs by a significant amount. You can use the template code shown above for many purposes.

Monday 20 June 2011

NOTE: SYSTASK Is Great, If You're Allowed To Use It! (XCMD)

In my previous posting I featured the SYSTASK statement as a great means of executing operating system commands in parallel. Statements such as SYSTASK and CALL SYSTEM allow any operating system command to be executed and so they can be dangerous in the wrong hands. Paul Homes recently wrote an excellent blog post about the whole subject of issuing operating system commands from SAS and the restrictions that can be placed upon doing so. Recommended.

NOTE: With SYSTASK, Even Men Can Multi-Task!

I've been doing a lot of file manipulation recently (hence my observations on INFILE's FILEVAR). I've become a great fan of SYSTASK for executing operating system commands. The key element to SYSTASK's capabilities is that it can execute commands in parallel, i.e. asynchronously. So, if you have a number of large files that you want to do time-consuming tasks upon (such as compress or perform a word count), SYSTASK can do them in parallel and you'll get your results quicker (if your system has multiple processors and/or cores, and decent I/O performance).

Here's a simple (unix) example that zips two files in parallel:

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=alpha;

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=beta;

waitfor _all_ alpha beta;

%put Both files are now zipped;


Note the NOWAIT keyword on each SYSTASK statement; this instructs SAS to continue execution rather than waiting for the command to finish. The WAITFOR statement (as its name implies) forms a synchronisation point in your code. In the example above, it will wait for "all" of the tasks named on the WAITFOR statement before allowing execution to continue beyond the WAITFOR statement.

In SAS 9.1 there's a restriction whereby you cannot use a tilde (~) or a wildcard (*). Aside from that, SYSTASK is a terrific means of speeding-up your SAS code and making greater use of your computing resources.

Monday 6 June 2011

NOTE: Reading Multiple Files (with irregular names)

I was introduced to the INFILE statement's FILEVAR parameter recently. It seems it's a great way to read multiple files into a DATA step. Hitherto I had tended to use a widlcard in the FILEREF.

To read multiple files with similar names, you can simply put a wildcard in the FILENAME statement thus:

filename demo '~ratcliab/root*.txt';

If I have files with the following names in my home directory:

root1.txt
root2.sas
root3.txt


...then the first and third will be read by the following DATA step:

17 filename demo '~ratcliab/root*.txt';
18
19 data;
20   length string $256;
21   infile demo;
22   input string $256.;
23 run;

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root1.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root3.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.


That's all great if your files have similar names. If not, ask the FILEVAR parameter to step forward...

NOTE: More on LENGTH Functions

I got a good amount of feedback on my recent article on LENGTH functions, including a blog comment from Rick@SAS. In addition to providing some useful detail on LENGTH functions within SAS/IML, Rick also suggested:
I think a source of confusion with character missing values is that it doesn't matter how many blanks are in a string (0, 1, 2, or 20), the strings are all equivalent and all equal to the character missing value. That's why you can say
IF x = "" THEN...
and the statement works regardless of the length of x
Good point, Rick. I still recommend using the MISSING function, not least for the clarity of purpose, but Rick's use of the character constant ("") certainly removes the uncertainty around the LENGTH function.

NOTE: Early Bird Discount Ending Soon (SAS Professionals 2011)

The 2011 SAS Professionals Convention is to be held July 12th - 14th in Marlow. Book before June 13th at the reduced rate of £100+VAT. If you plan to go (why wouldn't you?), don't miss the early bird discount.

Sadly, for the third year running, events outside of my control mean that I won't be going. Every year I start-off determined to go, and then something crops up in the last couple of moth to prevent it. Bah!

Wednesday 1 June 2011

NOTE: Length Functions (Something Missing?)

How many functions to tell you the length of a value do you need? At least six apparently! SAS provides LENGTH, LENGTHC, LENGTHM, LENGTHN, KLENGTH and %LENGTH. Why?...

As we've all discovered to our cost, the basic LENGTH function accurately tells us the length of a character string (excluding trailing blanks) unless the string is completely blank, in which case LENGTH misleadingly returns the value 1. That's why I always use LENGTHN; it returns the value zero for a blank string.

I rarely use the others but, for the record, LENGTHC returns the length of a string including trailing blanks; but beware because it returns the value one when supplied with a null string as input.

The LENGTHM function is a slightly different beast because it returns the declared length of the variable rather than of its contents, i.e. it returns what was specified on (or implied for) the variable's LENGTH statement. KLENGTH is another oddity. In essence, it is the DBCS equivalnet of LENGTH. And %LENGTH is the macro equivalent of LENGTHN, i.e. it returns zero for a null/blank string.

Oh, there's a %KLENGTH too. And SAS/IML has a length function too, but let's not go there!

Why might we be using length functions? One popular use is to test if a variable is missing or null. For these cases, the MISSING or NMISS functions are often the best option - not least because their names make the purpose of their usage far clearer than using a length function.

The MISSING function returns 1 if the value passed to it is missing. The value passed to it can be numeric or character. A chracter string is deemed to be missing if it is all blank or has zero length. Perfect! This is a far better choice than any of the length functions if you want to test  avariable for a missing value.

NMISS returns the number of missing numeric values.

Finally, for completeness, I should mention CALL MISSING. You can use this routine to set character or numeric values to missing, though very few of us do.

Wednesday 25 May 2011

NOTE: Parameter Validation - %DATATYP

In yesterday's article on coupling, I showed how the use of macro parameters can decouple your macros, making them more maintainable and reliable. Building-in some parameter validation is always a good practice too. Some simple, basic validation can sometimes be all that's needed to reveal a problem before it gets too far into your sequence of macro calls and becomes difficult to unpick and debug.

The %DATATYP autocall macro is very useful in this area. When passed a value, it will tell you whether the string is numeric or character. This is especially useful in macro-world where all values are handled as character strings. The following code snippet gives an introduction to its usage:

%macro multiplier(data=,out=,var=,mult=);
  %if %datatyp(&data) ne CHAR %then...
  %if %datatyp(&mult) ne NUMERIC %then...

The macro is smart enough to recognise 1.23E3 as numeric, i.e. 1230.

An alternative DATA Step approach is to use the INPUT function with the question mark (?) or double question mark (??) modifier in order to avoid messages being written to the log.

/* Check that Y (a char var) contains a valid numeric value */
if input(y,??best.) eq . then
  ... <not numeric>


Whatever approach you take, a little parameter validation is better than none and will undoubtedly repay you at some point in the future.

Tuesday 24 May 2011

NOTE: Coupling, Bad

In my recent article on %MEND I said I didn’t like nested macro definitions. Some of my correspondents have suggested it’s a good means of keeping macro code near to where it’s called. I think this suggests a bad approach; namely, the inner workings of the sub-macro should not be relevant to the caller – the important element is the interface and outward behaviour, thus the definition of the sub-macro need not be located near to where it's used. (Plus, if it's used in more than one place it cannot possibly be located near to both.)

The design principle involved here is of "loose coupling". Design good practice suggests that objects, components and modules in your applications should make use of little or no knowledge of the internal implementation of other objects, components and modules with which they interact. Designing a loosely-coupled system provides the benefit of making it easier to make changes to one object without impacting another (if the interface and outward behaviour are not changed); this, in-turn, means your applications become more maintainable and reliable.

For example, I am loosely-coupled with my car, i.e. I have little or no understanding of how it works beyond my knowledge of its steering wheel, gear lever and pedals (its interface). I don't need any knowledge of how the engine or gearbox work in order to drive it. The advantage to me is that I can hop into almost any car and drive it just as effectively; the advantage for the car manufacturer is they can sell their cars to a wide range of people without needing to train them on how to use the specific model of car.

For macros in particular, it is very easy to write something that requires the caller to know things about the inner workings of the macro, e.g. the macro may expect certain global macro variables to be defined, or it may write its output to other global macro variables, or it may read/write to/from specific data sets. It is so much easier to define these things as part of the parameter interface for the macro, then it is so much easier to understand what the macro wants as input and what it might provide as output. Consider these two macros:


No parameter interfaceClear parameter interface
%macro demo;
  data beta;
    set alpha;
    x = 2 * x;
  run;
%mend;


Called thus:

%demo;
%macro doubler(data=,out=,var=);
  data &out;
    set &data;
    &var = 2 * &var;
  run;
%mend doubler;


Called thus:

%doubler(data=first
        ,out=second
        ,var=profit);

The second, with the clear parameter interface, does not demand that its user knows the names of the input and output data sets; the interface and function of the macro are already clear. Thus it is easier to enhance the macro without "breaking" any code that uses it.

The example is just a simple one, but the principle has greater and greater value as your applications and their components get larger. The topic is much bigger than I can describe here. It's difficult finding references that don't go into (non-SAS) coding examples. If you're brave, you can try Martin Fowlers' classic Reducing Coupling from 2002; else take a look at Coupling and Cohesion in the C2 wiki.