Wednesday 27 February 2013

NOTE: DS2 Final, Final Comments

I got a lot of feedback about the DS2 articles I recently wrote. I should add a few more summarising comments:
  • I mentioned that DS2 doesn't currently support MERGE. Chris Hemedinger commented that the ability to use SQL in your SET statement means that you can use a SQL join on the SET statement and thereby a) achieve a MERGE, and b) benefit from "pushing" the work to a source database (if using non-SAS data).
  • Jack Hamilton observed that accidentally omitting the QUIT at the end of a call to PROC DS2 will cause some less-than-obvious syntax errors for the following conventional DATA step. Beware!
  • With power comes great responsibility! Whilst a DS2 program and its executing threads are constrained to one process on one machine, there are currently no limits to how many threads a user can request to use with thread programs. So, a user could take full advantage of all the cores on the machine - to the detriment of their fellow users.

    DS2 takes no account of the CPUCOUNT option, and it has no association with SAS Grid controls (with which, SAS administrators are able to manage and control users' use of system resources). I'm sure that some degree of control, facilitating workload balancing, will follow in time
DS2 promises much, and I'm keen to see how it evolves over the coming years.


NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Tuesday 26 February 2013

NOTE: Plan Early and Avoid a Year 2020 Problem!

My friend Stuart Pearson alerted me to a distant but fast approaching issue for some SAS sites. Stuart observed a problem at one of his clients as they were working on forecasts for the next ten years and one of their processes was feeding-in 2 digit years. 2021 was becoming 1921. This was because the default YEARCUTOFF option value is 1920. As Stuart said to me, as we near 2020 it is something that SAS users need to be fully aware of.

It's not a difficult issue to overcome. But to overcome it, you need to be aware of it.

Stuart also highlighted that SAS Data Quality handles 2 digit years differently to SAS/BASE. By default, anything greater than 04 and less than 99 will be assigned to the 20th century (19nn); for years 01 - 03, the 21st century (20nn) is assigned. You can create your own schemes and rules, but you need to be aware of the defaults.

Monday 25 February 2013

Giving Focus to Peer Reviews

I'm a keen advocate for peer reviews. I've written about them before (here and here) but there's always more to say.

Peer reviews must always be treated as a constructive exercise. The style of questions can play a big part in the atmosphere and tone of the exercise.

There are many ways you can judge the quality of code. It's easy for developers to put too much time and effort in the wrong places, building things no one would use to begin with. So, aside from the basic questions of the code meeting design and coding guidelines, I ask myself the following questions:

a. How easy will it be to add new features?
b. How easy will it be to change existing features?
c. How easy will it be for a new team member to become productive?

I look for a good balance between the potentially contradictory questions. Using complex architecture might help with (1) and sometimes (2) but will probably hurt (3), so there is an interesting compromise to be delivered.

These questions ensure that the review takes account of the productivity of the team (and company) in addition to the regular technical factors.

Wednesday 20 February 2013

NOTE: DS2, Final Comments #sasgf13

In my previous posts, I've covered many aspects of DS2 (previous posts are listed at the bottom of this post). It's time to wrap up by offering a few more final details.
Whilst DS2 provides a wide range of data types, not all types are supported by all output data structures. The BASE engine, for example, has not been updated to allow storage of anything other than numeric and character variables in SAS datasets, so an attempt to create a data set with a variable of type BIGINT will be met with a warning message:

WARNING: BASE driver, creation of a BIGINT column has been requested, but is not supported by the BASE driver. A DOUBLE PRECISION column has been created instead.

The traditional SAS numeric variable is known in this context as a double precision column!

Paired with the SAS Embedded Process, DS2 enables you to perform processing similar to SAS in completely new places, such as in-database processing in relational databases, the SAS High-Performance Analytics grid and the DataFlux Federation Server.

If you want to know more, consider attending Mark Jordan's pre-conference tutorial at this year's SAS Global Forum. In the seminar, entitled "What Will DS2 Do for You?", you will learn the basics of writing and executing DS2 code. Mark promises to shows attendees how to explicitly control threading to leverage multiple processors when performing data manipulation and data modeling. He will demonstrate how DS2 improves extensibility and data abstraction in your code through the implementation of packages and methods. It's an extra fee event ($155) but could add a powerful extra string to your SAS bow!


NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Monday 18 February 2013

NOTE: DS2, Threaded Processing

In my recent posts on DS2 (DATA step evolved), I showed the basic syntax plus packages & methods, and I showed the use of SQL within a SET statement. In today's post, I'll show the biggest raison d'être for DS2 - the ability to run your code in threads to make it finish its job more quickly.

"Big data", that's the big talking point. One of the key principles of performing speedy analytics on big data is to split the data across multiple processors and disks, to send the code to the distributed processors and disks, have the code run on each processor against its sub-set of data, and to collate the results back at the point from which the request was originally made. Thus, we're sending code to the data rather than pulling the data to the code. It's quicker to send a few dozen lines of code to many processors than it is to pull many millions of rows of data to one (big) processor.

DS2 was designed for data manipulation and data modeling applications. DS2 also enhances a SAS programmer’s repertoire with object-based tools by providing data abstraction using packages and methods. DS2 executes both within a SAS session by using PROC DS2, and within selected databases where the SAS Embedded Process is installed.

Here's a simple example. In summary, it shows how the use of eight threads reduces the turnaround time of the task from 24 seconds in a conventional DATA step to 4.4 seconds in a call to DS2 with eight threads.

16 /*****************************/
17 /* Create a chunky data set. */
18 /* Then read it:             */
19 /* a. With one thread        */
20 /* b. With eight threads     */
21 /* c. Using "old" DATA step  */
22 /*****************************/
24 options msglevel=n;
25 options cpucount=actual;
27 proc options option=threads;run;
SAS (r) Proprietary Software Release 9.2 TS2M3
THREADS Threads are available for use with features of the SAS System that support threading
28 proc options option=cpucount;run;
SAS (r) Proprietary Software Release 9.2 TS2M3
CPUCOUNT=24 Number of processors available.
30 /****************************/
31 /* Create a chumky data set */
32 /****************************/
33 data work.jmaster;
34   do j = 1 to 10e6;
35     output;
36   end;
37 run;
39 /**************************/
40 /* Now read it three ways */
41 /**************************/
43 /* But first define the threaded code thread */
44 proc ds2;
45   thread r /overwrite=yes;
46     dcl double count;
47     method run();
48       set work.jmaster;
49       count+1;
50       do k=1 to 100;/* Add some gratuitous computation! */
51         x=k/count + k/count + k/count;
52       end;
53     end;
54     method term();
55       OUTPUT;
56     end;
57   endthread;
58 run;
NOTE: Execution succeeded. No rows affected.59 quit;  60
61 /* One thread */
62 proc ds2;
63   data j1(overwrite=yes);
64     dcl thread r r_instance;
65     dcl double count;
66     method run();
67       set from r_instance threads=1;
68       total+count;
69     end;
70   enddata;
71 run;
NOTE: Execution succeeded. One row affected.72 quit;NOTE: PROCEDURE DS2 used (Total process time):
real time 25.09 seconds
cpu time 25.16 seconds
74 /* Eight threads */
75 proc ds2;
76   data j8(overwrite=yes);
77     dcl thread r r_instance;
78     dcl double count;
79     method run();
80       set from r_instance threads=8;
81       total+count;
82     end;
83   enddata;
84 run;
NOTE: Execution succeeded. 8 rows affected.85 quit;NOTE: PROCEDURE DS2 used (Total process time):
real time 4.40 seconds
cpu time 32.96 seconds
87 /* And read it in DATA step */
88 data jold;
89   set work.jmaster end=finish;
90   count+1;
91   do k=1 to 100;/* Add some gratuitous computation! */
92     x=k/count + k/count + k/count;
93   end;
94   if finish then output;
95 run;
NOTE: There were 10000000 observations read from the data set WORK.JMASTER.NOTE: The data set WORK.JOLD has 1 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 23.98 seconds
cpu time 23.75 seconds

The code does six things:
  • Firstly, it sets the maximum number of CPUs available to the SAS task. It sets this to 24 (this box has 12 cores, each with two hyperthreads)
  • Next, it creates a sample data set (to be read in the subsequent steps)
  • It then uses DS2 to define a thread - like a function or a method. The thread will read a row from the sample data set, increment a count and do some arbitrary compute activity (to ensure the test exercise isn't I/O-bound, thereby preventing DS2 from showing its capabilities)
  • Next, it executes DS2 again. The previously defined thread is used inside an executable piece of code (once, because threads=1). This takes 25 seconds elapsed time (and 25 seconds CPU time)
  • Now we execute the same code with DS2 but we use eight threads (each processing one eighth of the input records). This takes 4.4 seconds to complete (but consuming a total of 33 seconds of CPU time across the eight threads)
  • Finally, we execute a traditional DATA step to do the same task. Not surprisingly, it takes a similar amount of time to complete the task as the single-threaded DS2 code
So, we see how the use of eight threads reduced the turnaround time by a factor of 25/4.4=5.7. Not a factor of 8, but not bad nonetheless!

This example suggests that great benefit can be achieved with DS2 threads. In our simple case, the code was compute-bound. If your task is more I/O-bound then the benefits may be less predictable. Jason, from the DS2 development team, recently told me:
When DS2 does I/O, it starts one "reader" thread. This reader thread requests blocks of data from a data source and puts the blocks on a queue. Then, either the DS2 "DATA" program fetches those blocks off the queue or DS2 "THREAD" programs fetch the blocks off the queue.

The key here is there is one reader thread and one or more "compute" threads. As long as the reader thread can keep up with the compute threads, we should see a speed up in execution time. One reader thread works well for most data sources as data sources usually present one source to read from.

With a data source like SPDE, there could be multiple data sources across multiple devices. Right now DS2 does not take advantage of having multiple reader threads. However, I believe the architecture is flexible enough to allow mulitple reader threads.

At this point, we have been discussing I/O on a single machine. SAS High-Performance Analytics and SAS Scoring Accelerator execute DS2 in parallel across many machines. The I/O model for these types of systems is different and enables DS2 programs to better take advantage of multiple storage devices.
So, as is usually the case when we talk of performance, a lot depends on your hardware architecture and the amount of effort you put into the tuning of your architecture and code. Nonethess, with DS2 it seems that there are benefits aplenty to be had.

In my next post, I'll wrap up the topic with a few extra details.


NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Wednesday 13 February 2013

NOTE: DS2, SQL Within a SET Statement

Earlier this week, I suggested taking a look at DS2 ("the DATA step evolved") as a learning opportunity for the New Year. I listed the advantages that DS2 offers over the traditional DATA step. One of those was the ability to use SQL within a SET statement, thereby combining the benefits of both languages. There follows a simple example that combines SQL's familiar aggregation capabilities with SAS's ability to produce more than one output data set from a single pass of the input data.

In fact, DS2 takes a number of ideas from SQL:
  • Apart from using ANSI SQL data types in addition to the traditional Numeric and Character types, DS2's DATA steps takes a leaf out of SQL's book when no output data set name is specified. Traditional data steps would create output data sets with names like WORK.DATA1; DS2 returns the result to the Output window, much like PROC SQL would do.
  • Another DS2/SQL similarity is that DS2 will not automatically replace or over-write an existing table; you must specify the OVERWRITE=YES data set option.
  • And, finally, PROC DS2 uses RUN and QUIT statement in a similar manner to PROC SQL. PROC DS2 uses the RUN statement to indicate that preceding code is complete and should be executed, whilst the QUIT statement tells the procedure that its work is fully complete and it can be unloaded from memory.
One interesting point to note is that DS2 does not respect the SASHELP library. If you reference SASHELP (on a SET statement, for example) you'll get an error message informing you that the "schema name SASHELP was not found". This is a shame because SASHELP's example data sets (such as CLASS and PRDSALE) are so useful when demonstrating and experimenting with code and ideas.

17 proc copy inlib=sashelp outlib=work;
18   select prdsale;
19 run;

NOTE: There were 1440 observations read from the data set SASHELP.PRDSALE.
NOTE: The data set WORK.PRDSALE has 1440 observations and 10 variables.

21 proc ds2;
22   data high(overwrite=yes) low(overwrite=yes);
23     drop count;
24     method run();
25       set {select  country
26                   ,year
27                   ,quarter
28                   ,sum(actual) as actual
29                   ,sum(predict) as predict
30              from work.prdsale
31              group by country,year,quarter};
32       if actual lt 5000 then output low;
33       else output high;

34     end;
35   enddata;
36   run;

NOTE: Execution succeeded. 24 rows affected.
37 quit;

Clearly, you could do all of this with a PROC SUMMARY (or PROC MEANS) followed by a conventional DATA step, but the input/output activity is much reduced with the above technique, and use of a SQL SELECT statement within a SET statement is particuarly useful for those who are less familiar with some SAS procedures.

However, observe how DS2's NOTE statement offers little information about the numbers of rows written to each of its output tables. Maybe greater information will be offered in a future version (I'm using V9.2 TS2M3).

And you should be aware too that DS2 does not (currently) incorporate all functions and statement that you'd expect in traditional DATA steps. For instance, we have SET but we don't have MERGE or UPDATE or MATCH. We must assume that these will come in-time too. However, the ability to provide some of these capabilities may, in part, conflict with DS2's ability to perform threaded execution, so some traditional DATA step capabilities may never make their way into DS2.

In the next DS2 article, I'll discuss high performance data access using DS2's threading capabilities.


NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Monday 11 February 2013

NOTE: DS2, Learn Something New!

There's never a bad time to learn something new. How about DS2?

In May of last year, upon returning home from SAS Global Forum, I wrote about DS2, a new alternative to DATA step. Generally available in V9.4, PROC DS2 is currently available in SAS V9.3 as an experimental technology (and V9.2 TS2M3 too).

DS2's focus is on high performance for data manipulation and data analysis; it incorporates threading. But there's a lot more functionality available besides. DS2 gives you, the programmer, far more flexibility with your coding and gives you far greater abilities to structure your code.

The DS2 syntax includes the ability to specify SQL within a SET statement, combining the benefits of both languages, additional data types, ANSI SQL types, programming structure elements, and user-defined methods and packages. In advance of the 9.4 launch, you can find a Getting Started and a Language Reference manual in the product documentation.

Here's a simple example of some DS2 code and its output:

17 proc ds2;
19   data _null_;
20     method init();
21       dcl varchar(16) str;
22       str = 'Hello World!';
23       put str;
24     end;
25   enddata;
27 run;
Hello World!

NOTE: Execution succeeded. No rows affected.
28 quit;

As you can see, DS2 is actually a PROC, but it takes DATA step-like code as statements within the PROC. Within the DATA step, code is arranged into methods - sub-routines if you like (similar to functions in java, or functions created by PROC FCMP). The most basic methods are called INIT, RUN, and TERM. INIT and TERM run once (at the beginning and the end of the DATA step, as their names imply), and RUN is executed once for each input row, just as a conventional DATA step does.

The Getting Started manual takes you through some of the basics of the syntax. Let's move on and see some of what I find most exciting about DS2.

Firstly (below), I'll take a look at structured programming with packages and methods. In subsequent articles I'll look at SQL in the SET statement and at high performance data analysis with threading.

So, let's look at packages and methods; a great means of structuring our code. Take a look at the code and log output below. Note how we create a package named pythagorus, and the package contains two methods (gethyp and getside). The methods can contain any amount of code, and the package can be stored as an external file. There are large overlaps with the abilities of the macro language here, but DS2 brings the advantage of using one coherent language, with many different types of data (not just character!).

17 proc ds2;
19   package pythagorus/overwrite=yes;
20     method gethyp(double a, double b)
21                  returns double;
22       a_sq = a**2;
23       b_sq = b**2;
24       return sqrt(a_sq + b_sq);
25     end;
26     method getside(double hyp, double sidea)
27                   returns double;
28       return sqrt(hyp**2 - sidea**2);
29     end;
30   endpackage;
31   run;

NOTE: Execution succeeded. No rows affected.
33   data demo(overwrite=yes);
34     method init();
35       short=3; long=4; hyp=.; output;
36       short=4; long=5; hyp=.; output;
37       short=.; long=4; hyp=5; output;
38       short=3; long=.; hyp=5; output;
39     end;
40   enddata;
41   run;

NOTE: Execution succeeded. 4 rows affected.
43   data results(overwrite=yes);
44     dcl package pythagorus pyth();
45     method run();
46       set demo;
47       select;
48         when (missing(hyp))
49           hyp=pyth.gethyp(short,long);
50         when (missing(short))
51           short=pyth.getside(hyp,long);

52         when (missing(long))
53           long=pyth.getside(hyp,short);
54       end;
55     end;
56   enddata;
58   run;

NOTE: Execution succeeded. 4 rows affected.
59 quit;

After the package and its methods are defined, we create some test data in a small DATA step, and then we generate our results in a final DATA step. Notice how we need to declare the package in our results DATA step before we use it.

The ability to create packages of methods will allow SAS code to be written in a clearer, more structured form than was previously possible. I look forward to discovering the full benefits of packages and methods as I use there new capabilities more and more.

DS2 will still perform type conversions (one of my bête noires of the DATA step; I wrote about it in September 2009 and in a follow-up article) but the rules are more complicated because DS2 introduces so many different types. Indeed, there's a whole chapter on it in the Language Guide.

Not quite sure where to start? Let PROC DSTRANS help you by translating a subset of your DATA step code into DS2 code. Then, if necessary, you can revise your program to take advantage of DS2 features before submitting your program using PROC DS2.

In my next DS2 post, I'll show how to sue SQL statement in your SET statements.


NOTE: DS2. Data Step Evolved?
NOTE: DS2, Learn Something New!
NOTE: DS2, SQL Within a SET Statement
NOTE: DS2, Threaded Processing
NOTE: DS2, Final Comments
NOTE: DS2, Final, Final Comments

Wednesday 6 February 2013

NOTE: Booked and Ready for SAS Global Forum 2013 #sasgf13

I referred back to my Booked for SGF 2012 post to make sure I'd done all what was necessary for my 2013 trip. I'mall set and ready to go!

There was a time when I'd need to put my slides onto floppy disks, keeping one for myself and giving one or two to colleagues who were also going to SGF. In that way, I could be confident of turning-up for the conference with a presentation to present. Uploading my paper and slides to the Call for Papers web site gives me the confidence of knowing that my slides will be on the streams' PCs before I even set foot in the presentation rooms for my two papers.

I'm looking forward to going and seeing a real city rather than the ersatz location that is Walt Disney World. I'm arriving on Saturday, and staying on for a couple of days after the conference. Can you recommend a tourist site to visit? Andrew Karp has already recommended the Computer History Museum in Mountain View.

Last year I booked flight and accommodation as a "holiday" package with Virgin Holidays (including free return transfer from airport to hotel), but the Virgin hotels don't seem near the San Francisco conference centre this year, so I'm flying with Virgin Atlantic and staying in the conference hotel. I also booked:
  • Car parking for my car to be left in the UK
  • Lounges with wi-fi so I can make best use of my time whilst hanging around the airports waiting for flights
  • A new application for entry to the US through the Electronic System for Travel Authorisation (ESTA) - my previous application had last two years and had run out
  • Travel insurance
  • Conference registration
  • Did I forget anything??
Will I see you in San Francisco?

Tuesday 5 February 2013

NOTE: Quick SAS Keyword Lookup in Chrome

In a previous blog I mentioned how to tell the Google Chrome browser to search by using a different search engine (or a specific search URL). I created a quick and easy way to search SAS conference proceedings and to search Visuwords for definitions of words. But, what if you've already found an interesting page and it has a word on it that you'd like to look-up or search?

Well, that's easy to do in Chrome too. Just right-click and select Search Google For <Word>. By default, no other web site (or URL) is offered. However, by adding the Context Menu Search extension, we can add multiple alternate search engines.

  1. To do this, visit the Chrome Web Store and search for "context menu search"
  2. When you get the search results, you'll see "Context Menu Search from" at the top of the list.
  3. Click Add to Chrome and follow the instructions to add the extension
  4. Manage your extensions by clicking Extensions in the Tools menu, and then clicking Options for Context Menu Search (or, click on a highlighted word in your browser, right-click, select Context Menu Search, and then Options)
  5. On the Options screen, use the fields to add a new search. For example, to set-up Visuwords to do a graphical dictionary/thesaurus look-up, specify "Visuwords" as the display name and "" as the link
And there you have it. You can now highlight words and then search for them with specific search engines and URLs. The ability to do this for Visuwords and Wikipedia is very useful.

Monday 4 February 2013

NOTE: Unable to Write to Template Store!

Sometimes you get into a habit that is so deep-seated that you don't realise you're doing it. I was just reminded of a little trick that I regularly use when editing ODS graphics templates.

A colleague of mine was using PROC TEMPLATE and getting the following (apparently) benign error in his log:

15 proc template;
16   edit styles.default as styles.defsize2;
17     style container / font_size=2;
18   end;
ERROR: Template 'Styles.Defsize2' was unable to write to template store!
19 run;

I say "benign" because the code that subsequently used styles.defsize2 worked fine. But we have a coding guideline that says that no errors are allowed in our team's code (not even benign ones!). He was struggling to make the error go away and was promoting the argument that benign errors should be permitted by our coding guidelines.

We discussed the message and what it meant. Templates, by default, are stored in an ITEMSTOR named SASUSER.TEMPLAT. My programmer's code was reading SASUSER.TEMPLAT.STYLES.DEFAULT and attempting to create a new style named SASUSER.TEMPLAT.STYLES.DEFSIZE2. The error message was saying he couldn't do that because DEFSIZE2 already existed. Why did it already exist? Because his code had worked without an error message the very first time he ran it - but he'd forgotten this fact.

So, it wasn't necessary to run the code every time because the DEFSIZE2 style already existed. Of course, each new programmer that ran the code had to run the PROC TEMPLATE once in order to create DEFSIZE2 in their own SASUSER library.

I proposed a slightly different approach: let's create the template in the work library each time the code was run. With this change, the code would run consistently for everybody, everytime.

Making this change is simple, we just need to change the default location for templates to be read from and written to. And the easiest way to do that is to add the WORK library to the ODS search path (adding it to the front of the search path). Specifically, we want to add an ITEMSTOR that's in the WORK library to the ODS search path. The ITEMSTOR need not exist before we write something to it - it'll be created automatically. We can use ODS PATH SHOW to demonstrate the effect:

15 ods path show;
Current ODS PATH list is:

16 ods path(prepend) work.templat(update);

17 ods path show;
Current ODS PATH list is:

The first call to ODS PATH SHOW shows us the two default ITEMSTORs, and then we see that we've added WORK.TEMPLAT to the head of the search path.

We don't need to change any other part of our code because all elements of ODS respect this search path.

So, we updated the code, got rid of the (not so) benign error message, and didn't make any changes to our coding guidelines. An interesting and informative exercise, demonstrating not just the ODS PATH for templates but also the wisdom of not allowing any error messages in SAS logs. The programmers in Cary put those messages in for a reason!

Oh, one more thing. You can use PROC DATASETS to see ITEMSTORs in your libraries, and you can view a list of templates with PROC TEMPLATE's LIST statement (see below), but I haven't figured-out a means of inspecting the content of an ITEMSTOR. Does anybody know how? I appreciate that the SAS Registry is an ITEMSTOR, and you can view that in PC SAS; but how do you view these things if they're on a server and you're using Enterprise Guide?

proc datasets lib=sasuser;

proc template;
  list ;

Answers on a postcard please...