Wednesday, 13 July 2011

NOTE: "Big" SAS 9.3 is Officially Released

SAS's web site is now making copious references to SAS 9.3, and yesterday's press release seems to be bigging-up SAS's "big data" capabilities. At a more technical and detailed level, there's a good summary of changes in the Support site, and a comprehensive What's New in SAS 9.3 publication in the SAS Documentation site.

I'm always intrigued to see what functions are introduced with a new release. Although I will inevitably forget 90% of them (it's an age thing!), they often provide a degree of interest. This time around I see lots of SOAPxxxx functions that provide interfaces to web services, a bunch of financial rate calculations, plus the following two that I hope to remember and use:

MVALID
checks the validity of a character string for use as a SAS member name.

SYSEXIST
returns an indication of the existence of an operating environment variable.

All I need to do now is get hold of a copy of the software.

Oh... and... I couldn't possibly make this post without highlighting the introduction of a new PROC... ladies and gentlemen, I give you... PROC GROOVY. Apparently, Groovy is an object-oriented programming language for the Java platform, but that's a far too dull explanation for a PROC which must surely return messages featuring words and phrases such as "cool", "far out", and "amazing".

Thursday, 7 July 2011

NOTE: SAS-Themed Crosswords

I've just been having a bit of fun with a couple of the SAS-themed crosswords I created a long while ago (circa 2003); so long ago that I couldn't remember the answers! Great fun, and I was struck by the difficulty of the two prize-crosswords. Repeated congratulations to the respective winners.

Have a go yourself. It's a good way to challenge your knowledge of SAS and information technology; it's a great training tool. I recommend you start with one of the coffee-time crosswords before attempting either of the prize-crosswords. However, before starting any of them, you may wish to read some background information about the style of the crosswords so that you can more easily make sense of the clues.

Have fun! Tell me what you think of them (what you like, what you don't like, what you find easy/hard). I intend to produce some more crosswords soon. I'll let you know when they're published...

Wednesday, 22 June 2011

NOTE: SYSTASK With An Unknown Number of Calls

In an earlier article (and the associated article on security) I extolled the virtues of SYSTASK for doing operating system activities in parallel. I gave an example that executed two gzip commands in parallel. But what would you do if you didn't know how many files you needed to zip?

Well, let's assume you have a table containing a list of files (WORK.FILES in the example below); we need to issue a SYSTASK statement for each row in the table; and then we need to issue a WAITFOR statement that refers to the names of each of the SYSTASKs so that we don't proceed any further until all of the zips are complete.

data files;
  file='Alpha.csv'; output;
  file='Beta.csv'; output;
  file='Gamma.csv'; output;
run;

%macro zippem(data=,var=);
  data _null_;
    set &data end=finish nobs=numobs;
    length stmt $256;
    stmt = cat('systask command "gzip '
              ,&var
              ,'" nowait taskname=TSK'
              ,putn(_n_,'Z5.')
              ,';'
              );
    call execute(stmt);
    if finish then
    do;
      stmt = 'waitfor _all_';
      do i = 1 to numobs;
        stmt = cat(trim(stmt),' TSK',putn(i,'Z5.'));
      end;
      stmt = cat(trim(stmt),';');
      call execute(stmt);
    end;
  run;
%mend zippem;

%zippem(data=files,var=file);


The macro produces the following log output:

NOTE: CALL EXECUTE generated line.
1 + systask command "gzip Alpha.csv" nowait taskname=TSK00001;
2 + systask command "gzip Beta.csv " nowait taskname=TSK00002;
NOTE: LOG/Output from task "TSK00001"
> gzip: Alpha.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00001"
3 + systask command "gzip Gamma.csv" nowait taskname=TSK00003;
4 + waitfor _all_ TSK00001 TSK00002 TSK00003;
NOTE: LOG/Output from task "TSK00003"
> gzip: Gamma.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00003"
NOTE: LOG/Output from task "TSK00002"
> gzip: Beta.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00002"


Ignoring the fact that my files don't exist(!), you can see that the output from each command is echoed to the log (useful). It's a simple macro, but it can speed-up your jobs by a significant amount. You can use the template code shown above for many purposes.

Monday, 20 June 2011

NOTE: SYSTASK Is Great, If You're Allowed To Use It! (XCMD)

In my previous posting I featured the SYSTASK statement as a great means of executing operating system commands in parallel. Statements such as SYSTASK and CALL SYSTEM allow any operating system command to be executed and so they can be dangerous in the wrong hands. Paul Homes recently wrote an excellent blog post about the whole subject of issuing operating system commands from SAS and the restrictions that can be placed upon doing so. Recommended.

NOTE: With SYSTASK, Even Men Can Multi-Task!

I've been doing a lot of file manipulation recently (hence my observations on INFILE's FILEVAR). I've become a great fan of SYSTASK for executing operating system commands. The key element to SYSTASK's capabilities is that it can execute commands in parallel, i.e. asynchronously. So, if you have a number of large files that you want to do time-consuming tasks upon (such as compress or perform a word count), SYSTASK can do them in parallel and you'll get your results quicker (if your system has multiple processors and/or cores, and decent I/O performance).

Here's a simple (unix) example that zips two files in parallel:

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=alpha;

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=beta;

waitfor _all_ alpha beta;

%put Both files are now zipped;


Note the NOWAIT keyword on each SYSTASK statement; this instructs SAS to continue execution rather than waiting for the command to finish. The WAITFOR statement (as its name implies) forms a synchronisation point in your code. In the example above, it will wait for "all" of the tasks named on the WAITFOR statement before allowing execution to continue beyond the WAITFOR statement.

In SAS 9.1 there's a restriction whereby you cannot use a tilde (~) or a wildcard (*). Aside from that, SYSTASK is a terrific means of speeding-up your SAS code and making greater use of your computing resources.

Monday, 6 June 2011

NOTE: Reading Multiple Files (with irregular names)

I was introduced to the INFILE statement's FILEVAR parameter recently. It seems it's a great way to read multiple files into a DATA step. Hitherto I had tended to use a widlcard in the FILEREF.

To read multiple files with similar names, you can simply put a wildcard in the FILENAME statement thus:

filename demo '~ratcliab/root*.txt';

If I have files with the following names in my home directory:

root1.txt
root2.sas
root3.txt


...then the first and third will be read by the following DATA step:

17 filename demo '~ratcliab/root*.txt';
18
19 data;
20   length string $256;
21   infile demo;
22   input string $256.;
23 run;

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root1.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root3.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.


That's all great if your files have similar names. If not, ask the FILEVAR parameter to step forward...