Monday 8 September 2014

The Imitation Game @TIGmovie @BletchleyPark

For wholly different reasons, my daughter and I are thrilled to see a date for the premiere of The Imitation Game movie in London and the UK. She's thrilled because it features Benedict "Sherlock" Cumberbatch; I'm thrilled because it's a big screen depiction of the life and work of Alan Turing - the British pioneer of modern day computing.

Many would say that Alan Turing started the digital revolution. Although others around the world (such as the American Alonzo Church) had done some work, it was Alan Turing who envisaged and designed a machine that could be programmed to solve an infinite number of problems by being given a rule set upon which it would base its actions. In fact, it could theoretically solve any problem for which there was a solution (hence, it is basically a modern computer).

It was Alan Turing who was instrumental in the development of the Bombe at Bletchley Park and it was Alan Turing whose ground-breaking ideas about Artificial Intelligence (AI) really pushed the boundaries of mathematical thinking at that time.

When Time magazine published its list of the 100 most important people of the twentieth century in 1999, they included Alan Turing in that list and said of him:
"The fact remains that everyone who taps at a keyboard, opening a spreadsheet or a word-processing program, is working on an incarnation of a Turing machine."
Turing's sorry, shabby reward for the instrumental role he played in winning the war for Britain was to be persecuted during the Cold War because his homosexuality was viewed as a security risk, to the point that he committed suicide. His pardon last year was a small recognition of his country's past mistakes.

The two most well known of his papers are:
If you can't wait until the general release of the film, to get a fix of Bletchley or Enigma or Turing, you might like to read Robert Harris's Enigma (or watch the film version), or Jack Norman's Broken Crystal. Both are good semi-fictional reads.

Monday 1 September 2014

Graphics on Android

Last week, writing about admin and deployment enhancements in SAS v9.4, I mentioned my estimation of the proportion of SAS customers on the latest version of SAS (I confidently estimated less than 50%).

These figures are available in other contexts. For instance, Google publish figures for Android versions on a monthly basis. This is useful for developers and gives them guidance on how backwards-compatible their apps should be.

When combined with the number of different manufacturers and handsets, the proliferation of Android versions is known as "fragmentation" and is seen in some quarters as a bad thing. From my perspective as a consumer, I think choice is a good thing, but I do see how it can create support and maintenance headaches for developers.

Anyway, my reason for mentioning this, aside from the nod back to last week's article, was to draw your attention to a recently published report on Android fragmentation by Open Signal. The quality and style of the graphics in their report really caught my eye, so I thought I'd share it with you. I like the look and style of the graphics, but I also like the interactivity you get when you move your mouse over the graphics.

What do you think? Could you replicate these graphics in SAS?

Thursday 28 August 2014

NOTE: What's More in 9.4 - Admin & Deployment

We can't consider SAS version 9.4 to be "new" any more (it first shipped in July 2013), but if we had the numbers to show it, I'm sure we'd see that less than 50% of customers have upgraded, so it's worth revisiting 9.4's attractions.

The complexity, effort and cost of upgrading have grown over the years. I don't know many clients who still perform upgrades themselves; most rely on SAS Professional Services to do the heavy lifting. Whether this lack of self-sufficiency is good for the clients in the long-term is debatable. SAS themselves claim to recognise the issue and are making efforts to ease the burden of upgrading. Perhaps we'll see significant changes in this area in 9.5, but I won't hold my breath.

Anyway, to return to v9.4, I recently took a look at the What's New in SAS 9.4 book. Wow, that's a big tome! 140 pages. I've not bothered to check, but it's the biggest "What's New" that I recall. So there must be plenty of juicy new features to justify an upgrade. In fact, there are, and I'll spread them over a number of articles. To counterpoint my comments above, I'll start with a mention of the changes in the areas of deployment and administration.

Firstly, the web-based Environment Manager (EM) shows SAS's direction for a new Management Console. EM allows admin and monitoring from a web-based interface and hence does not require the installation of any client-side software.

Secondly, there's far greater and more explicit support for virtual SAS instances, either hosted on-site or off-site. This gives IT departments far greater flexibility to build and deploy multiple instances of SAS; this is a good thing if you think that multiple instances are a good thing.

Thirdly, many of those 3rd-party bits and bobs in the middle tier have been replaced by SAS Web Application Server. On the face of it, we no longer need to recruit and retain support personnel with skills in non-SAS technologies. Certainly, it's good to have just one supplier to turn to in the event of questions or problems. However, the skills and knowledge required to install and operate SAS Web Application Server are similar to those required for the bits ans bobs of mid-tier used with v9.3 and v9.2, so it's not a complete "get out of jail free" card. And if you look carefully, you'll see that SAS Web Application Server is largely a rebranded version of a 3rd party toolset. Nonetheless, it's a positive step.

And finally, availability and resilience have been much improved with the ability to have more than one metadata server. I wrote about this in May last year. Alongside clustered metadata servers, we can also have clustered middle-tier servers. Simplistically, this means that if one server fails then the service can continue and will not fail.

Unplanned Sabbatical - NO MORE

So, it's been a bit quiet in NOTE:land for the last six months. I last wrote in January, and it's now August.

I started the NOTE: blog in July 2009. Since then, up to January, I have posted 459 articles. That's an average of more than 8 posts per month. I wasn't aware of these numbers until I calculated them to write this article, but maybe they go some way to explaining why I hit a point in January where I felt I had to take a break from writing. At first it was just "a few weeks" but it's turned into months.

Despite publishing nothing for the last six months, the NOTE: site has received 7,500+ hits per month, so I guess there must be something of interest in those 459 articles.

I've always enjoyed writing my posts for NOTE:, so I knew that my "sabbatical" would end sooner rather than later. And so, you can now reinstate your expectations of a steady stream of stuff about SAS software, software development practices, data warehousing, business intelligence, and analytics, plus occasional mentions of broader technical topics that are of personal interest to me, e.g. Android and Bletchley Park.

And finally, thank you to those who sent kind messages of concern regarding my sudden silence in January. I was touched by your concern.

Wednesday 22 January 2014

Estimation (Effort and Duration) #2/2

Yesterday I spoke about the art of estimation. I highlighted how valuable it is to be able to plan and estimate. I also highlighted that estimation is about getting a number that is close enough for the purpose, not about being accurate.

To get to the meat of things... here is my recipe for estimation success... It won't surprise you to see me say that the key elements of estimation are:
  1. Understand the task, i.e. what needs to be produced
  2. Comparison with similar tasks already completed
  3. Decomposition of tasks into smaller, more measurable activities
  4. Attention to detail (sufficient for our purposes)
The first (requirements) is obvious, but is very often not given enough attention to detail, resulting in an incomplete set of items to be produced. In a SAS context, this list might include technical objects such as Visual Analytics reports, stored processes, information maps, macros, DI Studio jobs, table metadata, library metadata, job schedules, security ACTs and ACEs, userIDs, data sets, views, and control files; on the business side, your list might include a user guide, training materials, a schedule of training delivery, a service model that specifies who supports which elements of your solution, a service catalog that tells users how to request support services, and a process guide that tells support teams how to fulfil support requests; and on the documentation side your list might include requirements, designs & specifications, and test cases.

Beyond identifying the products of your work, you'll need to identify what inputs are required in order to allow you to perform the task.

I'll offer further hints, tips and experience on requirements gathering in a subsequent article.

With regards to comparisons, we need to compare our planned task with similar tasks that have already been completed (and hence, we know how many people worked on them and how long they took). When doing this we need to be sure to look for differences between the tasks and make sure we take account of these by increasing or decreasing our estimate above or below the time it took for the actual task. In doing this, we're already starting to decompose the task because we're already looking for partial elements of the task that differ.

Decomposition is the real key, along with a solid approach to understanding what each of the sub-tasks does. As you decompose a unique task into more recognisable sub-tasks, you'll be able to more confidently estimate the effort/duration of the sub-tasks.

As we decompose the task into smaller tasks, we must be sure that we are clear which of the decomposed tasks is responsible for producing each of the deliverable items. We need to look out for intermediate items that are produced by one sub-task as an input to another sub-task; and pay the same attention to inputs - we must be certain that we understand the inputs and outputs of each sub-task.

I'll offer a deeper article on decomposition in a subsequent article.

You're probably thinking that requirements, comparisons, and decomposition are quite obvious. So they should be! We already established that all perform estimations every day of our life. All I've done is highlight the things that we do subconsciously. But there is one more key element: attention to detail. We must pay attention to understanding each sub-task in our composition. We must be sir to understand its inputs, its outputs, and how we're going to achieve the outputs from the inputs.

Having a clear understanding of the inputs, the outputs and the process is crucial, and it can often help to do the decomposition with a colleague. Much like with pair programming, two people can challenge each other's understanding of the task in hand and, in our context, make sure that the ins, outs and process of each sub-task are jointly understood.

I hope the foregoing has helped encourage you to estimate with more confidence, using your existing everyday skills. However, we should recognise that the change in context from supermarket & driving to software development means that we need a different bank of comparisons. We may need to build that bank with experience.

To learn more about estimating, talk to your more experienced colleagues and do some estimating with them. I'm not as great fan of training courses for estimation. I believe they're too generalistic. In my opinion, you're far better off learning from your colleagues in the context of the SAS environment and your enterprise. However, to the extent that some courses offer practical exercises, and those exercises offer experience, I can see some merit in courses.

Good luck!

Tuesday 21 January 2014

Estimation (Effort and Duration) #1/2

Estimation: The art of approximating how much time or effort a task might take. I could write a book on the subject (yes, it'd probably be a very dull book!), it's something that I've worked hard at over the years. It's a very divisive subject: some people buy into the idea whilst others persuade themselves that it can't be done. There's a third group which repeatedly try to estimate but find their estimates wildly inaccurate and seemingly worthless and so they eventually end-up in the "can't be done" camp.

Beware. When we say we can't do estimates, it doesn't stop others doing the estimation on our behalf. The end result of somebody else estimating how long it'll take us to deliver a task is a mixture of inaccuracy and false expectations placed upon us. Our own estimates of our own tasks should be better than somebody else's estimate.

My personal belief is that anybody can (with a bit of practice and experience) make decent estimates, but only if they perceive that there is a value in doing so. In this article I'll address both: value, and how to estimate.

So, let's start by understanding the value to be gained from estimation. The purpose is not to beat up the estimator if the work stakes longer than estimated! All teams need to be able to plan - it allows them to balance their use of constrained resources, e.g. money and staff. No team has enough staff and money to do everything it wants at the same time. Having a plan, with estimated effort and duration for each activity, helps the team keep on top of what's happening now and what's happening next; it allows the team to start things in sufficient time to make sure they're finished by when they need to be; it allows teams to understand that they probably won't get a task done in time and hence the team needs to do something to bypass the issue.

Estimates form a key part of a plan. For me, the value of a plan comes from a) the thought processes used to create the plan, and b) the information gained from tracking actual activity against the planned activities and spotting the deviations. There's very little value, in my opinion, in simply having a plan; it's what you do with it that's important.

Estimates for effort or duration of individual tasks combine to form a plan - along with dependencies between tasks, etc.

Okay, so what's the magic set of steps to get an accurate, bullet-proof estimate?...

Well, before we get to the steps, let's be clear that an estimate is (by definition) not accurate nor bullet-proof. I remember the introduction of estimation in maths classes in my youth. Having been taught how to accurately answer mathematical questions, I recall that many of us struggled with the concept of estimation. In hindsight, I can see that estimating required a far higher level of skill than cranking out an accurate answer to a calculation. Instead of dealing with the numbers I was given, I had to make decisions about whether to round them up or round them down, and I had to choose whether to round to the nearest integer, ten or hundred before I performed the eventual calculation.

We use estimation every day. We tot-up an estimate of the bill in our heads as we walk around the supermarket putting things into our trolley (to make sure we don't go over budget). We estimate the distance and speed of approaching vehicles before overtaking the car in front of us. So, it's a skill that we all have and we all use regularly.

When I'm doing business analysis, I most frequently find that the person who is least able to provide detail on what they do is the person whose job is being studied. It's the business analyst's responsibility to help coax the information out of them. It's a skill that the business analyst needs to possess or learn.

So, it shouldn't surprise us to find that we use estimation every day of our life yet we feel challenged to know what to do when we need to use the same skills in a different context, i.e. we do it without thinking in the supermarket and in the car, yet we struggle when asked to consciously use the same techniques in the office. And, let's face it, the result of an inaccurate estimate in the office is unlikely to be as damaging as an inaccurate estimate whilst driving, so we should have confidence in our office-based estimations - if only we could figure out how to do it!

We should have confidence in our ability to estimate, and we should recognise that the objective is to get a value (or set of values) that are close enough; we're not trying to get the exact answer; we're trying to get something that is good enough for our purposes.

Don't just tell your project manager the value that you think they want to hear. That doesn't help you, and it doesn't help the project manager either.

Don't be afraid to be pessimistic and add a fudge factor or contingency factor. If you think it'll take you a day then it'll probably take you a day and a half! I used to work with somebody who was an eternal optimist with his estimations. He wasn't trying to pull the wool over anybody's eyes, he honestly believed his estimations (JH, you know I'm talking you!). Yet everybody in the office knew that his estimations were completely unreliable. Typically, his current piece of work would be "done by lunchtime" or "done by the end of today". We need to be realistic with our estimations, and we need to look back and compare how long the task took compared with our estimation. If you estimated one day and it took two, make sure you double your next estimation. If somebody questions why you think it'll "take you so long", point them to your last similar piece of work and tell them how long it took you.

When creating and supplying an estimate it's worth thinking about three values: the most likely estimate, the smallest estimate, and the biggest estimate. For example, if I want to estimate the time it might take to develop a single DI Studio job that requires a significant degree of complexity, perhaps I can be sure it'll take at least a couple of days to develop the job, certainly no more than two weeks. Armed with those boundaries, I can more confidently estimate that it'll take one week.

If you're not confident with your estimates, try supplying upper and lower bounds alongside them, so that the recipient of your estimates can better understand your degree of confidence.

Tomorrow I'll get into the meat of things and offer my recipe for estimation success.

Monday 20 January 2014

Business Intelligence (BI) Evolution

I recently stumbled upon an interesting series of papers from IBM. There're entitled Breaking Away With Business Analytics and Optimisation. The informative and deep-thinking series talks about the need for data and good analytical processes, but it also highlights the need for a vision and a focus for our activities; IBM describes this as "breakthrough ideas". I interpret it as meaning the creation of competitive advantage, i.e. doing something better than the competition.

One particular paper in the series that caught my attention was Breaking away with business analytics and optimisation: New intelligence meets enterprise operations. Page 3 of this paper contains a neat new interpretation of the traditional BI evolution diagram. I've shown IBM's diagram above.

Traditionally the BI evolution diagram shows evolution from historic (static) reporting, through data exploration and forecasts, to predictive models and real-time systems, i.e. a gradual transition from "rear-view mirror" reporting to influencing the future. The IBM diagram contains more dimensions and focuses on the drive to achieve competitive advantage ("Breakaway"). Nice. This diagram certainly earns a place alongside the traditional form.

Tuesday 14 January 2014

NOTE: Thoughts on Lineage

I got quite a lot of interested feedback on the BI Lineage post I made last week. My post highlighted a most informative article from Metacoda's Paul Homes.

Paul himself commented on my post and offered an additional tip. Here's what Paul said:
I agree it would be nice if BI developers could do their own scans without relying on unrestricted admins to do them ahead of time. This would be similar to how DI developers can do their own impact analysis for DI content in SAS Data Integration Studio. Ideally, as with DI, they could be done dynamically, without having to do a scan and have a BI Lineage custom repository to store them in.

In the meantime, one tip I'd suggest to make it easier for the BI developers, is that BI Lineage scans can also be scheduled. An unrestricted admin can schedule a scan, at a high level in the metadata tree, to be done every night for example.
A useful tip indeed. Thanks Paul.

Monday 13 January 2014

2014, The Year of Personal Data

If 2013 was the year of wearable, personal devices then 2014 will be the year of personal data. In 2013 we saw a huge rise in popularity of wearable devices for measuring steps walked, distance travelled, pulse, calories consumed, and a lot more besides. These devices, and the smartphone, PC and cloud software that accompanied them, put us on the first few rungs of the business intelligence lifecycle - principally allowing us to do historic reporting.

I believe 2014 will see a great evolution of our use of personal data. Rather than the "rear view mirror" historic reporting that we've seen in 2013, we'll see software that predicts your future activity and offers advice and recommendations on how to positively influence your outcomes. It's not beyond the bounds of possibility for your smartphone to start prompting you to go for a walk lunchtime in order to meet your weekly target for steps, or consume no more 400 calories at dinner in order to avoid bursting your weekly calorie target. And those are just simplistic examples.

As an example of the lengths to which you can go to perform data mining on your personal data, I highly recommend the recent report in Wired of the astrophysicist who diagnosed himself with Crohn's disease. A fascinating story.

Friday 10 January 2014

NOTE: Wrap-Up on Test Coverage and MCOVERAGE

I've spent this week describing the functionality and purpose of the MCOVERAGE system option introduced in SAS V9.3. Coverage testing is an important consideration for your testing strategy - it's important to know how much of your code has been tested.

As its name suggests, MCOVERAGE only logs macro coverage. It's a great shame that there isn't an equivalent for DATA steps. Perhaps it will be added in due course, to DATA steps or DS2, or both.

With some planning, and judicious use of some post-processing capability to make sense of the log(s), MCOVERAGE can be an important tool in your testing arsenal.

I note that HMS Analytical Software GmbH's testing tool (SASunit) includes coverage testing through the use of MCOVERAGE. I've not used SASunit myself, and I can't speak for how complete, reliable and supported it may be, but if you're interested in learning more I suggest you read the SASUnit: General overview and recent developments paper from the 2013 PhUSE conference and take a look at SASunit's SourceForge pages.

What is your experience with using coverage testing and/or MCOVERAGE? Post a comment, I'd love to hear from you.

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014 (this article!)

Thursday 9 January 2014

NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing

Over the last few days I've been revealing the features and benefits of the MCOVERAGE system option introduced in SAS V9.3. This system option creates a log file to show which lines of our macro(s) were executed, e.g. during our tests.

Knowing that we tested all lines of code, or knowing that we tested 66% of all lines of code is important when judging whether we have tested sufficient amounts of our code to give sufficient confidence to put the new/updated system into production. This information relates back to our testing strategy (where we specified targets for the proportion of code lines tested). It also helps us spot dead lines of code, i.e. lines of code that will not ever be executed (perhaps due to redundant logic).

Yesterday I showed code to read an mcoverage log file and create a table to show which macro lines had been executed and which had not. My code was basic and only worked for one execution of the tested macro. Quite often we need to run our code mor than once to test all branches through our logic, so today I'll discuss upgrading my mcoverage processing code so that it handles multiple executions of the tested macro.

We might start by running our tested macro twice, with two different parameter values...

filename MClog "~/mcoverage2.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);
%fred(param=1); /* Take a different path through the code */

filename MClog clear;
* BUT, see my note about closing MClog at 
  the end of my earlier blog post;

The mcoverage log produced from these two consecutive executions looks like this:


1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED
1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 14 14 FRED
2 15 15 FRED
2 16 16 FRED
2 16 16 FRED
2 18 18 FRED

You will recall that type 1 records mark the beginning execution for a macro, type 3 records indicate non-compiled lines (such as blank lines), and type 2 records indicate executed lines of code.

Note how we now get two type 1 records. These each mark the start of a new execution of the %fred macro. Close inspection of the type 2 records shows different sets of line numbers for the first and second executions, reflecting different paths through the %fred macro code.

We're aiming to create an output that shows whether the lines of %fred macro code were executed in one or more tests, or not. So, given that non-executed rows of macro code don't create a record in the mcoverage log, we can process the mcoverage log quite simply by counting the number of type 2 records for each line of macro code. For simplicity, we'll count the type 3s too. The output that we get will look like this:


Recordnum RecordRectype Executions Analysis
1 %macro fred(param=2); 2 4 Used
2   * comment ; 2 2 Used
3   %put hello world: &param; 2 2 Used
4   %if 1 eq 1 %then %put TRUE; 2 6 Used
5   %if 1 eq 1 %then 2 2 Used
6   %do; 2 2 Used
7     %put SO TRUE; 2 2 Used
8   %end; 2 4 Used
9   %if 1 eq 0 %then 2 2 Used
10   %do; . . NOT used!
11     %put FALSE; . . NOT used!
12   %end; . . NOT used!
13   %if &param eq 1 %then 2 2 Used
14   %do; 2 1 Used
15     %put FOUND ME; 2 1 Used
16   %end; 2 2 Used
17 3 2 Not compiled
18 %mend fred; 2 2 Used

So, we can see that executing the %fred macro with two different values for param has resulted in all but three lines of code being executed. We might choose to add additional tests in order to exercise the remaining lines, or a closer inspection might reveal that they are dead lines of code.

The code to create the above output is included at the end of this post. The sequence followed by the code is as follows:
  • Read the mcoverage log file into a data set. Process the data set in order to i) remove type 1 records, and ii) count the number of rows for each line of macro code
  • Read the macro source into a data set, adding a calculated column that contains a line numbering scheme that matchers the scheme used by the mcoverage log. We are careful to preserve leading blanks in order to preserve indentation from the code
  • Join the two data sets and produce the final report. Use a monospace font for the code and be careful to preserve leading blanks for indentation
I'll wrap-up this series tomorrow with a summary of what we learned plus some hints and tips on additional features that could be added.

Here's the code:


/* This code will not cope reliably if the macro    */
/* source does not have a line beginning with the   */
/* %macro statement for the macro under inspection. */
/* This code expects a coverage log file from one   */
/* macro. It cannot cope reliably with log files    */
/* containing executions of more than one different */
/* macro.                                           */
/* Multiple different macros might be required if */
/* testing a suite of macros.                     */
filename MClog "~/mcoverage2.log"; /* The coverage log file (MCOVERAGELOC=) */
filename MacSrc "~/fred.sas";      /* The macro source  */

/* Go get the coverage file. Create macro */
/* var NUMLINES with number of lines      */
/* specified in (first) type 1 record.    */
data LogFile;
  length macname $32;
  keep Macname Start End Rectype;
  infile MClog;
  input Rectype start end macname $;
  prevmac = compress(lag(macname));

  if _n_ ne 1 and prevmac ne compress(macname) then
    put "ERR" "OR: Can only process one macro";

  if rectype eq 1 then
    call symputx('NUMLINES',end);

  if rectype ne 1 and start ne end then
    put "ERR" "OR: Not capable of handling START <> END";
run;

%put NUMLINES=&numlines;

/* Count the number of log records for each line of code. */
proc summary data=LogFile nway;
  where rectype ne 1;
  class start rectype;
  var start; /* Irrelevant choice because we only want N statistic */
  output out=LogFile_summed n=Executions;
run;

/* Go get macro source and add a line number value that */
/* starts at the %macro statement (because this is how  */
/* MCOVERAGE refers to lines.                           */
/* Restrict number of lines stored to the number we got */
/* from the coverage log file.                          */
/* Numlines does not include %mend, so we implicitly    */
/* increment the number of lines by one and thereby     */
/* retain the line containing %mend, purely for         */
/* aesthetic reasons for the final report.              */
data MacroSource;
  length Record $132;
  retain FoundStart 0 
    LastLine 
    Recordnum 0;
  keep record recordnum;
  infile MacSrc pad;
  input record $char132.; /* Keep leading blanks */

  if not FoundStart and upcase(left(record)) eq: '%MACRO' then
    do;
      FoundStart = 1;
      LastLine = _n_ + &NumLines - 1;
    end;

  if FoundStart then
    recordnum + 1;

  if FoundStart and _n_ le LastLine then
    OUTPUT;
run;

/* Bring it all together by marking each line of code */
/* with the ecord type from the coverage log.         */
proc sql;
  create table dilly as
    select  code.recordnum
      ,code.record
      ,log.rectype
      ,log.Executions
      ,
    case log.rectype
      when 2 then "Used"
      when 3 then "Not compiled"
      when . then "NOT used!" 
      else "UNEXPECTED record type!!"
      end 
    as Analysis
      from MacroSource code left join LogFile_summed log
        on code.recordnum eq log.start;
quit;

proc report data=dilly nowd;
  define record /display style(column)={fontfamily="courier" asis=on};
run;

filename MacSrc clear;
filename MClog clear;

*** end ***;

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014 (this article!)
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Wednesday 8 January 2014

NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros

Over the last couple of days I've been uncovering the MCOVERAGE system option for coverage of testing of macro code. Coverage testing shows which lines were executed by your tests (and which were not). Clearly, knowing the percentage of code lines that were executed by your test suite is an important measure of your coding efforts.

Yesterday we saw what the mcoverage contained for a typical execution of a macro. What we would like to do is make the information more presentable. That's what we'll do today. We'll produce some code that will output the following summary (from which, we can determine that 33% of our code lines weren't executed by our test).

recordnum record rectype analysis
1 %macro fred(param=2); 2 Used
2 * comment ; 2 Used
3 %put hello world: &param; 2 Used
4 %if 1 eq 1 %then %put TRUE; 2 Used
5 %if 1 eq 1 %then 2 Used
6 %do; 2 Used
7 %put SO TRUE; 2 Used
8 %end; 2 Used
9 %if 1 eq 0 %then 2 Used
10 %do; . NOT used!
11 %put FALSE; . NOT used!
12 %end; . NOT used!
13 %if &param eq 1 %then 2 Used
14 %do; . NOT used!
15 %put FOUND ME; . NOT used!
16 %end; . NOT used!
17 3 Not compiled
18 %mend fred; 2 Used

To create this table, we need to read the mcoverage log and the macro source for %fred as follows:

  • We need to process the mcoverage log by reading it into a data set and i) removing record type 1 (because it has no part to play in the above table, and ii) removing duplicated log rows for the same code line (which happens when a line of code is executed more than once).
  • We need to process the macro source by reading it into a data set and adding a column to record the line number (matching the numbers used in the coverage log).
  • Having read both files into separate data sets (and processed them as outlined above), we can join them and produce our report. The code to achieve this is shown at the end of this post.

The code that I've created expects a coverage log file from one execution of one macro. It cannot cope reliably with log files containing either multiple executions of the same macro or executions of more than one different macro. Is this a problem? Well, multiple executions of the same macro might be required if testing various permutations of inputs (parameters and data); and multiple different macros might be required if testing a suite of macros.

Tomorrow I'll augment the code so that it can deal with multiple executions of the same macro, e.g. testing %fred with param=2 and param=1.

Meanwhile, here's today's code...

/* This code will not cope reliably if the macro    */
/* source does not have a line beginning with the   */
/* %macro statement for the macro under inspection. */

/* This code expects a coverage log file from ONE */
/* execution of ONE macro. It cannot cope         */
/* reliably with log files containing either      */
/* multiple executions of the same macro or       */
/* executions of more than one different macro.   */

filename MClog "~/mcoverage1.log"; /* The coverage log file (MCOVERAGELOC=) */
filename MacSrc "~/fred.sas";     /* The macro source  */

/* Go get the coverage file. Create macro */
/* var NUMLINES with number of lines      */
/* specified in type 1 record.            */
data LogFile;
  length macname $32;
  keep macname start end rectype;
  infile MClog;
  input rectype start end macname $;
  prevmac = lag(macname);

  if _n_ ne 1 and prevmac ne macname then
    put "ERR" "OR: Can only process one macro";

  if rectype eq 1 then
    call symputx('NUMLINES',end);

  if rectype ne 1 and start ne end then
    put "ERR" "OR: Not capable of handling START <> END";
run;

%put NUMLINES=&numlines;

/* Remove duplicates by sorting START with NODUPKEY. */
/* Hence we have no more than one data set row per   */
/* line of code.                                     */

  /* This assumes the log file did not contain different 
     RECTYPEs for the same start number */
  /* This assumes log file does not contain differing
     permutations of START and END */
proc sort data=LogFile out=LogFileProcessed NODUPKEY;
  where rectype ne 1;
  by start;
run;

/* Go get macro source and add a line number value that */
/* starts at the %macro statement (because this is how  */
/* MCOVERAGE refers to lines.                           */
/* Restrict number of lines stored to the number we got */
/* from the coverage log file.                          */
data MacroSource;
  length record $132;
  retain FoundStart 0 
    LastLine 
    recordnum 0;
  keep record recordnum;
  infile MacSrc pad;
  input record $132.;

  if not FoundStart and upcase(left(record)) eq: '%MACRO' then
    do;
      FoundStart = 1;
      LastLine = _n_ + &NumLines - 1;
    end;

  if FoundStart then
    recordnum + 1;

  if FoundStart and _n_ le LastLine then
    OUTPUT;
run;

/* Bring it all together by marking each line of code */
/* with the record type from the coverage log.        */
proc sql;
  select  code.recordnum
    ,code.record
    ,log.rectype
    ,case log.rectype
       when 2 then "Used"
       when 3 then "Not compiled"
       when . then "NOT used!"
       else "UNEXPECTED record type!!"
     end as analysis
  from MacroSource code left join LogFileProcessed log
    on code.recordnum eq log.start;
quit;

filename MacSrc clear;
filename MClog clear;

As an endnote, I should explain my personal/idiosyncratic coding style:
  • I want to be able to search the log and find "ERROR" only if errors have occurred. But if I code put "ERROR: message"; then I will always find "ERROR" when I search the log (because my source code will be echoed to the log). By coding put "ERR" "OR: message"; my code looks a little odd but I can be sure that "ERROR" gets written to the log only if an error has occured

Tuesday 7 January 2014

NOTE: Broadening Access to the BI Lineage Plug-In

Metacoda's Paul Homes recently wrote a most informative article entitled Providing User Access to the SAS BI Lineage Plug-in. As Paul says in his article, the BI Lineage plug-in can be used to do impact analysis for BI content (reports, information maps etc.) in a similar way that SAS Data Integration Studio provides impact analysis for DI content (jobs, tables, etc).

The plug-in was new with the November 2010 release of V9.2. Its results include lineage and reverse lineage, i.e. predecessor and successor objects.

Developers find this information useful in order to understand the impact of changing an information map (for example) on reports and, contrary-wise, it is useful for understanding what BI objects (such as information maps) will need to be changed in order to add a new column to a report. This information is useful to capture the full scope of a proposed change and hence to more accurately estimate the effort required.

Testers also find this information useful because it helps to gives them a gauge of the amount of coverage their testing is achieving (this week's theme on NOTE:!).

Paul describes how to make lineage reports viewable by any authorised user, but he concludes that only a strictly limited set of users can create the reports, i.e. what SAS calls "unrestricted users". This a shame because the functionality is of broad interest and value. Let's hope that SAS makes the creation of lineage reports more accessible in future. If you agree, hop over to the SASware ballot community and propose the enhancement. If you're unfamiliar with the ballot, read my overview from August 2012.

In addition, the ability to join lineage reports for BI and DI objects would provide the full provenance of data items. Now that's something I'd love to see!

NOTE: Macro Coverage in Testing (MCOVERAGE)

Yesterday I introduced the MCOVERAGE system option (introduced in V9.3) for capturing coverage of macro execution. This is useful in testing, to be sure you executed all lines of your macro. This may take more than one execution of your macro, with different input parameters and data.

I finished yesterday's post by showing the mcoverage log file created from the execution of a sample macro. I've listed all three files below. They are:
  1. The program that I ran
  2. The mcoverage log file
  3. The macro source for %fred (with line numbers added; the blank lines were intentional, to show how they are dealt with by MCOVERAGE)


filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);

filename MClog clear;

* BUT, see my note about closing MClog at
  the end of yesterday's blog post;

1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED

1.
2.  %macro fred(param=2);
3.    * comment ;
4.    %put hello world: ¶m;
5.    %if 1 eq 1 %then %put TRUE;
6.    %if 1 eq 1 %then 
7.    %do;
8.      %put SO TRUE;
9.    %end;
10.   %if 1 eq 0 %then 
11.   %do;
12.     %put FALSE;
13.   %end;
14.   %if &param eq 1 %then
15.   %do;
16.     %put FOUND ME;
17.   %end;
18.
19. %mend fred;
20.

The SAS 9.4 Macro Language: Reference manual tells us that the format of the coverage analysis data is a space delimited flat text file that contains three types of records. Field one of the log file contains the record type indicator. The record type indicator can be:
  • 1 = indicates the beginning of the execution of a macro. Record type 1 appears once for each invocation of a macro
  • 2 = indicates the lines of a macro that have executed. A single line of a macro might cause more than one record to be generated.
  • 3 = indicates which lines of the macro cannot be executed because no code was generated from them. These lines might be either commentary lines or lines that cause no macro code to be generated.
We can see examples of these in the listing shown above. The second and third fields contain the starting and ending record number, and the fourth field contains the name of the macro (you figured that out yourself, right?).

So, record type 1 from our log is telling us that %fred is 18 lines long; record type 3 is telling us that line 17 has no executable elements within it (because it's blank); and the record type 2 lines are telling us which code lines were executed. By implication, lines of code that were not executed don't feature in the mcoverage log. How do we interpret all of this?

The first thing to note is that the line numbers shown in the mcoverage log are relative to the %macro statement and hence don't align with our own line numbers (I deliberately included a blank first and last line in the fred.sas file in order to demonstrate this). The type 2 records show that all lines were executed by our test except 10-12 and 14-17 (these are numbered 11-13 and 15-18 above). Given the logic and the fact that we supplied param=2 when we executed the macro (see yesterday's post), this would seem understandable/correct.

However, surely we can write a quick bit of SAS code to do the brainwork for us and show which lines were executed and which weren't. Of course we can, and I'll show an example program to do this tomorrow...

MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014 (this article!)
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Monday 6 January 2014

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE)

The SAS macro language is almost as old as SAS itself (who knows exactly?) so you'd think the need to add new functionality would have ceased - particularly with the ability to access most DATA step functions through %sysexec. But apparently not...

SAS V9.4 introduces a few new macro features, but not a huge number. The two that caught my eye were:
  1. The SYSDATASTEPPHASE automatic macro variable which offers an insight into the current running phase of the DATA step
  2. The READONLY option on %local and %global.
Not so long ago, SAS V9.3 introduced a raft of new automatic macro variables, macro functions, macro statements and macro system options.

When 9.3 was launched, one of the new system options caught my eye: MCOVERAGE. It claimed to offer coverage analysis for macros, i.e. highlighting which macro code lines were executed and which were not (particularly useful whilst testing your macros). When I wrote of the release of 9.3 I didn't have immediate access to 9.3, the documentation offered little in the way of real-world explanation, and (I confess) I forgot to return to the topic when I got use of a copy of 9.3.

Well, I was reminded of MCOVERAGE recently and I've spent a bit of time over Christmas figuring out how it works and what it offers in real terms (what is Christmas for if it's not for indulging yourself in things you love?). If you do a lot of macro coding then you'll be interested to know that MCOVERAGE offers plenty. Read on...

Consider this piece of code:

filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%fred(param=2);

filename MClog clear;
The SAS log doesn't include any extra information, but we've created a new file named mcoverage1.log in our unix home directory (if you're on Windows, substitute "~/mcoverage1.log" with "C:\mcoverage1.log". I'll describe what the %fred macro does later but, for now, let's just say it's a macro that we want to test. So, we've tested it (with param=2), it worked fine, but have we tested all of the lines of code, or did we only execute a sub-set of the whole macro? If we look into mcoverage1.log we can find the answer. It looks like this:

1 1 18 FRED
2 1 1 FRED
3 17 17 FRED
2 1 1 FRED
2 2 2 FRED
2 3 3 FRED
2 4 4 FRED
2 4 4 FRED
2 4 4 FRED
2 5 5 FRED
2 6 6 FRED
2 7 7 FRED
2 8 8 FRED
2 8 8 FRED
2 9 9 FRED
2 13 13 FRED
2 18 18 FRED
What does this mean? I'll explain tomorrow...

But before tomorrow, I must add one further piece of information. In order to see the mcoverage log, it needs to be closed by SAS. One does this by coding filename MClog clear;. However, I found that SAS refused to close the file because it was "in use". Even coding options nomcoverage; before closing it didn't help. In the end I resorted to running another (null) macro after setting nomcoverage. This did the trick, but if anybody can suggest how I can more easily free-up the mcoverage log I'd be very interested to hear. Here's the full code that I used:

%macro null;%mend null;

filename MClog "~/mcoverage1.log";

options mcoverage mcoverageloc=MClog;

%include "~/fred.sas";

%fred(param=2);

options nomcoverage mcoverageloc='.';

%null;

filename MClog clear;
MCOVERAGE:

NOTE: Macros Newness in 9.4 and 9.3 (MCOVERAGE), 6-Jan-2014 (this article!)
NOTE: Macro Coverage in Testing (MCOVERAGE), 7-Jan-2014
NOTE: Making Sense of MCOVERAGE for Coverage Testing of Your Macros, 8-Jan-2014
NOTE: Expanding Our Use of MCOVERAGE for Coverage Analysis of our Macro Testing, 9-Jan-2014
NOTE: Wrap-Up on Test Coverage and MCOVERAGE, 10-Jan-2014

Friday 3 January 2014

NOTE: Metadata-Bound Libraries for Dummies

As I said in yesterday's My 2013 Top Ten article, I think that metadata-bound libraries (introduced in V9.3 maintenance 2) are one of the most significant SAS enhancements for some time. It seems SAS's Chris Hemedinger agrees. Read his Closing the "LIBNAME loophole" with metadata-bound libraries article on his SAS Dummy blog to get a very rounded view on their benefits and how to use them.

Thursday 2 January 2014

My 2013 Top Ten

Last week I published the ten blog articles that scored most hits over the last 18 months. That was "your" top ten. I thought I'd offer an alternative top ten by listing the articles that I most enjoyed writing or which I personally thought were of the greatest significance. Pure indulgence; please forgive me!

In no particular order...
  • NOTE: Ampersands Again, 31-Jan-2013. I love SAS macros and I was very happy to write a couple of articles at the beginning of the year about the use of multiple ampersands to achieve indirect references to macro variables
  • NOTE: High-Availability Metadata, 8-May-2013. V9.4's (optional) introduction of multiple metadata servers was a big step forward for SAS resilience
  • Predictive Analytics in the 17th Century, 14-May-2013. I was glad to be able to bring John Gaunt's fascinating work to a wider audience
  • NOTE: The OPEN Function (reading data sets in macros), 9-Jan-2013. SAS/AF's SCL language remains a favourite of mine. Using some of the SCL functions in DATA step and macro brings back happy memories! They're useful too. This individual article was part of a short-series I wrote about the OPEN function
  • Test Cases, an Investment, 11-Dec-2013. A very recent article, but covering something that doesn't get enough coverage, i.e. re-use of tests for regression testing
  • NOTE: Enterprise Guide vs DI Studio - What's the Difference?, 4-Dec-2013. One of my favourite interview questions!
  • NOTE: More Agility at SAS, 18-Nov-2013. SAS are a successful software company, so there's always lots to learn by understanding how software development is done at SAS
  • NOTE: Interactive Metadata-Bound Libraries (MBLs), 14-Nov-2013. I think that metadata-bound libraries are one of the most significant developments in SAS for quite some time. To see a nice interactive interface as option is just icing on the cake
  • Beware the Data Shadow, 24-Apr-2013. Be sure that your sources of data are of good quality and coming from reliable sources
  • Animated Graphics, Courtesy of V9.4 SAS/GRAPH, 11-Jun-2013. Hans Rosling's animated statistical presentation on health and income growth across the decades is legendary. In this post I provided a pointer to Rob Allison's treatise on how to do the same (almost) in SAS
Let's hope for a peaceful and prosperous New Year for us all!

Wednesday 1 January 2014

Turing Pardoned!

In December 2012 I wrote of my family's visit to Bletchley Park in Central England. Bletchley has a remarkable history and we enjoyed discovering its contribution to computer science along with that of Alan Turing.

In the article, I mentioned the tragic circumstances surrounding the end of Alan Turing's life (in brief, he was a homosexual in 1950's Britain, he was convicted of gross indecency, accepted chemical castration as punishment, and subsequently died soon after - having taken a draft of cyanide, possibly as a suicide attempt). I mentioned an official public apology from Prime Minister Gordon Brown made in 2009 and a growing campaign for a pardon for Turing.

I was pleased to see that Alan Turing was indeed granted a royal pardon on December 24th, i.e. two weeks ago. Whilst some people think that a royal pardon for someone who has been dead for nearly 60 years is meaningless, I agree with those who consider that the pardon rights a wrong and provides some kind of recognition for a man who contributed so much to his country and to computer science (virtually all of which was secret until the mid-70s).

I still await my copy of Sue Black's Saving Bletchley Park book (which nears completion) but, in the meantime, I have enjoyed reading Robert Harris's Enigma and Jack Norman's Broken Crystal - both being fictionalised accounts of activities at Bletchley Park during World War II.