Monday 26 March 2012

NOTE: A Big Data Announcement (Hadoop)

There was a time when SAS practitioners merely needed to have knowledge of SAS DATA Step and PROC syntax plus a smattering of good data practices and IT practices. Over the years, SAS/ACCESS products gave access to data in other systems without the SAS practitioner needing to know anything but access credentials for those other systems. In more recent years, SAS version 9 has, on the one hand, introduced a range of SAS clients that allow the user to focus on their data knowledge and analytical skills rather than their SAS coding expertise, whilst, on the other hand, SAS version 9 has introduced (necessary) complexity to the architecture incorporating multiple types of SAS servers, comprehension of TCP/IP ports, third-party components such as web servers, plus the Platform suite of tools (from LSF job scheduling to Grid management).

Into this smorgasbord of clients and architecture, SAS added some fascinating new components earlier this month. I was interested in the new features and capabilities, but I also wanted to understand whether this brought more or less complexity for the SAS user and/or those charged with SAS platform support.

So, firstly, what was the announcement? Well, on March 6th, SAS announced the introduction of Hadoop support as part of Enterprise DI Server. Hadoop, in a nutshell, is an Apache, open source, product that provides massively parallel access to massive volumes of data. Significant users and supporters of Hadoop include Amazon, EBay, Facebook, Google, IBM, Macy's, Twitter, and Yahoo. No matter how you look at it, this is a big announcement if you are into big data; and if you're not yet into big data, maybe you soon will be.

There are multiple technologies associated with Hadoop, and SAS seems to have covered them all. For instance,

  • SAS/ACCESS will provide seamless and transparent data access to Hadoop (via HiveQL). Users can access Hive tables as if they were native SAS data sets.
  • PROC SQL will provide the ability to execute explicit HiveQL commands in Hadoop
  • SAS will help execute Hadoop functionality with Base SAS by enabling MapReduce programming, scripting support and the execution of HDFS commands from within the SAS environment. This will complement the capability that SAS/ACCESS provides for Hive by extending support for Pig, MapReduce and HDFS commands. [yes, I did copy the text from the SAS web site; no, I don't (yet) fully understand all of the terms!]
  • DI Studio will include Hadoop-specific transforms for extracting and transforming data
I'm slowly getting up-to-speed with this stuff myself, and I'll certainly be on the lookout for knowledge at SAS Global Forum next month. In the meantime, I found a couple of blog posts by Mark Troester most informative [1, 2].

The software release is certainly getting large amounts of positive comment from the technology media. The article in Information Week is just one example, singing SAS's praises.

And so, to return to my original question: has this announcement brought more or less complexity for the SAS user and/or those charged with SAS platform support? It seems clear that the SAS platform architect will need to understand Hadoop concepts, and that will require additional skills and knowledge. On the other hand, it sounds like SAS clients will do a great job of allowing the user to focus on their data and their analytical processes rather than learn new Hadoop-specific technical skills. On balance, I'd say that's the right compromise.