Thursday, 23 July 2015

NOTE: SAS Grid Manager for Hadoop

I've recently written about how much new functionality is getting released by SAS on an almost monthly basis without much fanfare, and I've also written about how Hadoop is becoming a new "operating system" and we should expect to see Grid and LASR running within Hadoop in due course. Well, the release of SAS v9.4 m3 earlier this month brought: SAS Grid Manager for Hadoop.

In fact, the m3 release of SAS Grid Manager brought a raft of changes that point towards a different future for grid computing with SAS.
  • SAS Grid Manager for Hadoop has been added. SAS Grid Manager for Hadoop brings workload management, accelerated processing, and scheduling, to a Hadoop environment
  • Support has been added for using an Oozie scheduling server. This server is used in a SAS Grid Manager for Hadoop environment
  • An agent plug-in and a management module have been added to SAS Environment Manager. In short, we can now monitor and manage our Platform grids using Environment Manager instead of RTM (although some features remain unique to RTM for the moment)
So, grid computing in SAS 9.4 m3 now offers a choice between Platform Suite for SAS and Grid Manager for Hadoop. And if you choose the Platform grid, you may no longer need to install and operate RTM.

Licensing issues aside, you may choose to run one or both of the types of grid technology. This article focuses on Grid Manager for Hadoop. From a user's perspective, there is little or no difference between the two choices because Grid Manager for Hadoop accepts all of the existing Grid syntax and submission modes; integration with other SAS products and solutions is supported by Grid Manager for Hadoop. However, from an architectural and administrative point of view, I believe there are two key advantages for Grid Manager for Hadoop:
  1. If your data is in Hadoop, you don't need to extract it out of the Hadoop cluster in order to process it on the grid. A key tenet of big data is to minimise "data miles" by sending the code to the data rather than transferring terabytes or petabytes of data to the compute server
  2. SAS Platform grids require a clustered file system ("shared data"); Grid Manager for Hadoop uses a shared-nothing approach and hence a bane of my life is eliminated! I've never shared a happy coexistence with a clustered file system. They have often been new/unknown technology for my client's IT infrastructure team, and they have often been unreliable (there may be a link between these two facts). When the clustered file system is the heart of the grid, unreliability is not a good quality
I must point out that the documentation does not state that the full syntax of SAS/BASE and associated products is available when run on a Hadoop-based grid. Certainly, up to this point time, the SAS processes embedded into Hadoop have only been able to run a subset of SAS syntax, via DS2, plus high performance (HP) procedures. Furthermore, if we think of the no-shared-data model, it would seem inefficient in the extreme to run a SAS job on one grid node and expect the Hive/HDFS data to be streamed to that one node from all of the data nodes where it resides. So, efficient use of the in-Hadoop capability necessitates the use of DS2 or HP procedures.

The SAS Grid Computing in SAS 9.4, Fourth Edition manual gives you all the information you need to plan, install and utilise your Grid within Hadoop with your v9.4 m3 environment. You will see that Yarn is used for resource management, Oozie for scheduling. Cloudera, Hortonworks and MapR distributions of Hadoop are supported.

The manual tells us that the install process involves six steps:
  1. Install Hadoop services
  2. Enable Kerberos on the Hadoop cluster
  3. Enable SSL
  4. Update YARN parameters
  5. Set up HDFS directories
  6. Run the SAS Deployment Wizard to install and configure a SAS Grid Manager for Hadoop control server
I'm sure this install won't be plain-sailing because there are a lot of new technologies and components involved. Equally, there are doubtless some features of the Platform grid that are not (yet) available in the Hadoop-hosted grid. But if you are planning a big data project and you need a grid, I suggest you give due consideration to this new option.