Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, 2 September 2015

NOTE: SAS "Inside" of Hadoop

We previously looked at SAS Grid Manager for Hadoop, which brings workload management, accelerated processing, and scheduling to a Hadoop environment. This was introduced with the m3 maintenance release of SAS v9.4. M3 also introduced support for using an Oozie scheduling server.

If you're keen to get additional SAS services running on your Hadoop cluster, potentially reducing "data miles", you'll be pleased to know that SAS has an experimental feature in v2.7 of the LASR Analytic Server that allows us to experimentally manage resources with YARN. I need to stress "experimental" - this is not ready for our production systems quite yet, unless reliability & availability are not our top priorities.

If the experimental status doesn't put you off then you can find more details at the back of the LASR Analytic Server 2.7: Reference Guide.

YARN (Yet Another Resource Negotiator) is part of the base framework of version 2 of Apache Hadoop. It's a resource manager and it takes care of the Hadoop cluster's compute resources. YARN can manage and share resources between various applications. Configuring LASR to participate in YARN's resource sharing allows YARN to have a complete picture of activities on the cluster.

The use of YARN with LASR is part of the increasing integration between SAS and Hadoop. I look forward to seeing it move to "general availability" status.

Thursday, 23 July 2015

NOTE: SAS Grid Manager for Hadoop

I've recently written about how much new functionality is getting released by SAS on an almost monthly basis without much fanfare, and I've also written about how Hadoop is becoming a new "operating system" and we should expect to see Grid and LASR running within Hadoop in due course. Well, the release of SAS v9.4 m3 earlier this month brought: SAS Grid Manager for Hadoop.

In fact, the m3 release of SAS Grid Manager brought a raft of changes that point towards a different future for grid computing with SAS.
  • SAS Grid Manager for Hadoop has been added. SAS Grid Manager for Hadoop brings workload management, accelerated processing, and scheduling, to a Hadoop environment
  • Support has been added for using an Oozie scheduling server. This server is used in a SAS Grid Manager for Hadoop environment
  • An agent plug-in and a management module have been added to SAS Environment Manager. In short, we can now monitor and manage our Platform grids using Environment Manager instead of RTM (although some features remain unique to RTM for the moment)
So, grid computing in SAS 9.4 m3 now offers a choice between Platform Suite for SAS and Grid Manager for Hadoop. And if you choose the Platform grid, you may no longer need to install and operate RTM.

Licensing issues aside, you may choose to run one or both of the types of grid technology. This article focuses on Grid Manager for Hadoop. From a user's perspective, there is little or no difference between the two choices because Grid Manager for Hadoop accepts all of the existing Grid syntax and submission modes; integration with other SAS products and solutions is supported by Grid Manager for Hadoop. However, from an architectural and administrative point of view, I believe there are two key advantages for Grid Manager for Hadoop:
  1. If your data is in Hadoop, you don't need to extract it out of the Hadoop cluster in order to process it on the grid. A key tenet of big data is to minimise "data miles" by sending the code to the data rather than transferring terabytes or petabytes of data to the compute server
  2. SAS Platform grids require a clustered file system ("shared data"); Grid Manager for Hadoop uses a shared-nothing approach and hence a bane of my life is eliminated! I've never shared a happy coexistence with a clustered file system. They have often been new/unknown technology for my client's IT infrastructure team, and they have often been unreliable (there may be a link between these two facts). When the clustered file system is the heart of the grid, unreliability is not a good quality
I must point out that the documentation does not state that the full syntax of SAS/BASE and associated products is available when run on a Hadoop-based grid. Certainly, up to this point time, the SAS processes embedded into Hadoop have only been able to run a subset of SAS syntax, via DS2, plus high performance (HP) procedures. Furthermore, if we think of the no-shared-data model, it would seem inefficient in the extreme to run a SAS job on one grid node and expect the Hive/HDFS data to be streamed to that one node from all of the data nodes where it resides. So, efficient use of the in-Hadoop capability necessitates the use of DS2 or HP procedures.

The SAS Grid Computing in SAS 9.4, Fourth Edition manual gives you all the information you need to plan, install and utilise your Grid within Hadoop with your v9.4 m3 environment. You will see that Yarn is used for resource management, Oozie for scheduling. Cloudera, Hortonworks and MapR distributions of Hadoop are supported.

The manual tells us that the install process involves six steps:
  1. Install Hadoop services
  2. Enable Kerberos on the Hadoop cluster
  3. Enable SSL
  4. Update YARN parameters
  5. Set up HDFS directories
  6. Run the SAS Deployment Wizard to install and configure a SAS Grid Manager for Hadoop control server
I'm sure this install won't be plain-sailing because there are a lot of new technologies and components involved. Equally, there are doubtless some features of the Platform grid that are not (yet) available in the Hadoop-hosted grid. But if you are planning a big data project and you need a grid, I suggest you give due consideration to this new option.

Tuesday, 7 July 2015

Hadoop is the New Black

It feels like any SAS-related project in 2015 not using Hadoop is simply not ambitious enough. The key question seems to be "how big should our Hadoop cluster be" rather than "do we need a Hadoop cluster".

Of course, I'm exaggerating, not every project needs to use Hadoop, but there is an element of new thinking required when you consider what data sources are available to your next project and what value would they add to your end goal. Internal and external data sources are easier to acquire, and volume is less and less of an issue (or, stated another way, you can realistically aim to acquire large and larger data sources if they will add value to your enterprise).

Whilst SAS is busy moving clients from PC to web, there's a lot of work being done by SAS to move the capabilities of the SAS server inside of Hadoop. And that's to minimise "data miles" by moving the code to the data rather than vice-versa. It surely won't be long before we see SAS Grid and LASR running inside of Hadoop. It's almost like Hadoop has become a new operating system on which all of our server-side capabilities must be available.

We tend to think of Hadoop as being a central destination for data but it doesn't always start its presence in an organisation in that way. Hadoop may enter an organisation for a specific use case, but data attracts data, and so once in the door Hadoop tends to become a centre of gravity. This effect is caused in no small part by the appeal of big data being not just about the data size, but the agility it brings to an organisation.

SAS's Senior Director of the EMEA and AP Analytical Platform Centre of Excellence, Mark Torr (that's one heck of a title Mark!) recently wrote a well-founded article on the four levels of Hadoop adoption maturity based upon his experiences with many SAS customers. His experiences chime with my far more limited observations. Mark lists the four levels as:
  1. Monitoring - enterprises that don't yet see a use for Hadoop within their organisation, or are focused on other priorities
  2. Investigating - those at this level have no clear, focused use for Hadoop but they are open to the idea that it could bring value and hence they are experimenting to see where and how it can deliver benefit(s)
  3. Implementing - the first one or two Hadoop projects are the riskiest because there's little or no in-house experience, and maybe even some negative political undercurrents too. As Mark notes, the exit from Investigating into Implementing often marks the point where enterprises choose to move from the Apache distribution to a commercial distribution that offers more industrial-strength capabilities such as Hortonworks, Cloudera or MapR
  4. Established - At this level, Hadoop has become a strategic architectural tool for organisations and, given the relative immaturity of Hadoop, the organisations are working with their vendors to influence development towards full production-strength capabilities
Hadoop is (or will be) a journey for all of us. Many organisations are just starting to kick the tyres. Of those who are using Hadoop, most are in the early stages of this process in level 2, with a few front-runners living at level 3. Those organisations at leve 3 are typically big enough to face and invest in solutions to the challenges that the vendors haven’t yet stepped up to, such as managing provenance, data discovery and fine-grained security.

Does anybody live the dream fully yet? Arguably, yes, the internal infrastructures developed at Google and Facebook certainly provide their developers with the advantages and agility of the data lake dream. For most us, we must be content to continue our journey...

Wednesday, 21 November 2012

NOTE: Now I see Visual Analytics

I'll confess that whilst there was a lot said about SAS Visual Analytics at this year's SAS Global Forum, I came home with some confusion over its architecture, functionality and benefits. I was fortunate to spend some quality time with the software recently and I think I've now got a good handle on it. And it's impressive.

It's comparatively early days in its life cycle; it provides value for a significant set of customers, but it will benefit an ever larger population as it evolves and gets enhanced over time.

The key benefits as I see them are i) its handling of "big data", ii) its user friendly yet highly functional user interface, and iii) its ability to design a report once yet deliver the report through a variety of channels (including desktop, web and mobile).

The big data element is delivered through in-memory techniques that are incorporated in the SAS LASR Analytic Server. In essence, this means that you need to reserve a number of servers (on commodity "blade" hardware or on database appliances from EMC Greenplum and Teradata) for the purpose of providing the in-memory capabilities. Once the data is loaded onto the LASR server and copied into memory, users can explore all data, execute analytic correlations on billions of rows of data in just minutes or seconds, and visually present results. This helps quickly identify patterns, trends and relationships in data that were not evident before. There's no need to analyse sub-sets of your data and hope that they are representative of the full set of data.

The user-friendly interface is largely drag-and-drop in a similar style to the design of Excel pivot tables. There is a wide range of output styles such as tables, graphs, & charts, and these can be laid-out into a report and linked together for synchronised filtering, drilling, slicing and dicing. The current release incorporates regression analysis and correlations. I anticipate that future releases will soon after more functionality such as forecasting.

The reports that you design in Visual Analytics are simultaneously available through a number of channels including web, and  mobile on iPad & Android. This means that your dashboards and reports are available to anybody, anywhere (combined with SAS security measures that make sure nobody sees any information that they are not meant to).

All-in-all, SAS Visual Analytics is another step in taking away the friction caused by technology limitations and allowing analysts to execute their analytical processes more effectively and efficiently. Less programming, more analysis, better results.