Wednesday 8 May 2013

NOTE: High-Availability Metadata #sasgf13

One of the most notable features of v9.4 wasn't mentioned in the SAS Global Forum Technology Connection but I caught a paper by Bryan Wolfe on the subject. SAS v9.4 will remove SAS's most notable "single point of failure" - the metadata server. SAS architects and administrators will optionally be able to specify and create a cluster of metadata servers (with real-time shared data) to mitigate metadata server failure.

For those with SAS systems providing high value operational services, this enhancement could be a key deciding factor in choosing to upgrade to v9.4. Sites with less demanding applications can choose to retain a single metadata server.

Whilst SAS has hitherto offered a large degree of resilience for failure of most processes and servers (particularly with the use of Grid and EGO), the metadata server has always been a weak link. V9.4 resolves this shortcoming by introducing the ability to cluster a group of metadata servers, all of whom are running 24x7, communicating with each other, and able to take-over the work of a failed metadata server.

The coordinated cluster of metadata servers appears as a normal metadata server to SAS users. Hence, no code changes will be required if your site implements this technology. The chosen approach is intrinsically scalable.

The cluster requires three or more nodes; each is a full metadata server. One is nominally a master, the others are slaves. The system decides who is the master at any point in time. Each metadata server must have access to a shared backup disk area.

Client connections go to slaves. Load balancing causes redirects when required. The load balancing means that read performance is the same or better when compared with v9.3 performance. To keep all metadata server instances synchronised, slaves pass write requests to the master, and the master then passes those requests asynchronously to all other slaves so that they can update their own copy of the metadata storage (in-memory and on disk).

SAS clients (such as Enterprise Guide and Data Integration Studio) keep a list of all nodes. Each client is responsible for reconnection. This is transparent to users. Hence, in the event of a slave failure, the client will automatically establish communication with an alternate server. If the master fails, the remaining slaves need to negitiate with each other to "elect" a new master. As a result, there can be a more noticeable delay, although it's unlikely to exceed 10 seconds.

The new functionality will be supported in v9.4 on all SAS platforms except IBM Z/OS. All metadata servers must be on the same OS. The cluster license is included in SAS Integration Technologies. Unlike some of SAS's other high availability and failover solutions, no additional 3rd party software is required.

All-in-all, this is a very significant enhancement for those who rely on their SAS systems to reliably deliver information, knowledge and decisions.