NOTE: Development With SAS: DI

Showing posts with label DI. Show all posts

Wednesday, 4 December 2013

NOTE: Enterprise Guide vs DI Studio - What's the difference?

A favourite interview question of mine is: Compare and contrast SAS 9's stored process server and workspace server. This question is very good at revealing whether candidates actually understand some of what's going on behind the scenes of SAS 9. I mentioned this back in 2010, together with some notes on my expectations for an answer.

I was amused to see Michelle Homes post another of my favourite interview questions on the BI Notes blog recently: What’s the difference between SAS Enterprise Guide and SAS DI Studio? This question, and the ensuing conversation, establishes whether the candidate has used either or both of the tools, and it reveals how much the candidate is thinking about their environment and the tools within.

For me, there are two key differences: metadata, and primary use.

Michelle focuses on the former and gives a very good run-down of the use of metadata in Data Intergration Studio (and the little use in Enteprise Guide).

With regards to primary use, take a look at the visual nodes available in the two tools. The nodes in DI Studio are focused upon data extraction, transformation and loading (as you would expect), whilst the nodes in Enterprise Guide (EG) are focused upon analysing data. Sure, EG has nodes for sorting, transposing and other data-related activities (including SQL queries), but the data manipulation nodes are not as extensive as DI Studio. In addition to sorting and transposing, DI Studio offers nodes that understand data models, e.g. an SCD loader and a surrogate key generator (I described slowly changing dimensions (SCDs) and other elements of star schema data models in a post in 2009). On the other hand, EG has lots of nodes for tabulating, graphing, charting, analysing, and modelling your data.

One final distinction I'd draw is that EG's nodes are each based around one SAS procedure, whilst DI's nodes are based around an ETL technique or requirement. You can see that DI Studio was produced for a specific purpose, whilst EG was produced as a user friendly layer to put on top of the SAS language and thereby offers a more generalistic solution.

For the most part, I'm stating the obvious above, but the interview candidate's answer to the question provides a great deal of insight into their approach to their work, their sense of curiosity and awareness, and their technical insight.

Wednesday, 6 April 2011

NOTE: DI Studio Version 4.3

Version 4.3 of SAS Data Integration Studio (DI Studio) is to be released alongside SAS 9.3 later this year. I stopped by the Data Integration booth at SAS Global Forum to get some detail from those that should know. There are a raft of enhancements that are worthy of note, but the big one that caught my eye was job versioning and rollback. Here's what I found (and saw)...

Increased ability to perform ELT (Extract, Load and Transform) in addition to traditional ETL - doing your transformations down in the database and thus minimising the data miles travelled by your data
Code import wizard in 4.2 already understands macros to the extent that it can create and display a process node representing the macro, but DI Studio 4.3 will give you the option of expanding the macro so that you see the component tasks of the macro in the your DI process flow. This gives you a clearer picture of what your job is doing and allows job performance to be collected at a more granular level
Talking of job performance, DI Studio 4.3 provides more breadth and depth with regard to the performance profile of your jobs and their steps. You have access to predefined reports that can (optionally) be created in Web Report Studio
A Type 1 SCD loaded transform has been added for those that don't need the complexity of the Type 2 loader. Not sure what Type 1 and Type 2 are, see my earlier article on the subject
Job deployment can now be done from an OS command line using a newly-supplied shell script. This script supports deployment of one or more jobs at a time, and it can be scheduled. For those of you who have invested in automating the promotion and deployment of your jobs using the programmatic interfaces to export/import that DI Studio 4.2 brought, you can now consider completing the task by automating the deployment of your promoted jobs. Apparently, job deployment hasn't been made available through the programmable API because the task of deployment requires a great deal of Java activity that SAS weren't comfortable with delivering through the APIs. Disappointing, but there you go
And finally, for those who take configuration management seriously, we have versioning of jobs (and packages). Sadly this isn't a top-down delivery of a release management capability, but it's a step in the right direction. SAS have provided plug-ins for CVS and Subversion, but they say they'll publish the API and allow you to write your own plug-ins for your own preferred source code management system, such as IBM Rational's ClearCase. These plug-ins facilitate the process of pushing exported packages (with one or more jobs and dependent objects) into your source code management system. From within the DI Studio interface you can see and inspect versions of your packages.You can even compare objects within packages, so you can answer the common Release Manager questions like "what changes did you make to the job between version 2 and version 4?". Sadly, the comparison can only be done on one object at a time.

So, all-in-all, DI Studio 4.3 has a lot to offer alongside the introduction of SAS 9.3. The versioning leaves a lot more work still to be done with regard to release management, support for a wider range of source code management systems, and detailed reporting. Nonetheless, it's a job well done by the DI Studio team.

The plug-ins are not available from Management Console and thus cannot be used and applied to any SAS metadata objects that you might expect to export/import. Thus there's no support for BI objects. That's disappointing, but SAS tell me they are aware of the shortcoming and hope to plug the gap in a future release.

[See all of my posts regarding SAS Global Forum]

Wednesday, 2 December 2009

NOTE: Migrating to DI Studio

Users of DI Studio version 4.21 (available as part of the second release of SAS V9.2) benefit from the addition of the SAS code import wizard. In essence, you point the wizard at your code and it will convert your code to a DI job, including metadata for source and target tables. In practice, it has limitations, as you might expect of the first release of a tool with such high ambitions, but it is an impressive new capability. You can read more in the Data Integration Studio 4.21: User's Guide.

Under the covers, the code import wizard uses PROC SCAPROC. Rick Langston gave me a demo of this new PROC at this year's SGF and I was very impressed. Whilst it is used by DI Studio, you can also use it yourself, most notably for highlighting parallel processing (and, optionally, grid) opportunities in your code. SCAPROC is designed to operate in the background whilst your code is running, so you need to temporarily add a call to SCAPROC at the beginning and the end of your code. The SGF paper by Rick and Eric Thies provides a good introduction. SCAPROC was introduced with the first release of SAS V9.2.

Monday, 30 November 2009

NOTE: Star Schemas and Slowly Changing Dimensions in Data Warehouses

Most data warehouses include some kind of star schema in their data model. If the dimensional data in the warehouse is likely to change over time, i.e. you have a slowly changing dimension, then you should consider this whilst designing your data model. This article provides a simple introduction to star schemas and slowly changing dimensions (SCDs). These terms are important if you are to understand some of the more powerful transforms in DI Studio.

The star schema, at its simplest, consists of a fact table surrounded by a number of associated dimension tables. The visual representation of this derives the name.

The fact table contains the facts, i.e. the central data of the warehouse (typically numeric), along with keys to the various dimension tables. A simple data model with one fact table and three dimension tables is shown below.

The keys in the dimension tables are called “Primary keys”; they uniquely identify each row. The same keys in the fact table are known as “Foreign keys”; they do not necessarily uniquely identify each row. Often the fact table will have an extra column called the “Compound Key”. This will contain a combination of the dimension keys, making it unique to each row of the fact table.

Continued (read more)...

Monday, 5 October 2009

NOTE: The DATA Step Debugger

While the IT world moves from hand-typed code to visually-created, drag-and-drop code, beneath the covers there usually lurks machine-generated code that we can all recognise. And, while we await the nirvana of a visual-coding tool that offers all the power and flexibility of hand-cranked code (don't hold your breath!) we still need to create what SAS DI Studio calls User-Written Code Nodes. In other words, as I said in my recent DI Studio in V9.2 post, DI Studio is a great tool and it becomes greater with every new release, but we're still some years away from that day when we don't need to write any code ourselves.

In the meantime, we need to continue writing code, testing code, and debugging code. A key tool for use when debugging DATA steps is the DATA step debugger (DSD). I mentioned it in my 2001 paper "Debugging Made ~~Easy~~ Easier". I recommended that SAS programmers be on intimate terms with the DSD! At the time, my suggestion was to read Dave Riba's paper from SUGI25 - “How to Use the Data Step Debugger”.

This week I noticed a more recent paper by Russ Lavery - "An Animated Guide: The SAS® Data Step Debugger" presented at SGF 2007. Russ offers a very accessible, easily understandable and well-illustrated guide to using the DSD. Highly recommended.

DSD helps you understand the detail of what SAS does (and when) within the DATA step - you feel like you can see inside the SAS's head. It's a real education!

DSD is almost exclusively an interactive tool, but it can be used in batch. Dave Riba tells you how in his aforementioned paper. However, I've never found a satisfactory use for it in practice.

I'm conscious that this post hasn't offered any new information, it's merely pointed you to a number of existing materials that were produced in past years. I'm happy to do that; I'm happy to avoid repeating good information that's held elsewhere (and avoid plagiarism and claiming the information as my own!!); I'm happy to provide a channel that makes SAS practitioners aware of this excellent material. I hope you agree that this, and other similar styled posts, have value. Share your thoughts with a comment...

[Yes, I know DI Studio is actually called DI Server these days, but spare me a thought - I've only just got used to calling it DI Studio rather than Warehouse Administrator]

Wednesday, 30 September 2009

NOTE: Data Integration in V9.2

Many people are not convinced by SAS Data Integration Studio (DI Studio). To the seasoned SAS programmer, DI Studio is a glossy interface that cannot produce the range of functionality that their hand-written code can achieve, and the code that it does produce is less efficient in many cases. But look at the other side of the coin, to create a job with DI Studio requires knowledge of the data, but little knowledge of SAS syntax, so the skills required are more readily available. And with each new version of SAS comes more transforms and other job nodes, i.e. steps in the job.

So, DI Studio may not yet be a panacea that allows data modellers to build extract, transform and load (ETL) code without SAS programming skills, but it's moving closer. SAS's new approach to releasing products independent of each other perhaps means that DI Studio can be be evolved and released to customers more quickly.

Think of this: if you go far enough back in history you will come across a time when machine code programmers scoffed at the possibility of producing compilers for 3rd generation languages like Fortran and Cobol. They argued that the compiler could never produce code that was optimised as well as their hand-crafted machine code. But the increase in machine speed combined with better compilers meant that the inefficiencies in the compiler's code were reduced, and the impact of those remaining inefficiencies was decreased by the faster machines. Sounds familiar? How long will it be before the majority of SAS jobs are produced with DI Studio, as SAS produce new and better transforms, and machines get faster?

I was impressed by the V9.2 release of DI Studio that I saw at SAS Global Forum (SGF) earlier this year. Apart from basic interface enhancements and a number of new transforms, I noted:

Ability to prevent DI Studio from "tidying" your layout. Thus you can position transforms and tables in places that make best sense to you
You can put the same table in more than one place on your layout, e.g. where it's used as input and output in different points
The addition of textual notes that can be placed on the layout in order to provide a form of documentation (or temporary development notes)
A usable undo capability!
Performance monitoring that shows real-time statistics while your job is running

Visual coding, such as is offered by SAS DI Studio, is the future. When will you get on board?

What are your thoughts? Post a comment!

The small print:

Please send general comments to NoteEditor@RTSL.eu

The RTSL blog: © 2001-2015 Ratcliffe Technical Services Limited. All rights reserved.

5 Willow Close, Bexley, Kent, GB-DA5 1QY, England

Re-publication by permission only. You may forward copies only if no fee is involved. Please encourage the recipients of the forwarded copy to subscribe and get their own free copy

SAS® is a trademark of SAS Institute, Inc. All other trademarks, company names, products names and logos mentioned are the property of their respective owners.

NOTE: Development With SAS

Pages