I wrote yesterday about democratisation of analytics and data. I believe getting information into the hands of business users is a "good thing"; accurate, timely information to influence the operational decisions they make each day. However, getting information "at any cost" is not a "good thing".
Quite often our users get their information from "data shadow" systems - groups of spreadsheets and local, customised databases. While these systems appear to provide exactly the information that business users are asking for, they are outside the purview of the IT group, and they often spawn data silos with the usual problems of inconsistency and quality.
Business groups build data shadow systems to answer the business questions that the enterprise applications, data warehouse or reports fail to answer for them. They're filling a gap in the services they receive from their IT departments. Users may not want to get the information this way, but they don't see any alternative. Worse still, as data shadow systems evolve over time, they encompass more and more information, and increasing numbers of business users come to depend on them.
Data shadow systems give business groups what they want, but most business users do not want to spend so much time creating these systems. Nor should they. They should be spending their time gaining a better understanding of their business, not wrestling with technology.
Because dealing with technology is not what business users do best, they cobble data shadow systems together without an overarching design. Each addition gets more difficult to implement and more costly to maintain. And when data management principles and disciplines aren't followed, data consistency and integrity suffer. Data shadow systems often fulfil their business's need, but they do so in a very costly manner that uses too many resources and sacrifices data quality.
My experience suggests that every company, large or small, has at least one data shadow system. Keeping the needs of business users in mind, it is possible to replace or rework these shadow systems with solutions that dovetail with a company's overall data warehousing architecture. Replacement doesn't need to be a huge effort, either. The best data warehouse projects deliver in a strictly incremental fashion, piece-by-piece.
Rebuilding data shadow systems is the right thing to do to ensure consistent, quality information for running a business.
Whilst many data shadows are created with Microsoft software, there are many created with SAS software too. Is your own system a data shadow? If so, you've got the best set of tools in your hands for rectifying the situation. Get to it!
SAS® and software development best practice. Hints, tips, & experience of interest to a wide range of SAS practitioners. Published by Andrew Ratcliffe's RTSL.eu, guiding clients to knowledge since 1993
Showing posts with label Data Warehouses. Show all posts
Showing posts with label Data Warehouses. Show all posts
Wednesday, 24 April 2013
Thursday, 15 April 2010
NOTE: Go Green and Save Data Miles
Whilst attendees of SAS Global Forum (SGF) in Seattle, USA have been hearing about further SAS developments in in-database processing, I've been spending time back home in Blighty enlightening one of my clients' junior staff in the benefits of SQL pass-through. If your data is held outside of SAS (in Oracle or Teradata for example), you need to understand pass-through.
It's all about doing things in the best place. Typically, the channel that passes data from non-SAS data storage to SAS software is relatively slow when compared with the non-SAS data storage's ability to process data. Expressed more simply, you're better off asking Oracle to apply a where-clause and send the results to SAS than if you ask Oracle to send all records to SAS and allow SAS to apply the filtering. SAS may be able to filter data quickly, but you're better off not sending all that data down the pipe in the first place.
Pass-through is a SAS technology that passes SQL instructions to the data storage server for execution. There are two types of pass-through: implicit and explicit. Implicit pass-through is the cleverest. Explicit pass-through is where you offer statements to SAS that specifcialy (explicitly) instruct SAS to pass instructions to the data server; with implicit pass-through, you rely on SAS intelligence to look at standard SAS code and deduce that it could interpret some of it into SQL and pass it through to the data server. Steve Feder's NESUG 2007 paper succinctly describes alternative techniques and their respective merits.
In-database execution of SAS merely takes pass-through a step further. Pass-through allows execution of SQL within the database; in-database execution of SAS allows far more work to be done within the database server and thereby further reduce the amount of data sent down the long and slow pipe from database to SAS software.
Beginning with Teradata in 2008, SAS began partnering with database vendors to develop and deliver in-database execution of SAS. At this year's SGF, SAS announced the engagement of BlueCross Blue Shield of Tennessee in a pilot program embedding SAS Analytics within the company’s IBM DB2 enterprise data warehouse. SAS currently has plans to provide in-database execution of SAS software within Teradata, Netezza, IBM DB2, HP Neoview, Aster Data and Green Plum. These are exciting initiatives that will take the speed of SAS analytics to a new level.
It's all about doing things in the best place. Typically, the channel that passes data from non-SAS data storage to SAS software is relatively slow when compared with the non-SAS data storage's ability to process data. Expressed more simply, you're better off asking Oracle to apply a where-clause and send the results to SAS than if you ask Oracle to send all records to SAS and allow SAS to apply the filtering. SAS may be able to filter data quickly, but you're better off not sending all that data down the pipe in the first place.
Pass-through is a SAS technology that passes SQL instructions to the data storage server for execution. There are two types of pass-through: implicit and explicit. Implicit pass-through is the cleverest. Explicit pass-through is where you offer statements to SAS that specifcialy (explicitly) instruct SAS to pass instructions to the data server; with implicit pass-through, you rely on SAS intelligence to look at standard SAS code and deduce that it could interpret some of it into SQL and pass it through to the data server. Steve Feder's NESUG 2007 paper succinctly describes alternative techniques and their respective merits.
In-database execution of SAS merely takes pass-through a step further. Pass-through allows execution of SQL within the database; in-database execution of SAS allows far more work to be done within the database server and thereby further reduce the amount of data sent down the long and slow pipe from database to SAS software.
Beginning with Teradata in 2008, SAS began partnering with database vendors to develop and deliver in-database execution of SAS. At this year's SGF, SAS announced the engagement of BlueCross Blue Shield of Tennessee in a pilot program embedding SAS Analytics within the company’s IBM DB2 enterprise data warehouse. SAS currently has plans to provide in-database execution of SAS software within Teradata, Netezza, IBM DB2, HP Neoview, Aster Data and Green Plum. These are exciting initiatives that will take the speed of SAS analytics to a new level.
Labels:
Data Warehouses,
SAS
Monday, 30 November 2009
NOTE: Star Schemas and Slowly Changing Dimensions in Data Warehouses
Most data warehouses include some kind of star schema in their data model. If the dimensional data in the warehouse is likely to change over time, i.e. you have a slowly changing dimension, then you should consider this whilst designing your data model. This article provides a simple introduction to star schemas and slowly changing dimensions (SCDs). These terms are important if you are to understand some of the more powerful transforms in DI Studio.
The star schema, at its simplest, consists of a fact table surrounded by a number of associated dimension tables. The visual representation of this derives the name.
The fact table contains the facts, i.e. the central data of the warehouse (typically numeric), along with keys to the various dimension tables. A simple data model with one fact table and three dimension tables is shown below.
The keys in the dimension tables are called “Primary keys”; they uniquely identify each row. The same keys in the fact table are known as “Foreign keys”; they do not necessarily uniquely identify each row. Often the fact table will have an extra column called the “Compound Key”. This will contain a combination of the dimension keys, making it unique to each row of the fact table.
The star schema, at its simplest, consists of a fact table surrounded by a number of associated dimension tables. The visual representation of this derives the name.
The fact table contains the facts, i.e. the central data of the warehouse (typically numeric), along with keys to the various dimension tables. A simple data model with one fact table and three dimension tables is shown below.
The keys in the dimension tables are called “Primary keys”; they uniquely identify each row. The same keys in the fact table are known as “Foreign keys”; they do not necessarily uniquely identify each row. Often the fact table will have an extra column called the “Compound Key”. This will contain a combination of the dimension keys, making it unique to each row of the fact table.
Labels:
Data Modelling,
Data Warehouses,
DI,
SAS
Subscribe to:
Posts (Atom)