Thursday 23 July 2009

NOTE: Integrity Constraints

Issue 1 of NOTE: featured an article on integrity constraints...

Integrity constraints were introduced with version 7 of SAS® software. In essence, they are rules that define the valid forms of your data. The rules can be defined in a number of ways and will be respected by SAS software whenever there is an attempt to update the data set.
You can create five different types of integrity constraint rules:
  • A column can contain only non-null values.
  • A column can contain only values that fall within a specific set, range, or list of values, or that duplicate a value in another column in the same row.
  • A column can contain only values that are unique.
  • A column that is a primary key can contain only values that are unique and that are not missing values.
  • A column that is a foreign key (the child) can contain only values that are present in the associated primary key (the parent) or null values. A column that is a primary key can contain only values that cannot be deleted or changed unless the same deletions or changes have been made in values of the associated foreign key. Values of a foreign key can be set to null, but values cannot be added unless they also exist in the associated primary key.
These integrity constraints are similar to those that have been available in industrial strength databases for many years. They are a useful means of ensuring the integrity of your data (including relational data) without repetitive coding in Base SAS programs (or SCL programs).

You can use PROC DATASETS, PROC SQL, or SCL functions to create, update, and remove integrity constraints. To list them, you can use PROC CONTENTS. Here is a simplistic example using PROC DATASETS:

data master;
attrib key length=$8
data length=8;
key='a0000001'; data=2; output;
key='a0000002'; data=4; output;
run;

proc datasets lib=work nolist;
modify master;
ic create ic1=primary key(key);
ic create ic2=check(where=(data between 2 and 8));
quit;

data updt;
key='a0000001'; data=999; output;
key='a0000002'; data=8; output;
run;

data master;
modify master updt;
by key;
run;

The log resulting from the final DATA step will be as follows:

data master;
modify master updt;
by key;
run;

key=a0000001 data=999 FIRST.key=1 LAST.key=1 _ERROR_=1 _IORC_=660130 _N_=1
NOTE: There were 1 observations read from the data set WORK.MASTER.
NOTE: The data set WORK.MASTER has been updated. There were 1 observations rewritten, 0 observations added and 0 observations deleted.
NOTE: There were 1 rejected updates, 0 rejected adds, and 0 rejected deletes.
NOTE: There were 2 observations read from the data set WORK.UPDT.

Why is this useful?
It is important to maintain the integrity of your data. Hitherto this has required repetitive coding of integrity rules in your suite of programs. Now, the integrity rules can be coded in one place. This makes them easy to find and easy to maintain.