SAS at Backend when creating a new SAS dataset from a raw dataset


A typical SAS program to create a SAS dataset:

data try;

infile rawfile ‘path_to the_rawfile.txt’;

input var1 1-5 var2 $ 7-12 var3 14-16 var4 $ 17-22;

run;

When a new SAS dataset is created, data step is used. At the backend, data step works in two main steps:

  • Compilation Phase: Descriptor portion
  • Execution Phase: Data portion

Compilation Phase basically starts as soon as the data statement is written in the program and ends when it encounters run statement. Compilation phase checks for the syntax errors which typically include but are not limited to misspelled keywords, missing semicolon or other punctuations etc.

As soon as the word data is encountered, SAS creates an Input Buffer which is basically a logical memory space where all the variables are stored in a line. Once the input buffer is created, a Program Data Vector (PDV) is created which is a logical way of representing how a SAS database is saved in the memory. Just like a SAS database, a column is created for every variable and a null value is assigned to each of them because the values are not read at this point. Along with a column for each variable, there are two additional columns created called _N_ and _ERROR_. These variables are only created at the backend and are not visible at the front end or the SAS dataset.

_N_ stands for the number of observations. Default value of _N_ is 1. Every time an observation is read from a raw data file, the value of _N_ increases by 1.

_ERROR_ stands for errors. Default value of _ERROR_ is 0. Every time an error is encountered while reading the observations, the value of _ERROR_ increases by 1.

The compilation phase saves the name of the SAS dataset and all the variable names and their attributes.

Execution Phase starts after the run statement is encountered. It tells SAS that the compilation phase is over and it is now ready to start saving the observations into the SAS database. The execution phase begins when the user prompts the program to be executed. At this point, each observation is read from the raw data file and copied to the input buffer and from there copied to the PDV which saves the data in the SAS dataset.

At the beginning of the execution phase, the default values of all the variables are set to null – ‘.’ (period) for numeric variables and a blank for character variables. _N_ has the default value of 1 and _ERROR_ has the default value of 0. These are the default values before the first observation is read from the input buffer. After the first observation is copied to the PDV, the first iteration is over and the values for the next observation are again set to default for all the variables except _N_ whose value is now 2 and keeps on increasing with each iteration.

+ There are no comments

Add yours