Creating a SAS data set from a raw data file in fixed field format
Find below a sample raw data file. It’s a simple txt file with four columns. Data in each column starts from the same point, i.e. this is a fixed format file.
The main steps needed to convert a raw file to a SAS data file are as follows:
- Reference a SAS library
- Reference the raw data file
- Name the new SAS data set being created
- Identify the external file
- Describe the columns/data
- Execute/run the program
First thing first, upload the raw data file (in this case our emp.txt file) to SAS. Depending upon the SAS version being used, the file would be uploaded to the folder or the library.
Reference a SAS library
Reference the permanent SAS library where the final SAS dataset is to be saved using the libname statement.
libname libref ‘path’;
In this example, we are making a new SAS dataset named emp which will be stored in the permanent SAS library company. So, the libname will be:
libname company ‘/home/path/company’;
If the SAS dataset is being saved in the temporary WORK file, then this step would be skipped.
Reference the raw data file
Now we have to reference the raw data file that we uploaded in SAS by using the filename statement.
filename fileref ‘path’;
In our example we have to reference emp.txt file so that it can be read by the input statement to create a SAS dataset. The filename statement will be:
filename emp ‘/home/path/emp.txt’;
This step can be skipped as well and the file path can be given directly in the fourth step along with the infile statement.
Name the new SAS dataset being created
Next comes the data step to name the new SAS dataset being created. In our example we are creating a permanent SAS dataset named finalemp in the company library by using the data statement.
data company.finalemp;
This statement creates a SAS dataset named finalemp.
Identify the external file
The raw file from which the data is to be read is specified next using the infile statement. There are two ways this statement can be used. If the filename statement is used in the previous step, then infile just specifies the name of the raw file:
infile raw_file_name;
Which would be:
infile emp;
Raw file can also be directly identified here by giving the actual path to the file and the filename statement earlier can be skipped. The syntax would be:
infile ’file_path’;
Practically, when the raw file contains a lot of observations, it is feasible to first check if the program we wrote is bug free. Instead of creating a dataset with all the observations in it, we can just copy the first few observations, and see if they are copied in our SAS dataset without any error. This can be done by using obs option in the infile statement.
infile emp obs=2;
This statement will only copy the first two observations in the finalemp dataset. This gives us a chance to check if there is any error in the log before copying the entire data, thus saving time for huge datasets.
Read the data from the raw file
Reading the data from the raw file can be done in many ways. Below is the column input method. The emp.txt file is in fixed format, which means that the value for each variable starts from the same position. In the column input method, the variables are named and the starting and ending position of each variable is specified. Important point to note is that column input method only reads standard values. For any non-standard values (values containing special characters e.g. commas), format input is used.
Standard Numeric Input: Includes numbers with decimal points, scientific or E notation and plus or minus sign.
Non-Standard Numeric Input: Includes any values with special characters.
The input statement contains two important parts: defining the names of the variables in the order you want them in the SAS dataset and define their lengths. In our example, the input statement will be:
input FirstName $ 1-8 LastName $ 10-17 Department $ 20-28 Salary 30-35;
Where FirstName is the first variable of our SAS dataset and the first column of our raw file, LastName is the second variable, Department the third and Salary the last.
The code above would generate a permanent SAS dataset named finalemp.
Some points to note:
- For numeric variables, like Salary, give the name and the variable position.
- For character variables, like FirstName, give ‘$’ after the name of the variable. This defines the variable as a character variable.
- Input FirstName $ 1-8;
- Column input method works only with standard values. For non-standard values the data has to be formatted. (Example later in this article).
- For a single digit variable, only the starting value is given. For example, if Sex is another variable with values M or F, it will be given as follows along with the rest of the variables:
- Input Sex $ 10;
- Final dataset stores the variables in the order in which they are written in the input statement.
- Input LastName $ 10-17 FirstName $ 1-8 Department $ 20-28 Salary 30-35;
- The LastName would be the first variable in the dataset and FirstName would be the second variable and so on.
Execute the program
The last and very important step of every data and proc statement is the run statement.
run;
Final program to create the SAS database
Our final program will be:
libname company ‘/home/path/company’;
filename emp ‘/home/path/emp.txt’;
data company.finalemp;
infile emp;
input FirstName $ 1-8 LastName $ 10-17 Department $ 20-28 Salary 30-35;
run;
This executes our data step and creates our final SAS dataset that looks like this:
Using Format
Suppose the raw file contains non-standard data. In this case, we have to format the data before it can be read. Look at the raw file below:
This file contains a salary column with commas and an additional column for sex. The code to read this raw file to a SAS dataset will be:
libname company ‘/home/path/company’;
filename newemp ‘/home/path/company/newemp.txt’;
data company.finalnewemp;
infile newemp;
input FirstName $ 1-8 LastName $ 10-17 Department $ 20-28 @30 Salary comma7. Sex $ 38;
Format Salary dollar9.;
Run;
In this code we format the Salary variable by using the @ symbol. Since the Salary variable in the raw file contains a comma and starts from position number 30 and spans to 7 spaces(including the comma), the code to read it is @30 Salary comma7.. The delimiter being important. In the next step we format it to include the dollar sign and store the value in dollar, comma format e.g. $90,000.
The last variable in the dataset is Sex variable, which is a single digit variable that is at position number 38. Hence, the code to read it is Sex $ 38. The $ sign being there because Sex is a character variable.
The new SAS dataset is ready.
You must be logged in to post a comment.
+ There are no comments
Add yours