Using RegEx in SAS Informats
SAS Formats and Informats have always been very convenient to use. They make it easy to convert, format or validate data.
Recently we were investigating some data validation issues. Some parts of the data validation was done using prxmatch
but that was not too easy to maintain (i.e. when the same check had to be performed in multiple locations or for multiple data sources).
That’s when we discovered that you can use a regex with a SAS Informat… Something that was reported as new back in 2012.
Here’s a small example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
PROC FORMAT;
INVALUE doors (default=50 max=200)
"/ 2dr/i" (REGEXP) = 2
"/ 4dr/i" (REGEXP) = 4
"s/(.*)([\d])dr?(.*?)$/\2/i" (REGEXPE) = _SAME_
"/V40\s*$/i" (REGEXP) = 5
OTHER = .;
RUN;
PROC SQL;
SELECT Make, Model, input(Model, doors.) as Doors
FROM sashelp.cars
WHERE calculated doors > 4;
QUIT;
In this small example, a new informat called doors
is created. It is used to parse the Model name of the cars defined
in the SASHELP.CARS
table to extract the number of doors.
It will first look for the value 2dr
for cars with 2 doors, 4dr
for cars with 4 doors.
Of course, this is not practical if you have many combinations.
The alternative is to use the substitution to extract the part of the string that matters.
The 3rd test is looking for a digit followed by the d
and optionally an r
, if it matches, it will keep the digit.
So this 3rd version would also cover the first two tests (it is just a more generic version of the same test).
Last is a sample for the exceptions… when there is no indication about the number of doors.
Once you have to informat you are free to use it anywhere… for example in a PROC SQL.
More Info ? There is some great content on the SAS Website…