SAS Tutorial for beginners
|
Introduction
The SAS
system began as a software system for Data Analysis & statistical
work. Since then, SAS has evolved and
made its presence in diverse fields.
Today, SAS Systems analysis tools range from simple statistics to
specialized analysis for econometrics & forecasting, statistical design,
computer performance evaluation & Operation Research. SAS finds its highest application in the
field of Data Warehousing & Data Mining.
1. Origin of
SAS
SAS originally stood for
“statistical analysis system” and many of
the characteristics of SAS can be traced back to its statistical
background.
In statistical experiments, a
measuring process can be repeated at many different times. Each instance of measurement is called an observation and different qualities
that are measured are called variables. That’s the source of these two SAS terms and
the form of a SAS dataset.
Ideally, these observations are
independent, i.e. different observations do not depend on each other. The data from each observation can be
processed independently, without reference to data from other observations, and
the order in which observations are processed do not affect the
conclusion. This makes possible the
concept of observation loop for a
computer program involving a repeated process of reading one observation at a
time into memory and extracting the information needed from it. This observation loop is a central part of
the design of the SAS system.
2. Why SAS?
SAS System is an integrated
system of software products.
·
Its power,
flexibility & ease of use enables you to gain strategic control of all
your data processing needs. SAS System
has a collection of ready-to-use
programs called procedures. Combined
with other features of SAS System, it makes it possible to have a variety of
applications – from general-purpose data processing to specialized analysis in
many application areas.
·
It facilitates applications that run on more
than one computing environment. SAS
applications work the same, look the same and produce the same results
irrespective of your hardware or OS. This
is possible because SAS System has a layered structure called Multi Vendor Architecture (MVA) This
consists of a host specific component which is specifically written for each
environment and the portable component which brings it a universal ‘feel’. You can develop SAS applications on one
environment and run them in other environments without any changes.
·
It can accommodate skill level of potential
users. SAS provides flexible user
interface in the form of menu-driven or task-oriented interfaces. New users can practically develop
applications without learning the syntax of the SAS language through these
interfaces.
·
It provides an exhaustive inventory of application development tools.
3.
Characteristics of SAS
·
The SAS System has a modular design. It involves a large collection of several
programs that are coordinated by a central program called the supervisor.
·
SAS is an interpreted language but has some
characteristics of a compiler. Most SAS statements are grouped into segments
called steps, rather than being
interpreted and executed at one time, before execution.
·
SAS is called a step-structured language because it only allows one step to run at
a time, one after the other.
·
SAS has been called a very high-level language
because much of its syntax is even more abstract than most high-level
languages. The program code correlate
as close as possible to the ideas of the programmer and the result he/she seeks
to achieve.
·
SAS has it’s own storage format and SAS language
provides high-level access to files in this format. Data files that SAS system accesses this way
are called SAS Datasets. The simplified
access to SAS datasets in SAS syntax eliminates most of the work of programming
input & output in SAS programs. At
the same time SAS also provides high-level access to files of other formats
through specialized routines called format
& informat. The input and output capabilities of SAS are
still among the most powerful & flexible of any programming language.
4. Data
warehousing
“The goal
of data warehousing is to free the information that is locked up in the operational
data bases and to mix it with information from other, often external, sources
of data”
Operational systems are systems that help in running the enterprise
operations. They are the backbone of the enterprise running daily transactions
such as “inventory”, “payroll”, “accounting” and other such transactional
systems. Such systems are indispensable to an organization, as an enterprise
cannot operate without these systems. These systems are tuned for high
performance and quick response time and often need to extremely stable and
robust.
Informational systems perform the crucial functions of enabling the
planning, forecasting and other strategy related management functions. In a dynamic business environment, the
enterprise has to be geared for the future in order to sustain itself and grow
in a healthy manner. The Informational systems are knowledge-based (where as
the operational systems are data based) and they deal with analyzing data and
helping managers in arriving at decisions.
The significant difference
between the operational systems and the informational systems can be seen in
the area of the focus of the two systems.
An operational system is focussed on a single area while an
informational system has to span a breadth of different areas. This is because an operational system is
concerned with the data and transactions in a particular area while the
informational systems needs to data from different sources to facilitate
decision making. Even if there is an
all-encompassing operational system, it cannot double up as an informational
system because its main function is efficiency in operations. Data used in
analysis is typically historical data which is inactive and this data if mixed
with operational live data causes performance degradation of the operational
system. Thus informational systems have to be designed that aid the
decision-makers in performing analyzing and planning for the future. A Data
warehouse effectively performs the function of an informational system on an
enterprise level.
A Data warehouse is a “collection of integrated subject-oriented
databases designed to support the DSS function; where each unit of data is
relevant to some moment in time. The
data warehouse contains atomic data and lightly summarized data.“
Most databases are designed to
ease data entry, reduce redundancy and speed the retrieval of a single
entity. The data warehouse, on the other
hand, is designed of fast retrieval of information & answers. This means that groups of records will be
retrieved, manipulated & analyzed. It may require that data needs to be
accessed from multiple database sources – a
collection of integrated subject-oriented databases.
The data resident in a data warehouse is non-volatile data.
Data from the operational systems is triggered to go to the data warehouse when
most of the activities on these operational data has been completed. Data in
typically saved for a large period of time as the efficiency of analysis
improves with the breadth and depth of data available in the data warehouse.
The data in the data warehouse is usually at a level higher than the data at
the operational level i.e. some analysis or aggregation has already been
performed to the operational data. There are certain data items that will be in
the same level as that of the operational data (e.g. the grain of the fact in a
data warehouse might be the sales on a particular day and this might be the
same as the data in a operational system)
Basics of SAS Software
The core of the SAS system is the
Base SAS software. It gives you all
the tools you need to make your data useful & meaningful. It consists of
·
The SAS
language – programming language used to mange your data.
·
Procedures
– Pre-written computer programs that analyze & process datasets &
display results.
·
A Macro
Facility – to generate & store text strings & communicate info from
one program to another.
·
A Windowing Environment called SAS Display Manager System.
The Base SAS software provide
tools for :
·
Information Storage And Retrieval
·
Data Modification And Programming
·
Report Writing
·
Statistical Analysis
·
File Handling
The next few sections discuss the
following :
·
SAS DATA set
·
Data step
·
PROC step
·
The Parse / Execute Cycle
·
Options
·
File structure
·
SAS date and time
1. SAS DATA set
The
Data to be used must be in a form understandable to the SAS System. This form is called the SAS Data set. It consists of the Descriptor information & Data
values.
·
The Descriptor information describes the
contents of the SAS dataset to the SAS System.
·
The Data values contain the actual data to be
analyzed.
The
DATA values are arranged into a rectangular structure of rows (called
observation) and columns (called variables).
An Observation is a collection of data that usually relate to a single
object. A variable is the set of data
values that describe the characteristics of the object.
SAS recognizes missing values and has an internal
representation for them. Missing values
are values unavailable to the System.
This representation is used because SAS requires values for all
variables for every observation in the dataset.
This ensures the rectangular structure of the data values.
SAS datasets are kept in
collections called SAS data libraries. A
SAS dataset is identified in a SAS program by a two-level name that identifies
the SAS data library & the SAS dataset.
A one-level name used for a SAS dataset implies that the default WORK
library is being assumed.
All SAS programs consists of a
series of statements that, as a group, are designed to accomplish a specific
task called SAS steps. These SAS steps
fall into 2 categories –
DATA steps & PROC steps.
They are the building blocks of
all SAS programs. The SAS datasets is useful to store data between SAS steps
2. A Sample
SAS program
data
ht_wt;
input
name $
1-10 sex $ 12 age 14-15 height 17-18 weight 20-22 init_ht 24-25 init_wt
27-29;
wt_loss
= init_wt – weight;
ht_loss
= init_ht – height;
cards;
John M 35 76 172 73 175
James M 32 75 167 73 169
Mary F 30
68 165 62 154
Ruby F 32
65 158 58 163
;
run;
proc
summary data = ht_wt;
var
wt_loss ht_loss;
output
out = min_wt min(wt_loss )=;
output
out = min_ht min(ht_loss)=;
run;
proc
print data = min_wt;
run;
proc
print data = min_ht;
run;
endsas;
Note: The variables in SAS dataset ht_wt are name, sex, weight, height, init_ht, init_wt, wt_loss and ht_loss
3. The Data
Step
DATA – instructs the SAS system to create s SAS dataset. It signals
the beginning of the DATA step and gives a name to the SAS data set you are
creating.
INPUT – provides information to the System to organize data into
SAS datasets. It describes your input data, giving name to each variable and
identifying its location on the disk or tape file.
CARDS – mark the end of programming stmt & beginning of data
within the same step as the interpreter stops as soon as it gets to the CARDS
statement. Any statement that follows on
the same line will be completely ignored.
Since there is more than one data line, the data step is executed
repeatedly, creating one observation from each input line, until the end of the
input line is reached.
RUN – instructs SAS to execute the previous statements
3. The Proc
Step
In
addition to the data step, the program has two proc steps. Once your data are
accessible as a SAS dataset, you can analyze the data & write reports using
a set of utilities known as SAS Procedures. Proc’s are specialized application
programs that analyze data in a dataset for
producing univariate descriptive statistics, frequency tables, cross tabulation
tables, tabular reports, charts, plots etc.
Other procedures provide ways to manage SAS files. To most SAS users they are the main
attraction of the SAS System.
The first PROC step asks SAS to
call a procedure form its library and to execute that procedure, with the SAS
data set ht_wt as input. Proc
SUMMARY computes the minimum values of ht_loss
& wt_loss, the two variables
listed in the var statement. Two output SAS datasets are created (min_wt & min_ht) with minimum value
of variable wt_loss in one(min_wt) & minimum value of variable
ht_los in the other (min_ht) SAS dataset. The minimum values of the variables ht_loss & wt_loss are stored in the same variables itself in the
corresponding output datasets, in this case.
The output datasets contain all the variables named in the proc step and
a few extra identifier variables.
Note: The variable in SAS dataset min_ht is ht_loss & _freq_ and in SAS dataset min_wt is wt_loss & _freq_. _freq_ is an automatic variable created by
summary procedure which gives the number of observations for the current
subgroup.
The PRINT proc produces a report
in a table form, of all the data vales in the SAS dataset. This report prints the contents of the standard print file. The standard print file is the file
ordinarily used to hold text output from programs, whether it is sent to
printer, displayed on screen, or stored for later use. Print files are divided into pages and
usually have a tittle that appears on the first few lines of each page.
Another file that is
automatically created and maintained by SAS is the log file. The log file
contains the SAS supervisor’s
step-by-step account of the execution of the program, including program
statements, notes, warnings & error messages.
4. The Parse / Execute cycle
1.
When the interpreter reads lines from the program file,
it parses them into tokens & statements until a complete step is
formed.
·
It then checks the syntax of the statement
·
If it is the kind of statement that is executed
immediately, then the statement is executed.
·
If there is a syntax error it generates an error
message.
·
Otherwise, it adds information from the
statement to the step being built
·
Continues processing statements until it reaches
the end of the step
·
Then it executes the step if no errors are
reported.
2
Then it parses more lines from the program file until
it forms another complete step and the above process is repeated for the
execution of that step.
3
The above process continues until the end of program is
reached.
There is no compiling feature in
PROC steps because PROCs are already compiled programs.
Statements executed immediately
are called global statements and do
not have to be associated with a step.
The scope of the SAS data step begins with the key word DATA and ends with any one of the
following:
·
The keyword RUN
(or QUIT)
·
Another Data step beginning with the key word DATA
·
Another Proc step beginning with the key word PROC
·
End of program code
·
The keyword ENDSAS
·
CARDS or CARDS4
statement in a data step
A SAS dataset can be also be
compiled and stored separately. Compiling a data step creates a SAS file in a
SAS library, which can be run in a SAS program.
The compiled data set is not a machine language program but a parsed
code called “intermediate code”.
DATA ABC;
INFILE ABC;
INPUT
A B C D E;
RUN PGM=PROJECT.ABC;
The PGM= option specifies a SAS
file name where the compiled dataset will be stored. The dataset is not executed in this
case. A compiled data step can be
executed by using the PGM= option on a DATA statement. You can change the names of the input(SET,
MERGE or UPDATE statements) & output (DATA statement) SAS data sets in the compiled data step by using the REDIRTECT
INPUT & REDIRECT OUTPUT statements.
DATA PGM =SAS file;
REDIRECT INPUT compiled SAS dataset name = actual SAS dataset name …;
REDIRECT INPUT compiled SAS dataset name = actual SAS dataset name …;
RUN;
The input SAS dataset that are
present when the data step is compiled should have the same variables &
attributes as the SAS data set that will be used when the compiled dataset is
run, but need not contain any observations.
They can be created as :
DATA NEW;
SET
ABC;
STOP;
Where NEW is the dataset with
only the Descriptor information of dataset ABC.
5. Data in
memory
SAS keeps the values of all variables in a step in a block
of memory called the program data vector
or PDV. The size of the PDV is fixed, which limits
the number of variables a step can have to a few thousands. The PDV and
variable attributes represent a modest part of the SAS system’s use of memory. In addition it can include:
·
Pointers to all the files being used, including
the program file, the log file, the standard print file, input files, output
files and any libraries being used in the step.
·
Buffers that contain data read from or written
to each file.
·
The SAS supervisor
·
The current step and any procs, functions, CALL
routines, informats and formats being used in the step
·
File names associated with filerefs or librefs in a
FILENAME or LIBNAME statement
·
System options
·
Array definitions
·
Titles
·
Macro variable names & values
6. The
Observation Loop
The amount of input data read by
one repetition of the observation loop is an observation.
During
compilation, SAS creates three things:
1.
Buffer :
Area of memory into which each record of
raw data is
·
read when an Input statement executes, or
·
written when a Put statement executes.
2.
Program Data
Vector : Area of memory where the SAS data set is built, one observation at
a time. Values are assigned to the variables in the program vector during
execution. From here the values are written to the output SAS dataset as a single observation.
3.
Descriptor
information : Information the SAS
System creates & maintains about each SAS dataset like data set attributes
& variable attributes. The variable
attributes include :
1
Name of the variable
2
Data type
3
Length
4
Label
5
RETAIN or reinitialize
6
DROP or KEEP
7
Initial value
assigned, if any
8
Format or Informat , if any
9
Position of the variable in the dataset
During
Execution,
·
By default SAS data step executes once for each
observation being created. Each time the
DATA statement executes, a new iteration of the data step begins. The automatic
variable _N_ is set to the next
value.
·
Statements that read data (INPUT, SET, MERGE,
and UPDATE) are executable. They may appear
anywhere in the data step, and do not have to be placed right after the data
step.
·
A raw data reading statement like Input causes a record of data to be
read into the input buffer and then into the Program data vector. Records coming from another SAS dataset are
read directly into the Program Data Vector.
·
Subsequent SAS statements are executed for the
current record.
·
When the scope of the Data step is detected, the
following occur automatically :
At the bottom of the observation loop:
1.
Output – an
observation is written into the new SAS dataset
2.
Return – The
system returns to the top of the Data set in preparation for another iteration.
At the top of the observation loop:
3.
The observation count _N_ increases by 1
4.
Values of variables created with Input or assignment
statements are reset to missing in Program Data Vector.
5.
Resets the current input file to CARDS and the current output file to LOG.
·
The next iteration ensues and the process is
repeated.
·
The data step terminates when the end-of-file
condition is encountered for an INPUT, SET, MERGE or UPDATE in a SAS dataset or
raw data file.
The SAS Supervisor
does not set variable values to missing at the top of the DATA step for:
1.
Variables comes from
a SAS dataset (read with SET, MERGE, or UPDATE statements) are retained and
not reset to missing as the program passes through the DATA statement.
2.
Variables listed in a RETAIN statement
3. Variables
used on the left-side of a sum statement e.g. variable I is not reset in I +
5; These are initialized to 0.
4.
Variables used in I/O
statement option for INFILE, SET, MERGE, UPDATE or FILE. These are initialized to 0 or data-dependent
values.
5.
Variables that are array elements and the array uses
temporary variables.
Normally, the SET,
MERGE, UPDATE and INPUT statements stop the data step when the end of input
data is detected. But this does not happen for the following cases:
·
The INPUT statement has trailing @@
·
The POINT= option is used with SET statement for
direct access
·
EOF= option is used with INFILE statement
·
SET, MERGE, UPDATE or INPUT statement executes
conditionally
·
RECFM=U option is specified with INFILE option
The SAS interpreter does not create an
observation loop for steps that do not have INPUT, SET, MERGE, UPDATE or
DISPLAY.
8. File Structure
1.
Engine: Engines are a set of
internal instructions that the SAS System uses to read from and write to files.
Every SAS data set and Data Library is accessed through an engine. Engines open files, direct input/output operations
and gather descriptive information about files and their contents. The engine
uses this information to organize data into correct logical form – SAS data
sets.
2. SAS
data Library: It is the logical structure of files accessed by an
Engine for processing by the SAS System.
SAS Libraries contain SAS data files. They are of two types : Permanent
& temporary
3.
Permanent Libraries: they reside on
the external storage medium and are not deleted when the SAS session
terminates. SAS files in permanent
libraries are specified by a two level qualifier where the first qualifier
stands for the libref and the second
qualifier for the name of the file.
Permanent libraries are stored till they are specifically deleted.
4.
Temporary Libraries: the are
available only for the current session or job run and are deleted at the end of
the session or job run. SAS provides
these libraries for files created during the session but are not required after
the termination of the session. The first qualifier need not be specified in
this case as it defaults to the temporary WORK
library.
5.
SAS data Set: It is the logical
structure into which Engine fits data for processing by the SAS System. They are of two types : SAS data files &
SAS data views.
6. SAS
data file: Contains both the data values and the descriptor information
7.
SAS data View: Obtains the
descriptor information or data values or both from other files. Only the information necessary to derive the
descriptor information or data values is stored in the file.
9. SAS Date
& Time
SAS
System processes calendar date values by converting dates to integer
representing the number of days between January 1st, 1960 and a specified date. Valid SAS dates can be positive or negative
numbers and range from 1582 A.D. to 20,000 A.D.
SAS System
processes time by converting it to integer representing number of seconds since
midnight of the current
day. SAS time values are independent of
the date.
SAS System
processes datetime by converting it to integer representing number of seconds
since midnight of January
1st, 1960 and a specified datetime.
SAS System reads & displays
date, time & datetime values through formats & informats. Informats are used to read fields according
to a specified width and form while a Format writes or displays values according
to a specified width and form.
YEARCUTOFF System Option : This option specifies the first year of
a 100 year span used by Informats & functions. Based on this, the century values of dates
are determined by SAS system.
Rules of SAS Language
The smallest meaningful unit of a
program is a token. The SAS language has
three types of tokens – words, symbols & constants. Tokens are grouped into statements. SAS statements are named for the words they
start with, except for a few statements that do not begin with a keyword – like
assignment & sum statement.
·
A statement end with ‘ ; ‘
·
uppercase or lowercase allowed as SAS convert
all to uppercase before compiling.
·
Statements can begin in any column
·
There is no special continuation characters and
single statements can flow over to next line.
·
Names in SAS can only be upto 8 characters long.
·
No blanks are allowed in SAS names.
·
SAS Data set names can only have alphabets,
numbers or ‘_’ in them and cannot start with a number
1. Keywords
A Keyword is a word that has particular
meaning in SAS syntax. Words that are
used as SAS keywords can also be used as names. So, a keyword is identified by
it’s location. Some keywords that begin statements are :
STATEMENT ACTION WHERE
ATTRIB, FORMAT, INFORMAT, LABEL, LENGTH
|
set variable attributes
|
data & proc steps
|
BY, WHERE
|
Input
|
data & proc steps
|
ABORT, DELETE, DO, END, ELSE, GOTO, IF, LINK,
LOSTCARD, OTHERWISE, RETAIN, RETURN, SELECT, STOP, WHEN
|
control flow
|
data step
|
INFILE, INPUT, MERGE, SET, UPDATE
|
input
|
data step
|
DATA, DROP, ERROR, FILE, KEEP, LIST, PUT, OUTPUT,
RENAME
|
output
|
data step
|
ARRAY, CALL, DISPLAY, WINDOW
|
Miscellaneous
|
data step
|
CLASS,
|
run proc
|
Proc step
|
FILENAME, FOOTNOTE, LIBNAME, MISSING, OPTIONS, TITLE
|
Program parameters
|
Anywhere
|
DM, SKIP, PAGE, X
|
immediate action
|
Anywhere
|
CARDS, CARDS4, LINES, LINES4, QUIT, RUN
|
mark end of step
|
between steps
|
Some other common keywords are :
FROUPFORAMT
|
LT
|
GE
|
CHARACTER
|
GT
|
IN
|
EQ
|
CANCEL
|
AND
|
DEFAULT
|
DESCENDING
|
LE
|
LIKE
|
BETWEEN
|
MAX
|
MIN
|
NE
|
NOBS
|
SAME
|
NOTIN
|
NOTSORTED
|
OR
|
OF
|
NUMERIC
|
OUT
|
NOT
|
TO
|
UNTIL
|
WHILE
|
_PAGE_.
|
Note: Generally, the different classes
of objects in SAS programs have different name
spaces. That means that you can use
the same name for different objects as long as they are not the same kind of
object. E.g. you could have a SAS
variable TIME in a SAS dataset TIME.
Since variables and datasets share the same name space, you cannot use
the same name for a variable & an array in the same step. Also, as only one
step runs at a time, you can use the same names for variables, array &
statement labels in different steps.
2. Automatic Variables
These
variables are automatically created by the SAS System in various
circumstances. They exist within the
Program Data Vector but are not output to the data set being created. Some common automatic variables found in data
steps are :
1.
_N_ :
Denotes the iteration of the Data step. Always created within a Data step with
an initial value of 1. It’s value gets
incremented automatically each time the Data step executes the Data step and
begins a new iteration of the Data set.
2.
_Error_ :
Initially set to 0. It’s value gets
incremented if SAS encounters an error within the Data step.
3.
_I_ : If the
index variable is omitted from the array definition, then SAS assigns the
automatic index variable _I_ by which elements in the array can be accessed.
4.
First .<var_name>
: Temporary variable created by SAS to identify the beginning of a BY
group. When an observation is the first
in a BY group, the value of the First.<var_name> is set to 1 where
<var_name> is the BY variable. For
all the other observations in the BY group, it’s value is set to 0.
5.
Last .<var_name>
: Temporary variable created by SAS to identify the end of a BY group. When an observation is the last in a BY
group, the value of the Last.<var_name> is set to 1 where
<var_name> is the BY variable. For
all the other observations in the BY group, it’s value is set to 0.
6.
_ALL_ :
Depending on context _ALL_ may mean all
the datasets or all the variables that are available.
7.
_DATA_ : Using
_DATA_ asks the SAS interpreter
to name the new dataset as a subsequent name from the series: DATA1, DATA2,
DATA3, DATA4 etc.
8.
_NULL_ :
specifies that no SAS dataset is to be created
9.
_LAST_ : specifies the most recently created
dataset.
3. Variable
Attributes
SAS variables are of two data types: numeric and character.
In addition to their type, SAS variables have
these attributes: length, informat, format, and label
The attributes of the variables are stored in the descriptor information of the SAS dataset.
·
Length : The length attribute of a
variable is the number of bytes used to store each of its values in a SAS data
set. The default length is 8. The Length
statement determines the length of a numeric
variable only in the data set being created while for character variables it determines the length of the variable both
in the Program Data Vector and the dataset being created.
·
Informat : A variables informat is
the pattern that SAS uses to read data values into the variable. The default
informat is w. for numeric variables, $w. for character variables.
·
Format : A variables format is the
pattern SAS uses to write each value of a variable. The default format is
BEST12. for numeric variables, $w. for character variables
·
Label : The label attribute of a variable is a descriptive label of up
to 40 characters that can be printed by
certain procedures instead of the variable name.
4. Variable
lists
Variable
lists normally consist of variable names separated by spaces.
·
One form of abbreviated variable list uses a
hyphen to indicate variable names with a range of numeric suffixes A1-A12 is
same as A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12.
·
A double hyphen indicates a different kind of
range of variables. It represents the
order in which they appear in memory. If
a data step has variables AA BB CC SUM TOTAL defined in that order, then
BB--TOTAL represents the variables BB CC SUM TOTAL. This kind of variable list
is called a named range.
·
There are three special variable lists :
_ALL_
represents list of all variables available, including automatic variables.
_CHARACTER_
or _CHAR_ represents list of all character variables available alone
_NUMERIC_
represents list of all numeric variables available alone
·
A colon is sometimes used to indicate alphabetic
range of variable names. As AQ: would be
used to identify all variables whose
names begin with the letters AQ.
A
named range can be combined with a special variable list by putting the keyword
NUMERIC or CHARACTER between the two hyphens in named range of variables as
BB-NUMERIC-TOTAL includes only numeric variables located from BB to TOTAL in
memory.
5. The Numeric Data Type
The
SAS programming language has only two data types: numeric and character.
This
differs from most high-level languages, which have many different numeric data
types. By this, the SAS System saves
programmers from having to be concerned with the problems of converting between
data types. This data type corresponds to the double precision or 8-byte real
type of other languages. Besides, numbers, the numeric data is used for dates,
times and logical values.
All
numeric variables use 8 bytes of memory, but their length attribute determines
the amount of storage that will be used or the variable if it is stored in a
SAS dataset. Lengths shorter than 8 can
be used to save storage space and I/O time but values might not be precise due
to truncation. The shortest length
allowed is dependent on operating system and can be 2 or 3.
6. Options
SAS software uses 3 types of options
·
System
options : This options will be in effect for all DATA and PROC steps in a
SAS job or session unless they are re-specified in another OPTIONS
statement. They are instructions that
affect the entire SAS session and controls the way it performs operations.
·
SAS data
set options : Which are specified in parentheses following a SAS data set’s
name and affect only that data set. SAS
applies data set options specified with input data sets before it evaluates
program statements or applies data set options applied on output data sets.
·
Statement
options : which are specified
only in a given SAS statement or statements and affect only that statement or
step.
SAS
Programming concepts
The programming concepts
discusses the following :
·
The Data step
·
Input styles
·
Data step statements
·
Operators in SAS
·
Combining datasets
·
Arrays
I. The Data Step
Before you can use SAS software
to prepare your data for analysis or use a SAS procedure to analyze your data,
you must first get them into a SAS data set. Once your data are in a SAS data
set, you can combine the data set with other SAS data sets in many different
ways to and use any of the SAS procedures.
You can use the DATA step for
these purposes;
·
Retrieval – Getting your input data into SAS
data set.
·
Editing – Checking for errors in your data and
correcting them, computing new variables
·
Printing reports according to your
specifications and writing disk or tape files.
·
Producing new SAS data sets from existing ones
by subsetting, merging, and updating the old data sets.
The DATA step can include
statements asking SAS to create one or more new SAS data sets and programming
statements that perform the manipulations necessary to build the data sets.
Data analysis, file management, and information retrieval are all handled in
DATA steps.
Types of DATA
steps
A DATA step is a group of SAS
statements that begins with a DATA statement and usually includes all the
statements in one of these four groups :
1. Data from an
external file
DATA statement;
INFILE
statement;
INPUT statement;
other
SAS statements used in the DATA step
Run;
2. Data in job
stream
DATA statement;
INPUT Statement;
other
SAS Statements used on the DATA step
CARDS Statement;
data
lines
;
3. Data in
existing SAS data set
DATA Statement ;
SET|MERGE|UPDATE Statement;
other
SAS Statements used in the DATA step
Run;
4. Writing
reports
DATA _NULL_;
INPUT and CARDS|INFILE Statement;
or
SET|MERGE|UPDATE Statement;
FILE Statement;
PUT Statement;
other
SAS Statements used in the DATA step
Run;
The FILE statement tells SAS
where to print the report or write the file.
PUT Statements write the lines of
the report or file.
II. Working with SAS datasets
The SAS language supports both
sequential and direct access input from SAS datasets.
·
Sequential input is provided by the SET, MERGE
and UPDATE statement in the data step and the DATA= option and some other
options in the proc step.
·
Direct access is done by the SET statement with
the POINT= option.
SAS
data steps create output SAS datasets
using the DATA and OUTPUT statements.
PROC
steps create output SAS datasets
with the OUT= option on the PROC or OUTPUT statement.
All
output SAS datasets are new files. If a
SAS dataset exists with the same name, the old dataset is deleted after the
successful completion of the step and dataset with same name is created.
The
SAS step first processes the descriptor information of the SAS dataset and then
inputs or outputs one observation at a time.
After writing the last observation, most Engines also add more
information to the descriptor information.
Dataset
Options
Dataset
options input or output processes related to SAS datasets. On Output, they control the data stored in
SAS dataset. On Input, they affect the
way the SAS dataset appears to the step, but they do not actually change the
stored data in the input file.
Both for Input
& Output datasets
DROP= variable: list of variables not to be kept in the dataset
KEEP= variable: list of variables to be kept in the dataset
LABEL=: creates label for SAS datasets
RENAME= old = new: changes the name of
a variable
For Input
datasets
FIRSTOBS=: causes the processing to begin from a specified observation
OBS=: causes the processing to end with the specified observation.
WHERE=: selects observations that meet the specified condition.
CNTLLEV= MEM: specifies whether to lock
the entire dataset
CNTLLEV= REC: specifies whether to lock
only one observation at a time
For Output datasets
COMPRESS=: specifies whether the SAS dataset is compressed or not
REUSE=: specifies whether space can be reused in compressed SAS datasets
INDEX=: creates indexes
REPLACE=: specifies whether to allow
existing SAS datasets to be replaced by new SAS datasets with same name.
III. Input
/ Output Styles
1. List Input
Data
values are required to be separated by at least one blank (the default
delimiter) or the delimiter (if specified).
They are not required to be aligned in columns. SAS requires you to name the variable and
the data type alone (default is numeric, so give $ for character values). Formats or Informats will convert it to
Formatted Input style.
data newfile;
input
name $
age
sex $;
run;
Features :
1.
max (default) length of 8 will be applied to character
values unless specifically overridden by
a LENGTH statement.
2.
Character values cannot contain embedded blanks
3.
Missing values must be specified by “." only.
4.
Fields can only be read in order
5.
Data must be standard numeric or character format.
2. Column Input
Data values are required to be
aligned in columns. SAS requires you to
name the variable, the data type (default is numeric, $ for character values)
and the columns within which the data values are to be located for each record.
data newfile;
input
name $ 1-8
age 12-14
sex $ 16;
run;
Features :
1.
Character values can be from 1 - 200 characters long.
2.
Data values must be within the same columns on all the
input lines
3.
Character values cannot contain embedded blanks
4.
Missing values need not be specified by “.".
5.
Fields can be read in any order regardless of their
position in the records.
6.
Data must be standard numeric or character format.
7.
Leading & trailing blank within the fields are
ignored.
3. Formatted
Input
Formatted
input is used with pointer controls which controls the position of the input
pointer in the input buffer when reading data..
It differs from the List Input in that it enables you to read
non-standard data for which SAS requires additional instructions.
data newfile;
input name $char 8.
+4 age 2.
+1 salary comma5.;
run;
Features :
1.
Character values can be from 1 - 200 characters long.
2.
Character values can contain embedded blanks
3.
Missing values need not be specified by “.".
4.
Fields can be read in any order regardless of their
position in the records.
5.
Can read data stored in non-standard form.
4. Named Input
Named input is used when data lines contain variable names
followed by an equal sign and a value for the variable. Once the INPUT statement starts reading
named inputs, the System expects all remaining values in the input line to be
of the same form. The variables in the
INPUT statement do not have to be in the same order in which they occur in the
data records.
data newfile;
informat phone 6.;
input @1 date
julian5.
name=
age=
salary=;
cards;
99001 age=23 salary=1000 name=Mary
phone=232345
97234 phone=242334 salary=1000
name=Martin age=21
91210 age=24 salary=2000 phone= 223198
name=John
99001 name=Maggie phone=238971 age=24
salary=1600
;
Features :
1.
Cannot switch to another input style for a particular
input line once you start reading it with named input.
2.
If any of the values are not in named input form then
the System handles them as invalid data.
3.
If a variable that appears on the named input lines
appear in any other statement, the value is automatically read from the input,
whether or not it is explicitly specified in the input statement.
4.
Cannot read data stored in non-standard form.
All these input styles have a corresponding output style as
well.
Informat: specifies the format of the values for variables being
read. SAS system interprets the format
and convert it to an internal format
that it understands. Is supplied with the INPUT
statement or as data set option of input data set if required.
Format: specifies the format of
the values for output variables.
The SAS system converts the value to a format that is required to appear
in the output. Is supplied along with the PUT
statement or other output procedures.
IV. Working with External files
A
Text file is a sequence of records. SAS
supports sequential input & output for text files, even within the same
data step. Data positions are usually
stated in terms of columns which represent the distance from the beginning of
the record. SAS input / output syntax
for text files is more powerful and flexible than any of the classic high level
languages.
1. INPUT
operation
The syntax for
input from text files involves 2 statements:
·
INFILE – provides general identification
information about the input file
·
INPUT – controls the way the input data in
interpreted and assigned to variables
A blank INPUT
statement (with no arguments) is called a NULL INPUT statement. It can have several uses:
1) To bring input data line into the input
buffer without creating any SAS variables.
This data line can be copied as such to output file
2) Or release an input line held by a trailing @
or double trailing @.
INFILE
Because
the INFILE statement identifies the file to be read, it must execute before the
INPUT statement that reads the data lines.
As
it is executable, you can use it in conditional processing (in an IF-THEN
statement, for example).
You
can read from several external files within one DATA step. To read from multiple input files in a single
iteration of the DATA step, you can use multiple INFILE and INPUT
statements. To read from one file, then
close it and open another, you can use the FILEVAR= option. (FILEVAR= enables
you to dynamically change the current input file within your SAS job.
When
you use more than one INFILE statement for the same fileref and you use options
in each INFILE statement, the effect is additive. That is, the options specified in each INFILE
statement are added to the options specified in any previous INFILE statements
for that file.
You
can use the INFILE statement in combination with the FILE statement to update
records in an external file. To do so,
follow these steps:
1. Specify the INFILE statement before
the FILE statement.
2. Specify the same fileref or
physical filename in each statement.
3. Use options that are common to both
the INFILE and FILE statements in the INFILE statement instead of the FILE
statement. Any such options used in the
FILE statement are ignored.
To
update individual fields within a record instead of the entire record, use the
SHAREBUFFERS option.
The
INFILE statement is an executable statement.
It sets the current input file, which the INPUT statement then
reads. It must be executed before the
INPUT statement to which it refers. The
current input file is changed to CARDS at the top of the observation loop. So the INFILE has to be executed in every
repetition of the observation loop that executes INPUT statement.
Options
LRECL=: The number of characters in a
record
PAD: Pads input fields shorter than the
LRECL value with trailing blanks. Default is NOPAD
LINESIZE= / LS=: Limits the number of
characters in a record available to the INPUT statement. It prevents the INPUT statement from reading
past a certain column.
FIRSTOBS=: The number of the first
observation to be read from the input file.
Use this option to skip records at the beginning of the file.
OBS=: The number of the last record to
be read from the input file. Use this
option to skip records at the end of an input file.
N=: Specifies the number of lines
available to the input pointer.
EOF= label: The INPUT statement
branches to the statement label indicated if it attempts to read past the end
of a file.
END= variable: designates a numeric
variable that the INPUT statement sets to 1 when it reads the last record in
the file.
UNBUFFERED / UNBUF: Tells the SAS
supervisor not to look ahead at the next record when reading a record. The END= variable cannot be used to indicate
the last line in the input file.
COLUMN / COL = variable: designates a numeric
variable that the INPUT statement sets to the column pointer location
LINE= variable: designates a numeric
variable that the INPUT statement sets to the line pointer location.
LENGTH= variable: designates a numeric
variable that contains the length of the input line. The variable then can be sued with $VARYING informat to read varying length
records. Changing the value of the
variable between the INPUT & the PUT statement can change the length of the
_INFILE_ string.
START: designates the numeric variable
that identifies the starting character to be used I the _INFILE_ string. A value can
be assigned before the PUT statement to change the extend of the _INIFLE_ string.
DELIMITER= / DLM=: Delimiter used in
list input. The default is ‘ ‘.
FILEVAR= variable: Changing the value
of the character variable causes the INFILE statement to close the input file
and to open the file whose physical name is the value of the variable.
SHAREBUFFERS / SHAREBUFS: Use this
option for text files being edited to use the same buffer for input &
output. This means that any positions
skipped over by the PUT statement will stay the way they were before. Otherwise, the positions in an output record
that the PUT statement does not write to are filled with blanks. This option makes it possible to change some
fields in a file without processing other fields.
DSD: This option changes the way
delimiters are treated when using list input and enables you to read delimiters
as characters within quoted strings. When the DSD option is in effect, the
delimiter is assumed to be a comma.
If the data contain another delimiter, you must specify it with the DELIMITER= option. To read a value as missing between two
consecutive delimiters, use the DSD option. By default, consecutive delimiters
are treated as a unit. When you use the DSD option, consecutive delimiters are
treated separately; therefore, a value that is missing between consecutive
delimiters is read as a missing value.
The
DSD option also enables list input to read a character value that contains a
delimiter within a quoted string. For example, if data are separated with
commas, using the DSD option enables you to place the character string in
quotes and read a comma as a valid character. The quotes are not stored as part
of the character value.
The
following determines what the INPUT statement does when it gets to the end of a
record, before it finds all values for all the variables in the record.
FLOWOVER: The remaining variables are
read from the first column of the next record.
This is the default action.
TRUNCOVER: Using the TRUNCOVER option
enables you to read variable-length records when some records are shorter than
expected by the INPUT statement.
MISSOVER: The remaining variables are
assigned missing values.
STOPOVER: The step stops running as an
error condition is created.
The _INFILE_ string refers to the last
record read from the current input file.
The default INPUT
file is CARDS
2. OUTPUT
operation
The
syntax for input from text files involves 2 statements:
·
FILE – provides general identification
information about the output file
·
PUT – controls the way the input data in
interpreted and assigned to variables
FILE
The
FILE statement is an executable statement.
It sets the current output file, which the PUT statement then
writes. It must be executed before the
PUT statement to which it refers. The
current output file is changed to LOG at the top of the observation loop. So the FILE has to be executed in every
repetition of the observation loop that executes PUT statement.
As
it is executable, you can use it in conditional processing (in an IF-THEN
statement, for example).
You
can write to several external files within one DATA step. To write to multiple output files in a single
iteration of the DATA step, you can use multiple FILE and PUT statements.
Options
Many of the options find a similar
one for INFILE statement.
LRECL=: The number of characters in a
record
PAD: Pads input fields shorter than the
LRECL value with trailing blanks. Default is NOPAD for variable length records
and PAD for fixed length records.
LINESIZE= / LS=: Limits the number of
characters that can be written to a record by the PUT statement.
OLD: This option makes the step writes
output records at the beginning of the file, replacing any previous contents of
the file.
MOD: This option makes the step writes
output records at the end of the file, adding records to the previous contents
of the file.
PRINT or NOPRINT: Tells whether a file is a print file or a non-print file.
NOTITLES: tells the supervisor not to put the current titles, defined
in a TITLE statement, at the top of each page of a PRINT file.
PAGESIZE / PS : determines the number
of lines per page of output.
FIRSTOBS=: The number of the first
observation to be written to the output file.
Use this option to skip records at the beginning of the file.
OBS=: The number of the last record to
be written to the output file. Use this
option to skip records at the end of an output file.
N=: Specifies the number of lines
available to the output pointer.
HEADER= label: When the PUT statement
writes to the end of a PAGE, it branches out to the HEADER= statement label to execute a group of statements there
until a RETURN statement is reached.
COLUMN / COL = variable: designates a numeric
variable that the PUT statement sets to the column pointer location
LINE= variable: designates a numeric
variable that the PUT statement sets to the line pointer location.
LINESLEFT= / LL= variable: designates a
numeric variable that tells the number of lines remaining on the current page,
including the current line pointer.
DELIMITER= / DLM=: Delimiter used in
list output. The default is ‘ ‘.
FILEVAR= variable: Changing the value
of the character variable causes the FILE statement to close the output file
and to open the file whose physical name is the value of the variable, for
output.
SHAREBUFFERS / SHAREBUFS: Use this
option for text files being edited to use the same buffer for input &
output. This means that any positions
skipped over by the PUT statement will stay the way they were before. Otherwise, the positions in an output record
that the PUT statement does not write to are filled with blanks. This option makes it possible to change some
fields in a file without processing other fields.
When
a PUT statement attempts to write beyond the maximum allowed line length (as
specified by LINESIZE= option in FILE statement), the following options on the
FILE statement can cause varying results
FLOWOVER: The current output line is
written to the file and the data item that exceeds the current line length is
written to a new line.
DROPOVER: The option discards data
items that exceed the output line length as specified by the LINESIZE= option in the FILE statement and the column pointer
remains positioned after the last value written in the current line.
STOPOVER: stops processing the data
step immediately if a PUT statement attempts to write a data item that exceeds
the current line length. The System
writes the portion of the line built before the error occurred and issues an
error message.
The default OUTPUT
file is LOG.
Pointer
Controls & Line-Hold Specifiers
As the SAS System reads values from
data records in the input buffer, it keeps track of its position with a
pointer. Pointer controls are provided
on the Input statement so that you can reset the position of the pointer to
read data values in records at certain positions. Line-hold specifiers allow you to hold a
data record in the input buffer to be processed by another INPUT statement.
@ - Column pointer control that moves
the pointer to column n. Any decimal portion of variable values is
truncated and only integer values are used.
If 0 the pointer moves to column 1.
+ - Moves the pointer n columns. Any decimal portion of
variable values is truncated and only integer values are used. If 0 the pointer moves to column 1.
# - Moves the pointer to line n.
Any decimal portion of variable values is truncated and only integer
values are used.
/ - Advances the pointer to column 1 of
the next line.
: - Is a character comparison operator
that modifies existing comparison operators compare all values that start with a
given character. It changes the nature
of the comparison from an exact match to a “begins with” match. It has no effect when it is used between two
variables, only comparison with a string constant.
e.g. if upcase (charvar) =: “SMIT”;
when used to
compare 2 text strings, the longest string will be truncated to the length of
the shortest one for the purpose of evaluation. E.g.
if ‘ABC’ =:
‘ABCD’; or
if ‘ABC’ =: ‘AB’; will both evaluate to TRUE.
If charvar >:
‘A’; will return records where
variable charvar values begin with B.
If charvar in:
(‘SM’,’Will’,’aa’); will return
values that start by any of these character set, irrespective of their total
length.
Trailing @ - To allow the next INPUT
statement in the same DATA step to read from the same record. Prevents the next INPUT statement from
automatically releasing the current Input record and reading the next one into
the input buffer. Between INPUT
statements the pointer position remains the same.
Trailing @@ - To allow a record to be held
for the next INPUT statement, even across iterations of the DATA step. Here each input line contains values for
several observations. An input line held
by the system is released immediately if the pointer moves past the end of the
line, if a NULL INPUT statement executes
V. DATA
step Statements
The
SAS Statements that can appear in a DATA step fall into several categories:
File handling Statements, action
Statements, Control Statements, and information Statements; Each Statement is
either executable, positional, or declarative.
Executable statements (denoted by
X) are programming Statements that cause some action.
Positional Statements (P) cause
no action at execution, but their position in DATA step is important.
Declarative Statements (D) supply
additional information to SAS.
1. File handling
Statements
·
CARDS – precedes card data or lines entered at
terminal - data that are part of the job stream (P)
·
CARDS4 – precedes in-stream data lines
containing semicolons (P)
·
DATA – tells SAS to begin a DATA step and to
start building a SAS data set (P)
·
FILE – identifies the data file where lines are to be written
by the DATA step (X)
·
INFILE – The INFILE statement gives the fileref
of the control Statement (FILENAME statement). The fileref is a logical name to
the physical file. The fileref identifies the external file containing raw data
to be read. When the INFLE statement is executed the external file is opened.
(X)
·
INPUT – describes the records on the external input file.(X)
·
MERGE – combines
observations from two or more SAS data sets into a new data set. (X)
·
PUT – describes the format of the lines to be
written by SAS.(X)
·
SET – reads
observations from one or more existing SAS data sets. (X)
·
UPDATE – applies transactions to a master file.
Both transaction and master file are SAS data sets (X)
2. Action
Statements
·
ABORT – stops the current DATA step or the job,
depending on the mode of executing.
·
Assignment – creates and modifies variables.
·
CALL – invokes or calls a routine.
·
DELETE – excludes observations from the data set
being created.
·
ERROR – writes messages on the SAS log.
·
LIST – lists the current input lines to the LOG.
·
LOSTCARD – corrects for lost data lines when an observation has
an incorrect number of data lines.
·
MISSING – declares that certain values in the
input data represent special missing values for
numeric data
fields.
·
OUTPUT – creates new observations.
·
STOP – stops creating the current data set.
·
subsetting IF – selects observations for the
data set being created.
·
SUM – accumulates total
3. Control
Statements
·
DO – sets up a group of statements to be
executed as one statement.
·
iterative DO
·
DO UNTIL
·
DO WHILE
·
END – signals the end of a DO or SELECT group.
·
GO TO – causes SAS to jump to a labeled statement
in the step and continue execution at that
point.
·
IF-THEN/ELSE – conditionally executes a SAS
statement
·
LINK-RETURN – causes SAS to jump to a labeled
statement in the step and execute statements until it encounters a RETURN Statement.
·
RETURN – when not combined with a LINK
statement, causes SAS to return to the beginning of the DATA step to begin
execution. When combined with a LINK statement,
returns to the statement immediately following the most recently executed
LINK.
·
SELECT – conditionally executes one of several
SAS Statements.
4. Information
Statements.
·
ARRAY – defines a set of variables to be
processed the same way. (D)
·
ATTRIB – specifies a format, informat, label and
length for a variable.(D)
·
BY – specifies that the data set is to be
processed in groups defined by the BY
variables.(D)
·
DROP – identifies variables to be excluded from
a data set or analysis.
·
FORMAT – specifies formats for printing variable
values.
·
INFORMAT – specifies informats for storing
variable values.
·
KEEP – identifies variables to be included in a
data set or analysis.
·
LABEL – associates descriptive labels with
variable names.
·
RENAME – changes the name of the variables in a
data set.
·
RETAIN – identifies variables whose values are
not to be set to missing each time the DATA step
is executed and can give variables an initial value for the first iteration
(otherwise for first iteration it would have been missing). Sum statements variables are retained by
default.
VI. Operators in SAS
Priority
|
Symbol
|
Mnemonic Equivalent
|
Definition
|
Example
|
|
|
|
|
|
Group I
|
**
|
|
exponetiation
|
y=a**2;
|
|
+
|
|
positive prefix
|
y=+(a*b);
|
|
-
|
|
negative prefix
|
z=-(a*b);
|
|
^ or ~
|
NOT
|
logical NOT
|
if not z then put x;
|
|
> <
|
MIN
|
minimum
|
x=a><b;
|
|
<>
|
MAX
|
maximum
|
x=a<>b;
|
|
|
|
|
|
Group II
|
*
|
|
multiplication
|
c=a*b;
|
|
/
|
|
division
|
f=g/h;
|
|
|
|
|
|
Group III
|
+
|
|
addition
|
f=g+h;
|
|
-
|
|
subtraction
|
f=g-h;
|
|
|
|
|
|
Group IV
|
||
|
|
concatenate character values
|
name = 'J. ' ||
'Smith';
|
|
|
|
|
|
Group V
|
<
|
LT
|
less than
|
if y < z then put
x=;
|
|
<=
|
LE
|
less than or equal to
|
if y le z then put
x=;
|
|
=
|
EQ
|
equal to
|
if y eq (a+b) then
output;
|
|
^=
|
NE
|
not equal to
|
if x ne z then
output;
|
|
>
|
GT
|
greater than
|
if z gt a then
output;
|
|
>=
|
GE
|
greater than or equal to
|
if z ge a then
output;
|
|
|
IN
|
equal to one of a list
|
if sex in ('m','f')
then result='correct';
|
|
|
|
|
|
Group VI
|
&
|
AND
|
logical AND
|
if a=b and c=d then
x=1;
|
|
|
|
OR
|
logical OR
|
if a=b or c=d then
x=1;
|
VII. Combining Data Sets
Introduction
The
SAS System provides a means for processing observations that are ordered or grouped according to the
values of one or more variables read from existing SAS data sets. The SAS System detects the pattern by
tracking the values of the temporary variables FIRST.variable & LAST.variable.
The SAS System expects observations
to be ordered or grouped by the value of the variables specified in the BY
statement. The observations can be
ordered by sorting or indexing the dataset.
The most frequent use of BY
group processing in the Data step is to combine two or more SAS data sets. When
processing SET, MERGE & UPDATE statements, the SAS System reads one
observation at a time into the program data vector according to the values of
the BY variable. After processing all
the observations from one BY group, it expects the next observation to be from
the next BY group. The NOTSORTED option
in the BY statement is used when the data is not in alphabetical or numeric
order but are arranged in groups according to the values of the BY variable.
Note: Using 2 files in SET statement is equivalent
to using proc APPEND if the output dataset also appears as the first dataset on
the SET statement. E.g.
Data
in2;
Set in2
in3;
Is the
same as,
Proc
append base=in2 new=in3;
Run;
Using
more than 1 SET statement is equivalent to MERGE operation. E.g.
Data
in1;
Set in2;
Set in3;
Is the
same as,
Data
in1;
Merge
in2 in3;
1 Concatenating
data sets
Data myfile;
Set ourfile1 ourfile2….;
Concatenation is combining two or
more datasets one after the other into a single dataset. The number of observations in the new
dataset is the sum of observations of the old data sets and the order is all
observations from the first followed by all from the second. If input data set contains different
variables, observations form one data sets have missing values for variables
defined only in the other data set.
Ourfile1
OBS
|
COMMON
|
ANIMAL
|
1
|
a
|
ant
|
2
|
b
|
bird
|
3
|
c
|
cat
|
4
|
d
|
dog
|
5
|
e
|
eagle
|
6
|
f
|
frog
|
Ourfile2
OBS
|
COMMON
|
PLANT
|
1
|
a
|
apple
|
2
|
b
|
banana
|
3
|
c
|
coconut
|
4
|
d
|
dewberry
|
5
|
e
|
eggplant
|
6
|
f
|
fig
|
myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
|
2
|
b
|
bird
|
|
3
|
c
|
cat
|
|
4
|
d
|
dog
|
|
5
|
e
|
eagle
|
|
6
|
f
|
frog
|
|
7
|
a
|
|
apple
|
8
|
b
|
|
banana
|
9
|
c
|
|
coconut
|
10
|
d
|
|
dewberry
|
11
|
e
|
|
eggplant
|
12
|
f
|
|
fig
|
2. Interleaving
Data myfile;
Set ourfile1 ourfile2 ourfile3…..;
By var1 var2 var3…..;
The sum of observations in the
new data set is the total of observations of the old data sets. The observations in the new data sets is
arranged by the value of the BY variable and within each BY group by the order in
the old data sets.
In the example, the data set
ourfile1 & ourfile2 are SET by the variable COMMON
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
Ant
|
|
2
|
a
|
|
apple
|
3
|
b
|
Bird
|
|
4
|
b
|
|
banana
|
5
|
c
|
Cat
|
|
6
|
c
|
|
coconut
|
7
|
d
|
Dog
|
|
8
|
d
|
|
dewberry
|
9
|
e
|
Eagle
|
|
10
|
e
|
|
eggplant
|
11
|
f
|
frog
|
|
12
|
f
|
|
fig
|
3. One to One
Reading
Data myfile;
Set ourfile1;
Set ourfile2;
The
new data set contains all the variables
from all the input data sets. The number
of observations in the new data set in the number of observations in the
smallest original data set. If the data
set contains common variables, the values read from the last data set replace
those read from earlier ones.
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
Apple
|
2
|
b
|
bird
|
Banana
|
3
|
c
|
cat
|
Coconut
|
4
|
d
|
dog
|
Dewberry
|
5
|
e
|
eagle
|
Eggplant
|
6
|
f
|
frog
|
Fig
|
4. One to One
Merging
Data myfile;
Merge ourfile1 ourfile2;
SAS
system combines the first observation from all the data sets in the MERGE
statement into the first observation in the new data sat. Similarly all nth observations from all data
sets are merged to from nth observation in new data set. The number of observations in the new data
set is equal to the number of observations in the largest data set.
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apple
|
2
|
b
|
bird
|
banana
|
3
|
C
|
cat
|
coconut
|
4
|
D
|
dog
|
dewberry
|
5
|
E
|
eagle
|
eggplant
|
6
|
F
|
frog
|
fig
|
the result is similar the result
obtained by One to One Reading when the number of observations in the merging
data sets are equal. When the number of
observations are unequal, the SAS System stops processing before all
observations are read from all data sets with One to One Reading .
5. Match-Merging
Data myfile;
Merge ourfile1 ourfile2….;
By var1 var2…..;
Match-Merging
combines observations from two or more data sets into a single observation in
the new dataset according to the values of the common variable. Observations from the different datasets with
the same BY variable values are combined. If there are several observations with the
same BY variable values, they are matched in a manner similar to the one-to-one
merging process. But, in this case, if
one SAS dataset has fewer observations in a BY group than the other, the values
of the last observation in the BY group are used to form the rest of the
observations in the BY group, instead of missing values. Missing values are
used when one SAS dataset has no observations in a BY group.
When there is a conflict in the
value of the variable, other than a BY variable, between different input SAS
datasets in a MERGE statement, the value from the SAS dataset named later in
the MERGE statement is used.
The number of observations in the
new data set is equal to the total of the largest number of observations in
each BY group from among all input data set.
Each observation in new data set contains all the variables from all
data sets.
Duplicate values of BY variable
Ourfile1
OBS
|
COMMON
|
ANIMAL
|
1
|
a
|
ant
|
2
|
a
|
ape
|
3
|
b
|
bird
|
4
|
c
|
cat
|
5
|
d
|
dog
|
6
|
e
|
eagle
|
Ourfile2
OBS
|
COMMON
|
PLANT
|
1
|
a
|
apple
|
2
|
b
|
banana
|
3
|
c
|
coconut
|
4
|
c
|
celery
|
5
|
d
|
dewberry
|
6
|
e
|
eggplant
|
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apple
|
2
|
a
|
ape
|
apple
|
3
|
b
|
bird
|
banana
|
4
|
c
|
cat
|
coconut
|
5
|
c
|
cat
|
celery
|
6
|
d
|
dog
|
dewberry
|
7
|
e
|
eagle
|
eggplant
|
Nonmatching observations
Ourfile1
OBS
|
COMMON
|
ANIMAL
|
1
|
a
|
ant
|
2
|
c
|
cat
|
3
|
d
|
dog
|
4
|
e
|
eagle
|
Ourfile2
OBS
|
COMMON
|
PLANT
|
1
|
a
|
apple
|
2
|
b
|
banana
|
3
|
c
|
Coconut
|
4
|
e
|
Eggplant
|
5
|
f
|
Fig
|
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apple
|
2
|
b
|
.
|
Banana
|
3
|
c
|
cat
|
Coconut
|
4
|
d
|
dog
|
.
|
5
|
e
|
eagle
|
Eggplant
|
6
|
f
|
.
|
fig
|
6. Updating Data sets
Data myfile;
Update master trans;
By common;
The update statement uses
observations from the transaction data set to change values of corresponding
observations from master data set. All
observations in the find relation with observations in the master data set by
values of the BY variable.
The values of the BY variable or
combination of BY variables must be unique for each observation in the master
data set. The BY variables do not get
updated. The number of observations in
the new data set is the sum of the observations in the master data set and the
number of unmatched observations in the transaction data set.
Master
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apple
|
2
|
b
|
bird
|
banana
|
3
|
c
|
cat
|
coconut
|
4
|
d
|
dog
|
dewberry
|
5
|
e
|
eagle
|
eggplant
|
6
|
f
|
frog
|
fig
|
Trans
OBS
|
COMMON
|
PLANT
|
1
|
a
|
apricot
|
2
|
b
|
barley
|
3
|
c
|
cactus
|
4
|
d
|
date
|
5
|
e
|
eucalyptus
|
6
|
f
|
fennel
|
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apricot
|
2
|
b
|
bird
|
barley
|
3
|
c
|
cat
|
cactus
|
4
|
d
|
dog
|
date
|
5
|
e
|
eagle
|
eucalyptus
|
6
|
f
|
frog
|
fennel
|
Duplicate values of BY variable
Master
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
a
|
ant
|
apple
|
2
|
b
|
bird
|
banana
|
3
|
b
|
bird
|
banana
|
4
|
c
|
cat
|
coconut
|
5
|
d
|
dog
|
dewberry
|
6
|
e
|
eagle
|
eggplant
|
7
|
f
|
frog
|
fig
|
Trans
OBS
|
COMMON
|
PLANT
|
1
|
a
|
apricot
|
2
|
b
|
barley
|
3
|
c
|
cactus
|
4
|
d
|
date
|
5
|
d
|
dill
|
5
|
e
|
eucalyptus
|
6
|
f
|
fennel
|
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
A
|
ant
|
apricot
|
2
|
B
|
bird
|
barley
|
3
|
b
|
bird
|
banana
|
4
|
c
|
cat
|
cactus
|
5
|
d
|
dog
|
dill
|
6
|
e
|
eagle
|
eucalyptus
|
7
|
f
|
frog
|
fennel
|
If the master data set contains
two observations with the same value of the BY variable, the first observation
is updated and the second is ignored.
Warning message are also generated as the BY variables are to be
unique. If the transaction data set
contains duplicate values of the BY variable, SAS applies both transactions to
the observation and the last value copied into the program vector is written
into the new data set.
Non-matched observations
Master
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
1
|
A
|
Ant
|
.
|
2
|
C
|
Cat
|
Coconut
|
3
|
D
|
Dog
|
Dewberry
|
4
|
E
|
Eagle
|
Eggplant
|
5
|
f
|
Frog
|
Fig
|
Trans
OBS
|
COMMON
|
PLANT
|
MINERAL
|
1
|
a
|
Apricot
|
Amethyst
|
2
|
b
|
Barley
|
beryl
|
3
|
c
|
Cactus
|
.
|
4
|
e
|
.
|
.
|
5
|
f
|
Fennel
|
.
|
6
|
g
|
Grape
|
garnet
|
Myfile
OBS
|
COMMON
|
ANIMAL
|
PLANT
|
MINERAL
|
1
|
a
|
Ant
|
Apricot
|
Amethyst
|
2
|
b
|
.
|
Barley
|
Beryl
|
3
|
c
|
Cat
|
Cactus
|
.
|
4
|
d
|
Dog
|
Dewberry.
|
|
5
|
e
|
Eagle
|
Eggplant
|
.
|
6
|
f
|
Frog
|
Fennel
|
.
|
7
|
g
|
.
|
Grape
|
Garnet
|
Only non-missing values from the
transaction dataset is used when updating values in the master dataset. If necessary, the special missing value ._ can be used to update a value to
missing. Any observation from the transaction
data set that does not correspond to the master data set is written to the
program data vector and becomes the basis for an observation in the new data
set.
There can be more than one input
file involved in the creation of an observation by combinations of SET and
INPUT statements. However, the step
stops after one of the input streams
(they operate completely independent of each other) reaches the end of
data. This means that data at the end of
the other stream is never reached. In
these cases, Input files can be coordinated by using the END= option on the
INPUT, SET, or MERGE statements and using programming statements to identify
the end of an input stream. The
following code does a one-to-one merge
of a SAS dataset with a sequential file to create a new SAS dataset.
SEQUENTIAL FILE FLATFILE
NAME AGE BIRTHDAY
MOLLY 24 23JAN69
SUSAN 22 23FEB72
ARCHIE 22 01OCT59
BILLY 26 29JUL79
JOHN 22 04NOV75
SARAH 27 11MAY79
SASFILE STRUCTURE
OBS SEQNO
1
1
2
2
3
3
SAS CODE:
DATA NEW;
SEQNO =_N_;
INFILE FLATFILE END= LAST1;
IF NOT LAST1 THEN
INPUT @1
NAME $10. @12 AGE 2. @15 BIRTHDAY
$5.;
IF NOT LAST2 THEN
SET SASFILE (KEEP=SEQNO) END=LAST2;
IF LAST1 AND LAST2 THEN STOP;
RUN;
OBS SEQNO
NAME AGE BIRTHDAY
1
1 MOLLY
24 23JAN69
2
2 SUSAN
22 23FEB72
3
3 ARCHIE
22 01OCT59
4
. BILLY
26 29JUL79
5
. JOHN 22 04NOV75
6
. SARAH 27 11MAY79
VIII. ARRAYS
Array arrayname [ number-of-elements
] list-of-variables
e.g.
array books
[4] ref usage intro glossary;
subscript values can be passed as
number, variable with numeric value or an expression.
Here books[2] and usage are equivalent.
e.g.
Data myfile (drop=count);
Input ref usage intro glossary;
Array books[4] ref usage intro
glossary;
do count = 1 to 4;
if
books[count] =. then books[count] = 0;
end;
cards;
45 46 112 23
65 53 123 .
56
. 154 32
. 23
134 45
;
proc print data=myfile;
title ‘Data set produced with array
processing’;
run;
The Array statement defining the
array must appear in a data step before any reference to that array. An array definition is only in effect for the
duration of the data step. To use in
several data step, you must redefine the array in each data step.
e.g.
%let list ref usage intro glossary;
data one;
array books[4] &list;
-- more SAS statements --
run;
data two;
array journal[4] &list;
-- more SAS statements --
run;
To
refer to all variables in an array use the special array subscript asterisks (*)
IX. FUNCTIONS
ABS
Returns
a non negative number equal in magnitude to that of the argument.
e.g. x = abs(-2.4);
BYTE
Returns
the nth character on the ASCII or EBCIDIC collating sequence.
e.g. x = byte(80); put x; <== will
return & in EBCIDIC
CEIL
Smallest
integer that is greater than or equal to the argument.
e.g. x= ceil(2.1) <== x
will return 3.
COMPRESS
Removes
specified character(s) from character expressions. If second argument is not specified then
blanks, if any, will be removed
e.g. x = ‘A.B (C=D);’;
y
= compress (x ,’.();’);
put
y ; <== will
give you AB C=D
DATE
Returns
current date as a SAS date value
DATEJUL
Converts a Julian
date to a SAS date value.
e.g. x= datejul(01001);
<== will
return value in SAS date of 14976
DATEPART
Extracts
the date from a SAS datetime value.
e.g. datetm = ‘01feb98:8:45’dt ; <== way to specify constant datetime
values.
thedate
= datepart (datetm);
put
thedate wordate.; <== prints the date alone in wordate fromat.
DATETIME
Returns
the current datetime value.
DAY
Returns the date of the month
from a SAS date value.
e.g. now = ’04May98’d; <== way to specify constant values for dates.
d
= day (now); <== d
will hold the value 4.
DHMS
Returns
a SAS datetime from date, hour, minute and second.
Syntax : DHMS (date, hour, minute, second)
e.g. dt = dhms(‘01jan97 ’d,22,45,20);
DIM
Returns
the number of elements in an array
Syntax : DIM <n>(arrayname)|
(arrayname,bound-n)
Bound-n – Specifies the dimension in a multi-dimension array for
which you want to know the number of elements
e.g. dim(array2 , 3) returns the
number of elements in the second dimension of a multi-dimension array array2.
EXP
Raises the ‘e’
to the power supplied by the argument.
FLOOR
Largest integer
that is less than or equal to the argument.
e.g. x=
floor (1.6) <==
returns 2.
HMS
Returns
a SAS time value from hours, minutes & seconds values.
e.g.
sastime
= hms(14,45,20);
HOUR
Returns hour from a SAS time or
datetime value
INDEX
Searches
the source for the character string specified by the excerpt and returns its position. Returns
0 if the excerpt is not found.
Syntax : index (source, excerpt)
INDEXC
Searches
the source for the characters specified by the excerpt and returns its position. Returns
0 if the excerpt is not found.
Syntax : index (source, excerpt)
INPUT
The function
allows you to read argument using any informat specified by the second
argument. The informat specified determines whether the result is numeric or
character. Used to convert character values to numeric values.
e.g. fmtsale =
input(sale, comma8.); <== will
return $10,000 from values like 10000
INTNX
Syntax : INTNX(interval, from, number)
advances a date, time, or datetime value by
a given interval.
This function generates a SAS
date, time, or datetime value that is a given number of time intervals from a
starting value(from). The interval must be a character constant or variable
whose value is one of those listed below.
DATE interval
|
DATETIME interval
|
TIME interval
|
DAY
|
DTDAY
|
HOUR
|
WEEK
|
DTWEEK
|
MINUTE
|
MONTH
|
DTMONTH
|
SECOND
|
QTR
|
DTQTR
|
|
YEAR
|
DTYEAR
|
|
e.g. yr = intnx(‘year’,’05jan89’d); <== will return value 89.
INTCK
Gives
the number of intervals in a given time span
Syntax : intck(interval, from, to)
e.g.
qart =
intck(‘qtr’,’10jan98’d,’01may99’d); <==
will return value 5
JULDATE
Returns the Julian date from a
SAS date value
e.g.
sdate = ’01feb99’d;
jul = juldate(sdate); <==
will return value 99032
LEFT
Left aligns a SAS character
expression.
e.g.
sd = ’ feb day’;
p = left(sd); <==
will return value with trailing blanks
LENGTH
Returns the length of an argument
as the right-most non-blanks character in the argument. If the value is missing the length returns a
value of 1. If the argument is an uninitialized numeric
variable, it returns a value of 12.
e.g.
p = lenght(‘sdqwert ’);
<==
will return value of 7 for variable p
MAX
Returns largest of the
non-missing arguments.
e.g.
p = max(‘sdqwert’); <==
will return value of 7 for variable p
LENGTH
Returns the length of an
argument. If the value is missing the
length returns a value of 1.
e.g.
p = left(‘sdqwert’); <==
will return value of 7 for variable p
PUT
Specifies an output format for a
value. The result of a put function is always a character string. This is
useful for converting a numeric value to a character value.
SUBSTR
Syntax : substr(argument,position<,n>)
If used on the right side of an
assignment statement, it returns a portion of an expression you specify as
argument.
e.g.
part = substr(var1,3,6); <== will return character
string of length 6 from the 3rd position.
partend = substr(var1,3); <== will
return character string from the 3rd position till end.
If used on the left side of an
assignment statement, it places the value of the expression on the right side
of the assignment statement into the argument of SUBSTR, replacing n characters starting with the character
you specify in position.
e.g.
var= ‘CATNAP’;
substr(var,1,3) = ‘KID’; <== will return value of ‘KIDNAP’ into
variable var.
Put var;
SYSPARM
The SYSPARM function lets you
access a character string specified with the SYSPARM = system option in the job
control for your job or in an OPTIONS statement.
Procedures
Append
The APPEND proc adds the
observations from one dataset to the end of another.
Proc
append base = myfile new =
newfile (where ( x=2 ) ) force;
Options
BASE : names the dataset to which
observations are added. If not found a new data set with the name specified is
created
DATA : names the SAS dataset
with observations which is to be added to the end of the BASE= data set.
WHERE : limits the observations
selected from DATA= dataset that are to be appended to BASE= dataset.
FORCE : forces PROC APPEND to
concatenate datasets when the DATA= dataset contains variables that are either
·
not in the BASE= dataset or
·
do not have the same type as the variables in
the BASE= dataset or
·
are longer than the variables in the BASE=
dataset.
The advantage over SET operation
is that it bypasses processing
of data in BASE= dataset and adds new observations directly to the end of the
BASE= dataset.
Compare
This proc compares the contents
of two datasets or compare the values of different variables within a single
dataset to produces a variety of reports.
Proc compare
base = myfile (where (state = ‘NC’))
compare =yourfile;
var student birth state major;
with student1 birth5 state3;
out
= newfile;
·
matching variables : variables with the same
name or those explicitly paired by VAR and WITH statements.
·
Matching observations : observations that have the same values for
all variables that occur in the same position in the datasets.
PROC COMPARE compares the
following in order:
1.
Data set attributes
2.
Variables
3.
Attributes of variables
4.
Observations
5.
Values in pairs of observations that match
It produces
1.
An output dataset
2.
Printed reports
3.
A numeric return code stored in the automatic macro
variable &SYSINFO
Contents
This Proc provides information
about a SAS data library or individual files in a SAS data library.
libname person ‘ library-dataset ’;
Proc contents
data=person . _all_
memtype = view
nods position;
out = newfile;
run;
run;
DATA : specifies the SAS dataset
or library whose information is to be determined. The information of all the files in the
library having type as specified by MEMTYPE= option is gained by using the
keyword _all_
MEMTYPE : Specifies one or more
types of members in the SAS data library.
It can take the following values:
ACCESS – access files created
using SAS/ACCESS software
ALL – all member types
CATALOG – catalogs
DATA – SAS datasets
PROGRAM – stored compiled SAS
programs
VIEW – views created using SQL
procedures
NODS : Only the SAS data library is printed if this
option is used.
OUT : gives the name of the
output SAS dataset. The output dataset
contains information similar to that given in the variable description section
in the printed output
POSITION: The default order of
listing variables names in the SAS dataset is alphabetical. This option prints a second list of variables
names in the order of their position in the dataset.
Datasets
This Proc is used to list, copy,
rename and delete SAS files and to manage indexes and append SAS datasets in a
data library. It also provides all the
capabilities of the APPEND, CONTENTS, and COPY procedures.
Some of the differences in DATASETS
compared with other procs:
1.
The input library
is specified in the LIBRARY= option
2.
Statements are executed in the order of writing.
3.
Groups of statements can execute without a RUN
statement.
4.
There is a dependence of some statements on other
statements. E.g. the SELECT & EXCLUDE statements can only be executed
immediately after a COPY statement, FORMAT, INFORMAT, LABEL, RENAME, INDEX
CREATE, INDEX DELETE can only be used
after a MODIFY statement
5.
The DATASET procedure remains active until you type one
of the following :
·
QUIT
·
RUN CANCEL
·
A new PROC or DATA statement
If a syntax error is encountered
the RUN group containing the error is not executed.
libname drink ‘dataset1 ’;
libname eat ‘dataset 2 ’;
Proc datasets
library = eat memtype = data;
Copy out = drink move;
Select custard icecream;
Run;
Delete sandwich;
Run;
Change apple = apricot;
Run;
Modify hamburger;
Rename price =
rate;
quit;
CHANGE : Use CHANGE <old_name> = <new_name> statement to rename one
or more members
COPY : Use COPY
OUT = <from_lib> to copy members from one library to another. The MOVE option is used to delete from the input library
after copying
DELETE : Use DELETE statement to
specify members to be deleted from SAS library.
MODIFY : USE MODIFY statement to
change the attributes of the specifies datasets. Only one dataset name is allowed per modify
statement.
Formats
This proc is used to create your
own formats & informats. Options with FORMAT procedure can be used to print
the contents of a format library, create a control dataset for writing other
informats and formats or read a control dataset to create informats &
formats. Formats & informats gives
the SAS system information about data that is to be read or written. For example,
1.
Data type
2.
How many bytes it occupies
3.
Decimal placement for numbers
4.
How to handle leading trailing or embedded blanks or
zeros, etc.
A word immediately followed by a
period indicates a format or informat name.
They can have an optional width specification before the period. Numeric formats & informats can also have
an optional decimal specification after the period.
User defined formats convert a value to a different form for output E.g.
1.
Convert number to a character string – 1 to YES
2.
Convert from one character string to another character
string - YES to Y
3.
Specify a template to format the way a numeric value is
printed – print in the format of a telephone number.
User defined informats convert character
input values into a different form
1.
Convert a character number to a character string - 1 to YES
2.
Convert a character string to number - YES as 1(numeric)
3.
Convert a character string to another character string
- YES as Y
There are
two types of formats :
Value format : converts output values to a different
form. E.g.
Proc
Format;
value
SEX 1
= “Male” 2 =
“Female” ;
invalue $FRENCH ‘OUI’ = ‘YES’ ‘NON’ = ‘NO’;
In the first case, numeric variable values converted to
character form for output when the numeric
format SEX is used. Format stored as
number but formatted as character values. In the second case, values of OUI
& NON are stored as YES and NO by the SAS system when the character informat $FRENCH is used.
Picture
format : specify template for printing numbers giving specifics like leading
zeros, decimal & comma punctuation, fill characters, prefix & negative
number representation. Only applicable
to numeric values.
Proc Format;
picture PHONENUM
other = ‘000 / 000 - 000’;
picture FAX other = ‘0999 ) 999 - 999’ (
prefix = ‘(‘ );
Summary or Means
The summary procedure computes
descriptive statistics on numeric variables in a SAS dataset and outputs the
results to a new SAS dataset. The
difference between Means & Summary is that Summary does not produce any
printed output on its own. The summary output data set is typically printed
with PROC PRINT or is input to a DATA step that extracts the desired
information.
This procedure creates a SAS dataset containing
summary statistics or descriptive statistics on numeric variables. Each observation in the new dataset contains
the statistics for a different subgroup of the observations in the input
dataset representing all possible combinations of the levels of variables
specified in the CLASS statement.
PROC SUMMARY data = ht_wt;
var wt_loss ht_loss;
by sl_no;
class group;
id name;
output out = min_wt min(wt_loss )=;
output out = min_ht min(ht_loss)=;
run;
BY : A separate analysis on observations in the
group specified by the BY variables is obtained. The SAS dataset should be
sorted by the BY variables if the NOTSORTED option is not used.
CLASS : This specifies the
variables used to form sub-groups. The
level of interaction between the variables specified is obtained by this
statement. If a variable is taken into
account for a certain sub-group then it’s is assigned a binary value 1, else it
is 0. The decimal equivalent of this
binary numbers for a sub-group is the _TYPE_
value for that sub-group. The output
produces statistic info like
·
number of observations,
·
Mean ,
·
Std Deviation,
·
minimum &
·
maximum value
for that
sub-group for different values of _TYPE_.
If OUT= option is used, then this forms a single record in the output
dataset.
VAR: The variables in the dataset
for which statistics have to be calculated.
If this statement is not used then all variables except those in the BY,
CLASS, FREQ , ID and WEIGHT statements are analyzed.
ID : If additional variables from
the input dataset are to be included in the output dataset, then they can be
given with the ID statement.
e.g. The following code produces the result as :
Input file IN1:
code age date ind bit
1000 23 99123
Y
1
1001 . 99123
N 1
1003 32 99123
Y 0
1003 22 00123
Y
0
. 25 00123 Y 0
Code:
DATA IN2;
SET IN1;
PROC SUMMARY;
BY DATE NOTSORTED;
CLASS IND
BIT;
VAR AGE CODE;
OUTPUT OUT=IN3;
run;
will give :
DATE IND BIT
_TYPE_ _FREQ_ _STAT_
AGE CODE
99123 0 3 N 2.0000
3.00
99123 0 3 MIN
23.0000 1000.00
99123 0 3 MAX
32.0000 1003.00
99123 0 3 MEAN
27.5000 1001.33
99123 0 3 STD
6.3640 1.53
99123 0
1 1 N
1.0000 1.00
99123 0
1 1 MIN
32.0000 1003.00
99123 0 1 1 MAX
32.0000 1003.00
99123 0
1 1 MEAN
32.0000 1003.00
99123 0
1 1 STD
. .
99123 1
1 2 N
1.0000
2.00
99123 1
1 2 MIN
23.0000 1000.00
99123 1
1 2 MAX
23.0000 1001.00
99123 1
1 2 MEAN
23.0000 1000.50
99123 1
1 2 STD
. 0.71
99123 N
2 1 N
0.0000 1.00
99123 N
2 1 MIN
. 1001.00
99123 N 2 1 MAX
. 1001.00
99123 N
2 1 MEAN
. 1001.00
99123 N
2 1 STD
. .
99123 Y 2 2
N
2.0000 2.00
99123 Y 2 2 MIN
23.0000 1000.00
99123 Y 2 2 MAX
32.0000 1003.00
99123 Y 2 2 MEAN
27.5000 1001.50
99123 Y 2 2 STD
6.3640 2.12
99123 N
1 3 1 N
0.0000
1.00
99123 N
1 3 1 MIN
. 1001.00
99123 N
1 3 1 MAX
. 1001.00
99123 N
1 3 1 MEAN
. 1001.00
99123 N
1 3 1
STD
. .
99123 Y 0
3 1
N
1.0000
1.00
99123 Y 0
3 1 MIN 32.0000
1003.00
99123 Y 0
3 1 MAX
32.0000 1003.00
99123 Y 0
3 1 MEAN 32.0000
1003.00
99123 Y 0
3 1 STD
. .
99123 Y
1 3 1 N
1.0000
1.00
99123 Y
1 3 1 MIN
23.0000 1000.00
99123 Y
1 3 1 MAX
23.0000 1000.00
99123 Y
1 3 1 MEAN
23.0000 1000.00
99123 Y
1 3 1 STD
. .
123 0
2 N 2.0000
1.00
123 0 2 MIN
22.0000 1003.00
123 0 2 MAX
25.0000 1003.00
123 0 2 MEAN
23.5000 1003.00
123 0 2 STD
2.1213 .
123 0
1 2 N
2.0000
1.00
123 0
1 2 MIN
22.0000 1003.00
123 0
1 2 MAX
25.0000 1003.00
123 0
1 2 MEAN
23.5000 1003.00
123 0
1 2 STD
2.1213 .
123 Y 2 2 N
2.0000
1.00
123 Y 2 2 MIN
22.0000 1003.00
123 Y 2 2 MAX
25.0000 1003.00
123 Y 2 2 MEAN
23.5000 1003.00
123 Y 2 2 STD
2.1213 .
123 Y 0
3 2 N
2.0000
1.00
123 Y 0
3 2 MIN
22.0000
1003
123 Y 0
3 2 MAX
25.0000 1003
123 Y 0
3 2 MEAN
23.5000 1003
123 Y 0
3 2 STD
2.1213 .
Print
This Proc prints the observations in the dataset using some
or all of the variables. They can also
print totals and sub-totals for numeric variables.
Proc print data = person split = ‘*’
label n ;
label name=‘Associates*in the team’;
label name=‘Associates*in the team’;
var
name age team weight;
id
name;
by name;
by name;
pageby weight;
sumby
weight;
sum
age;
title ’ player details ’;
run;
title ’ player details ’;
run;
LABEL: variable labels are used
as column headings instead of variable names, if this option is used.
N : prints the number of
observations in the dataset at the end of printed output.
BY : the Procedure prints a
separate analysis for each variable in the BY group. It is required that the
dataset be sorted by BY variable.
PAGEBY : begins printing on a new page when the value of the
specified BY variable changes.
SUM : specifies the variables whose values is to be totaled.
VAR : names the
variables to be printed.
SQL
The SQL procedure implements the
Structured Query Language ( SQL) for SAS version 6 onwards.
Sample code
PROC SQL;
CONNECT TO DB2 (SSID=&SYS);
CREATE VIEW TEMP AS
SELECT * FROM CONNECTION TO DB2
(
SELECT T.ISCT_DT,
T.ISCT_CRED_USER_NBR,
T.ISCT_CRED_STR_ID,
T.ISCT_NBR,
T.ISCT_TOTL_AMT,
T.ISCT_CHRG_TYP_CD,
T.SLTKT_PUR_DT,
T.SLTKT_PUR_NBR,
/* more variables …… */
T.ISCT_CHRG_USER_NBR,
T.ISCT_CHRG_STR_ID,
0,
0,
' ',
' ',
' ',
' ',
0,
CURRENT DATE,
T.ISCT_CHRG_STR_ID,
' '
FROM
SI&OWN..STORE_TRSF T,
SI&OWN..VSTORE S,
SI&OWN..VSTORE B,
ST&OWN..STR_DLY_RPT D
WHERE
T.DRPT_ACPT_DT = &ED
AND T.ISCT_CHRG_TYP_CD = 'WTYRPR'
AND T.ISCT_CRED_USER_NBR =
'01'
AND T.ISCT_CRED_USER_NBR =
S.USER_NBR
AND T.ISCT_CRED_STR_ID = S.STR_ID
AND T.ISCT_CHRG_USER_NBR IN
('01','22')
AND T.DRPT_SALE_DT
= D.DRPT_SALE_DT
UNION ALL
SELECT T.ISCT_DT,
T.ISCT_CRED_USER_NBR,
T.ISCT_CRED_STR_ID,
/* more variables …… */
FROM
SI&OWN..STORE_TRSF T,
SI&OWN..TRSF_RFND_LN R,
SI&OWN..VSTORE S,
SI&OWN..VSTORE B,
ST&OWN..STR_DLY_RPT D,
MI&OWN..SKU_VRSN V,
ORDER BY 22, 39, 2, 3, 18, 43
);
DISCONNECT FROM DB2;
%PUT &SQLXRC;
%PUT &SQLXMSG;
QUIT;
DATA TEMP2;
SET TEMP (RENAME=(
ISCT_DT=ISCTDT ISCT_CRE=CREDUSR ISCT_CR0=CREDSTR
ISCT_NBR=ISCTNBR ISCT_TOT=ISCTAMT
ISCT_CHR=CHRGTYPE
SLTKT_PU=ORGTKTDT
SLTKT_P1=ORIGTKT SLTKT_PY=PAYTYPE ISCT_CH2=CHGTOUSR
ISCT_CH3=CHGTOSTR JNL_ACCT=JACCTNBR ICRT_NBR=ICRTNBR
SRVC_TKT=SERVINV SRVC_CMP=WARREPDT SLMKR_IN=SLMKINTL
DRPT_ACP=DRACPTDT DRPT_SAL=DRDATE ISCT_RFN=RFNREASN
ISCT_CUS=CUSTNAME ISCT_CMN=FFCOMENT EXPRESSN=SKUNBR
/*rename all the variables from input
dataset to a 8 character length name before reading in*/
EXPRES10=SKUSMDES EXPRES11=VCHRAMT EXPRES12=SLTKTDT
EXPRES13=SLTKTNBR EXPRES14=SLTKTTM EXPRES15=ACCTNBR
EXPRES16=VCHRDESC ACCTG_DI=ACCTDIV ACCTG_RE=ACCTREG
ACCTG_17=ACCTDIST ACCTG_GR=ACCTGRP STR_TELE=STAREACD
STR_TE18=STEXCHNO STR_TE19=STSTANO ISCT_C20=CHRTOUSR
ISCT_C21=CHRTOSTR EXPRES22=CLASSCD DRPT_MAN=MANFLAG
)) END=LAST;
IF _N_ = 1
THEN DO;
%PUT &SQLXRC;
%PUT &SQLXMSG;
END;
RUN;
QUIT;
%PUT &SQLXRC;
%PUT &SQLXMSG;
Another Example:
PROC SQL;
TITLE "JOIN
HOUSEHOLDS";
CREATE TABLE
WORK2.CBLHHLD AS
SELECT * FROM HHLD1
AS BASEDS INNER JOIN WORK1.HHVIEW AS APNDS
ON
BASEDS.HHLDID=APNDS.HHLDID
ORDER BY HHLDID;
This code is equivalent to the SAS MERGE step as:
DATA WORK2.CBLHHLD;
MERGE HHLD1(IN=BASEDS) WORK1.HHVIEW (IN=APNDS);
BY HHLDID;
IF BASEDS & APNDS THEN OUTPUT WORK2.CBLHHLD;
RUN;
Note:
Through PROC SQL it is not possible to get merged observations for records
where the BY variable (WHERE clause) has values only in one dataset (or TABLE)
as WHERE condition forces it to retrieve records where values coincide from the
two tables. But through MERGE, missing
values of BY variable are merged together to form observations in the output. So in order to simulate the SQL statement,
the merge criteria should be for common values of both indicator variables.
SAS Macro Language
SAS Macro language is a language
in its own rights. It is not part of the
proper SAS language but can act on the SAS language.
The simplest macrolanguage object
is the macrovariable. It is a set of characters that are
identified by name. Macroexpressions
include macrovariables, constant text, macro functions and macro operators. Macro
statements act on macro languages in certain ways. Finally, the Macro is a stored macrolanguage object.
Macro Variables
Macro
variables (or symbolic variables) belong to the SAS macro language and are
different from Data step variables. You
can define and use macro variables anywhere in a SAS program, except in data
lines. Also, the value of a dataset
variable depends on the observation being processed. Macro variable, on the other hand, contains
one value that remains constant until explicitly changed. A Macro variable is independent of the SAS
Data set.
%let dsn= Newdata;
The value of a macro variable is simply a string of
characters. The macro processor does not
make a distinction between character and numeric values. To refer to the macro variable value the
pattern &name (called macro variable reference) is used.
TITLE “Display of dataset &dsn”;
The macro processor resolves references in double quotes but
not in single quotes.
It is also possible to create macro variable values that
contain sections of SAS program as,
%let plot= %str(
proc print data = Newdata;
run;
);
&plot
The statements have to be enclosed in %STR() function so
that semicolons within the value are part of the text and not the end of the
%let statement.
A null value assigned to a macro variable has a length of
0. So, the macro processor replaces a
reference to a null value with 0 characters.
The simplest way to display macro variables is to use %PUT
statement as,
%put
!!!&dsn!!!;
will be resolved as,
%put
!!!Newdata!!!;
Macros
A macro is stored text identified by name. The %MACRO statement must begin every macro
and must contain a name for the macro.
The %MEND statement closes every macro.
The macro name can also appear after %MEND for clarity. To invoke a macro, place a % in front its
name, as
%macro dsn(lvar,fvar);
proc print data =&lvar..&fvar ;
run;
%mend dsn;
%dsn (work.sasdsn)
This pattern is called a macro invocation or a macro call. A macro is an entity in a utility catalog in
a library, usually WORK. SAS System does
not support copying or renaming macros.
A macro variable defined in parenthesis in a %MACRO statement is a macro
parameter. The macro processor matches
the first value to the first macro variable name, the second to the second,
etc.
Some other SAS Products
SAS/CONNECT software is a cooperative processing product that
through connections between remote SAS sessions, provides the ability to
transfer data among operating platforms supported by the SAS System.
It also provides remote
submission capabilities that allow users to submit SAS code to another host for
processing.
SAS/AF software is an interactive facility used to create
user-friendly windowing applications.
The software can be used to develop computer-based training courses and
on-line help systems. Ready-to-use
procedures construct menu screens from fill-in-the-blank panels, and transport
screens between operating environments.
The ACCESS procedure allows you to create access descriptors and view
descriptors that can be used to operate on data in a DBMS (like DB2) table or
view through the SAS system procedures.
The ACCESS procedure creates SAS files of type ACCESS and VIEW that can
be used in SAS programs. These files describe a DB2 table or view to the SAS
System so that the data contained in the table or view may be read, analyzed,
used, and updated by the SAS system.
There are two basic types of files created by the ACCESS procedure:
ACCESS and VIEW. An ACCESS file is a file of descriptive information relating
to a DB2 table. The ACCESS file describes the data in the table to the SAS
System. VIEW files can identify a
subset of the data described by
the ACCESS file. The data specified in a VIEW file can be used by the SAS
System in much the same way as a SAS data set is used. You can also create
ACCESS files for DB2 views, thus allowing you to read data from several DB2
tables at once. However, you cannot modify data read through a DB2 view.
You can also access data from
many relational DBMSs using the SQL Procedure Pass-Through Facility. The SQL procedure is a base SAS procedure
that works with SAS/ACCESS software to send and receive data directly between a
DBMS and the SAS System. You can also
store your Pass-Through code in a PROC SQL view for later use. The SQL
Procedure Pass-Through facility consists of three statements and one component:
CONNECT statement establishes a connection with the DBMS; EXECUTE statement
sends dynamic, non-query SQL statements to the DBMS; DISCONNECT statement
terminates the connection with the DBMS. You use the CONNECTION TO component in
the FROM clause of a PROC SQL SELECT statement to retrieve data directly from a
DBMS. You can use the Pass-Through
Facility statements in a PROC SQL query or you can store them in a PROC SQL
view. When you create the view, any
options that you specify in the corresponding CONNECT statement are also
stored. Thus, when the PROC SQL view is
used in a SAS program, the SAS System can establish the appropriate connection
to the DBMS.
SAS/ETS software is an econometric and time series analysis tool
for forecasting, planning and financial modeling. There are procedures for time series
analysis, linear and non-linear systems simulation,
loan amortization, depreciation
and row-and-column financial reporting.
SAS/ETS software is a component of the SAS System.
SAS/FSP software, a component of the SAS System, includes
procedures for full-screen interactive data entry, editing, query and letter
writing. With SAS/FSP software, users
can generate personalized form
letters and reports, customize
data presentation screens, create data entry applications that include cross
validation of field values and other data manipulation.
SAS/GRAPH software offers device-intelligent color graphics for
producing charts, maps and plots in a variety of patterns. Users can customize graphs with the software,
and present multiple graphs on a
page. SAS/GRAPH software is a component of the SAS
System, an applications system for data access, management, analysis, and
presentation.
SAS/OR software - an operations research, project management and
decision support tool - handles general assignment problems; performs critical
path analysis and linear programming; and determines minimum cost flow, maximum
flow, and shortest path through a network.
SAS/OR software is a component of the SAS System.
SAS/QC software, a component of the SAS System, offers a variety of
specialized tools for statistical quality improvement applications. Included are procedures for generating
Shewhart, cumulative sum and
moving average control charts,
for performing process capability analysis, and for experimental design.
SAS/STAT software, a comprehensive statistical analysis tool,
offers a wide range of capabilities including analysis of variance, regression,
categorical analysis, multivariate analysis, survival analysis, psychometric
analysis, cluster analysis and nonparametric analysis. SAS/STAT software is a component of the SAS
System.
SAS/SHARE software provides concurrent update access to SAS files.
"Concurrent update access" means that two or more users simultaneously
update the same SAS file or SAS data library. SAS/SHARE software is a component
of the SAS System.
Comments
Post a Comment