Welcome to part 4, the final section of the tutorial on the SAS system for Windows.
In this video we will cover some basic procedures and how to capture
results of a data analysis.
I'll introduce you to a couple of the base
procedures that are helpful for managing and checking an analysis.
The first procedure is the Print procedure. This is a technique to cross-check the data
before analysis. Through this procedure you can obtain a listing of an electronic dataset
to confirm everything has been interpreted correctly and that the calculation of
new variables are correct. To list the data set use the Proc Print command.
I'll include a Run statement.
What this command will do is provide a
listing of the most recently created dataset. Another useful procedure is the
Sort procedure. Through this procedure we can sort the dataset using one or more variables.
This is helpful when calculations are being performed or when
extreme values are being searched for. If you sort the data file by the variable
under question the missing and smallest values will appear at the top of the
list and the largest will appear at the bottom. That is quite helpful when the
data set may run hundreds or thousands of lines.
I'll include the Sort procedure
before the Print procedure. In this example I have invoked a sort
and I am sorting the data set by the variable WT100. I have included a Run
statement to ensure that the sort is finished before the print.
Notice that I can place these statements all in one line so long as I am separating them
with semicolons.
Normally we would list them line by line because it is a lot easier to edit when
checking for [typing] errors if you place the statements on individual lines.
By default SAS will use the last dataset created when a procedure is specified.
Sometimes you may want to have it do the procedure with another dataset
other than the last one created.
This is where the naming of data sets comes into play.
In each procedure statement in the
SAS system you can point to a specific data set for the program to perform the analysis with.
To do this you include a data = option in the procedure line.
I did not have to do it for this example but I will illustrate how this
specification works. In the Sort procedure I will include the
data = Second in the command line,
and I will do the same in the Print procedure.
Now both the Sort and the Print procedures will be performed with the
Second dataset.
To complete a sequence of tasks in the editor window make sure
the last line has a Run statement with a semicolon.
To submit the series of
statements into the SAS program you can either use in the top ribbon the
Run-Submit option or along the second line you can click on the symbol with the person running.
SAS has now performed the series of steps and has presented the
results in the Results window. However do not look at the results until you have
first checked the Log window. You need to confirm that there are no problems in the analysis.
This is one limitation with the SAS program. It automatically takes
you to the results. Do not look at those until you have first checked the Log window.
So I will now switch to the Log window and scroll to the top.
The log reports what SAS did with each statement. You will see it identified an
error in the first line and made a correction. It assumed the error was a
misspelling and gave a green warning.
If this change was fine
then you can proceed. Sometimes the self correction is not what you want.
The next section below was the creation of the first dataset. It found 20
observations and 4 variables were created. That should correspond to the
expected number of observations in the experiment. In this study there were 10
replicates of two species for a total of 20 so that is correct.
The rest of the
statements do not appear to have any errors. If you find red lines in the Log
window this indicates problems that SAS could not self-correct. You will need to
correct these errors before proceeding with interpretation of the analysis.
Since this log is acceptable we can go back to the Results window.
Here is the data listing with all 20 observations sorted by the
100 seed weight. We can also see that the values conform to our original data and
the calculation of the new variables appear correct. The Log and Results
windows are cumulative windows. New results will be appended to whatever is
already there. If there are problems with an analysis use the Log window to
identify where you need to make the correction to your command lines.
For example, I will go back to the Editor window and I'll make my correction and resubmit the statements.
The error I had was in the first line and I had misspelled the
command Title. I have missed the e to it. Now that I've made that correction
I will now resubmit the statements. The results for this second submission have
been appended to the results from the first submission. Normally you would not
want to do that, especially if there are errors in the first submission. You don't
want to have the results file containing erroneous information. So to clear the
Results view, select the window and using the ribbon options select Clear-all and
what that does it clears the window that was active. Also go to the Log window
and do the same.
And now we can return back to the Editor window. By doing these steps we've now
cleared the Results window and also cleared the Log window so now if we
resubmit statements we will only have the results of the correct analysis and
we'll only be dealing with the Log report of that last analysis that we have conducted.
There are also two other base procedures that can help you with
data analysis. The first is the SGscatter procedure. It does not generate
publication-ready graphs, but it can be a quick way to visualize the distribution
patterns in the data. This is helpful in confirming trends or more critically
absence of trends in error distributions in an analysis. To generate a
scatterplot use the Proc SGscatter command. The SGscatter procedure lets
you do a scatter plot of any variable against any other variable. So one of the
variables we have an interest in in this dataset is the hundred seed weight so the
WT100 variable is an obvious variable to consider. We do have two
classification variables, one being the Species, the other is the Replicate value
and so those are also variables that we could perhaps look at the distribution
patterns in terms of the hundred seed weight. So what I will do is generate a
scatterplot involving the hundred seed weights against those two variables.
With the plot statement in this procedure the variable on the left of
the asterisk is the variable that is placed on the y-axis or the vertical
axis of the graph, and on the right hand side you would list one or more
variables in which plots will be created and placing them in brackets we can then
have multiple requests of plots and so with this statement the 100-seed weights
will be plotted against each replicate as well as another plot of 100-seed
weights by Species. I'll now submit these commands.
So now we have two scatter plots created.
One on the left is the hundred seed weight against replicate
and the one on the right is the hundred seed weight against species.
Through these plots you can start seeing some of the trends in the data. For example,
the sunflower values in terms of hundred seed weights are much less than the corn values.
But in a variance analysis we're much more interested looking at the
residuals and making sure that we have random distributions of residuals, as
well as normal distribution. So instead of plotting the observed values, which
are the hundred seed weights, we are much more interested at plotting the residuals
of the analysis. So I will now switch back to the analysis and add the
statements to do a variance analysis of this study. In this particular experiment,
it is a completely random design with Species as a classification variable.
So I'll now do a
GLIMMIX analysis and divide the variation into that classification group.
The statements I've added will give us a variance analysis of the 100 seed
weights using the Proc GLIMMIX program. Species was a classification variable.
I've requested the means for the Species as well as I've generated an output dataset.
I've called that dataset Third, and this will have the individual
experimental unit values, as well as columns for the predicted values,
the residuals, and have also requested the studentized residuals. These
variables will be called Predicted, Residual, and Sresid in the dataset.
I've now done a change to the SGscatter request, and I'm now having it plot the
studentized residuals against Replication and Species.
I'll now submit these statements.
So now we have scatter plots of the studentized residuals for
each of the replicates, as well as each of the two species.
With the studentized residuals
we can now start seeing trends and patterns. There is one particular
value in the sunflower that seems to be very far apart in terms of its
distribution pattern relative to the rest of the observations.
This particular value
is a putative outlier and through analyses of the dataset itself it was
an error in recording.
So these plots let us look at the random type patterns of the residuals.
If you look at the sunflower and corn distributions the
corn distribution is fairly wide with generally a reasonable spread.
With a variance analysis one of the core assumptions that need to be met is
random distribution of residuals-- or random distribution of errors. The other
type of distribution pattern we also want to see is a normal distribution of
the errors or the residuals.
So I'll now switch back to the statements.
For this example I will add another procedure called
Proc Univariate and through this procedure I'll be able to generate a
graph of the distribution, with the expected normal distribution superimposed,
as well as perform a statistics test of normality.
I'll first invoke the Proc Univariate procedure.
And I've indicated a normal option as part of the procedure call.
What that does is it generates statistics tests of normality.
I've also indicated what variable to do the test on, and that is the
variable: studentized residual.
With this procedure the Histogram statement generates a frequency
distribution of the variable indicated-- which is the studentized residual, sresid.
The options to the statement: normal and kernel, superimpose on the frequency
distribution, the normal probability curve, as well as a curve that is based
on the data distribution itself. We can generate one of these graphs for all the
data placed into a pile or we can subdivide it for a particular classification.
We can add another statement, a BY Species to this procedure.
So long as your dataset is sorted by Species you can add BY Species
or whatever variable or variables that you are interested in looking at graphs
divided in terms of subdivision of your dataset. By placing all of these
statements together we can now generate a visual diagram of the distribution as
well as perform a statistical test of normality for each one of the Species
in terms of the studentized residuals in the analysis. But, in this particular
example we're using the Third dataset for generating our graphs. The Third
dataset has not been sorted by Species, only the Second dataset was. So we'll
have to add in prior to this procedure, a sort to ensure that
it's sorted by species.
I'll now submit these statements.
I'll scroll back through the results to get
the distribution graph for corn. The frequency diagram is illustrated with
the bars. The blue line is the expected distribution based on normal probability,
and the red line is the kernel which is the distribution based on the data itself.
As you can see those two lines reasonably follow each other. Yes there
are a few little squiggles here and there, but visually the the observed and
predicted are not deviating very markedly from each other.
But a better way of assessing: does that distribution differ from a normal distribution
is to apply a statistics test. So I will scroll back just a little bit further in the results,
to the summary of the statistics tests of normality.
In this procedure, there are four tests that are automatically generated.
The Shapiro-Wilk test is the first one listed
and that is one generally applicable to a lot of situations when you're dealing with a Gaussian type variable.
In this particular test the W statistic is 0.93.
More importantly we look at the P-value, and the P-value is 0.45 so using a typical
Type 1 error rate that we would assess at 5%, that P-value is well above that,
and we would accept the null hypothesis that the distribution follows a normal distribution.
So as you go back and look at this distribution again,
what we are seeing visually corresponds to the
statistics test, indicating that the distribution pattern does follow a
normal distribution.
Now let's consider the sunflower.
Sunflower has a very different distribution.
You will see that there's two peaks visually.
The red lines are
the distribution pattern using the kernel option, the blue line is what the
distribution would be expected to be if it was a normal distribution based on
the variances that are inherent in that dataset.
So obviously, visually, there seems to be two peaks and that would not be a
normal distribution. So as we go back and consider the test of normality,
again looking at the Shapiro-Wilk test the statistic is 0.54 and the P-value to that
statistic is less than 0.0001 so we would reject the null
hypothesis and declare that this distribution deviates from normality.
So once again the statistics test and our visual assessment of the
distribution correspond. And so there is something about this dataset
the sunflower values are not following a normal distribution. The reason for
this, in this particular example, is because of that one observation--and that
observation was an outlier--because there was an error in recording the value.
If that data point is taken out then the distributions actually do follow a
normal distribution.
I'll now illustrate how to capture
information for use in other programs.
Images in the results area can be
captured by hovering the mouse over the image and doing a right click,
and you can save the picture. You can save the picture as a PNG or as a bitmap file
and then use that in other programs.
If you're dealing with tables,
I'll just go up to the test for normality, if you hover over the table, again do a right-click
you can export the table to a Microsoft Excel program. That way
you can very quickly capture a particular table out of the results area.
If you hover outside the results area, and export to Microsoft Excel,
here will get all possible tables that can be generated in terms of an Excel spreadsheet.
This program will then let you go through that entire
result area and indicate which particular tables you want to export.
You can do a whole series at once.
You can also save the results table using
the File Save-as option, and you can export it as a web archive or as a HTML format.
You could also print and you can print it as a PDF document.
So those are some different ways of taking the results and saving them in different formats
that you can then import into other applications.
SAS also has an export method for the electronic datasets themselves.
To export these go to
the ribbon and select File Export Data,
select the data set you want to export, in this case we have three electronic
data sets when I've called First, one I've called Second, and one I've called Third.
Let's go to and use the Third data set.
and that was the one that also has
the residuals and predicted values.
Click on Next and this lets you then define
what type of format and what type of program you are going to move this
electronic data set into.
The SAS system that's installed is a 64-bit application
and so if you have a Microsoft office system that's also 64-bit on your
computer you can export it to the Microsoft Excel workbook
However if your
Microsoft System is a 32-bit you cannot use that option.
You have to change the particular export system
and if you have a 32-bit one of the one that works
fairly efficiently is the one option called Microsoft Excel 5 or 95 workbook.
That seems to work with a whole series of the 32-bit systems.
Click on next and
then you get to choose where you're going to save the file
and indicate what file name
you want to use. I'll use the Browse button to indicate and in my
case I'm going to send it to the desktop, and I'm going to call this file Example,
now I'll click on Finish, and that file has been generated.
I'll go to my desktop
and open up that file.
So this is the excel file of the Third dataset.
You will see that it has the Rep and Species and WT100 variables that were in the
original data set we submitted to the SAS program.
It's also got the two additional
variables that were calculated, the weight in milligrams, the natural log of
the 100 seed weight, as well as we've added the predicted values, the residuals
from the variance analysis, as well as the studentized residuals.
And so you can use these in other applications.
I'll now switch back to the SAS editor window.
So now we have a set of statements that includes all of the data, all of the
calculations, all of the variance analysis, and all the procedures that we applied.
We can save these SAS statements. We can do File, do a Save-as, and give
it a little different name.
And that way we can recall those statements at a future time for
further analysis of this study,
for changing particular applications that we may be doing with the data.
Perhaps there is another run of this study we can now add in
and do further analysis, or we can recall all of these statements again
for some other application.
An advantage of saving the statements along with the data,
so here are the statements and we've got the data along with it,
It helps you in future scenarios because you don't have to go hunting for the
associated data files. The data and the text are all together. These are just
simple text files and they don't take up very much room.
A couple of other additional points about the SAS system.
When report on an analysis you always need
to convey what program and what version was used. To determine which version you
were using select Help, and About SAS from the ribbon.
The window indicates
that this is SAS version 9.4, and that is what you would include in
the methods section of a paper or a thesis.
Also the package includes a Help
reference to all procedures statements and options available.
Again to access these,
under Help, go to SAS Help and Documentation, open up SAS products, and
SAS Procedures Options, gives you a summary of all procedures sorted by name.
You can now go to the particular procedure you would like help with.
So I will open up the GLIMMIX procedure.
The help files for procedures are all
organized the same way.
The system provides an Overview, Getting started,
the Syntax, Details, Example datasets, as well as all References related to the
procedure. Syntax gives to link to all possible
statements and all possible options that are available that you can use.
Some of you may be doing a repeated-measures analysis
and you would have to indicate
what repeated measure covariance structure you would want to specify.
In GLIMMIX that is specified using the Random statement
so by following the hyperlink
for the Random statement you can then access what are the whole series of
options that are available to that statement.
So the table in this
particular procedure, Table 44.17, gives the entire list of options that are
available for that particular statement. This involves the covariance structure
itself, how smoothing is applied, and what statistical output you
are requesting from that procedure.
The covariance structure would be
specified using the Type = option. Following that hyperlink you can then
obtain the entire list of all covariance structures.
You will find hyperlinks to the original papers or published text which explains
the application and the limitation of various options,
and so you can use those
to identify which is the most appropriate option for the type of data
or situation that you're encountering in your analysis.
This concludes the
tutorial on getting started with the SAS system. For those of you who are using
the University edition there is one additional introduction tutorial which
covers some of the unique features of that particular package.
Không có nhận xét nào:
Đăng nhận xét