Lecture Notes For Methods Of Applied Statistics (Stat442) (Brunner).pdf
(
1226 KB
)
Pobierz
AppliedNotes1.dvi
Draft Lecture Notes for Methods of Applied
Statistics (STA442H/1008H)
Jerry Brunner
April 30, 2002
Chapter 1
Introduction
This course is about using statistical methods to draw conclusions from real
data. It is deliberately non-mathematical, relying on translations of sta-
tistical theory into English. For the most part, formulas are avoided. This
involvessome loss of precision, it also makes the course accessible to students
from non-statistical disciplines (particularly graduate students and advanced
undergraduates on their way to graduate school) who need to use statistics
in their research. Even for students with strong training in theoretical statis-
tics, the use of plain English can help reveal the connections between theory
and applications, while also suggesting a useful way to communicate with
non-statisticians.
We will avoid mathematics, but we will not avoid computers. Learning to
apply statistical methods to real data involves actually doing it, and the use
of software is not optional. Furthermore, we will
not
employ “user-friendly”
menu-driven statistical programs. Why?
•
It’s just too easy to poke around in the menus trying different things,
produce some results that seem reasonable, and then two weeks later
be unable to say exactly what one did.
•
Real data sets tend to be large and complex, and most statistical anal-
yses involve a sizeable number of operations. If you discover a tiny
mistake after you produce your results, you don’t want to go back and
repeat two hours of menu selections and mouse clicks, with one tiny
variation.
•
If youneedtoanalyze a data set that issimilarto one youhaveanalyzed
1
in the past, it’s a lot easier to edit a program than to remember a
collection of menu selections from last year.
Don’t worry! The word “program” does
not
mean we are going to write
programs in some true programming language like C or Java. We’ll use
statistical software in which most of the actual statistical procedures have
already been written by experts; usually, all we have to do is invoke them by
using high-level commands.
Thestatisticalpackageswewilluseinthiscourseare
SAS
and
S
.These
packages are command-oriented rather than menu-oriented, and are very
powerful. They are industrial strength tools, and will be illustrated in an
industrialstrength environment—
unix
. Thisismostlyforlocal convenience.
There are Windows versions of both
SAS
and
S
that work just as well as the
unix versions, except for very big jobs.
Applied Statistics really refers to two related enterprises. The first might
be more accurately termed “Applications of Statistics,” and consists of the
appropriate application of standard general techniques. The second enter-
prise is the development of specialized techniques that are designed specifi-
cally for the data at hand. The difference is like buying your clothes from
Walmart versus sewing them yourself (or going to a tailor). In this course,
we will do both. We’ll maintain the non-mathematical nature of the course
in the second half by substituting computing power and random number
generation for statistical theory.
1.1 Vocabulary of data analysis
We start with a
data file
. Think of it as a rectangular array of numbers,
with the rows representing
cases
(units of analysis, observations, subjects,
replicates) and the columns representing
variables
(pieces of information
available for each case).
•
A physical data file might have several lines of data per case, but you
can imagine them listed on a single long line.
•
Data that are
not
available for a particular case (for example because
a subject fails to answer a question, or because a piece of measuring
equipment breaks down) will be represented by missing value codes.
Missing value codes allow observations with missing information to be
automatically excluded from a computation.
2
•
Variables can be
quantitative
(representing amount of something) or
categorical
. In the latter case the ”numbers” are codes representing
category membership. Categories may be
ordered
(small vs. medium
vs. large) or
unordered
(green vs. blue vs. yellow). When a quan-
titative variable reflects measurement on a scale capable of very fine
gradation, it is sometimes described as
continuous
. Some statisti-
cal texts use the term
qualitative
to mean categorical. When an
anthropologist uses the word “qualitative,” however, it usually means
“non-quantitative.”
Another very important way to classify variables is
Independent Variable (IV):
Predictor =
X
(actually
X
i
,i
=1
,...,n
)
Dependent Variable (DV):
Predicted =
Y
(actually
Y
i
,i
=1
,...,n
)
Example:
X
= weight of car in kilograms,
Y
= fuel e?ciency in litres per
kilometer
Sample Question 1.1.1
Why isn’t it the other way around?
Answer to Sample Question 1.1.1
Since weight of a car is a factor that
probably influences fuel e
ciency, it’s more natural to think of predicting fuel
e
ciency from weight.
The general principle is that if it’s more natural to think of predicting
A
from
B
,then
A
is the dependent variable and
B
is the independent variable.
This will usually be the case when
B
is thought to cause or influence
A
.
Sometimes it can go either way or it’s not clear. Usually it’s easy to decide.
Sample Question 1.1.2
Is it possible for a variable to be both quantitative
and categorical? Answer Yes or No, and either give an example or explain
why not.
Answer to Sample Question 1.1.2
Yes. For example, the number of cars
owned by a person or family.
In some fields, you may hear about
nominal, ordinal, interval
and
ratio
variables, or variables measured using “scales of measurement” with
those names. Ratio means the scale of measurement has a true zero point,
3
so that a value of 4 represents twice as much as 2. An interval scale means
that the difference (interval) between 3 and 4 means the same thing as the
difference between 9 and 10, but zero does not necessarily mean absence of
the thing being measured. The usual examples are shoe size and ring size.
In ordinal measurement, all you can tell is that 6 is less than 7, not how
much more. Measurement on a nominal scale consists of the assignment of
unordered categories. For example, citizenship is measured on a nominal
scale.
It is usually claimed that one should calculate means (and therefore,
for example, do multiple regression) only with interval and ratio data; it’s
usually acknowledged that people do it all the time with ordinal data, but
they really shouldn’t. And it is obviously crazy to calculate a mean on
numbers representing unordered categories. Or is it?
Sample Question 1.1.3
Give an example in which it’s meaningful to cal-
culate the mean of a variable measured on a nominal scale.
Answer to Sample Question 1.1.3
Code males as zero and females as
one. The mean is the proportion of females.
It’s not obvious, but actually all this talk about what you should and
shouldn’t do with data measured on these scales does not have anything to
do with
statistical
assumptions. That is, it’s not about the mathematical
details of any statistical model. Rather, it’s a set of guidelines for what
statistical model one ought to adopt. Are the guidelines reasonable? It’s
better to postpone further discussion until after we have seen some details
of multiple regression.
1.2 Statistical significance
We will often pretend that our data represent a
random sample
from some
population
. We will carry out formal procedures for making inferences
about this (usually fictitious) population, and then use them as a basis for
drawing conclusions about the data.
Why do we do all this pretending? As a formal way of filtering out
things that happen just by coincidence. The human brain is organized to
find
meaning
in what it perceives, and it will find apparent meaning even in
a sequence of random numbers. The main purpose of testing for statistical
4
Plik z chomika:
THERON
Inne pliki z tego folderu:
[statistics] Wiley, Practical Methods for Design and Analysis of Complex Surveys (2004), 2E.pdf
(3957 KB)
Modelowanie i symulacja - Zenon Biniek.pdf
(2288 KB)
modelowanie i symulacja - wykład.doc
(1710 KB)
Multilevel statistical models.pdf
(834 KB)
Data Analysis - Statistics - An Introduction to Statistical Inference and Data Analysis.pdf
(921 KB)
Inne foldery tego chomika:
$ FILMY
@ Słowniki i encyklopedie
@MULTIMEDIA
CHEMIA
DATA MINING
Zgłoś jeśli
naruszono regulamin