Lecture Notes For Methods Of Applied Statistics (Stat442) (Brunner).pdf

Draft Lecture Notes for Methods of Applied

Statistics (STA442H/1008H)

Jerry Brunner

April 30, 2002

Chapter 1

Introduction

This course is about using statistical methods to draw conclusions from real

data. It is deliberately non-mathematical, relying on translations of sta-

tistical theory into English. For the most part, formulas are avoided. This

involvessome loss of precision, it also makes the course accessible to students

from non-statistical disciplines (particularly graduate students and advanced

undergraduates on their way to graduate school) who need to use statistics

in their research. Even for students with strong training in theoretical statis-

tics, the use of plain English can help reveal the connections between theory

and applications, while also suggesting a useful way to communicate with

non-statisticians.

We will avoid mathematics, but we will not avoid computers. Learning to

apply statistical methods to real data involves actually doing it, and the use

of software is not optional. Furthermore, we will not employ “user-friendly”

menu-driven statistical programs. Why?

•

It’s just too easy to poke around in the menus trying diﬀerent things,

produce some results that seem reasonable, and then two weeks later

be unable to say exactly what one did.

•

Real data sets tend to be large and complex, and most statistical anal-

yses involve a sizeable number of operations. If you discover a tiny

mistake after you produce your results, you don’t want to go back and

repeat two hours of menu selections and mouse clicks, with one tiny

variation.

•

If youneedtoanalyze a data set that issimilarto one youhaveanalyzed

in the past, it’s a lot easier to edit a program than to remember a

collection of menu selections from last year.

Don’t worry! The word “program” does not mean we are going to write

programs in some true programming language like C or Java. We’ll use

statistical software in which most of the actual statistical procedures have

already been written by experts; usually, all we have to do is invoke them by

using high-level commands.

Thestatisticalpackageswewilluseinthiscourseare SAS and S .These

packages are command-oriented rather than menu-oriented, and are very

powerful. They are industrial strength tools, and will be illustrated in an

industrialstrength environment— unix . Thisismostlyforlocal convenience.

There are Windows versions of both SAS and S that work just as well as the

unix versions, except for very big jobs.

Applied Statistics really refers to two related enterprises. The ﬁrst might

be more accurately termed “Applications of Statistics,” and consists of the

appropriate application of standard general techniques. The second enter-

prise is the development of specialized techniques that are designed speciﬁ-

cally for the data at hand. The diﬀerence is like buying your clothes from

Walmart versus sewing them yourself (or going to a tailor). In this course,

we will do both. We’ll maintain the non-mathematical nature of the course

in the second half by substituting computing power and random number

generation for statistical theory.

1.1 Vocabulary of data analysis

We start with a data ﬁle . Think of it as a rectangular array of numbers,

with the rows representing cases (units of analysis, observations, subjects,

replicates) and the columns representing variables (pieces of information

available for each case).

•

A physical data ﬁle might have several lines of data per case, but you

can imagine them listed on a single long line.

•

Data that are not available for a particular case (for example because

a subject fails to answer a question, or because a piece of measuring

equipment breaks down) will be represented by missing value codes.

Missing value codes allow observations with missing information to be

automatically excluded from a computation.

•

Variables can be quantitative (representing amount of something) or

categorical . In the latter case the ”numbers” are codes representing

category membership. Categories may be ordered (small vs. medium

vs. large) or unordered (green vs. blue vs. yellow). When a quan-

titative variable reﬂects measurement on a scale capable of very ﬁne

gradation, it is sometimes described as continuous . Some statisti-

cal texts use the term qualitative to mean categorical. When an

anthropologist uses the word “qualitative,” however, it usually means

“non-quantitative.”

Another very important way to classify variables is

Independent Variable (IV): Predictor = X (actually X i ,i =1 ,...,n )

Dependent Variable (DV): Predicted = Y (actually Y i ,i =1 ,...,n )

Example: X = weight of car in kilograms, Y = fuel e?ciency in litres per

kilometer

Sample Question 1.1.1 Why isn’t it the other way around?

Answer to Sample Question 1.1.1 Since weight of a car is a factor that

probably inﬂuences fuel e ciency, it’s more natural to think of predicting fuel

e ciency from weight.

The general principle is that if it’s more natural to think of predicting A

from B ,then A is the dependent variable and B is the independent variable.

This will usually be the case when B is thought to cause or inﬂuence A .

Sometimes it can go either way or it’s not clear. Usually it’s easy to decide.

Sample Question 1.1.2 Is it possible for a variable to be both quantitative

and categorical? Answer Yes or No, and either give an example or explain

why not.

Answer to Sample Question 1.1.2 Yes. For example, the number of cars

owned by a person or family.

In some ﬁelds, you may hear about nominal, ordinal, interval and

ratio variables, or variables measured using “scales of measurement” with

those names. Ratio means the scale of measurement has a true zero point,

so that a value of 4 represents twice as much as 2. An interval scale means

that the diﬀerence (interval) between 3 and 4 means the same thing as the

diﬀerence between 9 and 10, but zero does not necessarily mean absence of

the thing being measured. The usual examples are shoe size and ring size.

In ordinal measurement, all you can tell is that 6 is less than 7, not how

much more. Measurement on a nominal scale consists of the assignment of

unordered categories. For example, citizenship is measured on a nominal

scale.

It is usually claimed that one should calculate means (and therefore,

for example, do multiple regression) only with interval and ratio data; it’s

usually acknowledged that people do it all the time with ordinal data, but

they really shouldn’t. And it is obviously crazy to calculate a mean on

numbers representing unordered categories. Or is it?

Sample Question 1.1.3 Give an example in which it’s meaningful to cal-

culate the mean of a variable measured on a nominal scale.

Answer to Sample Question 1.1.3 Code males as zero and females as

one. The mean is the proportion of females.

It’s not obvious, but actually all this talk about what you should and

shouldn’t do with data measured on these scales does not have anything to

do with statistical assumptions. That is, it’s not about the mathematical

details of any statistical model. Rather, it’s a set of guidelines for what

statistical model one ought to adopt. Are the guidelines reasonable? It’s

better to postpone further discussion until after we have seen some details

of multiple regression.

1.2 Statistical signiﬁcance

We will often pretend that our data represent a random sample from some

population . We will carry out formal procedures for making inferences

about this (usually ﬁctitious) population, and then use them as a basis for

drawing conclusions about the data.

Why do we do all this pretending? As a formal way of ﬁltering out

things that happen just by coincidence. The human brain is organized to

ﬁnd meaning in what it perceives, and it will ﬁnd apparent meaning even in

a sequence of random numbers. The main purpose of testing for statistical

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: