David B. Howell
In “The Statistics Sampler’’ (Bibliography No. 5), I developed a sequence of activities and lessons to introduce some basic ideas of sampling. The major mathematical underpinning for the lessons was the Central Limit Theorem. Except for counting, tallying, ratio, percent, graphing, and mean, however, NO formal mathematical skills or concepts were required of students. Intuitive ideas of random sampling, of the effects of changing sample size, and of confidence in predictions were the main outcomes expected of students.
In this Unit, I am extending the topic of sampling through activities which make more specific the concepts of, and relationships among, confidence limits, error tolerance, and sample size. Two related formulas are given so that the student, using a calculator, can apply the ideas, developed informally through the activities, to similar situations. It is assumed that the student has the pre-requisite arithmetic skills and the base experience in sampling such as was developed in “The Statistics Sampler.’’ Also required is the ability to substitute values into a formula and to evaluate the result using a calculator. As in “The Statistics Sampler’’ I would expect the Unit to lie within the grasp and capability of most regular education students from Grade 7 through Grade 12.
Here is the kind of problem I want to solve:
1. Before I buy an ad for my new shoelaces on MTV, I want to know the percent of high school students who watch MTV more than one hour per week. Obviously I can’t poll
every
high school student. In order, then, to predict the percent on the basis of a sample, I first need to know (a) How many high school students do I need in my sample?
(b) If I want my answer accurate to plus or minus 10%, how will that affect the sample size? How will that affect the confidence level?
(c) I want to be very confident about my answer. I can’t afford the expense of being 100% confident, though. Maybe 95% “sure’’ is good enough. How will that affect sample size? How will that affect accuracy?
Here are two more similar problems:
2. Sandra “Dunk-em’’ Smith is trying to decide whether to run for Student Government president in her high school. It would help her decide if she knew approximately what percent of the over 2000 students in her school know who she is. Three of her friends are willing to take a poll of a random sample of students. To be 95% sure that they have predicted an answer within plus or minus 5%, how many students should they poll?
3. Willie B. Ready works at Awesome Auto Parts. He has noticed that some of the air filters from a particular supplier are ripped. He thinks there may be as many as 2 defective filters out of every 10. He wants to be quite sure about that, however, before he goes to the trouble of changing suppliers. He decides to take a random sample of the air filters received next month to get an approximation to how many are defective. He wants to be 95% confident the percent of defective filters he predicts is accurate to within plus or minus 3%. How many does he need to sample?
The objective of this Unit is for students to be able to answer questions like the three just posed.
The formal mathematics is not really terribly difficult. We’re trying to predict P, the percent or proportion of a large population which has a certain characteristic. In the three problems above, the characteristics are (1) watches MTV more than one hour per week, (2) has heard of “Dunk-em’’ Smith, and (3) is a defective air filter. We will make the prediction of P based on the percent, p, of “successes’’ (a “success’’ is a member of the population having the characteristic) in the random sample. In other words, we’ll get
P G {p ± E};
that is, we will predict P as being within the range of p plus or minus an error tolerance, E. The prediction is one in which we will have some specific degree of confidence. The degree of confidence will depend on E, how wide an error tolerance we will accept, and the size of our sample. The larger the sample, the higher our degree of confidence in the prediction. The wider the error tolerance, the higher our degree of confidence.
E, then, is related both to the degree of confidence and to the sample size. It is also related to the percent or proportion of successes in our sample.
E=z Ãpq/n
z = standard normal coefficient at the stated confidence level
p = proportion of successes in the sample of size n
q = 1 - p (the proportion of failures in the same sample)
n = number of members in the sample
The more general statistical formula is
(figure available in print form)
where µ = the population mean
X = the sample mean
za = the standard normal coefficient corresponding to the confidence level a
s (or s) = the standard deviation of the population, if known (or the standard deviation of the sample)
n = the size of the sample
Several references in the Bibliography, especially Devore and Peck (2), Edwards (3), and Mason (7), provide the details connecting the general case to the population proportions of binomial distributions.
Teachers should be aware of several issues I am ignoring here. The issues seem to me to cloud the fundamental concepts we want the students to grasp. Those fundamental concepts are : (1) We can predict the population parameters from sample values. (2) The size of the sample, the error tolerances, and the degree of confidence are interrelated. (3) Smaller error tolerances lower the degree of confidence OR increase the sample size. Higher degrees of confidence require larger error tolerances OR lower degrees of confidence. (4) We can, for a given predicting need, specify two of the values we have been discussing and calculate the third value. There are simple formulas we can use.
“...issues I am ignoring...’’ Yes. (1) Does the form of the population distribution matter? Maybe. The Central Limit Theorem, however, guarantees us that regardless of the distribution of the original sample, the distribution of sample means (which is what we’re basing prediction on) tends to the normal distribution as the sample size increases. Therefore the binomially distributed cases we are concerned with in this Unit can be treated as normal distributions. (2) Suppose the population is relatively small? Or the sample size is small? Or the ratio of sample size to population is relatively large? Or we do or do not know the population variance? Then we would need to worry about correction factors, t-tests and other hypothesis-testing tools, and whether or not the standard deviation is defined in terms of n or n-l. The application problems posed and attacked in this Unit do not get into such detail complications. If your class, or a student, wants to tackle such situations
after
the points in this Unit are understood, then by all means explore the potential problems and solutions. All of the references in the Bibliography agree that
if both np ³ 5and nq ³ 5
then the simple procedures presented here are appropriate.
Each of the following Lessons should take from two to five class periods depending on the task efficiency in sampling, sophistication of discussion, the skill levels for charting, graphing, finding percent, rounding off, combining smaller samples into larger ones, etc. The Lesson outlines deal with content, not management or individual differences or testing. All of the Lessons are described in a format similar to that of “The Statistics Sampler:’’
-
A. Objective
-
B. The experimental question — a question which involves statistical sampling
-
C. Issues and some possible resolutions — a mini-lecture, a series of questions which should arise, a conversation between teacher and class. There is occasionally a direct comment to the teacher in brackets — or a lesson continued for illustration with
my
data. This section really defines the activity.
-
D. Observations and discussion to Objective — more questions or mini-lecture or summary or dialogue relating specifically to the stated Objective or bridging to the next activity.
The lesson numbering continues from “The Statistics Sampler.’’