Statistics Index

Statistics Samples and variables

Introduction... Symbols... Samples and Populations... Variables... Sample median...
Sample Mean, variance and standard deviation... Distribution Mean, variance and standard deviation... Sample relative frequency...



Introduction

In conducting experiments it is usual to obtain a sequence of results, observation, or values.   These are generally numbers.  The listing of these values is general identified as a sample   In obtaining information about a large (infinite)quantity of data (population) obtaining a relatively small sample of that population allows more convenient analysis of the whole population.   In order to obtain useful results from the sample analysis is required using basic statistical methods.    The most important results of this analysis is the average value and the spread.

The notes below provide some information on the calculations used for obtaining the statistics of the sample.




Symbols

In the notes below the probability distributions relate to probability of successes.  In practice they could equally relate to failures, or outcomes with desired values ( dice throw = 6)

μ = population mean
σ 2 = population variance
σ = population standard deviation
f(x) = probability density function
F(x) = probability distribution function.  Probability of x successes
m = sample median
n = sample size
xi = discrete value of random sample
xm = arithmetic mean of sample
ni = absolute frequency of xi
sx 2 = variance of sample
sx = Standard deviation of sample
fs-i = absolute frequency of xi = ni /n



Samples and Populations

Consider a large quantity of items which have been produced by a manufacturing process.  It may be too expensive to inspect all of the items and therefore a sample of the items are inspected .   Conclusions can then be drawn with respect to all of the items produced (the population). If the sample is 100 items from the population of 10 000 items and 5 of the sample are defective then it is reasonable to assume that 5% of the population = 5000 items are defective.  It is clear that this inference is very approximate and depends on the randomness of the sample selection

There are several good reasons that we use samples to study populations; chief among them are feasibility and cost.   For instance, in a nationwide political survey of the population of all voters in the United Kingdom, it would be difficult, if not impossible, to poll every voter.   It would also be quite expensive.   Statistical theory shows that a survey of a 1,000 carefully selected voters suffices to represent the opinions of the millions of people in the population of voters.

Random sampling is a way to remove bias in sample selection.   For example, to pick a random sample of 100 people out of a population of a 1,000, you might put all 1,000 names in a hat, then draw 100 of them.   Random sampling attempts to reduce bias in sample selection, since every member of the population has an equal chance of being selected.


Variables..

There are two types of variables... Descrete and Continuous.

A discrete random variable can only take on the value of a distinct number i.e. 0,1,2,3 etc.  Typical descrete numbers include number of children in a family, attendance at a theatre, the number of patients in a hospital ward, the number of defective bulbs in a box of ten

A continuous random variable is one which takes an infinite number of possible values.    Continuous random variables are usually measurements i.e include height, weight, the amount of sugar in an orange, the time required to run a mile


Sample Median..

The sample median m is the middle value (in the case of an odd-sized sample), or average of the two middle values (in the case of an even-sized sample), when the values in a sample are arranged in ascending order.

Example: Consider the sample values 1,3,5,8,9. The sample median m is 5


Sample Mean , Variance & Standard Deviation..

Sample Mean

The arithmetic mean of a sample of n elements is defined by the equation

The arithmetic mean is very useful but does not give a clear picture as to the spread of the variable values around the mean.  Consider two groups of seven numbers . (n= 7)


1,2 /2,4/ 3,2/ 4,1/ 3,3/ 2,3/ 1,5     xm = 2,571
2,2/ 2,4/ 2,7/ 3,0/ 2,8/ 2,6/ 2,3     xm = 2,571


The arithmetic mean is the same for both samples but the second sample is much more tightly grouped around the average.  The deviation of a value is defined as the difference between the value and the arithmetic mean.(x i - x m )

Sample Variance

It is very useful to know the average of the deviations that is (x i - x m ) / N.
However the sum of the deviations is always zero so the deviations are squared to provide a useful value.
The variance sx 2 is defined as.

Note:
The divisor (n-1) is used in this definition.   Other definitions use n.   For large samples (and populations) the use of n gives similar results to the use of (n -1).  For small samples (n-1) gives more accurate results.

Sample Standard Deviaton

The sample standard deviation is defined as the square root of the sample variance.


Distribution Mean , Variance & Standard Deviation..

The above equations apply specifically to samples for which the various outcomes are known and recorded.   When probability values are being evaluated for whole populations, and infinite number of random events the mean is identified by the symbol μ and the variance is identified by the symbol σ 2

Details on calculations of these values are provided in the relevant pages ref Discrete Distributions and Normal Distribution


Relative frequency

In the above notes on the mean and variance and the standard deviation the sample size of n each value of x i is considered to be a separate value and each value has a probability of 1/n of occurring.   In practice however when sampling there are generally a number (n i ) of occurrences of x i which are the same (discrete values) or within the same local range (continuous variables) for each value this frequency is called the absolute frequency .

The relative frequency of each value x i is identified as f s_i and is equal to the number of occurrences of x i /n = n i / n



Example: The example is provided to illustrate the relative frequency function.

In testing the breaking strength of a thread the following 100 loads values are recorded.

Breaking
force (N)
NumberRelative
frequency
Cumulative frequency Cumulative
relative
frequency
2002 0,02 2 0,02
2100 0,00 2 0,02
2204 0,04 6 0,06
2306 0,06 12 0,12
240110,11 23 0,23
25014 0,14 37 0,37
26016 0,16 53 0,53
27015 0,15 68 0,68
2808 0,08 76 0,76
290100,10 86 0,86
3008 0,08 94 0,94
3102 0,02 96 0,96
3203 0,03 99 0,99
33000,00 99 0,99
34010,01 10 1,00

Considering the table of sample test results above.

n= 100
If a certain numerical value does not occur e.g. force = 330 N then the relative frequency is 0
If all values are the same then the absolute frequency ni = n and the relative frequency (n/n) = 1

The sample relative frequency is therefore at least equal to zero and has a maximum value of 1

The relative frequency function fs (x) is provided for which each value x = xs_i equals the corresponding frequency fs_i Therefore

A sample size n can include k numerical different values.   The sum of all relative frequencies = 1 that is

The sample mean is obtained from the relative frequencies as shown below

The sample distribution function Fs (x) is provided for which is equal to the sum of all relative frequencies having values x

The sample variance can be expressed in terms of the relative frequencies as follows

This can be simplified using methods shown above to

For large samples this can be further simplified to



Useful Related Links
  1. Learning Math- Data Analysis, Statistics and Probability ...Clear tuturials on statistics and probability
  2. Probability Venn Applet.. Useful applet illustrating various probability conditions
  3. Venn Diagrams .... Notes of Venn Diagrams
  4. Wolfram- Venn Diagrams .... High Quality Information source
  5. Statistics Glossary .... Very accessible notes with some detail.

Statistics Index