The normal distribution, or bell curve (named for its shape), is the probability distribution of a random variable. Those are two complex terms in themselves, and are basic to understanding statistics. When data is random, it tends to distribute itself evenly around a mean value--the central limit theorem--and a number of numerical tests can be performed on it.
In a bell curve, the vertical axis represents the likelihood of an event, 0 being impossible and 1 being certain. 0.5 equals 50%, but the highest likelihood a bell curve reaches for any value is 40%. The horizontal axis is the data itself, ranging from a low to a high value. The top of the bell curve is reached at the center, ideally halfway between the low and high data points. The bell curve is a graphic display of the likelihoods that given events will occur: the probability distribution.
Data is random, or randomly distributed, when it varies without any set pattern. The variation might be discrete, with only a limited number of outcomes, like a coin flip, die roll or Y/N question. Or it might be continuous, varying smoothly, like a series of temperature measurements.
That the variation be random, due to unmeasurably small variations (how reality is too complex for us to monitor or understand fully), is necessary for statistical theory.
When data is distributed randomly, it tends to follow the "normal" distribution pattern of the bell curve. Within a data set, there are three basic measures: the mean, median and mode. The mean is the mathematical average. The median is the number in the middle of all values in the set; and the mode is the most common value. Consider this small set of numbers: [46, 46, 48, 51, 58]. The mean is 49.8, the median is 48, and the mode is 46. In a perfect normal distribution, the mean, median and mode are the same.
An important measure of the spread of data--how widely distributed it is--is the standard deviation (aka "sigma"). It is calculated by taking the difference of each data point from the mean, squaring each of the differences, taking the average of the squares, and then taking the square root of that average. To return to the above example, the standard deviation is 4.48.
In statistical theory, 68.2% of all data is contained within one standard deviation (one sigma) of the mean value, and 95.4% is contained within two standard deviations (two sigmas). When data is not normally distributed, whether due to a bias or some other condition, these tests can break down, and a plot of the probability distribution--the bell curve for those cases--is said to be skewed toward the high (positive) or low (negative) end of the data.
Skewed distributions.
Tomorrow: descriptive statistics. (Trust me, there’s a reason I’m teaching you this. We'll be dealing with data for most of the year, and statistical terms will come up frequently.)
Be well!
No comments:
Post a Comment