Basic statistics

VARIABLES

Qualitative vs Quantitative variables

Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.
Qualitative example: Blood group
Quantitative example: Temperature

 

Discrete vs Continuous variables

Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.
Discrete example: No of children in a family
Continuous example: A length measured to the nearest cm

 

Class intervals

Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.

 

Primary data vs secondary data

Primary data is directly collected for the experiment, whereas secondary data comes from an external source.

 

Scaling

Qualitative variables can be summarised by scales as well as class intervals.

Nominal scales are unordered scales where the category names follow no logical order e.g. gender:
– Male
– Female

Ordinal scales are scales with a logical order e.g. survey responses of:
– Strongly Agree
– Agree
– Neutral
– Disagree
– Strongly Disagree

 

Univariate and bivariate data

Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The relationship between bivariates can be tracked effectively using scatter plot charts.

Univariate example: Attendance figures for a football team
Bivariate example: Attendance figures for a football team compared with matchday beer sale figures

 

Five-number summary

The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.

 

Correlation

Correlation is a way of measuring the relationship between two variables. You can use the CORREL function in Excel to return the correlation coefficient between two variables.

A value of +1 indicates a perfect positive correlation meaning an increase in one is associated with an increase in the other. -1 would be a perfect negative correlation, where an increase in one field is associated with a decrease in the other. However it is important to remember correlation is not necessarily causation.

 

 

SAMPLING

Sample statistics

In an ideal world, you have data for the full population and can work with the overall distribution however this is rarely the case. Usually you only have a sample of the data to work from, so you use sample statistics such as the mean and standard deviation to approximate the parameters of the full population. The larger the sample, the more accurate your conclusions are going to be.

A common sampling method is to take multiple random samples, with each sample having its own sample mean that you record to form a sampling distribution. With enough samples the sampling distribution takes on a normal shape regardless of the overall population distribution because of the central limit theorem.

 

Cluster sampling

A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.

Example: A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.

 

Quota sampling

A technique frequently used for market research, specific quotas are used.

Example: Ensuring 20 men and 20 women make up the sample of 40.

 

Convenience sampling

Also referred to as opportunity sampling, a sample made up of the easiest people to reach. This method risks failing to produce a truly representative sample of the population.

Example: A company polling customers who are already subscribed on their mailing list.

 

Population

The full collection of individuals or items under consideration in your statistical study. A finite population can be physically listed, e.g. a list of the books in a library. A hypothetical population is an assumed future population based on inferences made from the existing population.

 

Parameter

An unknown numerical summary of a population.

 

 

DISTRIBUTIONS

Frequency distribution

A table listing all categories (or classes) and their frequencies.

 

Relative frequency

The percentage of the frequency of a class against the overall frequency of the sample.

 

Relative frequency distribution

A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).

 

Sample distribution

The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.

 

Normal distribution

A variable graphically described with a bell-shaped density curve.

Example: A meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. If the vast majority of observations are within a standard deviation it will have a normal distribution.

 

T distributions

A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.

 

Poisson distributions

A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.

Example: The probabilities of the likely number of goals in a football match, based on averages taken from recent results.

 

Skewness

A measurement of the symmetry of the probability distribution of a random variable.

A distribution is skewed if one end of its tail is longer than the other.

A positive skew is displayed if the majority of values are on the left of the distribution with a long tail on the right, also known as a right-skewed distribution because the outlier values are on the right. For example amount of rainfall per day because lots of days are without any rainfall at all but some outlier days have large amounts of rainfall.

A negative skew occurs if the longer end is on the right, with values mainly at the higher end of the scale. A normal distribution has no skew at all, with skewness = 0.

When the skewness is low the mean and median will not be very far apart. When measuring central tendency, any skew above 1 or under -1 suggests the data is too skewed for the mean to be the best measurement and instead the median is a better indicator of typical value.

The SKEW function can be used to measure skewness in Excel.

 

Kurtosis

The sharpness of the peak of a frequency distribution curve. Kurtosis helps describe the shape of a probability distribution of a random variable, measuring the “tailedness” of the data. There are different interpretations of how to measure kurtosis from a population but the purpose is to understand whether the distribution is tall and narrow or short and flat.

 

 

THEORIES

Central limit thereom

A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.

 

Spearman’s Rank correlation coefficient

A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.

 

Bayesianism

An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.

 

Capture / Recapture method

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

Example: For measuring the number of fish in a lake (often populations of a type of species).

 

 

NOTATION

Pr(A)

This represents the probability of ‘A’ happening e.g. on the toss of a coin Pr(Heads)=0.5.

 

X-bar

The x-bar refers to the symbol (or expression) used to represent the sample mean, which is used as an estimation of the population mean.

Leave a Reply

Your email address will not be published. Required fields are marked *