## VARIABLES

### Qualitative vs Quantitative variables

Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.

Qualitative example: Blood group

Quantitative example: Temperature

### Discrete vs Continuous variables

Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.

Discrete example: No of children in a family

Continuous example: A length measured to the nearest cm

### Class intervals

Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.

### Scaling

Qualitative variables can be summarised by scales as well as class intervals. **Nominal scales** are unordered scales where the category names follow no logical order e.g. gender:

– Male

– Female

**Ordinal scales** are scales with a logical order e.g. survey responses of:

– Strongly Agree

– Agree

– Neutral

– Disagree

– Strongly Disagree

### Univariate and bivariate data

Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The relationship between bivariates can be tracked effectively using scatter plot charts.

Univariate example: Attendance figures for a football team

Bivariate example: Attendance figures for a football team compared with matchday beer sale figures

### Five-number summary

The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.

## SAMPLING TECHNIQUES

### Cluster sampling

A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.

Example: A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.

### Quota sampling

A technique frequently used for market research, specific quotas are used.

Example: Ensuring 20 men and 20 women make up the sample of 40.

### Convenience sampling

Also referred to as opportunity sampling, a sample made up of the easiest people to reach.

Example: A company polling customers who are already subscribed on their mailing list.

## DISTRIBUTIONS

### Frequency distribution

A table listing all categories (or classes) and their frequencies.

### Relative frequency

The percentage of the frequency of a class against the overall frequency of the sample.

### Relative frequency distribution

A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).

### Sample distribution

The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.

### Normal distribution

A variable graphically described with a bell-shaped density curve.

Example: A meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. If the vast majority of observations are within a standard deviation it will have a normal distribution.

### T distributions

A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.

### Poisson distributions

A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.

Example: The probabilities of the likely number of goals in a football match, based on averages taken from recent results.

## MEASUREMENTS

### Mode

The most common value for a variable based on its frequency, can be calculated from either qualitative or quantitative data.

### Mean

The average value based on a variable of quantitative data.

### Median

The central value of a variable of quantitative data. Using the median instead of the mean lessens the impact of outliers.

Example: The median UK salary in 2017 was around £22,000 whereas the mean was closer to £26,500 and more heavily influenced by some of be large outliers from the top earners.

### Outlier

A unit that falls far from the rest of the data, which can have a misleading impact on the mean.

### Leading indicators

An indicator that may signal a future event.

Example: A creche getting attached to a restaurant could lead to a higher reported accident rate for the restaurant.

### Lagging indicators

An indicator that follows an event.

Example: Reporting the recent performance of a company’s share price to predict what might happen to it in the future.

### Moving average

An average based on a specific time period which generates a trend-following (or lagging) indicator because it is based on the past.

Example: Opinion polls average based on the previous 10 days, with each day the 11th day dropping off and the new day added.

### Range

The difference between the maximum and minimum values of a quantitative variable in a data set.

### Percentile

The observed values of a variable divided into hundredths. The first percentile (P1) divides the bottom 1% of values from the rest of the data set, the second percentile (P2) the bottom 2%, etc. The median is the 50th percentile (P50).

### Decile

The observed values of a variable divided into tenths. The second decile is the 20th percentile, represented as either D2 or P20.

### Quartile

The most common type of percentile used, dividing the observed values into quarters. There are three quartiles: Q1 divides at 25%, Q2 at 50% (the median) and Q3 at 75%.

### Interquartile range

The difference between the first and third quartiles of a variable (Q3 – Q1). This is the preferred measure of variation where there is a skewed distribution, in order to disregard outliers.

### Standard deviation

The most frequently used measure of variability, showing how tightly the observed values cluster around the mean. The end result showing how many of your results are within 1 standard deviation of the mean, how many within 2 standard deviations, etc. Standard deviation can easily be calculated within Excel.

### Standard error

Calculated as se = s / n (where se is the standard error, s is the sample’s standard deviation and n is the number of observations). In a perfect normal distribution, you would expect the standard error where 68% of observations fall within one standard error (or standard deviation) of the sample mean, 95% within two standard errors, and 99.7% within three standard errors.

Example: If we assume a fair National Lottery machine has a normal distribution, you’d expect 68% of ball selections will be within 1 standard error of the mean.

### Probability

The proportion of times a particular outcome would occur in a long run of repeated observations.

### Point estimate

A single number calculated from the data set, that is the best single guess for an unknown parameter.

### Interval estimate

A range of numbers around the point estimate, within which an unknown parameter is expected to fall.

### Confidence interval

A calculation which allows you to provide a % confidence of the probability of a parameter falling within particular values, based on known values for related variables.

### Triangulation

Combining information from multiple sources to help arrive at the most accurate conclusion possible, often by testing the same hypothesis using numerous different methods.

## HYPOTHESES

### Hypothesis

A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.

### Significance test

A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.

### P-value

A function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the value should be chosen (known as the significance level) which will usually be somewhere between 1 and 5%. The results can then be measured against the significance level to provide evidence for or against the null hypothesis. As a general rule a P-value of 5 or less is deemed to be statistically significant.

### Null hypothesis

The hypothesis that is directly tested during a significance test.

### Alternative hypothesis

Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.

### Control group

The group in an experiment who are randomly selected from the population.

### Treatment group

The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.

## ANALYSIS

### Descriptive statistics

A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.

### Inferential statistics

Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from a sample of a population. This can be used for forecasting future events based on previous information gathered.

### Inference

Making predictions and generalisations represented by data collections.

### Regression analysis

A statistical process for estimating the relationships among variables, based on the data you have available to you. Excel has an in-built tool for Regression analysis.

### R Square

The result of a regression analysis, a value of 1 would demonstrate a perfect correlation between the variables.

Example: An R square value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.

### ANOVA

Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another.

### General linear models

An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.

### Time series analysis

General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.

### Survival Analysis

A statistical method of estimating the expected duration of time until an event occurs.

Example: A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.

### R Programming

R is a programming language and software environment widely used for statistical analysis, testing and modelling.

## GRAPHS & CHARTS

### Histogram

Similar in look to a horizontal bar graph except the bars are connected to each other, histograms are formed from grouped data to display frequencies or relative frequencies (percentages) for each class in a sample.

### Scattergrams

A method of displaying the correlation between two or more variables, including a line of best fit to demonstrate how far each observation deviates from the mean.

### Frequency polygon

Line chart plotted at the mid-point of each class, with the classes grouped e.g. into 0-10, 11-20, etc.

### Venn diagram

Presented as two or more circles overlapping each other to demonstrate relationships between variables.

Example: Animals with two legs and animals who can fly. Some would show in one group or the other and some would overlap into both groups.

### Tree diagram

Uses probability to demonstrate outcomes based on more than one input.

Example: The first branch could be Europe, the second branches splitting out Germany, France and Spain and then the third branches split out the various cities in those countries.

### Box plots

A one-dimensional graph based on the numerical data from the five-number summary.

## THEORIES

### Central limit thereom

A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.

### Spearman’s Rank correlation coefficient

A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.

### Bayesianism

An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.

### Capture / Recapture method

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

Example: For measuring the number of fish in a lake (often populations of a type of species).