# Basic statistics

## VARIABLES

### Qualitative vs Quantitative variables

Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.
Qualitative example: Blood group
Quantitative example: Temperature

### Discrete vs Continuous variables

Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.
Discrete example: No of children in a family
Continuous example: A length measured to the nearest cm

### Random variables

A variable that takes on numerical values with a chance experiment. Discrete random variables only have a specific number of numerical values.

### Class intervals

Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.

### Primary data vs secondary data

Primary data is directly collected for the experiment, whereas secondary data comes from an external source.

### Scaling

Qualitative variables can be summarised by scales as well as class intervals.

Nominal scales are unordered scales where the category names follow no logical order e.g. gender:
– Male
– Female

Ordinal scales are scales with a logical order e.g. survey responses of:
– Strongly Agree
– Agree
– Neutral
– Disagree
– Strongly Disagree

### Univariate and bivariate data

Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The relationship between bivariates can be tracked effectively using scatter plot charts.

Univariate example: Attendance figures for a football team
Bivariate example: Attendance figures for a football team compared with matchday beer sale figures

### Five-number summary

The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.

### Correlation

Correlation is a way of measuring the relationship between two variables. You can use the CORREL function in Excel to return the correlation coefficient between two variables.

A value of +1 indicates a perfect positive correlation meaning an increase in one is associated with an increase in the other. -1 would be a perfect negative correlation, where an increase in one field is associated with a decrease in the other. However it is important to remember correlation is not necessarily causation.

## SAMPLING

### Sample statistics

In an ideal world, you have data for the full population and can work with the overall distribution however this is rarely the case. Usually you only have a sample of the data to work from, so you use sample statistics such as the mean and standard deviation to approximate the parameters of the full population. The larger the sample, the more accurate your conclusions are going to be.

A common sampling method is to take multiple random samples, with each sample having its own sample mean that you record to form a sampling distribution. With enough samples the sampling distribution takes on a normal shape regardless of the overall population distribution because of the central limit theorem.

### Cluster sampling

A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.

Example: A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.

### Quota sampling

A technique frequently used for market research, specific quotas are used.

Example: Ensuring 20 men and 20 women make up the sample of 40.

### Convenience sampling

Also referred to as opportunity sampling, a sample made up of the easiest people to reach. This method risks failing to produce a truly representative sample of the population.

Example: A company polling customers who are already subscribed on their mailing list.

### Population

The full collection of individuals or items under consideration in your statistical study. A finite population can be physically listed, e.g. a list of the books in a library. A hypothetical population is an assumed future population based on inferences made from the existing population.

### Parameter

An unknown numerical summary of a population.

## PROBABILITY

### Probability

The extent to which an event is likely to occur.

### Conditional probability

Where the probability of an event depends on the probability of another event beforehand.

### Independent vs Dependent events

In an independent event, the probability is not affected by any previous events. E.g. rolling a dice. When the probability of an event is influenced by prior events this is known as a dependent event. E.g. the probability of choosing a face card from a deck of cards changes based on the cards already chosen.

### Mutually inclusive vs mutually exclusive events

Mutually exclusive events cannot occur at the same time e.g. a number can be either odd or even, never both. Mutually inclusive events can occur at the same time. For example a number can be both even and less than 10. The probability calculation therefore needs to take into account both possibilities. Venn diagrams are useful visuals to express this.

## DISTRIBUTIONS

### Frequency distribution

A table listing all categories (or classes) and their frequencies.

### Relative frequency

The percentage of the frequency of a class against the overall frequency of the sample.

### Relative frequency distribution

A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).

### Sample distribution

The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.

### Standard distribution types

Standard types of distribution are the normal distribution, binomial distributions and exponential distributions. Normal and binomial distributions deal with discrete data whereas exponential deals with continuous data.

### Normal distribution

A variable graphically described with a bell-shaped density curve.

Example: A meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. If the vast majority of observations are within a standard deviation it will have a normal distribution.

### Binomial distributions

Binomial experiments involve only two choices and their distributions involve a discrete number of trials of these two outcomes. For example, the flipping of a coin. Therefore a binomial distribution is a probability distribution of the successful trials in the experiment.

### Exponential distributions

Exponential distributions deal with continuous data on a scale e.g. you may measure travel times between places by a scale of minutes.

### T distributions

A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.

The degrees of freedom are roughly equal to the number of observations in the test. The more degrees of freedom the more confident we can be that the results resemble the true full population distribution.

A T statistic is the ratio of the observed coefficient to the standard error, which can be evaluated against the T distribution appropriate for the size of the data sample.

With a large enough T statistic we can reject the null hypothesis at some level of statistical significance. The fewer the degrees of freedom and therefore the fatter the tails of the relevant T distribution, the higher the T statistic will need to be in order for us to reject the null hypothesis.

### Poisson distributions

A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.

Example: The probabilities of the likely number of goals in a football match, based on averages taken from recent results.

### Skewness

A measurement of the symmetry of the probability distribution of a random variable.

A distribution is skewed if one end of its tail is longer than the other.

A positive skew is displayed if the majority of values are on the left of the distribution with a long tail on the right, also known as a right-skewed distribution because the outlier values are on the right. For example amount of rainfall per day because lots of days are without any rainfall at all but some outlier days have large amounts of rainfall.

A negative skew occurs if the longer end is on the right, with values mainly at the higher end of the scale. A normal distribution has no skew at all, with skewness = 0.

When the skewness is low the mean and median will not be very far apart. When measuring central tendency, any skew above 1 or under -1 suggests the data is too skewed for the mean to be the best measurement and instead the median is a better indicator of typical value.

The SKEW function can be used to measure skewness in Excel.

### Kurtosis

The sharpness of the peak of a frequency distribution curve. Kurtosis helps describe the shape of a probability distribution of a random variable, measuring the “tailedness” of the data. There are different interpretations of how to measure kurtosis from a population but the purpose is to understand whether the distribution is tall and narrow or short and flat.

## EXPERIMENT BIAS

### Types of bias

There are numerous types of bias which can inadvertently influence the results of a statistical test:

– Cognitive bias: Biases which may stem from emotional or moral motivations or from social influences which deviate from rationality in judgement.

– Confirmation bias: The tendency to search for and interpret information in a way which confirms your pre-existing beliefs or hypotheses.

– Observer bias: When a researcher subconsciously influences participants or results to match their expectations.

– Recall bias: When the respondent hasn’t remembered things correctly.

– Recency bias: When an event that has happened most recently is disproportionately over-represented in the results.

– Selection bias: Accidentally working with a subset of your audience when you believe you have a representative sample.

– Sponsorship bias: The tendency of a scientific study to support the interests of the people or organisations funding the research.

– Survivorship bias: The error of concentrating an experiment on observations which made it past some selection process and not having visibility and therefore overlooking others that didn’t.

## THEORIES

### Central limit thereom

A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.

### Spearman’s Rank correlation coefficient

A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.

### Bayesianism

An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.

### Capture / Recapture method

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

Example: For measuring the number of fish in a lake (often populations of a type of species).

## NOTATION

### Pr(A)

This represents the probability of ‘A’ happening e.g. on the toss of a coin Pr(Heads)=0.5.

### X-bar

The x-bar refers to the symbol (or expression) used to represent the sample mean, which is used as an estimation of the population mean.