Qualitative vs Quantitative variables
Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.
Qualitative example: Blood group
Quantitative example: Temperature
Discrete vs Continuous variables
Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.
Discrete example: No of children in a family
Continuous example: A length measured to the nearest cm
A variable that represents a numerical value within a chance experiment. Discrete random variables can assume only a specific number of values at isolated points whereas continuous random variables can assume any value.
Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.
Primary data vs secondary data
Primary data is directly collected for the experiment, whereas secondary data comes from an external source.
Qualitative variables can be summarised by scales as well as class intervals.
Nominal scales are unordered scales where the category names follow no logical order e.g. gender:
Ordinal scales are scales with a logical order e.g. survey responses of:
- Strongly Agree
- Strongly Disagree
Univariate and bivariate data
Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The goal of examining bivariate data is usually to show a relationship or association between the two variables and can be tracked effectively using scatter plot charts.
Univariate example: Attendance figures for a football team
Bivariate example: Attendance figures for a football team compared with matchday beer sale figures
The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.
Correlation is a way of measuring the relationship between two variables. You can use the CORREL function in Excel to return the correlation coefficient between two variables.
A value of +1 indicates a perfect positive correlation meaning an increase in one is associated with an increase in the other. -1 would be a perfect negative correlation, where an increase in one field is associated with a decrease in the other.
It is important to remember correlation is not necessarily causation. When one variable has a direct impact on another there is what’s known as covariance, it could be that x is causing y in the anticipated way. However it could also be the case that reverse causation is taking place (y is causing x) or alternatively there could be a third variable where something else (z) is causing both x and y.
A subset of observations from the population data used to study or analyse in order to learn about the population.
A sample in which all observations within the population have an equal chance of being included not influenced by bias and each selection is independent from the other selections in the sample.
Sampling error occurs when a random sample is used to make inferences about a population because information from the full population is not available. Usually the larger the sample the more representative of the population it is, provided an appropriate sampling technique has been used.
In an ideal world, you have data for the full population and can work with the overall distribution however this is rarely the case. Usually you only have a sample of the data to work from, so you use sample statistics such as the mean and standard deviation to approximate the parameters of the full population. The larger the sample, the more accurate your conclusions are going to be.
A common sampling method is to take multiple random samples, with each sample having its own sample mean that you record to form a sampling distribution. With enough samples the sampling distribution takes on a normal shape regardless of the overall population distribution because of the central limit theorem.
A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.
Example: A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.
A technique frequently used for market research, specific quotas are used.
Example: Ensuring 20 men and 20 women make up the sample of 40.
Also referred to as opportunity sampling, a sample made up of the easiest people to reach. This method risks failing to produce a truly representative sample of the population.
Example: A company polling customers who are already subscribed on their mailing list.
The full collection of individuals or items under consideration in your statistical study. A finite population can be physically listed, e.g. a list of the books in a library. A hypothetical population is an assumed future population based on inferences made from the existing population.
An unknown numerical summary of a population.
The extent to which an event is likely to occur.
Where the probability of an event depends on the probability of another event beforehand.
Independent vs Dependent events
In an independent event, the probability is not affected by any previous events. E.g. rolling a dice. When the probability of an event is influenced by prior events this is known as a dependent event. E.g. the probability of choosing a face card from a deck of cards changes based on the cards already chosen.
Mutually inclusive vs mutually exclusive events
Mutually exclusive events cannot occur at the same time e.g. a number can be either odd or even, never both. Mutually inclusive events can occur at the same time. For example a number can be both even and less than 10. The probability calculation therefore needs to take into account both possibilities. Venn diagrams are useful visuals to express this.
A table listing all categories (or classes) and their frequencies.
The percentage of the frequency of a class against the overall frequency of the sample.
Relative frequency distribution
A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).
The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.
Standard distribution types
Standard types of distribution are the normal distribution, binomial distributions and exponential distributions. Normal and binomial distributions deal with discrete data whereas exponential deals with continuous data.
A variable graphically described with a bell-shaped density curve. As sample sizes of a variable increase, the sample distribution will become more ‘normal’ as the outliers will have less significance on the overall distribution.
For example, a meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. With a small dataset of just three or four of these meals there is more susceptibility to outliers (on either side) skewing the distribution, however as the sample size grows the bell shape will become more sharp and symmetric.
Binomial experiments involve only two choices and their distributions involve a discrete number of trials of these two outcomes. For example, the flipping of a coin. Therefore a binomial distribution is a probability distribution of the successful trials in the experiment.
Exponential distributions deal with continuous data on a scale e.g. you may measure travel times between places by a scale of minutes.
A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.
The degrees of freedom are roughly equal to the number of observations in the test. The more degrees of freedom the more confident we can be that the results resemble the true full population distribution.
A T statistic is the ratio of the observed coefficient to the standard error, which can be evaluated against the T distribution appropriate for the size of the data sample.
With a large enough T statistic we can reject the null hypothesis at some level of statistical significance. The fewer the degrees of freedom and therefore the fatter the tails of the relevant T distribution, the higher the T statistic will need to be in order for us to reject the null hypothesis.
A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.
Example: The probabilities of the likely number of goals in a football match, based on averages taken from recent results.
A measurement of the symmetry of the probability distribution of a random variable.
A distribution is skewed if one end of its tail is longer than the other.
A positive skew is displayed if the majority of values are on the left of the distribution with a long tail on the right, also known as a right-skewed distribution because the outlier values are on the right. For example amount of rainfall per day because lots of days are without any rainfall at all but some outlier days have large amounts of rainfall.
A negative skew occurs if the longer end is on the right, with values mainly at the higher end of the scale. A normal distribution has no skew at all, with skewness = 0.
When the skewness is low the mean and median will not be very far apart. When measuring central tendency, any skew above 1 or under -1 suggests the data is too skewed for the mean to be the best measurement and instead the median is a better indicator of typical value.
The SKEW function can be used to measure skewness in Excel.
The sharpness of the peak of a frequency distribution curve. Kurtosis helps describe the shape of a probability distribution of a random variable, measuring the “tailedness” of the data. There are different interpretations of how to measure kurtosis from a population but the purpose is to understand whether the distribution is tall and narrow or short and flat.
When a data set is clustered around two different modes, it is described as being bimodal.
Types of bias
A bias is a systematic error in sampling. There are numerous types of bias which can inadvertently influence the results of a statistical test:
- Cognitive bias: Biases which may stem from emotional or moral motivations or from social influences which deviate from rationality in judgement
- Confirmation bias: The tendency to search for and interpret information in a way which confirms your pre-existing beliefs or hypotheses
- Observer bias: When a researcher subconsciously influences participants or results to match their expectations
- Recall bias: When the respondent hasn’t remembered things correctly
- Recency bias: When an event that has happened most recently is disproportionately over-represented in the results
- Selection bias: Accidentally working with a subset of your audience when you believe you have a representative sample
- Sponsorship bias: The tendency of a scientific study to support the interests of the people or organisations funding the research
- Survivorship bias: The error of concentrating an experiment on observations which made it past some selection process and not having visibility and therefore overlooking others that didn’t
Central limit thereom
A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.
If the number of observations is large enough, even where the population distribution is not normally distributed, the sampling distribution of the mean will taken an approximately normal distribution.
Spearman’s Rank correlation coefficient
A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.
An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.
Capture / Recapture method
A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.
Example: For measuring the number of fish in a lake (often populations of a type of species).
This represents the probability of ‘A’ happening e.g. on the toss of a coin Pr(Heads)=0.5.
The x-bar refers to the symbol (or expression) used to represent the sample mean, which is used as an estimation of the population mean. An x with a horizontal line above it represents the mean of x, a y with a horizontal line above the mean of y, etc.