A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.
A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.
When we can’t prove the alternative hypothesis is significant, we don’t “accept” the null hypothesis but “fail to reject” the null hypothesis. This implies the data is not sufficiently persuasive for us to choose the alternative hypothesis over the null hypothesis but it doesn’t necessarily prove the null hypothesis either.
There are numerous types of significance test and it is important to choose the correct test for each scenario. For example, a sign test is applicable if the hypothesis refers to an assumed median, whereas a z-test is relevant for an hypothesis concerning a mean. With a large enough sample size, a z-test can be used to interpret an overall population mean based on the sample.
The null hypothesis assumes randomness and is directly tested during a significance test. As the null hypothesis indicates no significance, you are usually trying to disprove the null hypothesis in your statistical tests.
Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.
A random variable used in hypothesis testing to determine whether to reject the null hypothesis. To calculate the test statistic, if the hypothesis was testing an assumed median we could use a sign test to note how many observations are above and below that assumed median. The test statistic is the smaller of the number of values lying below or the number lying above the assumed median.
Critical region and critical values
The critical region can be established using a probability distribution of all potential outcomes. For example the critical region at the 5% significance level is when the combined probabilities of the outcomes comes to 0.05 or just less than that.
The critical value can determine whether or not an outcome is in the critical region. Critical values can be calculated or for random samples tables showing the statistic for each sample size can be used.
Sign tests effectively apply a + or – value to each observation in a sample to denote whether they are above or below the assumed median. Any values which tie with the assumed median are discarded from the calculation. When we find the value of the test statistic is less than or equal to the critical value, we reject the hypothesis. If it is above the critical value we accept the hypothesis.
T-Tests and Z-Tests
These are two methods of significance testing used to determine whether there is a significant difference between the means of two related groups.
If we already know the mean and standard deviation of the full population, we can conduct a z-test. Typically we only have a sample of the population and therefore need to carry out a t-test instead. For a z-test, it is generally accepted that a sample size of at least 30 is required for the test to be applicable, however t-tests can be used for smaller samples too.
The Z.TEST function in Excel performs the appropriate test depending on whether or not you supply the standard deviation. =Z.TEST(sample_range, mean[,stdev]). The function returns a p-value, indicating the probability of randomly observing a sample mean at least as far from the population mean as the one you have.
The t-test is based on the t-distribution whereas a z-test relies on the assumption of a normal distribution. Excel has in-built tools for the various types of t-test, including two-sample assuming equal variances and two-sample assuming unequal variances.
There are a variety of different types of t-test and z-test, the most basic being a one sample one tailed test. A one sample test only looks at a single sample of the data to compare against the sample mean but you could use multiple samples. A one tailed test can only provide a p-value either testing whether the sample mean is significantly higher or testing whether it is significantly lower than the hypothesised mean, however a two tailed test checks both ways. A two-sample test can also compare locations of two populations, usually with the hypothesis that the two populations are equal.
One-tailed vs Two-tailed tests
If you are only specifically looking for one alternative hypothesis, then a one-tailed significance testing is sufficient. However sometimes you have two possible alternative hypotheses. For example, you might be testing how paying for extra golf sessions affects a golfer’s performance. The null hypothesis is the sessions make no difference, an alternative hypothesis is that they improve performance but there could also be another alternative hypothesis that they worsen performance. It is important to make the correct decision whether your significance test needs to be one or two tailed.
We usually want a two tailed test looking for significance regardless of the direction and whether it matches our intuition. For example, if measuring whether a new initiative has increased productivity we shouldn’t rule out that it may have had the opposite effect.
The p-value is a function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the probability value (p-value) should be chosen, which is known as the significance level and will usually be somewhere between 1 and 5%. The results can be measured against the significance level to provide evidence for or against the null hypothesis.
The p-value returns the probability of getting the same result if the null hypothesis is true and the results were actually due to random chance. So what we’re actually testing is if the null was true, what chance is there you would get this result.
As a general rule a p-value of 0.05 (5%) or less is deemed to be statistically significant evidence against a hypothesis:
p > 0.10 = Little evidence against the hypothesis
p <= 0.10 and > 0.05 = Weak evidence against the hypothesis
p <= 0.05 and > 0.01 = Moderate evidence against the hypothesis
p <= 0.01 and > 0.001 = Strong evidence against the hypothesis
p < 0.001 = Very strong evidence against the hypothesis
Whilst a p-value is useful in establishing significance, it is often more informative to calculate a confidence interval to understand the precision of the result. The p-value may tell you to reject the null hypothesis however a confidence interval could also calculate to a threshold (often 95%) and tell you to reject the null hypothesis but also offer the additional information of a range of values the result is probably between.
P-hacking, also known as data dredging, is the process of findings patterns in data which can enable a test to be deemed statistically significance when in reality there is no significant underlying effect.
Type I and Type II errors
A Type I error is the rejection of a true null hypothesis, therefore finding an incorrect significance with a false positive.
A Type II error is the failure to reject a false null hypothesis, therefore failing to define a true significance with a false negative.
When you set a rigorous threshold for your p-value, such as 0.01 you stand more risk of a Type II error, or with a threshold that’s too relaxed with a low significance level you risk a Type I error. This is why choosing an appropriate p-value is so critical. If for example, you set a relaxed 0.1 threshold in court then 1 in every 10 defendants found guilty would actually be innocent.
The Chi-Squared statistic is a means of testing for independence with categorical data, rather than numeric. E.g. to test whether eye colour and gender illustrate a significant difference from independence (in other words, whether or not there is a relationship between the two variables).
CHISQ.INV is the Excel function to test using this statistic.
Statistical power measures the extent to which you can reject the null hypothesis and is a very important calculation to support the validity of your research project.
Power is highly dependent on your sample size but another element of a power calculation is the effect size, which can be measured using the Cohen’s d statistic. The d value will split the effect being measured into categories:
The sample size, effect size and power make up a statistical power calculation. A power calculator can take two of those measurements to calculate the third which is useful in planning your study e.g. to calculate the sample size you require to achieve acceptable power. 80% (0.8) is the standard guideline for acceptable power.
The group in an experiment who are randomly selected from the population.
The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.
A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.
Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from statistical relationships between fields in a sample of a population. This can be used for forecasting future events.
Making predictions and generalisations represented by data collections.
Machine learning is the process of computers solving problems by themselves, usually ones that humans are unable to.
Based on Artifical Intelligence but with an emphasis on big data and large scale applications, machine learning can train a computer algorithm to use statistics from data to learn and make forecasts and predictions about the future.
In supervised machine learning, algorithms use the numeric features of known values to detect patterns and trends on the variable we are trying to predict.
Unsupervised machine learning doesn’t focus on a particular known variable, instead looking at similiarities among all variables on the observations. Once the model is trained, new observations are assigned to their relevant cluster based on their characteristics.
There are numerous different techniques you can use to model your data, which you choose is dependent on the problem you’re wanting to solve:
- Classification may be used to solve a Yes/No question
- Regression can be used to predict a numerical value
- Clustering can group observations into similar looking groups
Sometimes referred to as market segmentation, this is the process of breaking down a population into groups, or samples, of similar characteristics with an identifiable difference. This can be as simple as splitting your samples between men and women, or could be based on any other attribute about the population. Often a core demographic or value is used for the groupings.
Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.
k-Means clustering is probably the most common partitioning method for segmenting data. It requires the analyst to specify ‘k’, which is the amount of distinct clusters you will be segmenting into.
This method begins by adding the ‘k’ number of clusters with evenly split markers, known as cluster centroids. It then assigns each observation to whichever cluster centroid it is nearest to.
Usually this initial clustering won’t have the observations very evenly split around the cluster’s centroid, so the algorithm will take the mean value of all observations in the cluster and re-position the centroid based on that. It will then look again to see whether any observations need moving into a different cluster. The algorithm continues doing this and re-positioning the centroids until it finds the best fit.
Best fit is defined as when the average distance from each observation to it’s cluster centre is at its smallest.
You can draw lines of demarcation between clusters with a Voronoi diagram, displaying the area of each cluster and depending which side of the line an observation sits it will be assigned to the relevant cluster.
The algorithm used to calculate this is known as the k-nearest neighbour (KNN) algorithm. Given a known set of cases, the algorithm clusters the value of ‘k’ number of points nearest to the values of the new case i.e. the ‘nearest neighbours’.
Example: You may run a supervised machine learning exercise using k-means clustering on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.
Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.
When you have specific definitions to group your data by, predictive modelling can be a useful alternative to clustering. Variables found to be statistically significant predictors of another variable can be used to define segmentations for your analysis.
Statistical learning emphasises more on mathematics and statistical models with their various interpretations and precisions.
A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.
Excel has an in-built tool for Regression analysis.
- Simple regression involves one independent variable (which is controlled within the experiment) and one dependent variable, which is being predicted based on the values of the independent variable
- Multiple regression analysis uses more than one independent variable, with techniques like the method of least square used to determine whether the independent variables are making a significant contribution to the model
- Linear regression analysis describes where a relationship between variables can be approximated by a straight line
- Simple linear regression is where the relationship between one independent variable and another dependent variable can be approximated by a straight line, with slope and intercept
- Multiple linear regression analyses two or more independent variables to create straight line models or equations with the dependent variable
- Curvilinear relationships are the results of regression analysis experiments where the variables do not have a linear relationship
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.
Relative Absolute Error (RAE) and Relative Squared Error (RSE) are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.
R Squared (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.
Example: An R Squared value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.
Line of best fit
The process of finding a line of best fit on a chart or plot which represents the relationship is known as regression.
Residuals are the gaps between the data points and the line. Data points above the line have positive residual values and below the line have negative residual values. The residual value = Data – Fit.
The least squares regression line is calculated to minimise the residual values. It doesn’t matter if there are more positive data points (above the line) or negative (below), just that the overall sum of the residuals is the lowest possible value. Often referred to simply as ‘the regression line’, it always goes through the point where the x axis mean meets the y axis mean.
Regression lines are only useful where a straight line is appropriate to represent the data i.e. where the residuals are not too high. Otherwise, a curved line is more appropriate. You can use a residual plot to help decide if your line fits the data well, if the residual plot has no pattern with data points scattered above and below the line then the line fits well. If a regression line does provide a reasonable model for a set of data, it can be used for predicting and forecasting future outcomes.
A system, sometimes used to support a regression experiment, which orders the observations in your range by value and scales them zero to one.
Classification is a data mining technique for solving Yes / No questions. Whereas regression analysis predicts a numeric value, classification helps us predict which class (or category) our data observations belong to. Classification is sometimes referred to as a decision tree and is a particularly useful technique for breaking down very large datasets for analysing and making predictions.
In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.
In classification experiments, you have a training set of labelled observations which we feed the machine and a test set of observations which we use only for evaluation. We can withhold some of the test data to use to validate our model.
Handwriting recognition is one example where classification could be used. Your training set will have labelled images stating which letters the images refer to which your machine will use to try to learn and evaluate the test set accurately.
The test data cases can then be divided into groups:
- Cases where the model predicts a 1 which were actually true are “true positives”
- Cases where the model predicts a 0 which are actually false are “true negatives”
- Cases where the model predicts a 1 which are actually false are “false positives”, Type I errors
- Cases where the model predicts a 0 which are actually true are “false negatives”, Type II errors
Based on the test results you may choose to move the threshold to change how the predicted values are classified.
The accuracy and significance of the results of a classification model can be measured in numerous different ways. Some examples are below, where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative:
- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Precision = TP / (TP + FP) which returns the fraction of cases classified as positives that are actually true and not false positives
- Recall / True Positive Rate = TP / (TP + FN) which provides the fraction of positive cases correctly identified
- False Positive Rate = FP / (FP + TN) comparing false positives against the actual number of negatives
You can plot the TPR (True Positive Rate) and FPR (False Positive Rate) based on any threshold by charting an ROC curve (receiver operating characteristic curve) to show the performance of a classification model across all thresholds. This also allows you to view the AUC (area under the curve) on that plot to understand the accuracy of the predictions from the classification model. The larger the AUC the better the model predicts. Simple guessing in an experiment with two categories would have an AUC of 0.5.
Random forests are a learning method which consists of numerous decision trees and can be used with classification and regression analyses. An extension to the standard decision tree, a random forest can create numerous decision trees for a classification model and output the mode, or for regression the mean prediction of each decision tree would be used.
A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.
A term to describe where a model has been iterated several times to the effect that it is performing more accurately with the test data than it would with any new data.
Regularisation, also referred to as ‘shrinkage’, is the process of adding information into your data as a technique for avoiding overfitting in your machine learning model.
Cross validation is a process for evaluating a machine learning algorithm which is also a technique to prevent overfitting. Nested cross validation is a method for tuning the parameters of an algorithm.
The process of categorising opinions expressed within a piece of text, usually to determine whether the comment was positive, negative or neutral. This type of analysis is prominent when analysing customer feedback on social media platforms.
Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another. Excel has a built-in analysis tool for ANOVA.
General linear models
An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.
Supervised vs Unsupervised learning
In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.
Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.
If you want to segregate your customers who pay on card vs your customers who pay with cash, that’s supervised machine learning because you’re telling the machine what to look for. If alternatively you want to give the computer all your data and measures and ask it to segment the customer base by highlighting any interesting patterns, that’s unsupervised machine learning.
Information leakage can occur when you fail to split the data before training a supervised machine learning algorithm. The model appears to be getting more accurate but it’s getting more accurate at learning the training data rather than the actual population data.
A method of splitting data for a supervised learning algorithm using a random sampling method which avoids biases.
Extremely large datasets which are too complex to analyse for insight without the use of sophisticated statistical software.
Time series analysis
General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.
A statistical method of estimating the expected duration of time until an event occurs.
Example: A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.
Monte Carlo simulation
A mathematical method of using random draws to perform calculations for complex problems. Using Random Number Generation to simulate the results of an outcome over and over again a vast number of times, the overall result helps you calculate the probability an outcome occurring.
The ‘replicate’ function in R allows you to stipulate how many times you wish to re-run the experiment.