A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.
A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.
When we can’t prove the alternative hypothesis is significant, we don’t “accept” the null hypothesis but “fail to reject” the null hypothesis. This implies the data is not sufficiently persuasive for us to choose the alternative hypothesis over the null hypothesis but it doesn’t necessarily prove the null hypothesis either.
The null hypothesis assumes randomness and is directly tested during a significance test. As the null hypothesis indicates no significance, you are usually trying to disprove the null hypothesis in your statistical tests.
Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.
T-Tests and Z-Tests
These are two methods of significance testing used to determine whether there is a significant difference between the means of two related groups.
If we already know the mean and standard deviation of the full population, we can conduct a Z test. Typically we only have a sample of the population and therefore need to carry out a T-test instead.
The Z.TEST function in Excel performs the appropriate test depending on whether or not you supply the standard deviation. =Z.TEST(sample_range, mean[,stdev]). The function returns a p-value, indicating the probability of randomly observing a sample mean at least as far from the population mean as the one you have.
The T-test is based on the T-distribution whereas a Z test relies on the assumption of a normal distribution. Excel has in-built tools for the various types of T-test, including two-sample assuming equal variances and two-sample assuming unequal variances.
There are a variety of different types of T and Z test, the most basic being a one sample one tailed test. A one sample test only looks at a single sample of the data to compare against the sample mean but you could use multiple samples. A one tailed test can only provide a p-value either testing whether the sample mean is significantly higher or testing whether it is significantly lower than the hypothesised mean, however a two tailed test checks both ways.
One-tailed vs Two-tailed tests
If you are only specifically looking for one alternative hypothesis, then a one-tailed significance testing is sufficient. However sometimes you have two possible alternative hypotheses. For example, you might be testing how paying for extra golf sessions affects a golfer’s performance. The null hypothesis is the sessions make no difference, an alternative hypothesis is that they improve performance but there could also be another alternative hypothesis that they worsen performance. It is important to make the correct decision whether your significance test needs to be one or two tailed.
We usually want a two tailed test looking for significance regardless of the direction and whether it matches our intuition. For example, if measuring whether a new initiative has increased productivity we shouldn’t rule out that it may have had the opposite effect.
The p-value is a function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the probability value (p-value) should be chosen, which is known as the significance level and will usually be somewhere between 1 and 5%. The results can be measured against the significance level to provide evidence for or against the null hypothesis.
The p-value returns the probability of getting the same result if the null hypothesis is true and the results were actually due to random chance. So what we’re actually testing is if the null was true, what chance is there you would get this result.
As a general rule a p-value of 0.05 (5%) or less is deemed to be statistically significant.
Whilst a p-value is useful in establishing significance, it is often more informative to calculate a confidence interval to understand the precision of the result. The p-value may tell you to reject the null hypothesis however a confidence interval could also calculate to a 95% threshold and tell you to reject the null hypothesis but also offer the additional information of a range of values the result is probably between.
P-hacking, also known as data dredging, is the process of findings patterns in data which can enable a test to be deemed statistically significance when in reality there is no significant underlying effect.
Type I and Type II errors
A Type I error is the rejection of a true null hypothesis, therefore finding an incorrect significance with a false positive.
A Type II error is the failure to reject a false null hypothesis, therefore failing to define a true significance with a false negative.
When you set a rigorous threshold for your p-value, such as 0.01 you stand more risk of a Type II error, or with a threshold that’s too relaxed with a low significance level you risk a Type I error. This is why choosing an appropriate p-value is so critical. If for example, you set a relaxed 0.1 threshold in court then 1 in every 10 defendants found guilty would actually be innocent.
The Chi-Squared statistic is a means of testing for independence with categorical data, rather than numeric. E.g. to test whether eye colour and gender illustrate a significant difference from independence (in other words, whether or not there is a relationship between the two variables).
CHISQ.INV is the Excel function to test using this statistic.
Statistical power measures the extent to which you can reject the null hypothesis and is a very important calculation to support the validity of your research project.
Power is highly dependent on your sample size but another element of a power calculation is the effect size, which can be measured using the Cohen’s d statistic. The d value will split the effect being measured into categories:
The sample size, effect size and power make up a statistical power calculation. A power calculator can take two of those measurements to calculate the third which is useful in planning your study e.g. to calculate the sample size you require to achieve acceptable power. 80% (0.8) is the standard guideline for acceptable power.
The group in an experiment who are randomly selected from the population.
The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.
A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.
Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from statistical relationships between fields in a sample of a population. This can be used for forecasting future events.
Making predictions and generalisations represented by data collections.
Machine learning is the process of computers solving problems by themselves, usually ones that humans are unable to.
Based on Artifical Intelligence but with an emphasis on big data and large scale applications, machine learning can train a computer algorithm to use statistics from data to learn and make forecasts and predictions about the future.
In supervised machine learning, algorithms use the numeric features of known values to detect patterns and trends on the variable we are trying to predict.
Unsupervised machine learning doesn’t focus on a particular known variable, instead looking at similiarities among all variables on the observations. Once the model is trained, new observations are assigned to their relevant cluster based on their characteristics.
There are numerous different techniques you can use to model your data, which you choose is dependent on the problem you’re wanting to solve:
- Classification may be used to solve a Yes/No question
- Regression can be used to predict a numerical value
- Clustering can group observations into similar looking groups
Sometimes referred to as market segmentation, this is the process of breaking down a population into groups, or samples, of similar characteristics with an identifiable difference. This can be as simple as splitting your samples between men and women, or could be based on any other attribute about the population. Often a core demographic or value is used for the groupings.
Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.
k-Means clustering is probably the most common partitioning method for segmenting data. It requires the analyst to specify ‘k’, which is the amount of distinct clusters you will be segmenting into.
This method begins by adding the ‘k’ number of clusters with evenly split markers, known as cluster centroids. It then assigns each observation to whichever cluster centroid it is nearest to.
Usually this initial clustering won’t have the observations very evenly split around the cluster’s centroid, so the algorithm will take the mean value of all observations in the cluster and re-position the centroid based on that. It will then look again to see whether any observations need moving into a different cluster. The algorithm continues doing this and re-positioning the centroids until it finds the best fit.
Best fit is defined as when the average distance from each observation to it’s cluster centre is at its smallest.
You can draw lines of demarcation between clusters with a Voronoi diagram, displaying the area of each cluster and depending which side of the line an observation sits it will be assigned to the relevant cluster.
The algorithm used to calculate this is known as the k-nearest neighbour (KNN) algorithm. Given a known set of cases, the algorithm clusters the value of ‘k’ number of points nearest to the values of the new case i.e. the ‘nearest neighbours’.
Example: You may run a supervised machine learning exercise using k-means clustering on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.
Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.
When you have specific definitions to group your data by, predictive modelling can be a useful alternative to clustering. Variables found to be statistically significant predictors of another variable can be used to define segmentations for your analysis.
Statistical learning emphasises more on mathematics and statistical models with their various interpretations and precisions.
A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.
Excel has an in-built tool for Regression analysis.
- Simple regression involves one independent variable (which is controlled within the experiment) and one dependent variable, which is being predicted based on the values of the independent variable
- Multiple regression analysis uses more than one independent variable, with techniques like the method of least square used to determine whether the independent variables are making a significant contribution to the model
- Linear regression analysis describes where a relationship between variables can be approximated by a straight line
- Simple linear regression is where the relationship between one independent variable and another dependent variable can be approximated by a straight line, with slope and intercept
- Multiple linear regression analyses two or more independent variables to create straight line models or equations with the dependent variable
- Curvilinear relationships are the results of regression analysis experiments where the variables do not have a linear relationship
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.
Relative Absolute Error (RAE) and Relative Squared Error (RSE) are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.
R Square (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.
Example: An R Square value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.
A system, sometimes used to support a regression experiment, which orders the observations in your range by value and scales them zero to one.
Classification is a data mining technique for solving Yes / No questions. Whereas regression analysis predicts a numeric value, classification helps us predict which class (or category) our data observations belong to. Classification is sometimes referred to as a decision tree and is a particularly useful technique for breaking down very large datasets for analysing and making predictions.
In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.
In classification experiments, you have a training set of labelled observations which we feed the machine and a test set of observations which we use only for evaluation. We can withhold some of the test data to use to validate our model.
Handwriting recognition is one example where classification could be used. Your training set will have labelled images stating which letters the images refer to which your machine will use to try to learn and evaluate the test set accurately.
The test data cases can then be divided into groups:
- Cases where the model predicts a 1 which were actually true are “true positives”
- Cases where the model predicts a 0 which are actually false are “true negatives”
- Cases where the model predicts a 1 which are actually false are “false positives”, Type I errors
- Cases where the model predicts a 0 which are actually true are “false negatives”, Type II errors
Based on the test results you may choose to move the threshold to change how the predicted values are classified.
The accuracy and significance of the results of a classification model can be measured in numerous different ways. Some examples are below, where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative:
- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Precision = TP / (TP + FP) which returns the fraction of cases classified as positives that are actually true and not false positives
- Recall / True Positive Rate = TP / (TP + FN) which provides the fraction of positive cases correctly identified
- False Positive Rate = FP / (FP + TN) comparing false positives against the actual number of negatives
You can plot the TPR (True Positive Rate) and FPR (False Positive Rate) based on any threshold by charting an ROC curve (receiver operating characteristic curve) to show the performance of a classification model across all thresholds. This also allows you to view the AUC (area under the curve) on that plot to understand the accuracy of the predictions from the classification model. The larger the AUC the better the model predicts. Simple guessing in an experiment with two categories would have an AUC of 0.5.
A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.
A term to describe where a model has been iterated several times to the effect that it is performing more accurately with the test data than it would with any new data.
Regularisation, also referred to as ‘shrinkage’, is the process of adding information into your data as a technique for avoiding overfitting in your machine learning model.
Cross validation is a process for evaluating a machine learning algorithm which is also a technique to prevent overfitting. Nested cross validation is a method for tuning the parameters of an algorithm.
The process of categorising opinions expressed within a piece of text, usually to determine whether the comment was positive, negative or neutral. This type of analysis is prominent when analysing customer feedback on social media platforms.
Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another. Excel has a built-in analysis tool for ANOVA.
General linear models
An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.
Supervised vs Unsupervised learning
In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.
Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.
If you want to segregate your customers who pay on card vs your customers who pay with cash, that’s supervised machine learning because you’re telling the machine what to look for. If alternatively you want to give the computer all your data and measures and ask it to segment the customer base by highlighting any interesting patterns, that’s unsupervised machine learning.
Information leakage can occur when you fail to split the data before training a supervised machine learning algorithm. The model appears to be getting more accurate but it’s getting more accurate at learning the training data rather than the actual population data.
A method of splitting data for a supervised learning algorithm using a random sampling method which avoids biases.
Extremely large datasets which are too complex to analyse for insight without the use of sophisticated statistical software.
Time series analysis
General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.
A statistical method of estimating the expected duration of time until an event occurs.
Example: A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.
Monte Carlo simulation
A mathematical method of using random draws to perform calculations for complex problems. Using Random Number Generation to simulate the results of an outcome over and over again a vast number of times, the overall result helps you calculate the probability an outcome occurring.
The ‘replicate’ function in R allows you to stipulate how many times you wish to re-run the experiment.