A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.
A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.
When we can’t prove the alternative hypothesis is significant, we don’t “accept” the null hypothesis but “fail to reject” the null hypothesis. This implies the data is not sufficiently persuasive for us to choose the alternative hypothesis over the null hypothesis but it doesn’t necessarily prove the null hypothesis either.
The hypothesis that is directly tested during a significance test. As the null hypothesis indicates no significance, you are usually trying to disprove the null hypothesis in your statistical tests.
Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.
T-Tests and Z-Tests
These are two methods of significance testing used to determine whether there is a significant difference between the means of two related groups.
If we already know the mean and standard deviation of the full population, we can conduct a Z test. Typically we only have a sample of the population and therefore need to carry out a T-test instead.
The Z.TEST function in Excel performs the appropriate test depending on whether or not you supply the standard deviation. =Z.TEST(sample_range, mean[,stdev]). The function returns a p-value, indicating the probability of randomly observing a sample mean at least as far from the population mean as the one you have.
The T-test is based on the T-distribution whereas a Z test relies on the assumption of a normal distribution.
There are a variety of different types of T and Z test, the most basic being a 1-sample 1-tailed test. A 1-sample test only looks at a single sample of the data to compare against the sample mean but you could use multiple samples. A 1-tailed test can only provide a p-value either testing whether the sample mean is significantly higher or testing whether it is significantly lower than the hypothesised mean, however a 2-tailed test checks both ways.
One-tailed vs Two-tailed tests
If you are only specifically looking for one alternative hypothesis, then a one-tailed significance testing is sufficient. However sometimes you have two possible alternative hypotheses. For example, you might be testing how paying for extra golf sessions affects a golfer’s performance. The null hypothesis is the sessions make no difference, an alternative hypothesis is that they improve performance but there could also be another alternative hypothesis that they worsen performance. It is important to make the correct decision whether your significance test needs to be one or two tailed.
A function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the probability value (p-value) should be chosen, which is known as the significance level. It will usually be somewhere between 1 and 5%. The results can be measured against the significance level to provide evidence for or against the null hypothesis.
As a general rule a p-value of 0.05 (5%) or less is deemed to be statistically significant.
P-hacking, also known as data dredging, is the process of findings patterns in data which can enable a test to be deemed statistically significance when in reality there is no significant underlying effect.
Type I and Type II errors
A Type I error is the rejection of a true null hypothesis, therefore finding an incorrect significance with a false positive.
A Type II error is the failure to reject a false null hypothesis, therefore failing to define a true significance with a false negative.
When you set a rigorous threshold for your p-value, such as 0.01 you stand more risk of a Type II error, or with a threshold that’s too relaxed with a low significance level you risk a Type I error. This is why choosing an appropriate p-value is so critical. If for example, you set a relaxed 0.1 threshold in court then 1 in every 10 defendants found guilty would actually be innocent.
The Chi-Squared statistic is a means of testing for independence with categorical data, rather than numeric. E.g. to test whether eye colour and gender illustrate a significant difference from independence (in other words, whether or not there is a relationship between the two variables).
CHISQ.INV is the Excel function to test using this statistic.
The group in an experiment who are randomly selected from the population.
The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.
A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.
Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from statistical relationships between fields in a sample of a population. This can be used for forecasting future events.
Making predictions and generalisations represented by data collections.
KDD stands for Knowledge Discovery in Databases, which covers the creation of knowledge from structured and unstructured sources, regularly involving machine learning.
Based on Artifical Intelligence and with an emphasis on big data and large scale applications, machine learning is the process of training a computer algorithm to use statistics from the data provided to learn and make forecasts and predictions about the future.
In supervised machine learning, algorithms use the numeric features of known values to detect patterns and trends on the variable we are trying to predict.
Unsupervised machine learning doesn’t focus on a particular known variable, instead looking at similiarities among all variables on the observations. Once the model is trained, new observations are assigned to their relevant cluster based on their characteristics.
Sometimes referred to as market segmentation, this is the process of breaking down a population into groups, or samples, of similar characteristics with an identifiable difference. This can be as simple as splitting your samples between men and women, or could be based on any other attribute about the population. Often a core demographic or value is used for the groupings.
Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.
k-Means clustering is probably the most common partitioning method for segmenting data. It requires the analyst to specify ‘k’, which is the amount of distinct clusters you will be segmenting into.
This method begins by adding the ‘k’ number of clusters with evenly split markers, known as cluster centroids. It then assigns each observation to whichever cluster centroid it is nearest to.
Usually this initial clustering won’t have the observations very evenly split around the cluster’s centroid, so the algorithm will take the mean value of all observations in the cluster and re-position the centroid based on that. It will then look again to see whether any observations need moving into a different cluster. The algorithm continues doing this and re-positioning the centroids until it finds the best fit.
Best fit is defined as when the average distance from each observation to it’s cluster centre is at its smallest.
You can draw lines of demarcation between clusters with a Voronoi diagram, displaying the area of each cluster and depending which side of the line an observation sits it will be assigned to the relevant cluster.
Example: You may run a supervised machine learning exercise using k-means clustering on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.
Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.
When you have specific definitions to group your data by, predictive modelling can be a useful alternative to clustering. Variables found to be statistically significant predictors of another variable can be used to define segmentations for your analysis.
Statistical learning emphasises more on mathematics and statistical models with their various interpretations and precisions.
A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.
Excel has an in-built tool for Regression analysis.
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.
Relative Absolute Error (RAE) and Relative Squared Error (RSE) are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.
R Square (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.
Example: An R Square value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.
Whereas regression analysis predicts a numeric value, classification is a data mining technique which helps us predict which class (or category) our data observations belong to. It is sometimes referred to as a decision tree. This is particularly useful when breaking down very large datasets before analysing and making predictions.
In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.
We can withhold some of the test data to use to validate our model. The test data cases can then be divided into groups:
Cases where the model predicts a 1 which were actually true are “true positives”.
Cases where the model predicts a 0 which are actually false are “true negatives”.
Cases where the model predicts a 1 which are actually false are “false positives”.
Cases where the model predicts a 0 which are actually true are “false negatives”.
Based on the test results you may choose to move the threshold to change how the predicted values are classified.
A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.
A term to describe where a model has been iterated several times to the effect that it is performing more accurately with the test data than it would with any new data.
The process of categorising opinions expressed within a piece of text, usually to determine whether the comment was positive, negative or neutral. This type of analysis is prominent when analysing customer feedback on social media platforms.
Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another.
General linear models
An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.
Supervised vs Unsupervised learning
In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.
Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.
If you want to segregate your customers who pay on card vs your customers who pay with cash, that’s supervised machine learning because you’re telling the machine what to look for. If alternatively you want to give the computer all your data and measures and ask it to segment the customer base by highlighting any interesting patterns, that’s unsupervised machine learning.
Extremely large datasets which are too complex to analyse for insight without the use of sophisticated statistical software.
Time series analysis
General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.
A statistical method of estimating the expected duration of time until an event occurs.
Example: A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.
Monte Carlo simulation
A mathematical method of using random draws to perform calculations for complex problems. Using Random Number Generation to simulate the results of an outcome over and over again a vast number of times, the overall result helps you calculate the probability an outcome occurring.
The ‘replicate’ function in R allows you to stipulate how many times you wish to re-run the experiment.