Applied statistics is used to solve practical problems with data. For each type of experiment in applied statistics, specific steps must be followed to collect, analyse and interpret the data.

Though applied statistics and data science can overlap at times, it is usually the case that applied statistics is being used by Data Scientists within their machine learning algorithms to assist their forecasting and predictions.

]]>

When your data does not fit to a normal distribution, nonparametric statistical methods are required. A number of the nonparametric tests use ordinal data rather than numerical, with the data ranked and sorted in order. For example, nonparametric statistics could analyse a Likert scale ranking survey responses: Strongly Agree, Agree, Neither agree nor disagree, Disagree, Strongly Disagree.

Nonparametric statistical tests include:

]]>

Natural Language Processing is a branch of artificial intelligence that uses computer algorithms to understand human’s language.

NLP aims to read, understand and learn from human languages and to provide insights about the data.

]]>

A one-sample z-test assesses the sample mean of a variable against a population mean, whilst a two-sample z-test compares the mean from two different groups. The differences are compared with the estimated standard error to conclude whether there is evidence that the population means differ.

It is generally accepted that a sample size of at least 30 is required for a z-test to be applicable because by the Central Limit Theorem we know the sample mean will be normally distributed, even if the population is not. A test statistic is calculated and compared against the critical values at the relevant significant level, to conclude whether the evidence supports or rejects the null hypothesis. Z-tests have a single critical value for each probability value regardless of the sample size, for example 1.96 for a p-value of 0.05.

Z-scores measure how many standard deviations an observation is away from the mean. They can either be included as part of a z-test, or used separately and they allow us to compare datasets measured in different units. A z-score of 0 indicates a data point identical to the mean. Outliers would usually be described as z-scores above 2 or below -2 because roughly 95% of values will be within two standard deviations.

The STANDARDIZE function in Excel measures z-scores.

We are going to perform a one-sample z-test, firstly breaking down the calculations and then using the Excel function to fast track getting a p-value. We are investigating how the amount of goals scored per game in this season’s Premier League football season compares to the overall average for Premier League seasons. Our measurement is the number of goals scored per game, in the Premier League era the average is 2.653 but this season the average has been 2.722. Our hypothesis tests whether this is significant or down to random fluctuations, working to a 5% significance level.

For a one-sample z-test at 5% significance our critical value is 1.96. If the z-score is less than -1.96 or more than 1.96 we can reject the null hypothesis.

We need our sample mean (x-bar) which is 2.722 goals per game this season and ‘s’ our sample standard deviation (which is 1.514). Our number of observations, ‘n’, is 288 which is the number of matches played so far this season. ‘A’ is our hypothesised mean, 2.653, which is the average goals per game over the course of 10,506 matches played in the Premier League era before this season. With these inputs we can calculate our test statistic:

The test statistic of 0.773 (to three decimal places) is between the critical values of -1.96 and 1.96 so we cannot reject the null hypothesis. In this instance there is no evidence to support the number of goals scored this season differs significantly to an average Premier League season.

Although it’s useful to know the methodology, the Z.TEST function in Excel is a quicker way to get your result. You don’t need to find the z-score or critical value to use the function, as it goes straight to the p-value.

The function to use is **Z.TEST(sample_range, mean[,stdev]**. You should supply the population standard deviation if known, alternatively and in this example the sample standard deviation is used as we enter the formula **=Z.TEST(A1:A288, 2.653, 1.514)** to give us the p-value for the 288 observations against our hypothesised mean.

The p-value is 0.219. This tells us there is around a 22% probability of a random sample (in this case this season’s goals per games) being at least as far from the population mean as this sample. This is considerably above the 5% significance threshold set by our hypothesis.

Our two-sample z-test is going to test the hypothesis that home advantage is a factor in the current Premier League season. Again we will use goals data, comparing the means of home goals and away goals for the 288 matches.

H will represent home and A will represent away. We have 288 matches so n = 288 for both samples and we will also need the sample means (H is 1.507, A is 1.215) and sample standard deviations (H is 1.192, A is 1.208). The first calculation is the Estimated Standard Error (ESE):

Next we calculate z:

The critical values for the 5% significance level are -1.96 and 1.96 so our test statistic of 2.917 means we have evidence to reject the null hypothesis. In fact we can work to a 1% significance level and the test statistic still falls outside of the critical region (between -2.58 and 2.58) so we can reject the null hypothesis to a 1% significance level. There is substantial evidence that goals scored by home teams exceed goals scored by away teams.

There is also an in-built Data Analysis tool in Excel for the two-sample z-test. Ensure you have enabled the Analysis ToolPak in Excel and from the Data tab select Data Analysis in the ribbon and select **z-Test: Two Sample for Means**.

Beforehand, our data has been setup this way. Note the formulas used for the means and variances of each variable:

In the settings, select the cell ranges for the two variables (in this instance I have ticked to specify that Labels are included in my ranges). The hypothesised mean difference is 0, which describes the null hypothesis that the two variables do not differ. The variance for each variable needs to be manually entered, taken by using the VAR function as shown above. An Alpha of 0.05 specifies the 5% significance level and finally I’ve set the output to be added to a new worksheet:

Our results are returned to the new worksheet. The z-score matches the 2.917 we manually calculated and the z Critical for two-tail matches the 1.96 critical value for the 5% significance level. The key output is the p-value for two-tail of 0.0035 (effectively 0.35%) which demonstrates how we could also reject the null hypothesis at 1% significance, or even 0.5%. Note that p-values and critical values for a one-tailed equivalent z-test are also given:

]]>

T-tests can be used with any sample sizes and the mean or standard deviation of the population do not need to be known. Although the t-test relies on the assumption of a normal distribution, its probability values are based on the t-distribution. The test is appropriate when either the population is normal or the sample is large enough that the Central Limit Theorem applies. A one-sample t-test compares the mean of one group against an hypothesised mean value, whereas a two-sample test compares the means of two groups.

Two-sample tests differ depending on the variances, before choosing which type of t-test to use the variances should be calculated to define them equal or unequal. If variances are unknown, you need to either assume them equal or unequal before deciding which procedure to follow. You might assume unequal variances if the two sample sizes greatly differ because variances tend to become smaller as sample sizes increase. There are also paired two-sample t-tests, which compare the means of two samples that can be directly related to each other (e.g. a weight loss experiment with the same participants’ before and after samples). Excel has tools for each of these:

.

To conclude whether a t-test supports or rejects the null hypothesis, a t-value test statistic is calculated and compared against the critical values at the relevant significant level. Unlike a z-test, t-tests have separate critical values for each sample size.

For a two-sample t-test, you can calculate the common population variance to determine whether or not you should assume equal variances. You firstly square the standard deviation of each of the samples to get their sample variances. Next, divide the larger sample variance by the smaller. If the result is less than 3 we can assume a ‘common population variance’, otherwise we should use the two-sample t-test assuming unequal variances.

An alternative is to use the f-distribution to compare variances, using the f-statistic of the two samples. The F.TEST function does this in Excel, if the result is greater than alpha we can assume a common variance between them.

The pooled estimate for common variance can also be calculated. It gives us a common variance for both the samples, the result will lie between the two variances but may be weighted closer to the sample with the most observations.

We will compare the heights of our sample of 18 male athletes against the average height of males in the UK, which is 175.3cm.

Our sample mean is 178.2cm:

In most cases it is best to use a two-tailed test. In this example you may have the preconceived idea that heights will be above average but you should not rule out the possibility they could be below average.

The degrees of freedom for a one-sample test is simply n-1, 17 in this case. Next, we calculate the critical value using the Excel function **T.INV(probability, deg_freedom)**, in this case the formula used is: *=T.INV(0.99,17)*. The critical value is **2.567**, so the critical range is between -2.567 and 2.567.

We can reject the null hypothesis at 1% significance if the difference between the sample and hypothesised means (178.2 – 175.3 = 2.93) is greater than or equal to the critical value multiplied by the sample standard deviation, divided by the square root of n:

The critical value of 2.567 multiplied by 0.85 = 2.18. As 2.93 >= 2.18 we can reject the null hypothesis at 1% significance, providing strong evidence against the null hypothesis.

We can also use these calculations for the t-value (or t-statistic), which is simply the 2.92 *(sample mean – hypothesised mean)* divided by the 0.85 (*s/sqrt(n)*) which gives us a t-statistic of **3.45**.

The t-statistic is then used to quantify whether we support or reject the null hypothesis. Using the T.DIST Excel function, the degrees of freedom (17) and the t-stat (3.45) we can calculate the probability that the T random variable is >= t-stat: *=1-T.DIST(3.45,17,TRUE)* which equals 0.001529. As this is a two-tailed test we need to double that to 0.0030581 which is still way short of our 0.01 p-value and further confirmation we reject the null hypothesis.

To work out the confidence interval for a population mean, the lower limit is **point estimate – (critical value) x ESE** and the upper limit is **point estimate + (critical value) X ESE** where ESE is the estimated standard error of the point estimate.

For a t-test, the CONFIDENCE.T Excel function will quickly perform this calculation. The function has three required arguments: Alpha, the standard deviation and the sample size and you can deduct it from the sample mean for the lower limit or add it to the sample mean for the upper. Referring to our one-sample t-test worked example:

Lower limit: *=178.2 – CONFIDENCE.T(0.01, 3.6, 18)*

Upper limit: *=178.2 + CONFIDENCE.T(0.01, 3.6, 18)*

99% confidence interval: **(175.74, 180.66)**

This time we split the athletes into two groups based on age, with teenage athletes compared against those aged 20+.

Comparing the two samples, they both contain 9 observations but the Under 20s have taller average heights. As the skewness in each sample is between 1 and -1 we can be confident of an approximately normal distribution, the closer the kurtosis is to zero the more normal the distribution is likely to be too.

Next we need to check the variance to see whether we should assume equal or unequal variances. Using the F.TEST() function we can test the probability that the variances are not significantly different: *=F.TEST(A2:A10,B2:B10)*. This returns 0.61, as this is significantly above Alpha for this experiment (0.05) we can assume equal variances. Had the value been below Alpha, we might choose to use the VAR() function on each array to confirm that we’d need to use an unequal variances test instead.

Now it’s time to use the Analysis ToolPak to choose the relevant test, in this case **t-Test: Two-Sample Assuming Equal Variances**.

Enter the cell ranges for Variables 1 and 2, in this case we have ticked that row labels are included in those ranges. Our hypothesised mean is zero, matching the null hypothesis that there is no difference between the heights of the two groups and we are using an Alpha of 0.05 so we’re working to a 5% significance level.

The output includes all the key statistics from the test, including the variances, degrees of freedom and the t-statistic but the most important in assessing your hypothesis is the p-value for the relevant tail. In this case the p-value for a two-tailed test is **0.187897**.

This p-value essentially means there is around a 19% chance of seeing a discrepancy like this between the two age groups by random chance. As this exceeds or Alpha of 0.05 (5%) we do not reject the null hypothesis in this instance.

If you don’t require the additional detail from the Analysis ToolPak option, you can fast-track much of this process by going straight to the T.TEST() function. You simply input the two arrays, state whether the test is one or two tailed and state which of the three t-tests you’re choosing:

The T.TEST function would take you straight to the 0.187897 p-value for the two-tail test in this example.

The matched pairs t-test compare means of two samples which are normally distributed and the observations can be paired naturally. The differences between the pairs are calculated for each observation and then those differences have a one-sample t-test applied to them to test for any significant difference between the means.

]]>

A confidence interval describes a range of values that are likely to include the true value for a population. The upper and lower confidence limits are the two numbers that make up the range of the interval. Confidence intervals do not provide certain answers, they are an estimate based on a sample.

Commonly either a 95% or 99% confidence interval is used in experiments, which equates to either a 5% or 1% significance level (p-value). The interval describes all values for which we cannot reject the null hypothesis at the given significance level. If a 95% confidence interval has been calculated and your experiment provides a value inside the interval, you can state that your conclusions would be right 95% of the time.

Here is an example of how to find a confidence interval for a population mean, which is relevant to both z-tests and t-tests. In this example:

n (number of observations) = 70

The sample standard mean (x-bar) = 48

The sample standard deviation = 15

We want to find a 99% confidence interval

The critical value for a z-test at 99% significance = 2.58

Note that the calculations are the same for a 95% confidence interval, apart from the critical value being 1.96 rather than 2.58. For 95% confidence, the interval is (44.49, 51.51).

There are Excel functions to speed up your confidence interval calculations, CONFIDENCE.NORM returns the confidence interval for a population mean based on the normal distribution and CONFIDENCE.T based on the t-distribution.

Taking a random sample and applying a 95% confidence interval, the interval will give all values for the population mean that would not be rejected at the 5% significance level. Therefore on about 95% of occasions the true population mean is within the interval, however 5% of the possible random samples you could select will provide a confidence interval which does not include the population mean.

A single number calculated from the data set, that is the best single guess for an unknown parameter. Point estimates can be deduced from a confidence interval by taking the midpoint of the interval.

A range of numbers around the point estimate, within which an unknown parameter is expected to fall. The confidence interval is a widely used type of interval estimate.

]]>

General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.

A statistical method of estimating the expected duration of time until an event occurs.

**Example:** A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

**Example:** For measuring the number of fish in a lake (often populations of a type of species).

]]>

Similar in look to a horizontal bar graph except the bars are connected to each other, histograms are formed from grouped data to display frequencies or relative frequencies (percentages) for each class in a sample.

A method of displaying the correlation between two or more variables, including a line of best fit to demonstrate how far each observation deviates from the mean.

Line chart plotted at the mid-point of each class, with the classes grouped e.g. into 0-10, 11-20, etc.

A one-dimensional graph based on the numerical data from the five-number summary.

A visualisation organising numerical data into categories based on place value. These contain more detail than a standard histogram. The stem is the left hand column containing the digits in the largest place and the leaf on the right hand column contains the digits in the smallest place.

Presented as two or more circles overlapping each other to demonstrate relationships between variables.

**Example:** Animals with two legs and animals who can fly. Some would show in one group or the other and some would overlap into both groups.

A branching diagram which lists all possible outcomes of an event.

**Example:** The first branch could be Europe, the second branches splitting out Germany, France and Spain and then the third branches split out the various cities in those countries.

]]>

A language commonly used in web design, JavaScript complements and integrates with both Java and HTML.

JavaScript packages include numerous libraries for charts, graphs and other data visualisations.

The syntax used by JavaScript bears resemblance to the syntax for the C# programming language.

]]>