A positive skewness indicates a ‘right-skewed’ distribution where the majority of the values are on the left of the distribution and the outlier values are on the right. A negative skewness indicates the opposite. If a skewness equals zero, that describes a perfect normal distribution. As a general rule, the mean (average) is the best measurement for a typical value unless you have a positive skew above 1 or a negative skew below -1, in those cases the median provides a more accurate measurement.

The SKEW function can be used to measure skewness in Excel as displayed in this example.

You can see this dataset has a positive skew of 2.2123. In this example, the median value gives a more realistic example of typical salary than the mean. The data is skewed by some outliers on large salaries.

To detect individual outliers, you can use a measurement of anything outside of 1.5 multiplied by the IQR (interquartile range). Here’s an example of how to do this using Excel formulas.

Note that we have added formulas for each of the first three quartiles. Quartile 2 is the same as the median. The interquartile range (IQR) is the difference between Quartiles 1 and 3. To identify outliers, the result needs to be either less than 1.5 of the IQR under Q1, or more than 1.5 of the IQR above Q3. The result here finds two observations above the maximum threshold, so we would define those as outliers.

]]>A drag-and-drop tool with a graphical user interface for building, testing and deploying predictive analytics solutions on your data.

A solution for creating accurate predictive and descriptive data models using data mining and statistical techniques such as linear regression, clustering and classification (decision trees).

Jupyter Notebooks are used to explore datasets through an interactive browser-based environment in which you can add notes and run code to manipulate and visualise data. They support languages regularly used by data scientists such as R and Python.

R is an open source programming language and software environment widely used for statistical analysis, testing and modelling.

Python is another programming language used for detailed statistical analysis, testing and modelling.

Similar in look to a horizontal bar graph except the bars are connected to each other, histograms are formed from grouped data to display frequencies or relative frequencies (percentages) for each class in a sample.

A method of displaying the correlation between two or more variables, including a line of best fit to demonstrate how far each observation deviates from the mean.

Line chart plotted at the mid-point of each class, with the classes grouped e.g. into 0-10, 11-20, etc.

Presented as two or more circles overlapping each other to demonstrate relationships between variables.

Example: Animals with two legs and animals who can fly. Some would show in one group or the other and some would overlap into both groups.

Uses probability to demonstrate outcomes based on more than one input.

Example: The first branch could be Europe, the second branches splitting out Germany, France and Spain and then the third branches split out the various cities in those countries.

A one-dimensional graph based on the numerical data from the five-number summary.

]]>

A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.

A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.

When we can’t prove the alternative hypothesis is significant, we don’t “accept” the null hypothesis but “fail to reject” the null hypothesis. This implies the data is not sufficiently persuasive for us to choose the alternative hypothesis over the null hypothesis but it doesn’t necessarily prove the null hypothesis either.

The hypothesis that is directly tested during a significance test. As the null hypothesis indicates no significance, you are usually trying to disprove the null hypothesis in your statistical tests.

Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.

These are two methods of significance testing used to determine whether there is a significant difference between the means of two related groups.

If we already know the mean and standard deviation of the full population, we can conduct a Z test. Typically we only have a sample of the population and therefore need to carry out a T-test instead.

The Z.TEST function in Excel performs the appropriate test depending on whether or not you supply the standard deviation. =Z.TEST(sample_range, mean[,stdev]). The function returns a p-value, indicating the probability of randomly observing a sample mean at least as far from the population mean as the one you have.

The T-test is based on the T-distribution whereas a Z test relies on the assumption of a normal distribution.

There are a variety of different types of T and Z test, the most basic being a 1-sample 1-tailed test. A 1-sample test only looks at a single sample of the data to compare against the sample mean but you could use multiple samples. A 1-tailed test can only provide a p-value either testing whether the sample mean is significantly higher or testing whether it is significantly lower than the hypothesised mean, however a 2-tailed test checks both ways.

A function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the probability value (p-value) should be chosen, which is known as the significance level. It will usually be somewhere between 1 and 5%. The results can be measured against the significance level to provide evidence for or against the null hypothesis.

As a general rule a p-value of 0.05 (5%) or less is deemed to be statistically significant.

The Chi-Squared statistic is a means of testing for independence with categorical data, rather than numeric. E.g. to test whether eye colour and gender illustrate a significant difference from independence (in other words, whether or not there is a relationship between the two variables).

CHISQ.INV is the Excel function to test using this statistic.

The group in an experiment who are randomly selected from the population.

The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.

A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.

Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from statistical relationships between fields in a sample of a population. This can be used for forecasting future events.

Making predictions and generalisations represented by data collections.

Based on Artifical Intelligence and with an emphasis on big data and large scale applications, machine learning is the process of training a computer algorithm to use statistics from the data provided to learn and make forecasts and predictions about the future.

In supervised machine learning, algorithms use the numeric features of known values to detect patterns and trends on the variable we are trying to predict.

Unsupervised machine learning doesn’t focus on a particular known variable, instead looking at similiarities among all variables on the observations. Once the model is trained, new observations are assigned to their relevant cluster based on their characteristics.

Sometimes referred to as market segmentation, this is the process of breaking down a population into groups, or samples, of similar characteristics with an identifiable difference. This can be as simple as splitting your samples between men and women, or could be based on any other attribute about the population. Often a core demographic or value is used for the groupings.

Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.

k-Means clustering is probably the most common partitioning method for segmenting your data and requires the analyst to specify a specific number of k distinct clusters.

This method assigns each observation to one of the clusters and then calculates the distance between each point and the mean value (centroid) of all the observations in the cluster.

Example: You may run a k-means clustering exercise on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.

Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.

When you have specific definitions to group your data by, predictive modelling can be a useful alternative to clustering. Variables found to be statistically significant predictors of another variable can be used to define segmentations for your analysis.

Statistical learning emphasises more on mathematics and statistical models with their various interpretations and precisions.

A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.

Excel has an in-built tool for Regression analysis.

**Mean Absolute Error (MAE)** and **Root Mean Square Error (RMSE)** measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.

**Relative Absolute Error (RAE)** and **Relative Squared Error (RSE)** are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.

**R Square** (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.

Example: An R Square value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.

Whereas regression analysis predicts a numeric value, classification is a data mining technique which helps us predict which class (or category) our data observations belong to. It is sometimes referred to as a decision tree. This is particularly useful when breaking down very large datasets before analysing and making predictions.

In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.

We can withhold some of the test data to use to validate our model. The test data cases can then be divided into groups:

Cases where the model predicts a 1 which were actually true are “true positives”.

Cases where the model predicts a 0 which are actually false are “true negatives”.

Cases where the model predicts a 1 which are actually false are “false positives”.

Cases where the model predicts a 0 which are actually true are “false negatives”.

Based on the test results you may choose to move the threshold to change how the predicted values are classified.

A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.

A term to describe where a model has been iterated several times to the effect that it is performing more accurately with the test data than it would with any new data.

Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another.

An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.

In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.

Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.

Extremely large datasets which are too complex to analyse for insight without the use of sophisticated statistical software.

General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.

A statistical method of estimating the expected duration of time until an event occurs.

Example: A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.

A mathematical method of using random draws to perform calculations for complex problems. Using Random Number Generation to simulate the results of an outcome over and over again a vast number of times, the overall result helps you calculate the probability an outcome occurring.

The ‘replicate’ function in R allows you to stipulate how many times you wish to re-run the experiment.

]]>

The most common value for a variable based on its frequency, can be calculated from either qualitative or quantitative data.

The average value based on a variable of quantitative data.

The central value of a variable of quantitative data. Using the median instead of the mean lessens the impact of outliers.

Example: The median UK salary in 2017 was around £22,000 whereas the mean was closer to £26,500 and more heavily influenced by some of be large outliers from the top earners.

A unit that falls far from the rest of the data, which can have a misleading impact on the mean. Outliers can be measured in a couple of different ways, simply as any observations outside of 1.5 multiplied by the IQR (interquartile range) or alternatively outliers can be deemed as any observations outside of two standard deviations from the mean.

An indicator that may signal a future event.

Example: A creche getting attached to a restaurant could lead to a higher reported accident rate for the restaurant.

An indicator that follows an event.

Example: Reporting the recent performance of a company’s share price to predict what might happen to it in the future.

An average based on a specific time period which generates a trend-following (or lagging) indicator because it is based on the past.

Example: Opinion polls average based on the previous 10 days, with each day the 11th day dropping off and the new day added.

The difference between the maximum and minimum values of a quantitative variable in a data set.

The observed values of a variable divided into hundredths. The first percentile (P1) divides the bottom 1% of values from the rest of the data set, the second percentile (P2) the bottom 2%, etc. The median is the 50th percentile (P50).

The observed values of a variable divided into tenths. The second decile is the 20th percentile, represented as either D2 or P20.

The most common type of percentile used, dividing the observed values into quarters. There are three quartiles: Q1 divides at 25%, Q2 at 50% (the median) and Q3 at 75%.

The difference between the first and third quartiles of a variable (Q3 – Q1). This is the preferred measure of variation where there is a skewed distribution, in order to disregard outliers.

The dispersion of the data from the mean. Variance measures the sum of the difference (deviation) between each observation and the mean. We have to square each of these deviations to keep them as positive values, if we didn’t the variance would always sum to zero.

In Excel, VAR.P is the function to use if you have the full population available or VAR.S to estimate the variance if you just have a sample.

The most frequently used measure of variability, showing how tightly the observed values cluster around the mean. The standard deviation is the square root of the variance, making it a much easier measurement to interpret.

Once you have the standard deviation of your sample, you can measure how many observations are within one standard deviation of the mean, how many within two standard deviations, etc. Standard deviation can easily be calculated within Excel.

In a perfect normal distribution, you would expect 68% of observations to fall within one standard deviation of the mean, 95% within two standard deviations and 99.7% within three standard deviations. For example with 49 balls in the National Lottery the mean is 25, assuming the results have a normal distribution you’d expect 68% of balls will be within one standard deviation of 25.

The standard error indicates how close the sample mean is from the true population mean, giving us an idea of how reliable our results will be. It is calculated as se = s / n (where se is the standard error, s is the sample’s standard deviation and n is the square root of the total number of observations).

The proportion of times a particular outcome would occur in a long run of repeated observations.

A single number calculated from the data set, that is the best single guess for an unknown parameter.

A range of numbers around the point estimate, within which an unknown parameter is expected to fall.

A calculation which allows you to provide a % confidence of the probability of a parameter falling within particular values, based on known values for related variables.

Combining information from multiple sources to help arrive at the most accurate conclusion possible, often by testing the same hypothesis using numerous different methods.

Decimal places allow you to specify how many decimals you wish to round a continuous qualitative variable to. As a standard, you will round to the nearest decimal place, either up or down, although Excel has a variety of rounding functions you can use including fixed round-ups and round-downs.

Example: 2.46874 to 3 decimal places would be 2.469.

You can specify a number to a specific amount of significant figures to ensure you are reaching your required degree of accuracy. Unlike decimal places, this will include whole numbers before the decimal.

Example: 2.46874 to 3 significant figures would be 2.47, or to two significant figures it would simply be 2.5.

There is no set formula in Excel for setting significant figures but it can be achieved using a combination of the ROUND, INT, LOG10 and ABS functions.

Fractions can be converted to their simplest form by reducing both the top and bottom figures to the lowest possible whole numbers. To do this you need to find the greatest common factor of the two numbers involved.

]]>Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.

Qualitative example: Blood group

Quantitative example: Temperature

Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.

Discrete example: No of children in a family

Continuous example: A length measured to the nearest cm

Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.

Primary data is directly collected for the experiment, whereas secondary data comes from an external source.

Qualitative variables can be summarised by scales as well as class intervals.

**Nominal scales** are unordered scales where the category names follow no logical order e.g. gender:

– Male

– Female

**Ordinal scales** are scales with a logical order e.g. survey responses of:

– Strongly Agree

– Agree

– Neutral

– Disagree

– Strongly Disagree

Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The relationship between bivariates can be tracked effectively using scatter plot charts.

Univariate example: Attendance figures for a football team

Bivariate example: Attendance figures for a football team compared with matchday beer sale figures

The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.

Correlation is a way of measuring the relationship between two variables. You can use the CORREL function in Excel to return the correlation coefficient between two variables.

A value of +1 indicates a perfect positive correlation meaning an increase in one is associated with an increase in the other. -1 would be a perfect negative correlation, where an increase in one field is associated with a decrease in the other. However it is important to remember correlation is not necessarily causation.

In an ideal world, you have data for the full population and can work with the overall distribution however this is rarely the case. Usually you only have a sample of the data to work from, so you use sample statistics such as the mean and standard deviation to approximate the parameters of the full population. The larger the sample, the more accurate your conclusions are going to be.

A common sampling method is to take multiple random samples, with each sample having its own sample mean that you record to form a sampling distribution. With enough samples the sampling distribution takes on a normal shape regardless of the overall population distribution because of the central limit theorem.

A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.

Example: A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.

A technique frequently used for market research, specific quotas are used.

Example: Ensuring 20 men and 20 women make up the sample of 40.

Also referred to as opportunity sampling, a sample made up of the easiest people to reach. This method risks failing to produce a truly representative sample of the population.

Example: A company polling customers who are already subscribed on their mailing list.

The full collection of individuals or items under consideration in your statistical study. A finite population can be physically listed, e.g. a list of the books in a library. A hypothetical population is an assumed future population based on inferences made from the existing population.

An unknown numerical summary of a population.

A table listing all categories (or classes) and their frequencies.

The percentage of the frequency of a class against the overall frequency of the sample.

A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).

The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.

A variable graphically described with a bell-shaped density curve.

Example: A meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. If the vast majority of observations are within a standard deviation it will have a normal distribution.

A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.

A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.

Example: The probabilities of the likely number of goals in a football match, based on averages taken from recent results.

A measurement of the symmetry of the probability distribution of a random variable.

A distribution is skewed if one end of its tail is longer than the other.

A positive skew is displayed if the majority of values are on the left of the distribution with a long tail on the right, also known as a right-skewed distribution because the outlier values are on the right. For example amount of rainfall per day because lots of days are without any rainfall at all but some outlier days have large amounts of rainfall.

A negative skew occurs if the longer end is on the right, with values mainly at the higher end of the scale. A normal distribution has no skew at all, with skewness = 0.

When the skewness is low the mean and median will not be very far apart. When measuring central tendency, any skew above 1 or under -1 suggests the data is too skewed for the mean to be the best measurement and instead the median is a better indicator of typical value.

The SKEW function can be used to measure skewness in Excel.

The sharpness of the peak of a frequency distribution curve. Kurtosis helps describe the shape of a probability distribution of a random variable, measuring the “tailedness” of the data. There are different interpretations of how to measure kurtosis from a population but the purpose is to understand whether the distribution is tall and narrow or short and flat.

A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.

A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.

An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

Example: For measuring the number of fish in a lake (often populations of a type of species).

This represents the probability of ‘A’ happening e.g. on the toss of a coin Pr(Heads)=0.5.

The x-bar refers to the symbol (or expression) used to represent the sample mean, which is used as an estimation of the population mean.

]]>

In this example we have finishing positions for Leicester City each season since the Second World War.

Highlight your dataset, go to Insert – Charts – See All Charts. Then in the All Charts tab, Histogram is one of your options.

By default you will get some automatically set bins. These are the preset ranges for each of the bars. In this case they are not particularly useful, the first bar is looking for number of seasons Leicester finished between 1st and 9.8th but we only want to deal in whole numbers.

Therefore we need to right-click on the axis and go to Format Axis. Here you can set the Bin width to a whole number and either define the number of bins, or in this example we have just defined that 45 is the maximum number we’re looking for a 1 is the minimum.

This reformatted version is more informative. You can see a relatively normal distribution and that a standard finishing position for Leicester would be around the 17th to 25th mark.

Note: This Histogram chart feature is not available in earlier versions of Excel. If you’re using 2013 or earlier, you may have to create your own histogram by adding a bar chart and then changing the gap width to zero.

]]>

To ensure one connection is refreshed at a time you should make your VB loop through each connection one by one and in the order you want them to. You can name each individual pivot cache, or even each pivot table, but then you have to keep the code maintained when there are changes. The following module is reusable because it will first look for **every** connection in the file and refresh them individually before looking for **every** pivot table and refreshing those individually too.

*Sub RefreshAll()*

*Dim wc As WorkbookConnection*

*Dim pc As PivotCache*

*For Each wc In ThisWorkbook.Connections*

*wc.ODBCConnection.BackgroundQuery = False*

*wc.Refresh*

*Next wc*

*For Each pc In ThisWorkbook.PivotCaches*

*pc.Refresh*

*Next pc*

*End Sub*

You can do this in VBA without having to specify each individual value in the code. In this example we have a list of existing values along with the new value we want to replace them with.

Select your data and hit Ctrl + T to turn it into an Excel table. Name the table **TableFindReplace**.

We can then apply this code which will check one-by-one for all instances of the values in the Existing column to replace them with the value in the New column. E.g. all instances of Man C to Manchester City, all instances of Man U to Manchester United, etc.

NB: You will need to adjust “SheetName” with your worksheet name and “TableData” with the table name containing your data.

*Sub MultiFindReplace()*

*Dim FindReplaceList As Range*

*Dim Results As Range*

*Set FindReplaceList = Sheets(“SheetName”).ListObjects(“TableFindReplace”).DataBodyRange*

*Set Results = Sheets(“SheetName”).ListObjects(“TableData”).DataBodyRange*

*Application.Goto Reference:=”TableData”*

*For Each cell In FindReplaceList.Columns(1).Cells*

*Selection.Replace what:=cell.Value, Replacement:=cell.Offset(0, 1).Value*

*Next cell*

*End Sub*

We will look at these functions individually to begin with but the real value comes from combining the two of them. They can serve as a useful alternative to VLOOKUP because they can look up in either direction – VLOOKUP is restricted to looking up left to right. Also if you have a large file looking up thousands of cells it can also be useful to use INDEX/MATCH rather than VLOOKUP because the calculations are faster.

**INDEX function**

The INDEX function brings back the value within a cell, based on the specified row and column references.

This example shows how you first select your data range (A2 to C29) then which row and column from that range you want to return. The 7th row is Pall Mall and the third column is it’s colour – Pink.

You can also refer to a cell to get your row or column values. In this case referring to F2 for the column number 2, which returns Pall Mall’s price of 140.

**MATCH function**

The MATCH function searches for a chosen value within a range of cells, returning the position of it in the range.

In this example the formula is looking for a match of the text “Pall Mall” within the range A2 to A29 and the result shows the 7th value in that range contains this text.

If there are multiple cells which match the value, the formula will just return the first instance as in this example bringing back the first row in the range that contains “Light blue”. The zero after the final comma denotes that we need a exact match here.

**Using INDEX and MATCH together
**Now we’re going to look at some of the circumstances where using these two functions together may be preferable to using a standard VLOOKUP formula. As mentioned earlier, VLOOKUP is limited to only searching from left to right and INDEX and MATCH are also faster functions to calculate than VLOOKUP.

Here the two functions are being used to lookup the value in F2 (Whitechapel Road) and return the price. In this example, there is no real benefit over using *=VLOOKUP(F2,A2:C29,2,0)* and they both achieve the same result.

However in this example, INDEX/MATCH has the added flexibility of looking up right to left and performing a “backwards VLOOKUP”. In this case looking for the first instance of “Orange” in the Colour column and returning the Property.

INDEX and MATCH can also combine to help us pick out a value from a table or matrix to perform a two-way lookup. Here we are matching the value in J1 (Euston Road) with our listing in column A and our month in J2 with our month name headers to find that Euston Road was purchased 7 times in April. The result of the formula will change depending on the highlighted values in J1 and J2.

]]>

To expand this limit, go to the Data tab and click on Connections. In the dialogue box, click on ‘ThisWorkbookDataModel’ and go to Properties. In the Usage tab, change ‘Maximum number of records to retrieve’ to a number of your choice (up to the Excel limit of 1,048,576).

There is a known issue with this that sometimes occurs in Excel 2013, where the option to change the maximum number is greyed out. If you can access the workbook using either Excel 2010 or 2016 you can workaround this, by changing the Connection Properties using those versions of Excel and then when you re-open in 2013 the new limit is retained.

]]>