An NPS result returns anything between 100 and -100 so it can be a useful mechanism for converting customer feedback scored on a 1-10 basis into a percentage, whilst at the same time eliminating ‘passive’ feedback where the customer is neither overly positive or negative in their response.

Responses of 9 or 10 are ‘promoters’, considered to represent customers who will happily endorse your products. Responses between 0 and 6 are ‘detractors’ who are unlikely to endorse you and finally answers between 7 and 8 are classed as ‘passives’ representing customers who don’t appear to have strong views in either direction.

Here is some dummy date from a customer satisfaction survey scored on a 1-10 scale, where 1 is scored as the least satisfied and 10 the most satisfied.

In the example dataset there are 10 promoters, 3 passives and 3 detractors. A formula to calculate NPS on the values in column C is *=100*(COUNTIF(C:C,”>=8″)-COUNTIF(C:C,”<7″))/COUNT(C:C)* returning an NPS of 43.75. An alternative method of calculating NPS, which is perhaps simpler to remember, is the % of promoters minus the % of detractors. In this case 10 of 16 were promoters (62.5%), 3 of 16 were detractors (18.75%) so 62.5% – 18.75% = 43.75%.

A simple example could be a company’s global sales, using a geographical hierarchy from continent, to country, to city with this dataset:

Select your data, go to the Insert menu and select Charts – See All Charts – Sunburst.

Your sunburst chart helps you to identify where the bulk of your sales are coming from and visually compare your continents, countries and cities against each other.

]]>

A positive skewness indicates a ‘right-skewed’ distribution where the majority of the values are on the left of the distribution and the outlier values are on the right. A negative skewness indicates the opposite. If a skewness equals zero, that describes a perfect normal distribution. As a general rule, the mean (average) is the best measurement for a typical value unless you have a positive skew above 1 or a negative skew below -1, in those cases the median provides a more accurate measurement.

The SKEW function can be used to measure skewness in Excel as displayed in this example.

You can see this dataset has a positive skew of 2.2123. In this example, the median value gives a more realistic example of typical salary than the mean. The data is skewed by some outliers on large salaries.

To detect individual outliers, you can use a measurement of anything outside of 1.5 multiplied by the IQR (interquartile range). Here’s an example of how to do this using Excel formulas.

Note that we have added formulas for each of the first three quartiles. Quartile 2 is the same as the median. The interquartile range (IQR) is the difference between Quartiles 1 and 3. To identify outliers, the result needs to be either less than 1.5 of the IQR under Q1, or more than 1.5 of the IQR above Q3. The result here finds two observations above the maximum threshold, so we would define those as outliers.

]]>The technology used for collecting, store, processing, transforming and analysing raw data in order to make it useful for gaining insights.

There are numerous programs for creating data visualisations and dashboard reports some of the most popular being:

– Microsoft Power BI

– Qlik

– Tableau

– Spotfire

Other products which specialise in infographics, animations and other visualisations include:

– Google Sites

– Illustrator

– Unity

A drag-and-drop tool with a graphical user interface for building, testing and deploying predictive analytics solutions on your data.

A solution for creating accurate predictive and descriptive data models using data mining and statistical techniques such as linear regression, clustering and classification (decision trees).

Jupyter Notebooks are used to explore datasets through an interactive browser-based environment in which you can add notes and run code to manipulate and visualise data. They support languages regularly used by data scientists such as R and Python.

Object-oriented programming is based on the concept of structured data, organised in fields within tables, with operations (functions) that can be applied to the structure. Procedural, or imperative programming focuses on explicit sequences of instructions to run a task.

R is an open source programming language and software environment widely used for statistical analysis, testing and modelling.

For data visualisations using R, a popular package is ggplot2.

Python is another open source language used for detailed statistical analysis, testing and modelling. It is considered object-oriented and is often used for building reusable code patterns.

Popular Python packages for data science include:

- NumPy (Numeric Python, for performing calculations over entire arrays)
- Matplotlib (for data visualisations)
- SciPy (for scientific and technical computing)
- Scikit-Learn (for machine learning)
- Pandas (for data manipulation and analysis using data frames)

A language commonly used in web design. Java is used by numerous data visualisation packages.

Similar in look to a horizontal bar graph except the bars are connected to each other, histograms are formed from grouped data to display frequencies or relative frequencies (percentages) for each class in a sample.

A method of displaying the correlation between two or more variables, including a line of best fit to demonstrate how far each observation deviates from the mean.

Line chart plotted at the mid-point of each class, with the classes grouped e.g. into 0-10, 11-20, etc.

Presented as two or more circles overlapping each other to demonstrate relationships between variables.

**Example:** Animals with two legs and animals who can fly. Some would show in one group or the other and some would overlap into both groups.

A branching diagram which lists all possible outcomes of an event.

**Example:** The first branch could be Europe, the second branches splitting out Germany, France and Spain and then the third branches split out the various cities in those countries.

A one-dimensional graph based on the numerical data from the five-number summary.

A visualisation organising numerical data into categories based on place value. These contain more detail than a standard histogram. The stem is the left hand column containing the digits in the largest place and the leaf on the right hand column contains the digits in the smallest place.

]]>

A prediction or statement about a characteristic of a variable, which can be tested to provide evidence for or against.

A method of statistically testing a hypothesis by comparing data against values predicted by the hypothesis. The significance test considers two hypotheses, the null and the alternative, effectively testing whether or not there is significant reason to support the hypothesis or whether the results are random.

When we can’t prove the alternative hypothesis is significant, we don’t “accept” the null hypothesis but “fail to reject” the null hypothesis. This implies the data is not sufficiently persuasive for us to choose the alternative hypothesis over the null hypothesis but it doesn’t necessarily prove the null hypothesis either.

The hypothesis that is directly tested during a significance test. As the null hypothesis indicates no significance, you are usually trying to disprove the null hypothesis in your statistical tests.

Contradicts against the null hypothesis, the alternative hypothesis is supported if the significance test indicates the null hypothesis to be incorrect.

These are two methods of significance testing used to determine whether there is a significant difference between the means of two related groups.

If we already know the mean and standard deviation of the full population, we can conduct a Z test. Typically we only have a sample of the population and therefore need to carry out a T-test instead.

The Z.TEST function in Excel performs the appropriate test depending on whether or not you supply the standard deviation. =Z.TEST(sample_range, mean[,stdev]). The function returns a p-value, indicating the probability of randomly observing a sample mean at least as far from the population mean as the one you have.

The T-test is based on the T-distribution whereas a Z test relies on the assumption of a normal distribution.

There are a variety of different types of T and Z test, the most basic being a 1-sample 1-tailed test. A 1-sample test only looks at a single sample of the data to compare against the sample mean but you could use multiple samples. A 1-tailed test can only provide a p-value either testing whether the sample mean is significantly higher or testing whether it is significantly lower than the hypothesised mean, however a 2-tailed test checks both ways.

If you are only specifically looking for one alternative hypothesis, then a one-tailed significance testing is sufficient. However sometimes you have two possible alternative hypotheses. For example, you might be testing how paying for extra golf sessions affects a golfer’s performance. The null hypothesis is the sessions make no difference, an alternative hypothesis is that they improve performance but there could also be another alternative hypothesis that they worsen performance. It is important to make the correct decision whether your significance test needs to be one or two tailed.

A function of the observed sample results used for testing a statistical hypothesis. Prior to the test being performed an agreed threshold for the probability value (p-value) should be chosen, which is known as the significance level. It will usually be somewhere between 1 and 5%. The results can be measured against the significance level to provide evidence for or against the null hypothesis.

As a general rule a p-value of 0.05 (5%) or less is deemed to be statistically significant.

P-hacking, also known as data dredging, is the process of findings patterns in data which can enable a test to be deemed statistically significance when in reality there is no significant underlying effect.

A Type I error is the rejection of a true null hypothesis, therefore finding an incorrect significance with a false positive.

A Type II error is the failure to reject a false null hypothesis, therefore failing to define a true significance with a false negative.

When you set a rigorous threshold for your p-value, such as 0.01 you stand more risk of a Type II error, or with a threshold that’s too relaxed with a low significance level you risk a Type I error. This is why choosing an appropriate p-value is so critical. If for example, you set a relaxed 0.1 threshold in court then 1 in every 10 defendants found guilty would actually be innocent.

The Chi-Squared statistic is a means of testing for independence with categorical data, rather than numeric. E.g. to test whether eye colour and gender illustrate a significant difference from independence (in other words, whether or not there is a relationship between the two variables).

CHISQ.INV is the Excel function to test using this statistic.

The group in an experiment who are randomly selected from the population.

The group in an experiment who are specifically selected from the population based on the particular characteristics which are being tested. They can then be compared against the control group to confirm or reject the hypothesis.

A brand of statistics which consists of methods for organising and summarising information. Descriptive statistics use the collected data and provides factual findings from it.

Inferential statistics consists of methods for drawing conclusions and measuring their reliability based on information gathered from statistical relationships between fields in a sample of a population. This can be used for forecasting future events.

Making predictions and generalisations represented by data collections.

KDD stands for Knowledge Discovery in Databases, which covers the creation of knowledge from structured and unstructured sources, regularly involving machine learning.

Based on Artifical Intelligence and with an emphasis on big data and large scale applications, machine learning is the process of training a computer algorithm to use statistics from the data provided to learn and make forecasts and predictions about the future.

In supervised machine learning, algorithms use the numeric features of known values to detect patterns and trends on the variable we are trying to predict.

Unsupervised machine learning doesn’t focus on a particular known variable, instead looking at similiarities among all variables on the observations. Once the model is trained, new observations are assigned to their relevant cluster based on their characteristics.

Sometimes referred to as market segmentation, this is the process of breaking down a population into groups, or samples, of similar characteristics with an identifiable difference. This can be as simple as splitting your samples between men and women, or could be based on any other attribute about the population. Often a core demographic or value is used for the groupings.

Clustering is an ‘unsupervised’ analysis which categorises your observations into groups, or ‘clusters’. There are numerous variations but in each case there is some form of distance measurement to determine how close or far apart observations are within each cluster.

k-Means clustering is probably the most common partitioning method for segmenting data. It requires the analyst to specify ‘k’, which is the amount of distinct clusters you will be segmenting into.

This method begins by adding the ‘k’ number of clusters with evenly split markers, known as cluster centroids. It then assigns each observation to whichever cluster centroid it is nearest to.

Usually this initial clustering won’t have the observations very evenly split around the cluster’s centroid, so the algorithm will take the mean value of all observations in the cluster and re-position the centroid based on that. It will then look again to see whether any observations need moving into a different cluster. The algorithm continues doing this and re-positioning the centroids until it finds the best fit.

Best fit is defined as when the average distance from each observation to it’s cluster centre is at its smallest.

You can draw lines of demarcation between clusters with a Voronoi diagram, displaying the area of each cluster and depending which side of the line an observation sits it will be assigned to the relevant cluster.

**Example:** You may run a supervised machine learning exercise using k-means clustering on a customer base, comparing the susceptibility to marketing campaigns from the “brand loyalist” cluster of customers against your “value conscious” cluster.

Hierarchical clustering builds multiple levels of clusters, creating a hierarchy with a cluster tree.

When you have specific definitions to group your data by, predictive modelling can be a useful alternative to clustering. Variables found to be statistically significant predictors of another variable can be used to define segmentations for your analysis.

Statistical learning emphasises more on mathematics and statistical models with their various interpretations and precisions.

A form of supervised learning that estimates the relationships among variables, based on the data you have available to you.

Excel has an in-built tool for Regression analysis.

**Mean Absolute Error (MAE)** and **Root Mean Square Error (RMSE)** measure the residuals (the variance between predicted and actual values) in the units of the label being analysed.

**Relative Absolute Error (RAE)** and **Relative Squared Error (RSE)** are relative measures of error. The closer their values are to zero, the more accurate the model is predicting.

**R Square** (also referred to as the Coefficient of Determination) is the result of a regression analysis. A value of 1 would demonstrate a perfect correlation between the variables.

Example: An R Square value of 0.96 between price and sales would indicate that 96% of the variance in sales is explained by the price.

Whereas regression analysis predicts a numeric value, classification is a data mining technique which helps us predict which class (or category) our data observations belong to. It is sometimes referred to as a decision tree. This is particularly useful when breaking down very large datasets before analysing and making predictions.

In its simplest form, you could use classification to break down your results into just two categories: true or false. You won’t necessarily come up with 1 (for true) or 0 (for false) but a value somewhere in the middle, with a threshold set to define which are classified true and which are classified false.

We can withhold some of the test data to use to validate our model. The test data cases can then be divided into groups:

Cases where the model predicts a 1 which were actually true are “true positives”.

Cases where the model predicts a 0 which are actually false are “true negatives”.

Cases where the model predicts a 1 which are actually false are “false positives”.

Cases where the model predicts a 0 which are actually true are “false negatives”.

Based on the test results you may choose to move the threshold to change how the predicted values are classified.

A matrix which displays the number of true positives, true negatives, false positives and false negatives in your testing, to help measure the effectiveness of your classification model.

A term to describe where a model has been iterated several times to the effect that it is performing more accurately with the test data than it would with any new data.

The process of categorising opinions expressed within a piece of text, usually to determine whether the comment was positive, negative or neutral. This type of analysis is prominent when analysing customer feedback on social media platforms.

ANOVA

Analysis of Variance, a hypothesis testing technique allowing you to determine whether the differences between samples are simply due to random sampling errors, or whether there are systematic treatment effects that cause the mean in one group to differ from the mean in another.

An ANOVA procedure used to test hypotheses in statistical experiments, factoring the results of known quantities along with predicted values.

In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.

Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.

If you want to segregate your customers who pay on card vs your customers who pay with cash, that’s supervised machine learning because you’re telling the machine what to look for. If alternatively you want to give the computer all your data and measures and ask it to segment the customer base by highlighting any interesting patterns, that’s unsupervised machine learning.

Extremely large datasets which are too complex to analyse for insight without the use of sophisticated statistical software.

General term for trending findings over a time period, usually in a graphical format, which can be used to make predicted forecasts for the future.

A statistical method of estimating the expected duration of time until an event occurs.

**Example:** A human beings expected lifespan being assessed using inputs such as age, gender, location and wealth.

A mathematical method of using random draws to perform calculations for complex problems. Using Random Number Generation to simulate the results of an outcome over and over again a vast number of times, the overall result helps you calculate the probability an outcome occurring.

The ‘replicate’ function in R allows you to stipulate how many times you wish to re-run the experiment.

]]>The most common value for a variable based on its frequency, can be calculated from either qualitative or quantitative data. The MODE function can be used in Excel to return this.

The average value based on a variable of quantitative data.

The central value of a variable of quantitative data. Using the median instead of the mean lessens the impact of outliers.

**Example:** The median UK salary in 2017 was around £22,000 whereas the mean was closer to £26,500 and more heavily influenced by some of be large outliers from the top earners.

A unit that falls far from the rest of the data, which can have a misleading impact on the mean. Outliers can be measured in a couple of different ways, simply as any observations outside of 1.5 multiplied by the IQR (interquartile range) or alternatively outliers can be deemed as any observations outside of two standard deviations from the mean.

An indicator that may signal a future event.

**Example**: A creche getting attached to a restaurant could lead to a higher reported accident rate for the restaurant.

An indicator that follows an event.

**Example:** Reporting the recent performance of a company’s share price to predict what might happen to it in the future.

An average based on a specific time period which generates a trend-following (or lagging) indicator because it is based on the past.

**Example:** Opinion polls average based on the previous 10 days, with each day the 11th day dropping off and the new day added.

The difference between the maximum and minimum values of a quantitative variable in a data set.

The observed values of a variable divided into hundredths. The first percentile (P1) divides the bottom 1% of values from the rest of the data set, the second percentile (P2) the bottom 2%, etc. The median is the 50th percentile (P50).

The observed values of a variable divided into tenths. The second decile is the 20th percentile, represented as either D2 or P20.

The most common type of percentile used, dividing the observed values into quarters. There are three quartiles: Q1 divides at 25%, Q2 at 50% (the median) and Q3 at 75%.

The difference between the first and third quartiles of a variable (Q3 – Q1). This is the preferred measure of variation where there is a skewed distribution, in order to disregard outliers.

The dispersion of the data from the mean. Variance measures the sum of the difference (deviation) between each observation and the mean. We have to square each of these deviations to keep them as positive values, if we didn’t the variance would always sum to zero.

In Excel, VAR.P is the function to use if you have the full population available or VAR.S to estimate the variance if you just have a sample.

The most frequently used measure of variability, showing how tightly the observed values cluster around the mean. The standard deviation is the square root of the variance, making it a much easier measurement to interpret.

Once you have the standard deviation of your sample, you can measure how many observations are within one standard deviation of the mean, how many within two standard deviations, etc. Standard deviation can easily be calculated within Excel.

In a perfect normal distribution, you would expect 68% of observations to fall within one standard deviation of the mean, 95% within two standard deviations and 99.7% within three standard deviations. For example with 49 balls in the National Lottery the mean is 25, assuming the results have a normal distribution you’d expect 68% of balls will be within one standard deviation of 25.

The standard error indicates how close the sample mean is from the true population mean, giving us an idea of how reliable our results will be. It is calculated as se = s / n (where se is the standard error, s is the sample’s standard deviation and n is the square root of the total number of observations).

The proportion of times a particular outcome would occur in a long run of repeated observations.

A single number calculated from the data set, that is the best single guess for an unknown parameter.

A range of numbers around the point estimate, within which an unknown parameter is expected to fall.

A calculation which allows you to provide a % confidence of the probability of a parameter falling within particular values, based on known values for related variables.

Combining information from multiple sources to help arrive at the most accurate conclusion possible, often by testing the same hypothesis using numerous different methods.

Decimal places allow you to specify how many decimals you wish to round a continuous qualitative variable to. As a standard, you will round to the nearest decimal place, either up or down, although Excel has a variety of rounding functions you can use including fixed round-ups and round-downs.

**Example:** 2.46874 to 3 decimal places would be 2.469.

You can specify a number to a specific amount of significant figures to ensure you are reaching your required degree of accuracy. Unlike decimal places, this will include whole numbers before the decimal.

**Example:** 2.46874 to 3 significant figures would be 2.47, or to two significant figures it would simply be 2.5.

There is no set formula in Excel for setting significant figures but it can be achieved using a combination of the ROUND, INT, LOG10 and ABS functions.

Fractions can be converted to their simplest form by reducing both the top and bottom figures to the lowest possible whole numbers. To do this you need to find the greatest common factor of the two numbers involved.

Parallel lines go in the same direction but do not intersect. E.g. two sides of a piece of paper.

Perpendicular lines are lines that intersect at a right angle. E.g. two lines crossing to form an ‘x’, if any one of the angles are 90° then all four of the angles will be 90°. If the lines intersect but none of the angles are at 90° then they are neither parallel not perpendicular.

Lines of symmetry perfectly divide a shape into multiple identical parts. E.g. dividing a square into four equally sized quadrants.

Right angles are formed when a horizontal line is met by a perfect vertical line at 90°.

Acute angles are any angles smaller than a right angle and therefore less than 90°.

Obtuse angles are any angles larger than a right angle and above 90°.

Triangles can be defined by the length of their sides or by their angle types.

Scalene triangles are where none of the sides are equal. Isosceles triangles are where at least two of the sides have equal length. Equilateral triangles have all three sides of equal length.

All triangles have three angles adding to 180°.

Acute triangles are where all three angles are less than 90°. Right triangles have one angle which is exactly 90°. Obtuse triangles have one angle which is greater than 90°.

Polygons are any shape with at least three straight sides and angles, typically five or more. A ‘regular polygon’ is equilateral (all sides have the same length) and also has all angles equal in measure.

Quadrilaterals are polygons with four sides and four corners. Therefore all squares and rectangles are quadrilaterals. Parallelograms are quadrilaterals with two pairs of parallel sides.

The perimeter is the length of an outline of a shape, calculated by adding together the length of each side.

The area of a rectangle is measured in square units, which could be square inches, square feet, square metres, etc. The calculation is simply length multiplied by width, represented by A = L * W.

The area of a triangle can be calculated using an adapted version of the formula for the area of a rectangle. If you copy a triangle, flip it 180° and place it next to the original you end up with a parallelogram, a type of rectangle. You need to half the ordinary rectangle calculation, because you only want the area of the triangle. Therefore the formula is half the base multiplied by the height, A = W * 0.5 * L.

To calculate the area of a circle you first need to be able to understand and define the following:**Radius** – The distance from the centre of a circle to the edge**Diameter** – The full width of a circle**Circumference** – The distance around the circle**Pi** – The ratio of a circle’s circumference to its diameter, which is 3.14159265.

The area of a circle is calculated as pi multiplied by the square of the radius: A = π * r^2.

Pythagoras’ theorem can be used when we know two sides of a triangle in order to calculate the third side. Pythagoras discovered that on a right triangle, the square of the hypotenuse (the side opposite the right angle, which is always the longest side) is equal to the sum of the squares of the other two sides a^2 + b^2 = c^2.

Trigonometry is the study of triangles. Three trigonometric functions that are specific to right-angled triangles are sine, cosine and tangent, often shortened to sin, cos and tan.

Firstly it is important to understand the terms used to describe the three sides of a right triangle:**Hypotenuse** – The side opposite the right angle, always the longest side**Opposite** – The side opposite the angle you’re working with**Adjacent** – The side next to the angle you’re working with

**Sin** = Opposite / Hypotenuse

**Cos** = Adjacent / Hypotenuse

**Tan** = Opposite / Adjacent

The acronym SOHCAHTOA is helpful to remember which sides go which each function.

These functions can be used to relate the angles in the triangle to the side lengths.

Algebra deals with letters and symbols and rules to manipulate those symbols used to represent numbers and quantities within formulae and equations.

Algebraic expressions commonly include a mixture of letters, numbers and other symbols. E.g. x + 5 = y. To represent multiplication you don’t need to use a multiplication symbol, 5 multiplied by x can simply be represented by 5x.

The Distributive Property is a mathematical law which means we can distribute the same operation to terms with parenthesis in the same equation.

The order of operations is used in both mathematics and computer programming to define which procedures are performed first in a complex formula. PEMDAS is an acronym to help remember the order:**Parentheses****Exponents****Multiplication and Division** (from left to right)**Addition and Subtraction** (from left to right))

Linear equations are equations between two variables that give a straight line when plotted on a chart. While we can’t determine definite values of either variable, we can solve the equation for how to calculate one of the variables with respect to the other.

Example: 2y-4x=2

Firstly isolate the y term: 2y=2+4x

Divide both sides by two: y=1+2x

Now we know the value of y for any given value of x. All the possible values of x and y could be plotted on a chart to visualise the linear growth of the equation.

Equations that can be rearranged into the form y = mx + c will produce what is known as a straight line graph.**Example:** x + y = 3 can be rearranged into y = 3 – x

Quadratic equations in algebra follow the form ax^2 + bx + c = 0 where x represents an unknown variable but a, b and c have known values. The value of a must be >0 for the equation to be quadratic, otherwise it is linear.

Polynomials can include many different terms: numeric values, variables (like x or y) and exponents (like squaring numbers). They can be combined by addition, subtraction, multiplication and division but not by division within a variable. E.g. 5xy^2 = 3x + 4y is a polynomial but 3xy^-2 is not.

An exponent is a quantity representing the power to which a given value is to be raised. For example 2^4 is equivalent to 2*2*2*2. Logarithms explain how many times a value needs to be multiplied to achieve another value. Using the same example of 2^4 = 16, this could also be represented as the log of 16 equalling 4: log(16)=4.

A linear scale can be visualised similar to a ruler or a tape measure, where the distance between all numbers are the same. This can also be applicable to minus numbers on a linear scale. A logarithmic scale, however, is based on multiplication of numbers rather than addition. When you move distance between numbers on a logarithmic number line you are multiplying by the previous number. E.g. a linear scale may add 10 to each value (0, 10, 20, 30, 40, 50) whereas a logarithmic scale which is scaling by 10 would multiply each value by 10 (1, 10, 100, 1000, 10000, 100000).

Linear growth is when a constant amount is fixed as the increase between each value e.g. 2, 5, 8, 11 which indicates a linear relationship. Exponential growth multiplies between values e.g. 1, 3, 9, 27 multiplying each by 3 and indicating an exponential relationship.

Vectors are quantities which explain the position of a point in space relative to another, defined by directional co-ordinates.

A matrix is an array of numbers that can be arranged into a tabular format with rows and columns. You can add or subtract matrices of the same size and structure and you can also transpose matrices to switch the row and column orientation.

]]>Qualitative variables are categorical items, whereas quantitative variables have a numerical value associated.**Qualitative example:** Blood group**Quantitative example:** Temperature

Quantitative variables can be split into discrete (counted to an exact figure) or continuous, which can’t be measured precisely so need to be rounded.

Discrete example: No of children in a family

Continuous example: A length measured to the nearest cm

A variable that takes on numerical values with a chance experiment. Discrete random variables only have a specific number of numerical values.

Class intervals can be used to categorise continuous quantitative variables e.g. lengths grouped at 10cm intervals 0-10cm, 10-20cm, etc. This effectively turns them into a qualitative variable, making them easier to summarise on your reports.

Primary data is directly collected for the experiment, whereas secondary data comes from an external source.

Qualitative variables can be summarised by scales as well as class intervals.

**Nominal scales** are unordered scales where the category names follow no logical order e.g. gender:

– Male

– Female

**Ordinal scales** are scales with a logical order e.g. survey responses of:

– Strongly Agree

– Agree

– Neutral

– Disagree

– Strongly Disagree

Univariate data comes from one source only. Bivariate data is two different dependent variables from the same population. The relationship between bivariates can be tracked effectively using scatter plot charts.

**Univariate example:** Attendance figures for a football team**Bivariate example:** Attendance figures for a football team compared with matchday beer sale figures

The five-number summary of a variable consists of its minimum value, the first quartile (Q1), Q2, Q3 and its maximum value.

Correlation is a way of measuring the relationship between two variables. You can use the CORREL function in Excel to return the correlation coefficient between two variables.

A value of +1 indicates a perfect positive correlation meaning an increase in one is associated with an increase in the other. -1 would be a perfect negative correlation, where an increase in one field is associated with a decrease in the other. However it is important to remember correlation is not necessarily causation.

In an ideal world, you have data for the full population and can work with the overall distribution however this is rarely the case. Usually you only have a sample of the data to work from, so you use sample statistics such as the mean and standard deviation to approximate the parameters of the full population. The larger the sample, the more accurate your conclusions are going to be.

A common sampling method is to take multiple random samples, with each sample having its own sample mean that you record to form a sampling distribution. With enough samples the sampling distribution takes on a normal shape regardless of the overall population distribution because of the central limit theorem.

A sampling technique which splits the population into specific categories, or clusters. Every individual in the sample must be assigned to one of the clusters, but the population of each cluster can vary. A random sample of each cluster is then selected.

**Example:** A health questionnaire splitting the sample by how frequently they go to the gym – Regularly, Occasionally or Never.

A technique frequently used for market research, specific quotas are used.

**Example:** Ensuring 20 men and 20 women make up the sample of 40.

Also referred to as opportunity sampling, a sample made up of the easiest people to reach. This method risks failing to produce a truly representative sample of the population.

**Example:** A company polling customers who are already subscribed on their mailing list.

The full collection of individuals or items under consideration in your statistical study. A finite population can be physically listed, e.g. a list of the books in a library. A hypothetical population is an assumed future population based on inferences made from the existing population.

An unknown numerical summary of a population.

The extent to which an event is likely to occur.

Where the probability of an event depends on the probability of another event beforehand.

In an independent event, the probability is not affected by any previous events. E.g. rolling a dice. When the probability of an event is influenced by prior events this is known as a dependent event. E.g. the probability of choosing a face card from a deck of cards changes based on the cards already chosen.

Mutually exclusive events cannot occur at the same time e.g. a number can be either odd or even, never both. Mutually inclusive events can occur at the same time. For example a number can be both even and less than 10. The probability calculation therefore needs to take into account both possibilities. Venn diagrams are useful visuals to express this.

A table listing all categories (or classes) and their frequencies.

The percentage of the frequency of a class against the overall frequency of the sample.

A table listing all classes and their relative frequencies, the total of which will equal 1 (100%).

The frequency distribution where data is only available for a sample of the full population, the larger the sample the more likely it is to correlate with the frequency distribution.

Standard types of distribution are the normal distribution, binomial distributions and exponential distributions. Normal and binomial distributions deal with discrete data whereas exponential deals with continuous data.

A variable graphically described with a bell-shaped density curve.

**Example:** A meal may normally be distributed to contain 200 calories, with a standard deviation of 5 calories. If the vast majority of observations are within a standard deviation it will have a normal distribution.

Binomial experiments involve only two choices and their distributions involve a discrete number of trials of these two outcomes. For example, the flipping of a coin. Therefore a binomial distribution is a probability distribution of the successful trials in the experiment.

Exponential distributions deal with continuous data on a scale e.g. you may measure travel times between places by a scale of minutes.

A type of probability distribution that resembles the normal distribution but differs slightly with its additional parameter known as “degrees of freedom”. How the distribution compares to the normal curve depends on how close the mean is to 0 and the standard deviation to 1.

The degrees of freedom are roughly equal to the number of observations in the test. The more degrees of freedom the more confident we can be that the results resemble the true full population distribution.

A T statistic is the ratio of the observed coefficient to the standard error, which can be evaluated against the T distribution appropriate for the size of the data sample.

With a large enough T statistic we can reject the null hypothesis at some level of statistical significance. The fewer the degrees of freedom and therefore the fatter the tails of the relevant T distribution, the higher the T statistic will need to be in order for us to reject the null hypothesis.

A probability distribution which analyses the probability of multiple outcomes occurring in a given timeframe.

**Example:** The probabilities of the likely number of goals in a football match, based on averages taken from recent results.

A measurement of the symmetry of the probability distribution of a random variable.

A distribution is skewed if one end of its tail is longer than the other.

A positive skew is displayed if the majority of values are on the left of the distribution with a long tail on the right, also known as a right-skewed distribution because the outlier values are on the right. For example amount of rainfall per day because lots of days are without any rainfall at all but some outlier days have large amounts of rainfall.

A negative skew occurs if the longer end is on the right, with values mainly at the higher end of the scale. A normal distribution has no skew at all, with skewness = 0.

When the skewness is low the mean and median will not be very far apart. When measuring central tendency, any skew above 1 or under -1 suggests the data is too skewed for the mean to be the best measurement and instead the median is a better indicator of typical value.

The SKEW function can be used to measure skewness in Excel.

The sharpness of the peak of a frequency distribution curve. Kurtosis helps describe the shape of a probability distribution of a random variable, measuring the “tailedness” of the data. There are different interpretations of how to measure kurtosis from a population but the purpose is to understand whether the distribution is tall and narrow or short and flat.

There are numerous types of bias which can inadvertently influence the results of a statistical test:

– Cognitive bias: Biases which may stem from emotional or moral motivations or from social influences which deviate from rationality in judgement.

– Confirmation bias: The tendency to search for and interpret information in a way which confirms your pre-existing beliefs or hypotheses.

– Observer bias: When a researcher subconsciously influences participants or results to match their expectations.

– Recall bias: When the respondent hasn’t remembered things correctly.

– Recency bias: When an event that has happened most recently is disproportionately over-represented in the results.

– Selection bias: Accidentally working with a subset of your audience when you believe you have a representative sample.

– Sponsorship bias: The tendency of a scientific study to support the interests of the people or organisations funding the research.

– Survivorship bias: The error of concentrating an experiment on observations which made it past some selection process and not having visibility and therefore overlooking others that didn’t.

A type of probability theory stating that under certain conditions, the mean of a large number of observations will be approximately normally distributed.

A formula for calculating correlation between two variables where the results of either 1 or -1 demonstrate perfect correlation, and the nearer the value to 0 the less correlation found.

An approach to statistical inference using calculations that result in “degrees of belief”, otherwise known as Bayesian probabilities. The probability of various outcomes are calculated based on inferences made from the available data.

A method of estimating observation sizes by targeting several small specific areas to take a count and then making assumptions about the rest of the population based on the findings.

**Example:** For measuring the number of fish in a lake (often populations of a type of species).

This represents the probability of ‘A’ happening e.g. on the toss of a coin Pr(Heads)=0.5.

The x-bar refers to the symbol (or expression) used to represent the sample mean, which is used as an estimation of the population mean.

]]>

In this example we have finishing positions for Leicester City each season since the Second World War.

Highlight your dataset, go to Insert – Charts – See All Charts. Then in the All Charts tab, Histogram is one of your options.

By default you will get some automatically set bins. These are the preset ranges for each of the bars. In this case they are not particularly useful, the first bar is looking for number of seasons Leicester finished between 1st and 9.8th but we only want to deal in whole numbers.

Therefore we need to right-click on the axis and go to Format Axis. Here you can set the Bin width to a whole number and either define the number of bins, or in this example we have just defined that 45 is the maximum number we’re looking for a 1 is the minimum.

This reformatted version is more informative. You can see a relatively normal distribution and that a standard finishing position for Leicester would be around the 17th to 25th mark.

Note: This Histogram chart feature is not available in earlier versions of Excel. If you’re using 2013 or earlier, you may have to create your own histogram by adding a bar chart and then changing the gap width to zero.

]]>

To ensure one connection is refreshed at a time you should make your VB loop through each connection one by one and in the order you want them to. You can name each individual pivot cache, or even each pivot table, but then you have to keep the code maintained when there are changes. The following module is reusable because it will first look for **every** connection in the file and refresh them individually before looking for **every** pivot table and refreshing those individually too.

*Sub RefreshAll()*

*Dim wc As WorkbookConnection*

*Dim pc As PivotCache*

*For Each wc In ThisWorkbook.Connections*

*wc.ODBCConnection.BackgroundQuery = False*

*wc.Refresh*

*Next wc*

*For Each pc In ThisWorkbook.PivotCaches*

*pc.Refresh*

*Next pc*

*End Sub*

You can do this in VBA without having to specify each individual value in the code. In this example we have a list of existing values along with the new value we want to replace them with.

Select your data and hit Ctrl + T to turn it into an Excel table. Name the table **TableFindReplace**.

We can then apply this code which will check one-by-one for all instances of the values in the Existing column to replace them with the value in the New column. E.g. all instances of Man C to Manchester City, all instances of Man U to Manchester United, etc.

NB: You will need to adjust “SheetName” with your worksheet name and “TableData” with the table name containing your data.

*Sub MultiFindReplace()*

*Dim FindReplaceList As Range*

*Dim Results As Range*

*Set FindReplaceList = Sheets(“SheetName”).ListObjects(“TableFindReplace”).DataBodyRange*

*Set Results = Sheets(“SheetName”).ListObjects(“TableData”).DataBodyRange*

*Application.Goto Reference:=”TableData”*

*For Each cell In FindReplaceList.Columns(1).Cells*

*Selection.Replace what:=cell.Value, Replacement:=cell.Offset(0, 1).Value*

*Next cell*

*End Sub*