DATA MINING
Data mining
The technology used for collecting, store, processing, transforming and analysing raw data in order to make it useful for gaining insights.
Knowledge Discovery
KDD stands for Knowledge Discovery in Databases, which covers the creation of knowledge from structured and unstructured sources in an attempt to formalise the knowledge discovery process.
There are five steps:
- Selection
- Preprocessing
- Transformation
- Data Mining
- Interpretation / Evaluation
CRISP-DM (CRoss Industry Standard Process for Data Mining) is another process to formalise the knowledge process. This time with six steps:
- Business Understanding (Identify project objectives)
- Data Understanding (Collect and review data)
- Data Preparation (Select and cleanse data)
- Modelling (Manipulate data and draw conclusions)
- Evaluation (Evaluate model and conclusions)
- Deployment (Apply conclusions to business)
VISUALISATIONS
Histogram
Similar in look to a horizontal bar graph except the bars are connected to each other, histograms are formed from grouped data to display frequencies or relative frequencies (percentages) for each class in a sample.
Scattergrams
A method of displaying the correlation between two or more variables, including a line of best fit to demonstrate how far each observation deviates from the mean.
Frequency polygon
Line chart plotted at the mid-point of each class, with the classes grouped e.g. into 0-10, 11-20, etc.
Venn diagram
Presented as two or more circles overlapping each other to demonstrate relationships between variables.
Example: Animals with two legs and animals who can fly. Some would show in one group or the other and some would overlap into both groups.
Tree diagram
A branching diagram which lists all possible outcomes of an event.
Example: The first branch could be Europe, the second branches splitting out Germany, France and Spain and then the third branches split out the various cities in those countries.
Box plots
A one-dimensional graph based on the numerical data from the five-number summary.
Stem and leaf plots
A visualisation organising numerical data into categories based on place value. These contain more detail than a standard histogram. The stem is the left hand column containing the digits in the largest place and the leaf on the right hand column contains the digits in the smallest place.
SOFTWARE
Data visualisation software
There are numerous programs for creating data visualisations and dashboard reports some of the most popular being:
- Microsoft Power BI
- Qlik
- Tableau
- Spotfire
Other products which specialise in infographics, animations and other visualisations include:
- Google Sites
- llustrator
- Unity
Microsoft Azure Machine Learning Studio
A drag-and-drop tool with a graphical user interface for building, testing and deploying predictive analytics solutions on your data.
SAS Enterprise Miner
A solution for creating accurate predictive and descriptive data models using data mining and statistical techniques such as linear regression, clustering and classification (decision trees).
Jupyter Notebooks
Jupyter Notebooks are used to explore datasets through an interactive browser-based environment in which you can add notes and run code to manipulate and visualise data. They support languages regularly used by data scientists such as R and Python.
PROGRAMMING LANGUAGES
Object-oriented vs Procedural programming
Object-oriented programming is based on the concept of structured data, organised in fields within tables, with operations (functions) that can be applied to the structure. Procedural, or imperative programming focuses on explicit sequences of instructions to run a task.
R
R is an open source programming language and software environment widely used for statistical analysis, testing and modelling.
RStudio is a popular graphical user interface for R, reducing the amount of direct programming required for statistical analysis and modelling. For data visualisations using R, a popular package is ggplot2.
Python
Python is another open source language used for detailed statistical analysis, testing and modelling. It is considered object-oriented and is often used for building reusable code patterns.
Popular Python packages for data science include:
- NumPy (Numeric Python, for performing calculations over entire arrays)
- Matplotlib (for data visualisations)
- SciPy (for scientific and technical computing)
- Scikit-Learn (for machine learning features like regression and clustering, interactive with NumPy arrays there are numerous functions for evaluating classification, clustering and regression models)
- Pandas (for data manipulation and analysis using data frames)
JavaScript
A language commonly used in web design. Java is used by numerous data visualisation packages.
DATA SCIENCE TERMINOLOGY
Full stack
The term ‘full stack’ in data science generally refers to someone with ability and experience in all areas of data science:
- Machine Learning
- Big Data
- ETL (Extract, Transform, Load)
- Analysis techniques (Regression, Classification, etc)
- Statistics
- Programming (usually Python or R)
- Data modelling
- Data visualisation and presentation