Supervised vs Unsupervised learning
In supervised learning, the algorithm is given the dependent variable and it looks at all the independent variables in order to make a prediction about the dependent variable. E.g. classification, regression. When training a supervised learning model, it is standard practice to split the data into a training dataset and a test dataset, so that the test data can be used to validate the trained model.
Unsupervised learning differs because the algorithm is only given the independent variables and without being given any direction, it returns results about relationships between any of the variables. E.g. clustering, segmentation.
If you want to segregate your customers who pay on card vs your customers who pay with cash, that’s supervised machine learning because you’re telling the machine what to look for. If alternatively you want to give the computer all your data and measures and ask it to segment the customer base by highlighting any interesting patterns, that’s unsupervised machine learning.
Information leakage can occur when you fail to split the data before training a supervised machine learning algorithm. The model appears to be getting more accurate but it’s getting more accurate at learning the training data rather than the actual population data.
A method of splitting data for a supervised learning algorithm using a random sampling method which avoids biases.