Data Mining
The purpose of data mining
Generally speaking, data mining is a process of discovering patterns or relationships in a set of data.
Data mining tasks include simplifying data by reducing noise, dimensionality or finding important variables and combinations. Among techniques in data mining we have clustering, classification, association analysis, anomaly detection, forecasting and predicting scores. Two techniques that are classified as special in data mining are sequence mining and text mining.
In this article, we will go through each of the techniques to explain their purposes and algorithms used to perform these tasks.
Approaches
There are two approaches when it comes to algorithms which are classical statistics and machine learning.
Classical statistics are typically quite transparent. They use familiar statistics which are possible to be calculated by hand. Meanwhile, machine learning methods are more complex, opaque and often require substantial computing power.
Most common algorithms in each approach
Classical Statistics Methods
- Linear Regression
- K-Means Classification
- K-Nearest Neighbours
- Hierarchical Clustering
Machine Learning Methods
- Hidden Markov Models
- Support Vector Regression
- Random Forests
- LASSO Regression
- Other — Apriori algorithm, word counts, etc.
Techniques
Data Reduction
Sometimes we might find ourselves in situations when data reduction are relevant. It can be practical constraints such as time, storage or memory; to reduce noise and focus on patterns or simply because it is easier to interpret. With data reduction we can avoid overfitting, multicollinearity and increase the degree of freedom.
Algorithms for data reduction:
- Linear — the most common technique for this purpose is PCA
- Non-linear — Kernel PCA, Isomap, Locally Linear Embedding, Maximum Variance Unfolding
Clustering
Clustering is the process of gathering data points into groups based on similar characteristics. Remember that the groups are groups of convenience as the criteria for clustering them might differ for various business needs or situations.
Depending on a situation, the list of methods used for clustering are:
- distance between points (only for convex clusters or small dataset)
- distance from the centroid (must choose k — number of groups, centroids, convex clusters with similar size)
- density (non-convex, different size, can ignore outliers, hard to interpret)
- distribution models (prone to overfitting, good for correlations, convex)
To see which algorithms are the most essential for a data scientist, read the article below:
Classification
Quite similar to clustering, classification tries putting things into the right bucket. Since classification can be performed on structured as well as unstructured data, things might get a bit confusing when it comes to choosing an algorithm.
An easy way to solve this problem is to ask yourselves who your audience or user is. If you need to easily explain how things work for a person, go with transparent methods like Decision Tree or Naive Bayes. Providing you have a black box model where the algorithm directly controls the outcome and accuracy is paramount, then opaque methods like Support Vector Machine and Artificial Neural Networks are acceptable.
Anomaly Detection
Anomalies are defined as unusual events, things that are not supposed to happen. They usually signify a problem such as a broken system, fraud, diseases.
These outliers can be spotted on a single variable (univariate), on combination of scores (bivariate, multivariate) or even categorical where a category has less than 10% of data in it.
Outliers can bring effects to your analysis such as distorted statistics, distorted relationships, misleading results, failure to generalise.
What to do with outliers:
- Remove — in few number, need to explain why
2. Transform — logarithm or square to make the distribution symmetrical
3. Robust — use measures that are not strongly influenced by outliers
Association Analysis
Association analysis works on the principles of discovering the relationship among items in a dataset. It learns the occurrence of an item in association with another item to find any rules or patterns. Algorithms for this type of analyses are: Apriori, Eclat, FP-growth, RElim, SaM, JIM.
Approaches
Finding item sets — combination of items that appear together; calculate the “support” for combinations of items.
Supp(X) = X-freq(X)/T
Freq(X) — the times X has appeared in the data. T — total number of transactions. Support X is the proportion of transactions T that contains itemset X.
Rule generation — if/then statements; calculate “confidence” or conditional probability. if X occurs, then what is the (conditional) probability of Y?
conf(X => Y) = supp(X u Y)/supp(X)
Other approaches — collective strength, conviction, leverage, lift
Algorithms
- Apriori
- calculate support for single-item item sets
- if supp < minsupp, then remove item set
- then expand to two-item item sets, etc.
- challenges — breadth-first search (BFS), many passes through the data, grow computation exponentially
2. Eclat
- calculate support for single-item item sets
- if supp > minsupp, then add item
- if supp < minsupp, then move to the next single item
- challenges — depth-first search (DFS), fast but difficult to manipulate
Regression Analysis
- using many variables to predict one
- classical methods — based on means and standard deviation; simultaneous entry — everything in together; blocked entry — group by group; stepwise entry — highest correlated variable first, then the second and so forth; prone to overfitting; nonlinear methods
- modern methods — correlation, distance, choosing between predictors; LASSO methods; LARS; RFE; SVR
Sequence Mining
Looks for chains that repeat, temporal (ordinal) associations where order of events matters. This technique is used in genetic sequencing, recommendation engines, marketing offers, behavioural “state switching”.
Algorithms
- GSP: Generalized Sequential Patterns; similar to apriori association analysis but observes order; uses sliding window for simultaneity
- SPADE: Sequential Pattern Discovery using Equivalence classes; faster than GSP; uses fewer database scans by using intersecting ID-lists
3. PrefixSpan: prefix-projected sequential pattern mining; grow patterns in sub-databases; fast but memory intensive
4. HMM: Hidden Markov Models; qualitatively distinct patterns of behaviour; easy to test specific hypotheses
Text Mining
It is often performed on unstructured data. Text mining is commonly used for sentiment analysis.
Algorithms
- focus on meaning — identify parts of speech, identify sentiment, and use meaning of words to analyse text. Algorithms focusing on meaning: NLP, HMM, Latent Dirichlet Allocation (LDA)
- bag of words — tokens of distinct categories without understanding their meaning, binary presence, frequency of weights. Algorithms using this method are: Naive Bayes, Neural Networks, SVM, etc.