Data Mining

Ly Nguyenova
5 min readJul 17, 2020

--

The purpose of data mining

Generally speaking, data mining is a process of discovering patterns or relationships in a set of data.

Data mining tasks include simplifying data by reducing noise, dimensionality or finding important variables and combinations. Among techniques in data mining we have clustering, classification, association analysis, anomaly detection, forecasting and predicting scores. Two techniques that are classified as special in data mining are sequence mining and text mining.

In this article, we will go through each of the techniques to explain their purposes and algorithms used to perform these tasks.

Approaches

There are two approaches when it comes to algorithms which are classical statistics and machine learning.

Classical statistics are typically quite transparent. They use familiar statistics which are possible to be calculated by hand. Meanwhile, machine learning methods are more complex, opaque and often require substantial computing power.

Most common algorithms in each approach

Classical Statistics Methods

  1. Linear Regression
  2. K-Means Classification
  3. K-Nearest Neighbours
  4. Hierarchical Clustering

Machine Learning Methods

  1. Hidden Markov Models
  2. Support Vector Regression
  3. Random Forests
  4. LASSO Regression
  5. Other — Apriori algorithm, word counts, etc.

Techniques

Data Reduction

Sometimes we might find ourselves in situations when data reduction are relevant. It can be practical constraints such as time, storage or memory; to reduce noise and focus on patterns or simply because it is easier to interpret. With data reduction we can avoid overfitting, multicollinearity and increase the degree of freedom.

Algorithms for data reduction:

  1. Linear — the most common technique for this purpose is PCA
  2. Non-linear — Kernel PCA, Isomap, Locally Linear Embedding, Maximum Variance Unfolding

Clustering

Clustering is the process of gathering data points into groups based on similar characteristics. Remember that the groups are groups of convenience as the criteria for clustering them might differ for various business needs or situations.

Depending on a situation, the list of methods used for clustering are:

  • distance between points (only for convex clusters or small dataset)
  • distance from the centroid (must choose k — number of groups, centroids, convex clusters with similar size)
  • density (non-convex, different size, can ignore outliers, hard to interpret)
  • distribution models (prone to overfitting, good for correlations, convex)

To see which algorithms are the most essential for a data scientist, read the article below:

Classification

Quite similar to clustering, classification tries putting things into the right bucket. Since classification can be performed on structured as well as unstructured data, things might get a bit confusing when it comes to choosing an algorithm.

An easy way to solve this problem is to ask yourselves who your audience or user is. If you need to easily explain how things work for a person, go with transparent methods like Decision Tree or Naive Bayes. Providing you have a black box model where the algorithm directly controls the outcome and accuracy is paramount, then opaque methods like Support Vector Machine and Artificial Neural Networks are acceptable.

Anomaly Detection

Anomalies are defined as unusual events, things that are not supposed to happen. They usually signify a problem such as a broken system, fraud, diseases.

These outliers can be spotted on a single variable (univariate), on combination of scores (bivariate, multivariate) or even categorical where a category has less than 10% of data in it.

Outliers can bring effects to your analysis such as distorted statistics, distorted relationships, misleading results, failure to generalise.

What to do with outliers:

  1. Remove — in few number, need to explain why

2. Transform — logarithm or square to make the distribution symmetrical

3. Robust — use measures that are not strongly influenced by outliers

Association Analysis

Association analysis works on the principles of discovering the relationship among items in a dataset. It learns the occurrence of an item in association with another item to find any rules or patterns. Algorithms for this type of analyses are: Apriori, Eclat, FP-growth, RElim, SaM, JIM.

Approaches

Finding item sets — combination of items that appear together; calculate the “support” for combinations of items.

Supp(X) = X-freq(X)/T

Freq(X) — the times X has appeared in the data. T — total number of transactions. Support X is the proportion of transactions T that contains itemset X.

Rule generation — if/then statements; calculate “confidence” or conditional probability. if X occurs, then what is the (conditional) probability of Y?

conf(X => Y) = supp(X u Y)/supp(X)

Other approaches — collective strength, conviction, leverage, lift

Algorithms

  1. Apriori
  • calculate support for single-item item sets
  • if supp < minsupp, then remove item set
  • then expand to two-item item sets, etc.
  • challenges — breadth-first search (BFS), many passes through the data, grow computation exponentially

2. Eclat

  • calculate support for single-item item sets
  • if supp > minsupp, then add item
  • if supp < minsupp, then move to the next single item
  • challenges — depth-first search (DFS), fast but difficult to manipulate

Regression Analysis

  • using many variables to predict one
  • classical methods — based on means and standard deviation; simultaneous entry — everything in together; blocked entry — group by group; stepwise entry — highest correlated variable first, then the second and so forth; prone to overfitting; nonlinear methods
  • modern methods — correlation, distance, choosing between predictors; LASSO methods; LARS; RFE; SVR

Sequence Mining

Looks for chains that repeat, temporal (ordinal) associations where order of events matters. This technique is used in genetic sequencing, recommendation engines, marketing offers, behavioural “state switching”.

Algorithms

  1. GSP: Generalized Sequential Patterns; similar to apriori association analysis but observes order; uses sliding window for simultaneity
  2. SPADE: Sequential Pattern Discovery using Equivalence classes; faster than GSP; uses fewer database scans by using intersecting ID-lists

3. PrefixSpan: prefix-projected sequential pattern mining; grow patterns in sub-databases; fast but memory intensive

4. HMM: Hidden Markov Models; qualitatively distinct patterns of behaviour; easy to test specific hypotheses

Text Mining

It is often performed on unstructured data. Text mining is commonly used for sentiment analysis.

Algorithms

  • focus on meaning — identify parts of speech, identify sentiment, and use meaning of words to analyse text. Algorithms focusing on meaning: NLP, HMM, Latent Dirichlet Allocation (LDA)
  • bag of words — tokens of distinct categories without understanding their meaning, binary presence, frequency of weights. Algorithms using this method are: Naive Bayes, Neural Networks, SVM, etc.

--

--

Ly Nguyenova
Ly Nguyenova

Written by Ly Nguyenova

Junior Data Scientist | Data Engineer | BI Analyst | Tableau Developer

No responses yet