Data Mining is an analytic process designed to explore data (usually large amounts of data – typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction – and predictive data mining is the most common type of data mining and one that has the most direct business applications.
The process of data mining consists of three stages:
(1) The initial exploration,
(2) Model building or pattern identification with validation/verification, and
(3) Deployment (i.e., the application of the model to new data in order to generate predictions).
This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and – in case of data sets with large numbers of variables (“fields”) – performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.
Model building and validation:
This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal – many of which are based on so-called “competitive evaluation of models,” that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques – which are often considered the core of predictive data mining – include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.
That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.
The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining, but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.
However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a “black box” approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.
Data Mining in Business
Through the use of automated statistical analysis (or “data mining”) techniques, businesses are discovering new trends and patterns of behavior that previously went unnoticed. Once they’ve uncovered this vital intelligence, it can be used in a predictive manner for a variety of applications.
The first step toward building a productive data mining program is to gather data. Most businesses already perform these data gathering tasks to some extent — the key here is to locate the data critical to your business, refine it and prepare it for the data mining process. If you’re currently tracking customer data in a modern DBMS, chances are you’re almost done.
Selecting an Algorithm
At this point, take a moment to pat yourself on the back. You have a data warehouse! The next step is to choose one or more data mining algorithms to apply to your problem. If you’re just starting out, it’s probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon the data you’ve gathered, the problem you’re trying to solve and the computing tools you have available to you. Let’s take a brief look at two of the more popular algorithms.
Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you’re ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you’ve got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you’re working with categorical data where order is not significant (like color, name or gender) you’re better off choosing another technique.
Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You’ll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you’ll receive a decision tree that requires a series of binary decisions. One popular classification algorithm is the k-means clustering algorithm.
In the next Ill discuss about some crucial concepts of data mining.