Sunday, February 24, 2008

DATA MINING
Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data “and "the science of extracting useful information from large data sets or databases".
Weakness of Data Mining
Data mining relies on the use of real world data. This data is extremely vulnerable to co-linearity precisely because data from the real world may have unknown interrelations. An unavoidable weakness of data mining is that the critical data that may explain the relationships is never observed. Alternative approaches using an experiment based approach such as Choice Modelling for human generated data may be used. Inherent correlations are either controlled for or removed altogether through the construction of an experimental design.
Examples
Data Mining is most frequently used for Customer Relationship Management applications. Common goals are to predict which people are most likely to:
a) Be Acquired
b) Be Cross-Sold or Up-Sold
c) Leave \ Churn
d) Be Retained, Saved, or Won back

These applications can contribute significantly to the bottom line. Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.
Business employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than 1 model to predict which customers will churn, we could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, we may only want to send offers to customers that will likely take to offer. And finally, we may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to 1) Manage model versions 2) Move to "Automated Data Mining."
Another example of data mining, often called the Market Basket Analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem, will develop a secondary problem within the next 6 months.

Data Mining in Business
Through the use of automated statistical analysis (or "data mining") techniques, businesses are discovering new trends and patterns of behavior that previously went unnoticed. Once they've uncovered this vital intelligence, it can be used in a predictive manner for a variety of applications.

Gathering Data
The first step toward building a productive data mining program is, of course, to gather data! Most businesses already perform these data gathering tasks to some extent -- the key here is to locate the data critical to your business, refine it and prepare it for the data mining process. If you're currently tracking customer data in a modern DBMS, chances are you're almost done. Take a look at the article Mining Customer Data from DB2 Magazine for a great feature on preparing your data for the mining process.

Selecting an Algorithm
At this point, take a moment to pat yourself on the back. You have a data warehouse! The next step is to choose one or more data mining algorithms to apply to your problem. If you're just starting out, it's probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon the data you've gathered, the problem you're trying to solve and the computing tools you have available to you. Let's take a brief look at two of the more popular algorithms.

Regression
Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you're ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you've got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you're working with categorical data where order is not significant (like color, name or gender) you're better off choosing another technique.

Classification
Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You'll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you'll receive a decision tree that requires a series of binary decisions. Take a look at the Classification Trees chapter from the Electronic Statistics Textbook for in-depth coverage of this technique.

Other Techniques
Regression and classification are two of the more popular classification techniques, but they only form the tip of the iceberg. For a detailed look at other data mining algorithms, look at this feature on Data Mining Techniques or the SPSS Data Mining page.

Data Mining Products
Data mining products are taking the industry by storm. The major database vendors have already taken steps to ensure that their platforms incorporate data mining techniques. Oracle's Data Mining Suite (Darwin) implements classification and regression trees, neural networks, k-nearest neighbors, regression analysis and clustering algorithms. Microsoft's SQL Server also offers data mining functionality through the use of classification trees and clustering algorithms. If you're already working in a statistics environment, you're probably familiar with the data mining algorithm implementations offered by the advanced statistical packages SPSS, SAS, and S-Plus.