All Of Statistica Models, Predictive, Stacking, Text Mining, Data Warehousing And On-Line Analytic Processing (Olap)
MODELS FOR DATA MINING
In the business environment, complex data mining projects
may require the coordinate efforts of various experts, stakeholders, or
departments throughout an entire organization. In the data mining literature,
various "general frameworks" have been proposed to serve as
blueprints for how to organize the process of gathering data, analyzing data,
disseminating results, implementing results, and monitoring improvements.
One such model, CRISP (Cross-Industry
Standard Process For Data Mining) was proposed in the mid-1990s by a European
Consortium of companies to serve as a non-proprietary standard process model
for data mining.
This general approach postulates the following (perhaps not
particularly controversial) general sequence of steps for data mining projects:
Another approach - the Six Sigma methodology - is a well-structured,
data-driven methodology for eliminating defects, waste, or quality control
problems of all kinds in manufacturing, service delivery, management, and other
business activities.
This model has recently become very popular (due to its
successful implementations) in various American industries, and it appears to
gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps -
that grew up from the manufacturing, quality improvement, and process control
traditions and is particularly well suited to production environments
(including "production of services," i.e., service industries).
Another framework of this kind (actually somewhat similar
to Six Sigma) is the approach proposed by SAS Institute called SEMMA which is
focusing more on the technical activities typically involved in a data mining
project.
All of these models are concerned with the process of how
to integrate data mining methodology into an organization, how to "convert data into information," how
to involve important stake-holders, and how to disseminate the information in a
form that can easily be converted by stake-holders into resources for strategic
decision making.
Some software tools for data mining are specifically
designed and documented to fit into one of these specific frameworks. The
general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide
a flexible data mining workbench that can be integrated into any organization,
industry, or organizational culture, regardless of the general data mining
process-model that the organization chooses to adopt.
For example, STATISTICA Data Miner can include the complete
set of (specific) necessary tools for ongoing company wide Six Sigma quality control
efforts, and users can take advantage of its (still optional) DMAIC-centric
user interface for industrial data mining tools. It can equally well be
integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow
either the CRISP or SEMMA approach - it fits both of them perfectly well
without favoring either one.
Also, STATISTICA Data Miner offers all the advantages of a
general data mining oriented "development kit" that
includes easy to use tools for incorporating into your projects not only such
components as custom database gateway solutions, prompted interactive queries,
or proprietary algorithms, but also systems of access privileges, workgroup
management, and other collaborative work tools that allow you to design large
scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a
combination of both models) that involve your entire organization.
PREDICTIVE DATA MINING
![]() |
Image source : blog.bosch-si.com |
The term Predictive Data Mining is usually applied to
identify data mining projects with the goal to identify a statistical or neural
network model or set of models that can be used to predict some response of
interest.
For example, a credit card company may want to engage in
predictive data mining, to derive a (trained) model or set of models (e.g.,
neural networks, meta-learner) that can quickly identify transactions which
have a high probability of being fraudulent. Other types of data mining
projects may be more exploratory in nature (e.g., to identify cluster or
segments of customers), in which case drill-down descriptive and exploratory
methods would be applied. Data reduction is another possible objective for data
mining (e.g., to aggregate or amalgamate the information in very large data
sets into useful and manageable chunks).
STACKING (Stacked Generalization)
The concept of stacking (Stacked Generalization) applies to
the area of predictive data mining, to combine the predictions from multiple
models. It is particularly useful when the types of models included in the
project are very different.
![]() |
Image source : iq.opengenus.org |
Suppose your data mining project includes tree classifiers,
such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and
Neural Networks. Each computes predicted classifications for a crossvalidation
sample, from which overall goodness-of-fit statistics (e.g., misclassification
rates) can be computed. Experience has shown that combining the predictions
from multiple methods often yields more accurate predictions than can be
derived from any one method (e.g., see Witten and Frank, 2000). In stacking,
the predictions from different classifiers are used as input into a
meta-learner, which attempts to combine the predictions to create a final best
predicted classification.
So, for example, the predicted classifications from the
tree classifiers, linear model, and the neural network classifier(s) can be
used as input variables into a neural network meta-classifier, which will
attempt to "learn" from the
data how to combine the predictions from the different models to yield maximum
classification accuracy.
Other methods for combining the prediction from multiple
models or methods (e.g., from multiple datasets used for learning) are Boosting
and Bagging (Voting).
TEXT MINING
While Data Mining is typically concerned with the detection
of patterns in numeric data, very often important (e.g., critical to business)
information is stored in the form of text. Unlike numeric data, text is often
amorphous, and difficult to deal with.
![]() |
Image source : data.flair.training |
Text mining generally consists of the analysis of
(multiple) text documents by extracting key phrases, concepts, etc. and the
preparation of the text processed in that manner for further analyses with
numeric data mining techniques (e.g., to determine co-occurrences of concepts, key
phrases, names, addresses, product names, etc.).
DATA WAREHOUSING
StatSoft defines data warehousing as a process of
organizing the storage of large, multivariate data sets in a way that
facilitates the retrieval of information for analytic purposes.
The most efficient data warehousing architecture will be
capable of incorporating or at least referencing all data available in the
relevant enterprise-wide information management systems, using designated
technology suitable for corporate data base management (e.g., Oracle, Sybase,
MS SQL Server.
![]() |
Image source : halobi.com |
Also, a flexible, high-performance (see the IDP
technology), open architecture approach to data warehousing - that flexibly
integrates with the existing corporate systems and allows the users to organize
and efficiently reference for analytic purposes enterprise repositories of data
of practically any complexity - is offered in StatSoft enterprise systems such
as STATISTICA Enterprise and STATISTICA Enterprise/QC , which can also work in
conjunction with STATISTICA Data Miner and STATISTICA Enterprise Server.
ON-LINE ANALYTIC PROCESSING (OLAP)
The term On-Line Analytic Processing - OLAP (or Fast
Analysis of Shared Multidimensional Information - FASMI) refers to technology that allows users of multidimensional databases
to generate on-line descriptive or comparative summaries ("views")
of data and other analytic queries.
Note that despite its name, analyses referred to as OLAP do
not need to be performed truly "on-line" (or in real-time); the term
applies to analyses of multidimensional databases (that may, obviously, contain
dynamically updated information) through efficient "multidimensional"
queries that reference various types of data. OLAP facilities can be integrated
into corporate (enterprise-wide) database systems and they allow analysts and
managers to monitor the performance of the business (e.g., such as various
aspects of the manufacturing process or numbers and types of completed
transactions at different locations) or the market.
![]() |
Image source : element61.be |
The final result of OLAP techniques can be very simple
(e.g., frequency tables, descriptive statistics, simple cross-tabulations) or
more complex (e.g., they may involve seasonal adjustments, removal of outliers,
and other forms of cleaning the data). Although Data Mining techniques can
operate on any kind of unprocessed or even unstructured information, they can
also be applied to the data views and summaries generated by OLAP to provide
more in-depth and often more multidimensional knowledge. In this sense, Data
Mining techniques could be considered to represent either a different analytic
approach (serving different purposes than OLAP) or as an analytic extension of
OLAP.
Reference source : documentation(dot)statsoft(dot)com
All Of Statistica Models, Predictive, Stacking, Text Mining, Data Warehousing And On-Line Analytic Processing (Olap)
Reviewed by AIA
on
December 21, 2019
Rating:

No comments: