Summary: This
is a thirteen that tells the reader all about
Data Mining and Data Warehousing the evolution
and the software’s that are used and those
that were used 10 years ago
INTRODUCTION: The technology
that exists with Data Mining and Warehousing is
comparatively a new term but the technology is
not. Data Mining is the process of digging or
gathering information from various databases.
This includes data from point of sales transactions,
credit card purchases, online forms which are
just a few of the many things that some of the
large companies dig to find out more about their
clients. The information is used to find out how
major of the clients shopping behavior, or what
makes them irritated or simply how can they make
the life of the client happier. Since gathering
all this information is a necessity in order to
increase sales and have a better relationship
with clients, and with storage devices becoming
cheaper, the idea of warehousing data came into
being. This literally means that the data is collected
in a central place where it is analyzed and sorted
according to the company requirement.
Data mining is the search for relationships and
global patterns that exist in large databases
but are `hidden' among the vast amount of data,
such as a relationship between patient data and
their medical diagnosis. These relationships represent
valuable knowledge about the database and the
objects in the database and, if the database is
a faithful mirror, of the real world registered
by the database (Holshemier & Siebes, 1994).
Data mining or knowledge discovery is a way of
sifting through millions and millions of records
that help the people who make decisions to better
understand the needs of the customer. Although
this technology is in its infancy state many industries
are using this technology some of them to note
are retailers, finance, health care, transportation
and aerospace are just a noted few. These industries
are already using the technology. By using complex
mathematical and statistical techniques and pattern
recognition techniques they get information that
about 10 years ago would seem an enormous a job
that would require months to process. Today these
figures are processed at an amazingly high speed
and with precision to the “T’. These
figures and analysis help an analyst in recognizing
the fact of relationship, trends, exception and
anomalies that are sometimes missed out while
analyzing data.
To understand how much data one talks about where
storage capacity is concerned. There are trillions
of point of sales transactions, credit card purchases,
pictures (which are just some types of data that
data mining applications pick up) all this are
stored in large databases that are measured in
bytes. Bytes are the measurement of storage devices.
Eight bits make one byte. 1024 bytes make One
Kilo Byte and it goes on and on. Today the size
of databases is in gigabytes and terabytes so
Gigabytes is equal to 1073741824 bytes. This is
comparatively a lot of data One terabyte will
be approximately equal to about 2 million books.
Wow that’s a lot of data but that is the
amount of data that is received from companies
such as Wal-Mart. The data is collected by various
methods. All this is stored in a central database
that is powered by extremely powerful machines
that are maintained by the company itself. The
place where all this data is stored is known as
a Data Warehouse. The data is accumulated in one
place and sorted and arranged into a manner so
that the user finds the information, which he/she
wants.
The Million-dollar question that needs to be answered
is, what is Data? Data are the facts and figures
that are collected by various means and sources.
This could come by various means. Organizations
collect huge amounts on a daily basis. These are
in different formats and in different databases,
some of the types that these software’s
collect are given below:
1. Operational data such as sales, cost, inventory,
payroll and accounting
2. Non operational data such as industry sales,
forecast data and macro economic data
3. Meta data: - data about the
data itself such as logical database design or
data dictionary definitions.
These are then collected and stored away in Data
Warehouses where it will be analyzed for different
reasons.
Today Data Mining can help businesses in many
ways; it is used to discover patterns and relationships
in the data to help make better business decisions.
Data mining can also help to spot sales trends
and make better marketing strategies and this
could also be used for telling them which customers
are loyal or not. Specific uses of data mining
include.
1. Market Segmentation: Identify
those customers, which buy the same products from
any particular company. For example a restaurant
could track when a customer visits and what they
typically like to order. This information can
used to increase sales by having daily specials
2. Customer churn: - To predict
which company are likely to leave your products
and go to a competitor. The best example of this
could be between the two famous beverage brands
Pepsi® and Coca Cola® the marketing team
could be using a data mining software to help
identify when customers turn away and for what
reason.
3. Fraud detection: - Identify
which transactions are most likely to be fraudulent.
This can also help reduce the number of Credit
Card robberies and identity theft that happens
today.
4. Direct Marketing: - To identify
which prospects should be included in the mailing
list. So that a better response is obtained.
5. Interactive marketing: - Predict
what each individual accessing a website is most
likely interested in seeing.
6. Market based analysis: -Understand
what products or services are commonly purchased
together e.g. beer and diapers
7. Trend analysis: - Reveal the
difference between typical customers this month
and the last.
Data mining is primarily used today for extracting
knowledge about the customers. This information
is required by analyst to further improve their
relations with the client. The businesses need
to know how the client move and behave with certain
products and when should a change of commodity
be bought in the market.
Having this kind of information brings tremendous
success and saves the organization from going
into an enormous loss. For example a video store
can use the information from the database to recommend
a movie from the information provided by the Data
Mining software’s. One of the biggest users
of Data Mining is Wal-Mart it collects it’s
data from around 2900 point of sales transactions
which come from 6 countries and the data is then
transmitted to its massive Teradata warehouse.
This is a very powerful machine it uploads 20
million points of sales transactions to an A&T
massively parallel system with 483 processors
running a centralized database.
The National Basketball Association most popularly
known as the NBA® is exploring the possibility
of having Data Mining Software to give coaches
and players of what strategies were put in to
play. This software will dig up information by
storing the images of the game. For example the
analysis of play-by-play sheet of the game between
New York Nicks and the Cleveland Cavaliers on
January 6, 1995 reveals that when Mark Price who
plays the Guard Position, John Williams attempted
four jump shots and made it good each time! Not
only does the application finds out about this
pattern but also explains considerably the average
shooting percentage of 49.30% for the Cavaliers
that game.
The coach can dig up this information by looking
up the information by using the universal clock
used by the NBA® to produce the images instead
of going through numerous hours of video footage.
In today’s highly competitive market they
need to turn this information that will reveal
the insights into customers and their guide to
marketing, investing and management strategies.
The data that is produced by the applications
not only develops strategies but are also a source
of telling how long a person spoke for, is the
phone a fax machine or not. The information revealed
by this can advise telemarketing division teams
on which phones are used commercially and which
phones are used for business. A common logic to
this would be that any phone that would be busy
at night would mean a common logic that teenagers
are using it, thus coming to the fact that this
would naturally be a house phone. If the phone
line remains busy in the day that would mean that
the person is a businessperson as all teenagers
are either in school or college, which leaves
the parents at home. Which means that the person
is using the phone for business use. However the
use of these technologies is not only restricted
to telemarketers and marketing agencies but there
are also used by governmental agencies. As Americans
are normally under attack abroad the homeland
security has been increased in light of evidence
and supporting proof that terrorist activities
are being planned on the American soil. Phone
lines are being taped and recording to every phone
message are sent to a database that checks for
keywords some of the key words could be “Allah”,
“bomb” “murder” and a
whole lot of words that scans the recorded message.
These recordings are saved with details such as
where the phone call originated from, to which
country it was made from and a whole lot of other
information. The possibility of recognizing words
from different languages this is a lot of work
and must be manned by a very powerful machine.
The roots of data mining can be traced along three
lines. Perhaps the longest of them all is the
statistical method. Without the statistics there
would be no Data mining, statistics are the basis
of any technology that provides data mining features.
The classical concepts embrace concepts such as
regression analysis, standard distribution, standard
deviation, standard variance, discriminate analysis,
cluster analysis, and confidence intervals. These
are the very things that do all the thinking for
the data mining application. Without these there
would be no way to find out what data would be
needed, a data mining application is actually
statistical software. So statistics plays a very
important role.
The second in line is the use of Artificial Intelligence.
This means that the computer begins to think for
itself and begins to think and act like a human.
The results are analyzed by this intelligence.
This approach was not practical till the early
1980’s, before this there was no computer
that was fast enough that could present data that
was required. During the 1980’s when computers
started being available for reasonable prices.
The AI found its way to scientific and governmental
agencies that could afford the high price rates
that were required to power such huge and highly
powerful machines these machines. This meant that
the use of these machines were out of the reach
of the common users. Some high enc commercial
products such as the query optimization modules
used the AI concepts for Relational Database Management
Systems (RDBMS)
Data mining is a natural development that took
its time as the usage of databases increased.
This application is ready for business use because
of three elements that have significantly developed
in the past years. These elements are the key
success to the application success. The elements
are:
1. Massive Data collection and storage
2. Powerful multiprocessor computers
3. Data mining algorithms
In the evolution from business data to business
information, each new step has built upon the
previous ones. For example, dynamic data access
is critical for drill-through in data navigation
applications, and the ability to store large databases
is critical to data mining. From the user's point
of view, the four steps listed in Table 1 were
revolutionary because they allowed new business
questions to be answered accurately and quickly.
The table below provides a periodic table in the
evolution of data mining.
Evolutionary Step Business Question Enabling
Technologies Product Providers Characteristics
Data Collection
(1960s) "What was my average total revenue
over the last five years?" Computers, tapes,
disks IBM, CDC Retrospective, static data delivery
Data Access
(1980s) "What were unit sales in New England
last March?" Relational databases (RDBMS),
Structured Query Language (SQL), ODBC Oracle,
Sybase, Informix, IBM, Microsoft Retrospective,
dynamic data delivery at record level
Data Navigation
(1990s) "What were unit sales in New England
last March? Drill down to Boston." On-line
analytic processing (OLAP), multidimensional databases,
data warehouses Pilot, IRI, Arbor, Redbrick, Evolutionary
Technologies Retrospective, dynamic data delivery
at multiple levels
Data Mining
(2000) "What's likely to happen to Boston
unit sales next month? Why?" Advanced algorithms,
multiprocessor computers, massive databases Lockheed,
IBM, SGI, numerous startups (nascent industry)
Prospective, proactive information delivery
Table 1. Steps in the evolution of data mining.
(An overview of Data Mining….., 2004)
The realities of data mining will be a profitable
thing that any business can invest in the future
The need of this will be felt more when the need
of improving customer relations and to provide
services of prime value. Companies will be able
to tell exactly what a client desires and when
s/he desires it, this is becoming very important
in the current era. People are very conscious
of what they buy. Tastes and traditions also play
an important part. The tastes of people are no
longer on a local level but are on a more global
level. With the knowledge of other cultures and
the choice of bringing it back home with them
is becoming the theme of the day. Keeping this
view in mind the act of digging information or
data mining will prove to be very fruitful. In
the future many more companies will be in the
possession of data warehousing, this will make
the market more competitive than ever. The work
of analyst will be more and easier and the process
of sifting through millions of data will be bought
down considerably.
There are many software that are available in
the market there are in use by many companies.
Some of these software’s are mentioned below
have been in the market for the past ten years
or so, a brief description of the software’s
are also given:
ALICE d’Isoft: -This software is designed
for users who are not technical enough. It has
an easy to use interface and produce results through
and interactive decision tree it creates its results
by running queries, the output is generated either
in the form of reports, charts. This software
shows the user the hidden knowledge that lies
hidden in the database. It also makes predictions
by using data that lies within the database (Isoft…,
2004).
AVS/Express Visualization Edition: - This software
is an advanced software that is used by technical
and experienced professionals. It provides state
of the art technology to the user for advanced
graphics, imaging, data visualization and presentation.
This database makes it easy for users to quickly
and interactively visualizes their data (AVS,
2004).
Cognos: - The company was founded
in 1969, they have their corporate headquarters
in Ottawa, Canada. They also have a sales headquarters
in Burlington, Massachusetts, U.S.A., the software
allows users to extract critical data from corporate
data assets through analysis, reporting and forecasting.
Cognos products fall into two basic categories:
business intelligence tools and 4GL tools. Though
the two categories are aimed at different user
markets, they share a common purpose: to streamline
business processes and increase productivity.
Knowledge Discovery One, Incorporated Knowledge
Discovery One, Inc. (KD1) was founded in January
1996 to build complete, easy-to-use applications
that allow retailers to better understand and
predict their customers' buying habits. Employing
knowledge discovery and data-mining techniques,
KD1's Retail Discovery Suite allows the retailer
to operate a more profitable organization by providing
a detailed understanding of their advertising,
merchandising, assortment, inventory, promotions,
and vendor performance issues ([kd1}, 2004).
dbProbe dbProbe is a business intelligence (OLAP
and reporting) tool that combines powerful data
analysis and scalability to thousands of users,
with simple deployment for administrators (no
client software to install). Users can drill down,
slice-and-dice, graph, filter, create, and share
reports and more. Data sources include MS OLE
DB for OLAP, Informix Metacube, and others.
Exchange Applications Exchange Applications,
a Boston-based software company, provides database-marketing
software and services that enable companies to
optimize the value in all their customer relationships
across the enterprise. ValEX, an integrated suite
of applications, enables companies to analyze
and understand how to best allocate marketing
dollars across customers and customer segments.
ValEX automates and accelerates a company’s
complete marketing process—data mining and
customer segmentation, business and marketing
planning, campaign execution, response tracking
and attribution, and campaign evaluation and refinement.
ValEX software enables non-technical marketing
professionals to perform marketing management
and campaign functions (exapps.com….., 2004).
KnowledgeSEEKER At the heart of what makes “KnowledgeSEEKER”
such a powerful and easy-to-use tool is it's Decision
Tree Induction process, which in simplified terms
acts as an automated query generator. This Decision
Tree Induction process has the mathematical power
and crunch power to construct and run the queries
required. This process shows the combined dependencies
between multiple predictors and the analysis results
are presented in highly intuitive colored classification
tree. Decision Trees allow for effective data
visualization and are extraordinarily easy to
understand and manipulate. “KnowledgeSEEKER”
findings can also be translated into a knowledge
base of rules or a set of executable programming
statements (KnowledgeSEEKER, 2004).
Open Visualization Data Explorer Open Visualization
Data Explorer is a full visualization environment
that gives users the ability to apply advanced
visualization and analysis techniques to their
data. These techniques can be applied to help
users gain new insights into data from applications
in a wide variety of fields including science,
engineering, medicine and business. Data Explorer
provides a full set of tools for manipulating,
transforming, processing, realizing, rendering
and animating data and allow for visualization
and analysis methods based on points, lines, areas,
volumes, images or geometric primitives in any
combination. Data Explorer is discipline-independent
and easily adapts to new applications and data.
The integrated object-oriented graphical user
interface is intuitive to learn and easy to use
(IBM Research…., 2004).
MARS MARS is a multivariate non-parametric regression
procedure introduced in 1991 by Stanford statistician
and physicist, Jerome Friedman. Salford Systems'
MARS, based on the original code, has been substantially
enhanced with new features and capabilities in
exclusive collaboration with Dr. Friedman. Vaunted
as the next frontier in data mining, MARS eliminates
the time consuming, trial-and-error process of
building accurate predictive models. MARS excels
at automatically finding optimal variable transformations
and interactions, the complex data structure that
often hides in high-dimensional data. This new-generation
approach to regression modeling effectively uncovers
business-critical data patterns and relationships
that are difficult, if not impossible, for other
approaches to uncover (Salford Systems, 2004).
|