Wednesday, October 3, 2007

BI Primer

Recently, i am looking at the Oracle Business Intelligence (BI) solution and preparing a technical sharing session.

When I dive into the BI, i found it is a damn big topic and a lot of acronym and technical term need to be clear before get to understand it.

Information here is compiled from multiple source available from the internet to clear those prerequisite.

What is BI ?

Business intelligence (BI) is a business management term, which refers to applications and technologies that are used to gather, provide access to, and analyze data and information about company operations.

BI systems can help companies have a more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and they can help companies to make better business decisions.

BI software and applications includes a range of tools. Some BI applications are used to analyze performance, projects, or internal operations. Other BI applications are used to store and analyze data, such as Data mining (DM), Data Farming, and Data warehouses; Decision Support Systems (DSS) and Forecasting. Other BI applications are used to analyze or manage the "human" side of businesses, such as Customer Relationship Management (CRM) and Marketing tools and Human Resources applications.

BI often uses key performance indicators (KPIs) to assess the present state of business and to prescribe a course of action. The KPI methodology was further expanded with the Chief Performance Officer methodology which incorporated KPIs and root cause analysis into a single methodology.


What is OLTP ?

OLTP stand for Online Transaction Processing. This is a standard, normalized database structure. OLTP is designed for transactions, which means that inserts, updates, and deletes must be fast. Imagine a call center that takes orders. Call takers are continually taking calls and entering orders that may contain numerous items. Each order and each item must be inserted into a database. Since the performance of the database is critical, we want to maximize the speed of inserts (and updates and deletes). To maximize performance, we typically try to hold as few records in the database as possible. OLTP is typically for data entry and retrieval transaction processing.


What us OLAP ?

Online Analytical Processing is an approach to quickly providing answers to analytical queries that are multidimensional in nature. OLAP is part of the broader category BI, which also includes Extract transform load (ETL), relational reporting and data mining. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).

The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the row and column of the matrix; the measures, the values.

What is an OLAP cube ?

The term "cube" comes from the geometric object that has three dimensions. OLAP cubes can have many more dimensions than 3, but the term continues to apply.

In database theory, an OLAP cube is an abstract representation of a projection of an RDBMS relation. It is an arrangement of data in arrays (alternatively called cubes) to allow fast analysis. The arrangement of data into cubes avoids a limitation of relational databases which are not well suited for near instantaneous analysis of large amounts of data. OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. Because there can be more than three dimensions in an OLAP system the term hypercube is sometimes used. The commercial OLAP products have different methods of creating the cubes and hypercubes and of linking cubes and hypercubes.

What is Measure, Dimension & Hierarchy?

While most databases designed for OLTP such as those used in claims processing are designed for efficiency in data storage, OLAP cubes are designed for efficiency in data retrieval.

Regular relational databases treat all data into the database similarly, however OLAP cubes categorize data into "dimensions" and "measures".

Measures represent items that are counted, summarized or aggregated, such as costs or units of service.

A company might wish to analyse some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional methods of analysing the data are known as dimensions.

Dimensions are variables by which measures are summarized, such as hospitals, physicians, or dates of service. This organization of data greatly facilitates the ability to formulate data requests based on real-life situations.

Each of the elements of a dimension could be summarised using a hierarchy. For example May 2005 could be summarised into Second Quarter 2005 which in turn would be summarised in the Year 2005. Similarly the cities could be summarised into regions, countries and then global regions; products could be summarised into larger categories; and cost headings could be grouped into types of expenditure. Conversely the analyst could start at a highly summarised level such as the total difference between the actual results and the budget and drill down into the cube to discover which locations, products and periods had produced this difference.

In addition, many of queries that could be posed to the data are "pre-aggregated" in the database such that the answers have already been precalculated and can be reported without delay.

What is Oracle OLAP Cube?

Oracle OLAP Cube requires a specific set of materialized views to map to a star schema. A materialized view is a replica of a target master from a single point in time. The master can be either a master table at a master site or a master materialized view at a materialized view site. ( more about 1 | 2).

Why are OLAP cubes important ?

Before OLAP technology was well developed, data had to be extracted from databases using "queries".

This meant that the analyst had to structure a request to the database for the information desired, and then submitted this query to the database server. That server would processing query and return the results.

Depending on the size of the database and the data requested, this query could take minutes or hours to complete.

In this sense, the "online" aspect of this type of reporting is questionable. OLAP cubes are fundamentally different in that they "pre-aggregate" the data used to answer many of queries that are anticipated.


What is star schema?

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema, consisting of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The "facts" that the data warehouse helps analyze are classified along different "dimensions": the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed.

Dimension tables have a simple primary key, while fact tables have a compound primary key consisting of the aggregate of relevant dimension keys.

Facts and Dimensions

When we talk about the way we want to look at data, we usually want to see some sort of aggregated data. These data are called measures. These measures are numeric values that are measurable and additive. For example, our sales dollars are a perfect measure.
Just tracking measures isn't enough, however. We need to look at our measures using those "by conditions". These "by conditions" are called dimensions.

Therefore, in designing a star schema, our first order of business is usually to determine what we want to see (our measures) and how we want to see it (our dimensions).

Building the Fact Table

The Fact Table holds our measures, or facts. The measures are numeric and additive across some or all of the dimensions. For example, sales are numeric and we can look at total sales for a product, or category, and we can look at total sales by any time period. The sales figures are valid no matter how we slice the data.

While the dimension tables are short and fat, the fact tables are generally long and skinny. They are long because they can hold the number of records represented by the product of the counts in all the dimension tables.

What is data mining?

Data mining has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases".

Data mining involves sorting through large amounts of data and picking out relevant information. It is usually used by Business intelligence organizations, and financial analysts, but is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods.

What is oracle OLAP Solution?

Oracle OLAP solution has Oracle DB OLAP Option and Hyperion Essbase.

Oracle Database OLAP is optimal for accelerating query performance and
adding analytic content for SQL-based BI Tools on Oracle Database, whereas oracle’s Hyperion Essbase is ideal for performance management applications to be the de-facto world leader in planning, budgeting applications etc.

Both the OLAP Option and Essbase share certain core capabilities which are fast query performance, fast incremental update of data sets, rich calculation capabilities and dimensional calculation and query models.

Oracle OLAP—an option to Oracle Database 11g Enterprise Edition—is a calculation engine that supports the entire spectrum of advanced analytical applications. It extends the analytic capabilities of the Oracle database by providing new multidimensional datatypes, a multidimensional calculation engine, and a framework to build OLAP applications using SQL, PL/SQL, and Java. (Know more about oracle OLAP option)


Database vs. Data Warehouse

The primary difference betwen you application database and a data warehouse is that while the former is designed (and optimized) to record , the latter has to be designed (and optimized) to respond to analysis questions that are critical for your business. Application databases are OLTP (On-Line Transaction Processing) systems where every transation has to be recorded, and super-fast at that.This system is write-optimized, and you shouldn?¡¥t crib if your analysis query (read operation) takes a lot of time on such a system.

A Data Warehouse (DW) on the other end, is a database (yes, you are right, it?¡¥s a database) that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analysed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be read-optimized.

Creation of a DW leads to a direct increase in quality of analyses as the table structures are simpler (you keep only the needed information in simpler tables), standardized (well documented table structures), and denormalized (to reduce the linkages between tables and the corresponding complexity of queries). A DW drastically reduces the ??cost-per-analysis?¡¥ and thus permits more analysis per FTE. Having a well-designed DW is the foundation successful BI/Analytics initiatives are built upon (original post).

Data Warehouse vs Data Mart

Data Warehouses and Data Marts are conceptually different in scope. However, they are built using the exact same methods and procedures.

A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema (OLAP cube) made of a fact table and multiple dimension tables.

In contrast, a Data Warehouse (DW) is a single organizational repository of enterprise wide data across many or all subject areas. The Data Warehouse is the authoritative repository of all the fact and dimension data (that is also available in the data marts) at an atomic level.

A data warehouse (or mart) is way of storing data for later retrieval. This retrieval is almost always used to support decision-making in the organization. That is why many data warehouses are considered to be DSS (Decision-Support Systems). oth a data warehouse and a data mart are storage mechanisms for read-only, historical, aggregated data.

As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:

* Offline Operational Databases ?a Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
* Offline Data Warehouse ?a Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure
* Real Time Data Warehouse ?a Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
* Integrated Data Warehouse ?a Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

The term data warehouse architecture is primarily used today to describe the overall structure of a Business Intelligence system. Other historical terms include decision support systems (DSS), management information systems (MIS), and others (original post).

List of reference i used to compile this primer:

Introduction to OLAP
http://www.dwreview.com/OLAP/Introduction_OLAP.html

Data Warehousing Overview
http://www.dwreview.com/DW_Overview.html

Designing OLAP Solutions
http://www.dwreview.com/OLAP/OLAP_Comparison.html

OLAP architectures
http://www.olapreport.com/Architectures.htm

Designing the Star Schema Database By Craig Utley
http://www.ciobriefings.com/whitepapers/StarSchema.asp

Open Source Analytics / Open source BI by Nishith
http://opensourceanalytics.com/

Discussion of designing OLAP Cubes
http://databaseanswers.org/designing_olap_cubes.htm

6 comments:

Will Dwinnell said...

Data mining involves sorting through large amounts of data and picking out relevant information.

It's worth noting that data mining typically involves the construction of predictive mathematical models, and is not a simple querying process.


-Will Dwinnell
Data Mining in MATLAB

My Tech Blog said...

Thanks for sharing.

Anonymous said...

Nice dispatch and this enter helped me alot in my college assignement. Gratefulness you as your information.

Anonymous said...

Correctly your article helped me terribly much in my college assignment. Hats high to you post, choice look progressive in behalf of more interrelated articles soon as its one of my favourite topic to read.

Mutant said...

Excellent summary and writeup. Good Job!

Unknown said...

Thank you for this, great summary and exactly what I was looking for!

´