Archive for the Category ◊ Data Warehouse ◊

Author:
Tuesday, May 25th, 2010

Data warehouse architecture is primarily based on the business processes of a business enterprise taking into consideration the data consolidation across the business enterprise with adequate security, data modeling and organization, extent of query requirements, meta data management and application, warehouse staging area planning for optimum bandwidth utilization and full technology implementation.

The Data Warehouse Architecture includes many facets. Some of these are listed as follows:

Process Architecture

Describes the number of stages and how data is processed to convert raw / transactional data into information for end user usage.

The data staging process includes three main areas of concerns or sub- processes for planning data warehouse architecture namely €œExtract€, €œTransform€ and €œLoad€.

These interrelated sub-processes are sometimes referred to as an €œETL€ process.

1) Extract- Since data for the data warehouse can come from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration.

2) Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse.

3) Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan.

Data Model Architecture

In Data Model Architecture (also known as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses:

  1. 3rd Normal Form – Top Down Architecture, Top Down Implementation
  2. Federated Star Schemas – Bottom Up Architecture, Bottom Up Implementation
  3. Data Vault – Top Down Architecture, Bottom Up Implementation

Technology Architecture

Scalability and flexibility is required in all facets. The extent of these features is largely depending upon organizational size, business requirements, nature of business etc.

Technology or Technical architecture primary evolved from derivations from the process architecture, meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation.

Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies.

Information Architecture

Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse.

Resource Architecture

Resource architecture is related to software architecture in that many resources come from software resources. Resources are important because they help determine performance. Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high. If there are not enough resources for the workload, then performance will be low.

Various Architectures

Please notice that with the different architectures there is one that stands out: Data Model Architecture. What is happening in the integration industry at large is: the ability to integrate information across the enterprise is becoming dependent on the quality of the data model architecture below.

The ability to be compliant, consistent and repeatable depends on how the data model is built under the covers.

There are 3 main data modeling styles for enterprise warehouses:

  1. 3rd Normal Form – Top Down Architecture, Top Down Implementation
  2. Federated Star Schemas – Bottom Up Architecture, Bottom Up Implementation
  3. Data Vault – Top Down Architecture, Bottom Up Implementation

The point to Data Warehousing Architecture is it is not JUST a data warehouse anymore. It is now a full-scale data integration platform, including right-time (real-time) data, and batch or strategic data sets in a single, auditable (and integrated) data store.

Author:
Wednesday, April 28th, 2010

Overview of Data Warehousing with Materialized Views

An enterprise data warehouse contains historical detailed data about the organization. Typically, data flows from one or more online transaction processing (OLTP) databases into the data warehouse on a monthly, weekly, or daily basis. The data is usually processed in a staging file before being added to the data warehouse. Data warehouses typically range in size from tens of gigabytes to a few terabytes, usually with the vast majority of the data stored in a few very large fact tables.

One of the techniques employed in data warehouses to improve performance is the creation of summaries, or aggregates. They are a special kind of aggregate view which improves query execution times by recalculating expensive joins and aggregation operations prior to execution, and storing the results in a table in the database. For example, a table may be created which would contain the sum of sales by region and by product.

Today, organizations using summaries spend a significant amount of time manually creating summaries, identifying which ones to create, indexing the summaries, updating them, and advising their users on which ones to use. The introduction of summary management in the Oracle server changes the workload of the DBA dramatically and means the end-user no longer has to be aware of which summaries have been defined. The DBA creates one or more materialized views, which are the equivalent of a summary. The end-user queries the tables and views in the database and the query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. This results in a significant improvement in response time for returning results from the query and eliminates the need for the end-user or database application to be aware of the summaries that exist within the data warehouse.

Although summaries are usually accessed indirectly via the query rewrite mechanism, an end-user or database application can construct queries which directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because, once the summaries are directly referenced in queries, the DBA will not be free to drop and create summaries without affecting applications.

The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle using a schema object called a materialized view. Materialized views can be used to perform a number of roles, such as improving query performance or providing replicated data, as described below.

In data warehouses, materialized views can be used to recompute and store aggregated data such as sum of sales. Materialized views in these environments are typically referred to as summaries since they store summarized data. They can also be used to recompute joins with or without aggregations. So a materialized view is used to eliminate overhead associated with expensive joins or aggregations for a large or important class of queries.

Schema Design Guidelines for Materialized Views

Guideline 1: Your dimensions should either be denormalized (each dimension contained in one table) or the joins between tables in a normalized or partially normalized dimension should guarantee that each child-side row joins with one and only one parent-side row.
If desired, this condition can be enforced by adding FOREIGN KEY and NOT NULL constraints on the child-side join key(s) and PRIMARY KEY constraints on the parent-side join key(s). If your materialized view contains only a single detail table, or if it performs no aggregation, a preferred alternative would be to use outer joins in place of inner joins. In this case, the Oracle optimizer can guarantee the integrity of the result without enforced referential integrity constraints.
Guideline 2: If dimensions are denormalized or partially denormalized, hierarchical integrity must be maintained between the key columns of the dimension table. Each child key value must uniquely identify its parent key value, even if the dimension table is denormalized.
Guideline 3: Fact tables and dimension tables should similarly guarantee that each fact table row joins with one and only one dimension table row. This condition must be declared, and optionally enforced, by adding FOREIGN KEY and NOT NULL constraints on the fact key column(s) and PRIMARY KEY constraints on the dimension key column(s), or by using outer joins as described in Guideline 1. In a data warehouse, constraints are typically enabled with the NOVALIDATE and RELY options to avoid constraint enforcement performance overhead.
Guideline 4: Incremental loads of your detail data should be done using the SQL*Loader direct-path option, or any bulk loader utility that uses Oracle’s direct path interface (including INSERT AS SELECT with the APPEND or PARALLEL hints). If the materialized view contains more than one table and performs aggregation, or if materialized view logs are not defined, then performing any other type of DML to your data will necessitate a complete refresh.
Guideline 5: Horizontally partition your tables by a monotonically increasing time column if possible (preferably of type DATE). For each table, create a bitmap index for each key column, and create one local index that includes all the key columns. Stripe each horizontal partition across several storage devices for maximum performance.
Guideline 6: After each load and before refreshing your materialized view, use the VALIDATE_DIMENSION procedure of the DBMS_OLAP package to incrementally verify dimensional integrity.
Guideline 7: Horizontally partition and index the materialized view like the fact tables. Include a local concatenated index on all the materialized view keys.
Author:
Wednesday, March 24th, 2010

Concept of Data Modeling

The goals of this article are to overview fundamental data modeling skills that all developers should have, skills that can be applied on both traditional projects that take a serial approach to agile projects that take an evolutionary approach. Every software or IT professional should have some conceptual idea about Data modeling. They don’t need to be experts at data modeling, but they should be prepared to be involved in the creation of such a model, be able to read an existing data model, understand when and when not to create a data model, and appreciate fundamental data design techniques. Through this article I am introducing these skills. The primary audience for this article is application developers who need to gain an understanding of some of the critical activities performed by an Agile DBA. This understanding should lead to an appreciation of what Agile DBAs do and why they do them and it should help to bridge the communication gap between these two roles.

What is Data Modeling?

Data modeling is the act of exploring data-oriented structures. Like other modeling artifacts data models can be used for a variety of purposes, from high-level conceptual models to physical data models. From the point of view of an object-oriented developer data modeling is conceptually similar to class modeling. With data modeling you identify entity types whereas with class modeling you identify classes. Data attributes are assigned to entity types just as you would assign attributes and operations to classes. There are associations between entities, similar to the associations between classes €€œ relationships, inheritance, composition, and aggregation are all applicable concepts in data modeling.
Traditional data modeling is different from class modeling because it focuses solely on data €€œ class models allow you to explore both the behavior and data aspects of your domain, with a data model you can only explore data issues. Because of this focus data modelers have a tendency to be much better at getting the data €œright€ than object modelers. However, some people will model database methods (stored procedures, stored functions, and triggers) when they are physical data modeling. It depends on the situation of course, but I personally think that this is a good idea and promote the concept in my UML data modeling profile.
Although the focus of this article is data modeling, there are often alternatives to data-oriented artifacts (never forget Agile Modeling’s Multiple Models principle). For example, when it comes to conceptual modeling ORM diagrams aren’t your only option €€œ In addition to LDMs it is quite common for people to create UML class diagrams and even Class Responsibility Collaborator (CRC) cards instead. In fact, my experience is that CRC cards are superior to ORM diagrams because it is very easy to get project stakeholders actively involved in the creation of the model. Instead of a traditional, analyst-led drawing session you can instead facilitate stakeholders through the creation of CRC cards.

Type of Data Models

There are three types of data models practiced in the industry:
€¢ Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs.
€¢ Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice).
€¢ Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.

How to Model Data

It is critical for an application developer to have a grasp of the fundamentals of data modeling so they can not only read data models but also work effectively with Agile DBAs who are responsible for the data-oriented aspects of your project. Your goal reading this section is not to learn how to become a data modeler, instead it is simply to gain an appreciation of what is involved.
The following tasks are performed in an iterative manner:
€¢ Identify entity types
€¢ Identify attributes
€¢ Apply naming conventions
€¢ Identify relationships
€¢ Apply data model patterns
€¢ Assign keys
€¢ Normalize to reduce data redundancy
€¢ Denormalize to improve performance

We will see description of each task and more about data modeling in next blog.

Category: Data Warehouse | Tags: ,  | Leave a Comment
Author:
Friday, January 29th, 2010

Introduction to Metadata

Metadata or meta data or meta-data also sometimes called metainformation is €œdata about data.€ Metadata is an emerging practice in the field of librarianship, information science, information technology and GIS. It can be applied to a vast array of objects including both physical and electronic items such as raw data, books, CDs, DVDs, images, maps, database tables, and web pages. Since the emergence of the Dublin Core metadata set and the internet, use of metadata has experienced a considerable growth in popularity as businesses and other organizations seek to organize rapidly growing volumes of data and information.

Importance of metadata in data warehouse

Good metadata is essential to the effective operation of a data warehouse and it is used in data acquisition/collection, data transformation, and data access. Acquisition metadata maps the translation of information from the operational system to the analytical system. This includes an extract history describing data origins, updates, algorithms used to summarize data, and frequency of extractions from operational systems. Transformation metadata includes a history of data transformations, changes in names, and other physical characteristics. Access metadata provides navigation and graphical user interfaces that allow non-technical business users to interact intuitively with the contents of the warehouse. And on top of these three types of metadata, a warehouse needs basic operational metadata, such as procedures on how a data warehouse is used and accessed, procedures on monitoring the growth of the data warehouse relative to the available storage space, and authorizations on who is responsible for and who has access to the data in the data warehouse and data in the operational system.

Significance in data warehouse

Metadata is your control panel to the data warehouse.  It is data that describes the data warehousing and business intelligence system:

  • Reports
  • Cubes
  • Tables (Records, Segments, Entities, etc.)
  • Columns (Fields, Attributes, Data Elements, etc.)
  • Keys
  • Indexes

Metadata is often used to control the handling of data and describes:

  • Rules
  • Transformations
  • Aggregations
  • Mappings

The power of metadata is that enables data warehousing personnel to develop and control the system without writing code in languages such as: Java, C# or Visual Basic.  This saves time and money both in the initial set up and on going management.
Data Warehouse Metadata
Data warehousing has specific metadata requirements.  Metadata that describes tables typically includes:

  • Physical Name
  • Logical Name
  • Type: Fact, Dimension, Bridge
  • Role: Legacy, OLTP, Stage,
  • DBMS: DB2, Informix, MS SQL Server, Oracle, Sybase
  • Location
  • Definition
  • Notes

Metadata describes columns within tables:

  • Physical Name
  • Logical Name
  • Order in Table
  • Datatype
  • Length
  • Decimal Positions
  • Nullable/Required
  • Default Value
  • Edit Rules
  • Definition
  • Notes

How can Data Warehousing Metadata be managed?
Data warehousing and business intelligence metadata is best managed through a combination of people, process and tools.

The people side requires that people be trained in the importance and use of metadata.  They need to understand how and when to use tools as well as the benefits to be gained through metadata.

The process side incorporates metadata management into the data warehousing and business intelligence life cycle.  As the life cycle progresses metadata is entered into the appropriate tool and stored in a metadata repository for further use.

Metadata can be managed through individual tools:

  • Metadata manager / repository
  • Metadata extract tools
  • Data modeling
  • ETL
  • BI Reporting

Metadata Manager / Repository
Metadata can be managed through a shared repository that combines information from multiple sources.

The metadata manager can be purchased as a software package or built as “home grown” system.  Many organizations start with a spreadsheet containing data definitions and then grow to a more sophisticated approach.
Extracting Metadata from Input Sources
Metadata can be obtained through a manual process of keying in metadata or through automated processes. Scanners can extract metadata from text such as SQL DDL or COBOL programs. Other tools can directly access metadata through SQL catalogs and other metadata sources.

Picking the appropriate metadata extract tools is a key part of metadata management.

Many data modeling tools include a metadata extract capability – otherwise known as “reverse engineering”.  Through this tool, database information about tables and columns can be extracted.  The information can then be exported from the data modeling tool to the metadata manager.

Author:
Tuesday, October 27th, 2009

 

DATA MINING

 

Data Mining is an analytic process designed to explore data (usually large amounts of data – typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction – and predictive data mining is the most common type of data mining and one that has the most direct business applications.

The process of data mining consists of three stages:

(1) The initial exploration,

(2) Model building or pattern identification with validation/verification, and

(3) Deployment (i.e., the application of the model to new data in order to generate predictions).

(1) Exploration:

This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and – in case of data sets with large numbers of variables (“fields”) – performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods  in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.

Model building and validation:

This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal – many of which are based on so-called “competitive evaluation of models,” that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques – which are often considered the core of predictive data mining – include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Deployment:

That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining, but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a “black box” approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.

 

Data Mining in Business

Through the use of automated statistical analysis (or “data mining”) techniques, businesses are discovering new trends and patterns of behavior that previously went unnoticed. Once they’ve uncovered this vital intelligence, it can be used in a predictive manner for a variety of applications.

 

Gathering Data

The first step toward building a productive data mining program is to gather data. Most businesses already perform these data gathering tasks to some extent — the key here is to locate the data critical to your business, refine it and prepare it for the data mining process. If you’re currently tracking customer data in a modern DBMS, chances are you’re almost done.

 

Selecting an Algorithm

At this point, take a moment to pat yourself on the back. You have a data warehouse! The next step is to choose one or more data mining algorithms to apply to your problem. If you’re just starting out, it’s probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon the data you’ve gathered, the problem you’re trying to solve and the computing tools you have available to you. Let’s take a brief look at two of the more popular algorithms.

1) Regression

Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you’re ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you’ve got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you’re working with categorical data where order is not significant (like color, name or gender) you’re better off choosing another technique.

2) Classification

Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You’ll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you’ll receive a decision tree that requires a series of binary decisions. One popular classification algorithm is the k-means clustering algorithm.

In the next Ill discuss about some crucial concepts of data mining.

Author:
Tuesday, September 01st, 2009

Data Profiling and Automated Cleansing Using Oracle Warehouse Builder

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

  1. Find out whether existing data can easily be used for other purposes
  2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
  3. Give metrics on data quality, including whether the data conforms to particular standards or patterns
  4. Assess the risk involved in integrating data for new applications, including the challenges of joins
  5. Assess whether metadata accurately describes the actual values in the source database
  6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
  7. Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality.

Most organizations build a data warehouse to provide an integrated, reliable, and consistent €œsingle version of the truth.€ Data is usually sourced from a number of systems and has to be extracted, cleansed, and integrated before being made available for users to query.

The quality of the data loaded into the data warehouse is often variable, however, and for that reason, historically the process of profiling your source data has been a time-consuming, manual process that has required either lots of experience with SQL*Plus or the purchase of an expensive third-party tool.

Oracle Warehouse Builder has the built in ability to profile data and no knowledge of SQL*Plus is required. Furthermore, the data profiles through Oracle Warehouse Builder can be used to generate automatic corrections to the data.

Data Profiling and Correcting Within Oracle Warehouse Builder

Data within your data warehouse can only be turned into actionable information when you are confident of its reliability. When you bring data into your data warehouse, you need to first understand the structure and the meaning of your data, and then assess the quality and the extent to which you may need to cleanse and transform it. Once you know what actions you need to take, you then need to make the required corrections to the data, and put in place a means to detect and correct any more errors that might occur in future loads. To do this, Oracle Warehouse Builder includes three new features that make this process simple and straightforward:

  1. Graphical Data Profiler€€Enables you to understand the structure, semantics, content, anomalies, and outliers present in your data, and derive data rules that will later be used within your data warehouse
  2. Correction Wizard€€Takes your data rules and applies them to your data, automatically generating correction mappings to cleanse and transform your data
  3. Data Auditor€€Takes your data rules and monitors the quality of subsequent data loads

Apart from removing the need for complex SQL*Plus scripts or third-party tools, doing your data profiling and corrections within Oracle Warehouse Builder has several advantages. The metadata that you generate about your data quality is stored alongside the other metadata in your design repository. Also, the mappings used to correct your data are regular Oracle Warehouse Builder mappings and can be monitored and managed with all of your other ETL (extract, transform, and load) processes. Doing your data cleansing and profiling within Oracle Warehouse Builder means that you only have to learn a single tool, and in addition, by integrating this process with your other ETL work, you ensure that data quality and data cleansing becomes an integral part of your data warehouse build process, and not just an afterthought.

Category: Data Warehouse | Tags: ,  | 2 Comments
Author:
Tuesday, July 14th, 2009

Why Warehouse security?

Warehouse is a huge collection of data. So unlike other databases, its security is more important from both internal and external threats. In general, when analysts and consultants take about data security the emphasis, or perceived emphasis, seems to be on the operational systems especially those, which are available to customers and partners. Everybody who shops online knows they should always look for a secure connection icon or some indication that their personal and credit card details will be securely managed when they submit purchase.

Warehouse is the integrated, time variant, nonvolatile collection of data which is used for decision making processes by management. The normal ETL process uses 3 tier architecture to load data in the warehouse. It takes data from OLTP systems (entered by users), after cleaning and passing through transformation logic is loaded into a central repository €€œ a data warehouse. 

The fact is the data in a data warehouse is even more valuable to a hacker since it has all been nicely cleaned and validated. Most importantly it is all in one place. Therefore, as more and more data is pushed into the data warehouse and organizations open up this information to more and more people, the chances of someone extracting (illegally) data must also increase. Also in the past data warehouse of many well known companies (like Telco few weeks ago) were hacked.

In reality most data warehouse probably have multiple data entry and exit points when you take into consideration factors such as ETL operations to move data, dashboards, reports that are sent to mobiles, adhoc query tools, email alerts etc. The ways in which data can be hacked are numerous but, with careful planning, it is possible to secure a data warehouse.

How to secure a data warehouse?

As the subject of data security is such a broad subject, in this blog post we will look at how Oracle can help you encrypt network traffic going to and from the data warehouse.

There are four basic areas to consider:

  • Access Control €€œ Oracle database vault and label security
  • Data Protection €€œ Oracle advanced security and backup access
  • Monitoring €€œ Oracle audit vault and EM configuration pack
  • User Management €€œ Identity management and user security

Oracle Database provides features to support the requirements of each of above four areas. For example, for implementing access control Oracle provides Label Security, Virtual Private Databases and Database Vault.

But before we consider locking down the data within the database we need to consider how data is moved, or transmitted around, the organization. Encryption of network data provides data privacy so that unauthorized parties are not able to view plaintext data as it passes over the network. This is becoming increasingly important as the data warehouse is linked directly into the operational systems to provide additional intelligence. While the value of the process that is moving data around is limited, i.e. these are not monetary transactions that can be high jacked and used to move money to another location, each process could still be transmitting valuable data such as security numbers, date of birth, zip code, credit balance, account numbers etc. that could be used outside of the organization for other illegal purposes. While there are many tools to make sure only authorized people access specific types of data, many companies freely broadcast this type of data around their networks with little or no encryption to protect it from hackers using sniffers. With so many processes, both internal and external, accessing the data warehouse the need to encrypt the network traffic carrying data to and from the data warehouse is becoming increasingly important. Every organization is moving towards active data warehousing. According to me in reality they need to move to a secure active data warehouse.

Secure transmission of data with Oracle

Oracle Advanced Security includes support for configuring native Oracle Net Services data encryption and integrity. Oracle Advanced Security provides the Advanced Encryption Standard (AES), DES, 3DES, and RC4 symmetric cryptosystems for protecting the confidentiality of Oracle Net Services traffic. This means that all Net Services traffic including ETL jobs generated by Warehouse Builder, manually coded PL/SQL ETL jobs, BI reports, dash board widgets etc are encrypted into unintelligible cipher text based on a key, in such a way that it is very hard (computationally infeasible) to convert cipher text back into its corresponding plaintext without knowledge of the correct key.

To check if the advanced is installed in your system is quite simple. Just run the program $ORACLE_HOME/bin/adapters. This will generate a list of all the supported connection and security drivers.

Data network encryption setup

Normally, the network or security administrator who sets up the encryption and integrity configuration parameters manages the configuration process. Encryption and integrity parameters are defined by modifying the sqlnet.ora file on the clients and the servers on the network. It is possible to configure any or all of the available Oracle Advanced Security encryption algorithms, and either or both of the available integrity algorithms. However, only one encryption algorithm and one integrity algorithm are used for each connect session.

A network connection can support more than one encryption algorithm and more than one integrity algorithm. When a connection is made, the server selects which algorithm to use based on the list of algorithms specified in the sqlnet.ora files. The server searches for a match between the algorithms listed on both the client and the server. It then picks the first algorithm in its own list that also appears in the client list. If one side of the connection does not specify an algorithm list, all the algorithms installed on that side are acceptable.

First run the Oracle Net Manager utility. This is located in the $ORACLE_HOME/bin directory. This will automate the setting of the required parameters in the various configuration files.

Once Net Manager is started, select the €˜Profile’ and then use the pull down menu (€œNaming€ is selected by default) to select the €œOracle Advanced Security€.

Once Oracle Advanced Security is correctly installed, it should be possible to access encryption screen. 

Selecting the integration tab gives the list of available methods. Choice depends on the level of encryptions required and in some cases geographical location as there are limitations imposed by many governments.

 For data encryptions, there are five choices:

1) Advanced Encryption Standard (AES):

This is the new Federal Information Processing Standard (FIPS) encryption      algorithm. It can be used by all businesses to protect sensitive data over a network.

2) Data Encryption Standard (DES):

DES has been a U.S. government standard for many years and is sometimes mandated in the financial services industry. Because it has been a standard for so long, DES is deployed throughout the world for use in a wide variety of applications.

3) Triple DES Support:

3DES encrypts message data with three passes of the DES algorithm and provides a high degree of message security. However, there is a performance penalty. 3DES typically takes three times as long to encrypt a data block when compared to the standard DES algorithm. This will obviously be of concern to data warehouse DBAs where data is moved across the network loaded in real/near-real time.

4) DES40:

The DES40 was designed to provide DES-based encryption to customers outside the U.S. and Canada at a time when the U.S. export laws were more restrictive. Now, in Oracle Advanced Security 11g Release 1 (11.1), DES40, DES, and 3DES are all available for export. DES40 is still supported to provide backward-compatibility for international customers.

5) RSA RC4:

The RC4 algorithm, developed by RSA Data Security Inc., has become the international standard for high-speed data encryption. RC4 is a variable key-length stream cipher that operates at several times the speed of DES, making it possible to encrypt large, bulk data transfers with minimal performance consequences. This is probably the ideal choice for data warehousing since it has minimal performance implications.

Each of these methods comes in a variety of flavors offering differing levels of key length. Therefore, in the list of available methods various values such as AES256, AES128, 3DES168 etc can be seen.

Once the method is selected, next step is to determine how encryption will work. There are four possible methods:

€¢ Rejected
€¢ Accepted
€¢ Requested
€¢ Required

The last step is to set the seed. Seed is symmetric key cipher operating on fixed length group of bits with an unvarying transformation. Some of the encryption methods require a seed that must be 10 to 70 characters in length; however, it changed at any time.

Once the setting is done save them to write these changes to the configuration files.

The SQLNET.ORA file on the server should now look like this:

SQLNET.AUTHENTICATION_SERVICES= (BEQ, TCPS)

SQLNET.ENCRYPTION_CLIENT = required
SQLNET.ENCRYPTION_SERVER = required
SQLNET.CRYPTO_SEED = ’04402314687768′
SQLNET.ENCRYPTION_TYPES_CLIENT= (RC4_256)
SQLNET.ENCRYPTION_TYPES_SERVER= (RC4_256)

Next amend the SQLNet.ora file on the client side so it looks like this (this can be done either by copying over the SQLNet.ora file from the server and taking out the SERVER related entries or you use the NetManager utility that is installed as part of the database client installation process):

SQLNET.ENCRYPTION_CLIENT= required
SQLNET.AUTHENTICATION_TYPES_CLIENT= (RC4_256)
SQLNET.CRYPTO_SEED= ’04402314687768′

NAMES.DIRECTORY_PATH= (TNSNAMES)
TRACE_LEVEL_CLIENT=10

The TRACE_LEVEL_CLIENT setting will allow us to check if the data is being encrypted.

To check if the working is expected, take a line sniffer and run a query from SQLPlus before you enable encryption, then enable encryption and the packets of data should now be scrambled. Any client that uses SQLNet to access the database will use encrypts to all data transmissions. If a client tries to us an incorrect SQLNet configuration then the connection will be refused and a TNS-12660 error will be returned to the client.