Archive for the Category ◊ Databases ◊

Thursday, March 10th, 2016

Many times we have been in situation where we need compare/manipulate the in a query based on different rows in the dataset. If we want to process current record by referring to record previous to current record or may be next record.

In such situation we can immediately think of using a cursor. But as we know using a cursor is not wise decision as it has performance impact.

Therefore let’s look at a very powerful feature introduced in   MSSSQL server 2012 onwards. SQL Server 2012 introduces new analytical function LEAD() and LAG(). These functions accesses data from a subsequent row (for lead) and previous row (for lag) in the same result set without the use of a self-join.


This can be achieved by just following simple query

SELECT s.SalesOrderID,s.SalesOrderDetailID,s.OrderQty,
LEAD(SalesOrderDetailID) OVER (ORDER BY SalesOrderDetailID) LeadValue,
LAG(SalesOrderDetailID) OVER (ORDER BY SalesOrderDetailID) LagValue

With above query we have access to only immediate after or immediate after row.  What if you had to access records that are N rows away from current record?

SELECT s.SalesOrderID,s.SalesOrderDetailID,s.OrderQty,
LEAD(SalesOrderDetailID,2) OVER (ORDER BY SalesOrderDetailID) LeadValue,
LAG(SalesOrderDetailID,2) OVER (ORDER BY SalesOrderDetailID) LagValue
FROM Sales.SalesOrderDetail s 


If sqlserver is processing current record which is the first row, and if we are using lag function, then we know there are no records that lie before the first row, hence the LAG function will return NULL. If we want to put a default value in such situation with can pass a 3rd argument to replace NULL with that value LAG(SalesOrderDetailID,2,0).

The real power of lead and lag functions can be seen when we are dealing with records that need to group and then some actions has to be performed.


This can be achieved using partition by in combination with lead or lag functionality

SELECT s.SalesOrderID,s.SalesOrderDetailID,s.OrderQty,
LEAD(SalesOrderDetailID) OVER (PARTITION BY SalesOrderID
ORDER BY SalesOrderDetailID) LeadValue,
LAG(SalesOrderDetailID) OVER (PARTITION BY SalesOrderID
ORDER BY SalesOrderDetailID) LagValue
FROM Sales.SalesOrderDetail s

Category: Databases  | Leave a Comment
Tuesday, February 02nd, 2016

There are quite a few advantages we are aware of using inheritance in C#, JAVA and other languages.

For example say we have base class Asset, and you have derived classes such as all the assets belong to an organization e.g. Buildings, rooms, desks, computers, phones etc. these all can be managed by just using the Asset base class.

What if we could achieve same with Database objects (or table) if we could have a mechanism to define a base class (or base table) and all other objects could derive from it, it would add so much flexibility and sense to the overall design. Also saving a lot of code complexity in handling different scenarios.

Say there are people in the organization who are allowed to use some of the assets assigned to him/her. In a typical scenario we would have one table for each buildings, rooms, desks, computers, phones and one table for storing people. Now if we have to track the allocation of the assets to a person, then we would need to have 5 other mapping tables (e.g. person to buildings, person to rooms, person to computers etc.).

If we have a single table to track all these asset then we just need 1 mapping table i.e. “Person to Asset” this can be achieved by using table inheritance concepts in database.


There are mainly 3 types of inheritance

Table per hierarchy (TPH)

Here all the data is stored in a single table with union of all the columns in all the derived class and base class.



Here one row in the table are used per entity e.g. if we are storing a Room then the columns Id, Name, Capacity, Type columns are populated with data, column ‘IsServer’ which does not belong to this entity is inserted with NULL. Discriminator column is used to identify the type of the entity i.e. it will be populated with B for buildings, R – Rooms, D – Desks and C – Computers.


  • Flat structure faster read/writing
  • Simpler queries in case you are using ORM tools for data access
  • All the entity column should be nullable in table, not null logic should be handled at ORM layer
  • Columns that do not participate in the entity being stored the space allocations still takes place
  • Table are not in normalized form


  • All the entity column should be nullable in table, not null logic should be handled at ORM layer
  • Columns that do not participate in the entity being stored the space allocations still takes place
  • Table are not in normalized form

Table per type (TPT)

Here the data is stored in one table per class, so the data for each class resides in its own table. The relations with the base class is maintained by primary/foreign keys.


Each entity here is a new row in assets table with an auto generated primary key, a corresponding entry will be made in one of the child table depending on what type of child it is having same primary/foreign key that was auto generated in Assets table.


  • Nullable constrain can be added in DB
  • No wastage of space as entity reside in its own type table
  • Due to normalization more join statements are required hence low performance
  • Tables are in Normalized form


  • Due to normalization more join statements are required hence low performance 


Table per concrete class (TPC)

Here there is no table for base type, every type has its own table with all the column of base class as well its own properties




  • Nullable constrain can be added in DB
  • No wastage of space as entity reside in its own type table
  • Flat structure faster read/writing
  • Simpler queries in case you are using ORM tools for data access
  • In case want to define a relation with respect to base class it’s not possible with single mapping table, we need to have 4 mapping table to map it with a other entity.
  • Tables are in Normalized form


  • In case want to define a relation with respect to base class it’s not possible with single mapping table, we need to have 4 mapping table to map it with a other entity.

These concepts are supported out of box by entity framework. For more details please read at:


Wednesday, July 08th, 2015

Since the inception of SSIS, the deployment of SSIS was always tricky and eventually, there were many challenges which came up for the deployment of SSIS to automate/schedule their execution. However, this helps Microsoft a lot as we have seen solution deployment in SSIS 2005/2008 to Project deployment in SSIS 2012. It is pointless to say that configurations have played a major role in providing the great heights for SSIS of what it is today being one of the most recognized ETL tool.

But personally I like the SSIS 2012 because it has solved a lot of problems for me which I have faced in my previous projects. So let me define the problem statement for you.

“How can I deploy a list of more than 70 packages in one shot and with a least number of changes in parameters”?

The SSIS 2005/2008 has helped in identifying this problem and after a quite workaround, it has been found that if we change the configurations in different packages, we can bind the packages with common parameters. But the problem was that solution is that we cannot deploy the projects separately and all the projects has to be deployed along with the solution with the common configurations. However with SSIS 2012, the opportunity knocked in with a separation of Solution Deployment and Package Deployment. The uniqueness of the Project Deployment model is that we can define different configurations and parameters for respective projects which allows a developer to design the project based on the defined requirements for an ETL.

This deployment model has solved the purpose of distributing the projects at various levels. For example, if a developer wants to load the Stage database from Source System, we can define various levels like Logging (at DB level), Auditing (at DB level), Extraction, Transformation, Business Logic Implementation, Cleansing and Load. And all the above layers can be executed in one shot by making a Parent Project consisting of Master packages calling the master packages for the respective layers.

The benefit of the above approach is that it is object and subject oriented. Also, as per the specific needs of the layer, the developer can define the various parameters to handle the load in a particular manner.

Now the solution can be simple, by the using the metadata of the packages. How? Let’s find out the same below.

1)      Define a parent for each project and called it as <Project Name>_Master.dtsx

2)      Create a metadata based table with the following attributes.

CREATE TABLE [dbo].[PackageMetadata](

   [PackageID] [int] IDENTITY(1,1) NOT NULL,

   [PackageName] [varchar](100) NULL,

   [PackageArea] [varchar](100) NULL,

   [PackageDomainName] [varchar](100) NULL,

   [IsActive] [bit] NULL,

   [CreatedDate] [smalldatetime] NULL,

   [ModifiedDate] [smalldatetime] NULL,

   [SortKey] [smallint] NULL)

The above table is going to hold the metadata for the packages including the package Name, the area of the package that it belongs to (For example, the project name), the domain of the package (For example, the Stage), whether the package is active or not, the timestamps and the sort Key. The SoryKey is one of the important attribute of the table as it will define the flow the data load.

Note that this is a custom made table and we can use the SSIS Catalog based tables too for doing the same job but advantage of custom table is to alter and reuse the data (or metadata) based on needs of the project/solution.

3)      Insert the relevant data in the metadata table and define the SortKeys. For example, SortKey can be defined as 1 for all the master packages (Parent Packages) and can be defined as 2 for all the dependent packages (Child packages)

4)      While implementing the Master package, define a data flow task with object as the source. The source should be connected to the metadata table with relevant filters on area and the domain with a sorting on the key.

5)      Introduce a loop container in the package and create a loop on running the packages based on the above query from data flow task.

 Parent Child Pattern in SSIS

Now we are all set and one parent will trigger the child packages and the load will be performed successfully. The benefits of this strategy is that it is easy to maintain and can be expanded for the other packages with just one record insertion. Along with that, if logging is enabled, the load performance and the error check can be easily performed.

Thursday, April 23rd, 2015

Phonetic algorithm:

Phonetic algorithm is an algorithm which searches for the word with similar sound. In other words, it matches the words with same\similar pronunciation. When we search for names in database, it is difficult to find same names with different spellings. For example, “John -Jhon” , “Thomas – Tomas”, “Steven – Stephen”, “Rakesh – Rakhesh”This problem can be solved using phonetic algorithm.


Some of the phonetic algorithms are as below:

Soundex: This creates a four character string with three numbers preceded by a character.
Daitch-Mokotoff Soundex: Extended version of above algorithm; creates six numbers as the output code.
Beider-Morse Phonetic Matching: Unlike above two algorithms, it does not check the “sound” test by spelling, but by the linguistic properties of the language.
Kölner Phonetik: Focussed for German words with the same Soundex logic.

In this article, I will talk about Soundex algorithm which is popular among all other phonetic algorithms.


This algorithm was developed by Robert C Russell and Margaret K. Odell and was patented in 1918. This algorithm breaks down the name to a four characters (single letter followed by three numbers) code.
Let us try to create code for the below word with this algorithm


Algorithm is as follows:

1. Keep first letter of the word as is.
                backscratcher -> B

2. Ignore all the vowels (A,E,I,O,U) and H,W,Y. Replace these characters with number ’0′.
                backscratcher -> B0CKSCR0TC00R

3. Replace following letters with corresponding number (as mentioned below)
                1 – B, F, P, V
                2 – C, G, J, K, Q, S, X, Z
                3 – D, T
                4 – L
                5 – M, N
6 – R

B0CKSCR0TC00R -> B02226032006

4. Remove consequently repeated numbers
                B02226032006 -> B02603206

5. Remove all zero and truncate the characters to four digits. If the output is less than three characters, then fill the rest of the characters with zeros.
                B02603206 -> B26326 -> B263

Below are some sample outputs when we search for names with different spellings:

Names Code
John – Jhon J500
Marie – Mary M600
Raghu – Ragu R200
Prabu – Prabhu P610
Marc – Mark M620


Many databases like Oracle, SQL, MySQL etc.. has implemented this algorithm as a built-in function.

Syntax for soundex in SQL:



Query to search the people with name pronounced as ‘Stephen’

Select name from PeopleTable where soundex(name) = soundex(‘Stephen’)

Category: Databases, General  | One Comment
Thursday, February 26th, 2015

Need for Database Unit Testing

  • Database Unit Testing ensures iterative changes don’t break existing functionality
  • Generates automated tests for programmable objects (Functions, Stored Procedures,Triggers etc..)

  • Develop a group of tests to run with every database build

  • SQL Server unit tests do establish a baseline state for your database and then helps to verify any subsequent changes that you make to database object

How to achieve?


  • Need to have latest SQL Server Data Tools integrated with Visual Studio
  • Target platform/Database needs to be set appropriately and compatible mode(while using VS2012 it sets SQL Server 2014 as default, hence change the target server version accordingly)
  • NuGet package manager is must while dealing with remote data server
  • Make sure existence of MSBuild engine(above two things will be taken care if we have VS2012 ultimate version)

What next after Unit testing?

  • Integration with daily builds(C.I)
  • Running automated unit test of DB along with application unit testing parallel or serial
  • Getting Code coverage, Code analysis metrics
Category: Databases  | Leave a Comment
Monday, December 01st, 2014

Data Warehousing is a big world to visit but fortunately its a flat world where we have only two ends. One end, where the data is available in the source as raw and unstructured data and the other end, where the data is presentable and structured with the history available for the analysis.

In this article, I am going to summarize about the very first manager (The Load Manager), which is required to do the basic operations but mostly load the data into the databases from different types of data sources.


- This Component performs the operations required to extract and load process.
- The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse.

The load manager does the following functions.
- Extract the data from source system.
- Fast Load the extracted data into temporary data store.
- Perform simple transformations into structure similar to the one in the data warehouse.

Please find the picture below which is summarizing the basic functionalities for the load manager


The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection( ODBC), Java Database Connection (JDBC), are examples of gateway.

In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time.
The complex transformations affects the speed of data processing. It is more effective to load the data into relational database prior to applying complex transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved. But certainly there are gateways still exist too which can perform the faster bulk load provided the hardware is supporting for all the possible operations.

While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks.
- Strip out all the columns that are not required within the warehouse.
- Convert all the values to required data types. etc.

For More details and references, please refer the book  - Data Warehousing in Real World

Friday, November 28th, 2014

Problem Statement - I want to send a flash report in the email to my customers (Data Driven Email Notifications)

Definition of Flash Report – The report which is summarizing the business with the help of very few facts and information

Requirements – MS SQL Server 2008/2012 (Database Email configured i.e. SMTP), HTML Support

Solution -

  1. First generate a SQL query to get the data which is supposed to be displayed in the flash report
  2. Use the below query template and replace the information as per the need

DECLARE @Body varchar(max)
declare @TableHead varchar(max)
declare @TableTail varchar(max)
declare @mailitem_id as int
declare @statusMsg as varchar(max)
declare @Error as varchar(max)
declare @Note as varchar(max)

Set NoCount On;
set @mailitem_id = null
set @statusMsg = null
set @Error = null
set @Note = null
Set @TableTail = ‘</table></body></html>’;

–HTML layout–
Set @TableHead = ‘<html><head>’ +
‘<H1 style=”color: #000000″>HEADER OF TABLE</H1>’ +
‘<style>’ +
‘td {border: solid black 1px;padding-left:5px;padding-right:5px;padding-top:1px;padding-bottom:1px;font-size:9pt;color:Black;} ‘ +
‘</style>’ +
‘</head>’ +
‘<body><table cellpadding=0 cellspacing=0 border=0>’ +
‘<tr bgcolor=#F6AC5D>’+
‘<td align=center><b>Name of column</b></td>’ +
‘<td align=center><b>Name of column</b></td>’ +
‘<td align=center><b>Name of column</b></td>’ +
‘<td align=center><b>Name of column</b></td>’ +
‘<td align=center><b>Name of column</b></td></tr>’;

–Select information for the Report–
Select @Body= (Select
Column As [TD],
Column As [TD],
Column As [TD],
Column As [TD],
Column As [TD]

FROM [DB].[dbo].[Table]
where -condition-
(whatever you want to do else …)

For XML raw(‘tr’), Elements)

– Replace the entity codes and row numbers
Set @Body = Replace(@Body, ‘_x0020_’, space(1))
Set @Body = Replace(@Body, ‘_x003D_’, ‘=’)
Set @Body = Replace(@Body, ‘<tr><TRRow>1</TRRow>’, ‘<tr bgcolor=#C6CFFF>’)
Set @Body = Replace(@Body, ‘<TRRow>0</TRRow>’, ”)

Set @Body = @TableHead + @Body + @TableTail

– return output–
Select @Body

EXEC msdb.dbo.sp_send_dbmail
@profile_name =email’, <–This is the mail account to sent from.
@mailitem_id = @mailitem_id out,
@subject = ‘subject Email’,
@body = @Body,
@body_format = ‘HTML’;

3. Create a Stored Procedure with the parameters to be passed (For example- recipients, subject etc)

4. Schedule the Stored Procedure on the SQL Server Agent

Result – The scheduled procedure will email the report to the recipient box and provide the updated information with every schedule.

Alternative Use – This concept can be used to provide the regular notifications based on the success/failure of the process/scripts in the project and it is helpful to act faster.

Possible Additions – The script above can be altered to use the CSS and can provide a more professional report if required.

Category: Databases, HTML5, Others | Tags:  | Leave a Comment
Friday, October 31st, 2014

Generally we come across scenarios wherein queries take long time to execute. Most of the times the reason would be that there could be some missing indexes. The below query would help to find list of missing indexes.

The query returns four fields:

  1. [Cost]: Is determined based on user seek and user scan on a table.
  2. [Database Name]: Name of the database where the table is present
  3. [equality_columns]: displays columns using equality predicates, e.g.  “Select * from student  where usn = 1”
  4. [inequality_columns]: displays columns using inequality predicates, e.g.  “Select * from student  where usn > 10”
  5. [included_columns] : Columns that could be included while creating the index



MID.[statement] AS [Table Name]

, as [Database Name]




,((MIGS.user_seeks + MIGS.user_scans) * MIGS.avg_total_user_cost * MIGS.avg_user_impact) AS [Cost]

FROM sys.dm_db_missing_index_groups MIG

INNER JOIN sys.dm_db_missing_index_details MID

ON MID.index_handle = MIG.index_handle

INNER JOIN sys.databases D

ON D.database_id = MID.database_id

INNER JOIN sys.dm_db_missing_index_group_stats MIGS

ON MIGS.group_handle = MIG.index_group_handle


Category: Databases  | Leave a Comment
Wednesday, October 29th, 2014

Data Compression is a technique used to reduce physical file sizes in sql which can greatly enhance the performance of many database applications. Data compression reduces disk I/O thereby helping in enhancing the performance. Data Compression is only available in Enterprise (SQL Server 2008 & SQL Server 2008 R2) edition or higher.

When data compression is turned on for a table, the data is compressed and stored in a compressed format. Whenever data is read, it needs to uncompressed. In other words, every time data has to be passed to or from the Storage Engine, it has to be compressed or uncompressed. Even though it is an overhead on CPU, the amount of disk I/O saved by compression increases overall performance of SQL Server. It’s a good practice to enable table compression only when you notice that size of a table is huge.




  1. Reduces physical disk space
  2. Improves performance by reducing the number of logical reads on a table.




  1. Compression adds on additional CPU cycles. If you have CPU constraints, it’s better to turn off table compression


There are two types of compression:


Row-level Data Compression: Row-level data compression converts fixed length data types into variable length data types. This ensures freeing up empty space thereby saving additional space. Since data is compressed, more additional rows can fit into a single 8kb page.


Page-level Data Compression: Page-level compression offers increased data compression over row-level compression. So page level compression is equal to (row level compression + prefix and dictionary compression)


Steps you need to follow to enable data compression:


1. Select and right click on a table which needs to be compressed.

2. Go to Storage -> Manage Compression

3. Select a compression type

4. Click on calculate to compare amount of space saved when you adopt for row level compression or page level compression.

5. The Data compression wizard will prompt you to save a query which could be used for compression of the selected table.

Category: Databases  | Leave a Comment
Thursday, October 09th, 2014

Oracle uses the concept of the ROWID to uniquely identify each row in each table in the database. This is represented as the pseudo-column ROWID

Oracle8 introduced the concept of OBJECTS. OBJECTS have identifiers which are added to the ROWID giving an EXTENDED ROWID format which is 10 bytes long.

ROWID is a VARCHAR2 representation of a base 64 number.

Note:base 64 starts at B, not A, A is used to designate zero

The ROWID is displayed as : OOOOOO.FFF.BBBBBB.SSS

    OOOOOO This is the Data Object Number
    FFF This is the Relative File Number
    BBBBBB This is the Block Number
    SSS This is the Slot Number

Multiple files can have the same relative file number because Oracle assigns the number based on TABLESPACE.

This means that these numbers are only unique per tablespace which indicates that we cannot derive an absolute address directly from the new ROWID.

This means that the ROWID addressing schema is tablespace relative.

ROWID contains the data object number which increments when an object’s version changes.
An object’s version changes whenever a table is truncated or a partition is moved.
The data object number is not related to the Object ID.

Category: Databases  | Leave a Comment