Home > SQL Server Tips > Database Administration > Hardware clustering vs. replication: A case study
SQL Server Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

DATABASE ADMINISTRATION

Hardware clustering vs. replication: A case study


Mich Talebzadeh, Contributor
05.04.2004
Rating: --- (out of 5)


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   


The following was based upon a real-life experience in a Tier 1 bank. We used Sybase tools like ASE and Replication Server but this case is very applicable to other engines such as Oracle with Data Guard, etc.

When I was working last year for a large bank, we had a disastrous experience that was saved by replication. Now I'm convinced that any hardware-based solution such as clustering is nowhere near as good as a software-based solution such as replication. Why bother spending large sums of money on hardware clustering when you can achieve the same results with a bit of ingenuity and thinking? The notes below will make my points a bit clearer.

I have always been fascinated by Peter Thawley's article entitled "High Availability in the Internet Age." This article was published in the International Sybase User Group (ISUG) journal in Q3 1999. The author brought up the downtime that eBay suffered as a result of a hardware fault that rendered their Oracle database corrupt. It took 24 hours for the Oracle database to be loaded back and in the meantime, the business lost a lot of income. Regardless of what caused the downtime, one would have expected the architect and DBAs to account for such failures. Although failures due to hardware problems are relatively rare due to redundancy and mirroring, they can and do happen.

The bank I worked for had a trading system that had been running for two years. It had relatively large databases on the order of 70GB, using Solaris 2.7 and Sybase 11.9.2 (the Stone Age by today's standards). They had been struggling to create a proper business contingency (BC) site for this database. The criteria were that a trade had to hit the database and the trader screen displayed an acknowledgement message within 3 seconds of trade entry. The system had to be capable of handling 60,000 trades input per hour with replication on, and the trade delivery from the production to BC database had to be designed in such a way th


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   


RELATED CONTENT
SQL Server High Availability, Scalability and Reliability
Top load balancing methods for SQL Server
Maintaining high availability of SQL Server virtual machines
Creating fault-tolerant SQL Server installations
Scaling up vs. scaling out with SQL Server 2008
How to configure storage in SQL Server database with more writes than reads
SQL Server database replication tutorial
Licensing a standby server for SQL Server replication
Get SQL Server log shipping functionality without Enterprise Edition
Monitor database mirroring and replication after a SQL Server upgrade
Upgrade live applications to SQL Server 2005 for high availability
SQL Server High Availability, Scalability and Reliability Research

SQL Server Replication
Top load balancing methods for SQL Server
New replication features in SQL Server 2008 and what they mean to you
SQL Server database replication tutorial
Licensing a standby server for SQL Server replication
Upgrade live applications to SQL Server 2005 for high availability
Tool to synchronize two SQL Server databases
Simplify SQL Server replication
Replication techniques in SQL Server
Podcast: SQL Server high availability options
Managing identity columns with replication in SQL Server

Database Administration
Top load balancing methods for SQL Server
Performance implications of transaction log autogrowth in SQL Server
The keys to database backup protection for SQL Server
Understanding transparent data encryption in SQL Server 2008
Working with sparse columns in SQL Server 2008
Determining the source of full transaction logs in SQL Server
Implementing SQL Server 2008 FILESTREAM functionality
Improving SQL Server full-text search performance
Using the OPENROWSET function in SQL Server
New replication features in SQL Server 2008 and what they mean to you

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
replication  (SearchSQLServer.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


at it minimized or eliminated any trade losses. The system had to be capable of supporting 300 traders online.

Within two weeks of being on site we managed to create a BC system using Sybase replication, capable of handling the above criteria. By mid-November, we put the system into production and the client did some initial tests over a week-end that established that the replication was working in a timely manner and the trades were getting to the BC site. What is interesting is the rest of the story.

In the afternoon on a fateful business day, disaster struck in the form of a hardware problem on the data server. The Sybase error log started sending the following out:

ASE was running on an E4500 server with two Sun A5200 arrays for data storage. The problem was that a full array was taken out of action as a result of a single disk failure! You may argue the reasons for this. However, it appears that this is an inherent design fault of this type of array. (On three specific disk slots on A5200 you need to have disks. These disks are used to re-issue the "fcal signal" and if one of these fails and fcal signal does not get around, the array can fail.) As we had two arrays, one would have expected that the mirror array would have saved us. Unfortunately, that didn't happen in this case.

The problem was that although one of the storage arrays (out of two) was taken out of action, a Sybase device was residing on a volume that had both the underlying damaged disk and its mirror on the same array. This in turn was traced to Veritas that a few months before had a hot swap disk in the same array as the originally affected disk. As a result of human error, this problem was not detected or it was just ignored. So when the single array failed in the middle of writing data, a hole appeared in the trading database.

The point that I am trying to make is about the speed with which we managed to recover. We checked the BC database and the last trade entry there was recorded at two minutes before the crash. The counters showed that particular trade entry was indeed the last trade gone into the trading system. In this scenario all we did was to swap the interfaces files and asked the users to start their applications again. In effect we carried out a full test of the BC database and it tuned out to be fine, so that was the sign off!

The fundamental question is whether or not hardware redundancy in the form of hardware-level disk mirroring can ever provide complete protection for databases. Many organizations tend to use a form of local area (traditional disk mirroring) or wide area (Storage Area Networks) hardware mirroring to create redundancy and resiliency. For example, EMC's SRDF is an example of an extended-distance replication solution at the hardware level. As an alternative, software replication can be deployed. I tend to argue in favor of software solutions for the reasons explained below.

As a practitioner, I have always believed that software high availability (HA) solutions in the form of data replication (though somehow asynchronous) do have an edge on hardware-based disk-to-disk mirroring. My arguments here are based on the axioms that although databases are built on file systems or raw partitions, they fundamentally behave in a different way compared to pure file systems. Indeed it is a simplistic view to assume that disk-to-disk, bit-by-bit mirroring will be sufficient protection for your data.

HA disk-to-disk mirroring in its most common form is designed to do a bit-by-bit copy of data from the primary disk to the mirror. As long as the distance between two disk locations is not great and a fast connection such as Gigabit exists between these two sites, the writes will be almost simultaneous. At a first glance such solutions may look ideal; they can be almost off-the-shelf. Apply the solution using local or remote clustering and we have high availability of data.

In reality, databases are more complicated than a simple matter of bit-by-bit copying. A database engine performs a lot of tasks through the background processes in a process-based RDBMS like Oracle or through internal processes in a thread-based model like Sybase. These processes routinely perform tasks such as checkpointing and flushing of transaction logs and data that are necessary for maintenance of the database but in reality have no bearing to the availability of data. In other words, most of this form of data need not be copied or replicated. In comparison, in HA designs based on data replication, this redundant data is never replicated because it is not deemed necessary for the BC database. This brings us to the point of how software replication is different from hardware mirroring.

Replication in its most common form moves transactions (inserts, updates and deletes) at the table level from a source database to one or more destination databases. Compared to hardware mirroring, which is basically all or nothing, replication can

This points out another benefit of using replication technology for availability. Often customers use disk mirroring and disk snapshot technologies to maintain a hot database, often at a relatively high cost. These technologies do a good job with physical problems like a bad block on a disk. However, if there is data corruption caused by a software error in the I/O device driver or DBMS layer, that corruption will simply be propagated to the remote device! Since Sybase Replication Server operates at the logical level by turning transactions back into SQL operations against the replicate server(s), this risk is entirely avoided.


Rate this Tip
To rate tips, you must be a member of SearchSQLServer.com.
Register now to start rating these tips. Log in if you are already a member.




DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



SQL Server Development - .NET, C#, T-SQL, Visual Basic
HomeNewsTopicsITKnowledge ExchangeTipsAsk the ExpertsMultimediaWhite PapersIT Downloads
About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
SEARCH 
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts