Manage Learn to apply best practices and optimize your operations.

Log shipping: Four tips to maintain RTO and RPO with SQL Server

It's not uncommon for DBAs to be lulled into a false sense of security by log shipping, without realizing the negative effects it could have on disaster recovery.

As you most likely know, log shipping is one of several high-availability features built into Microsoft SQL Server....

It allows you to easily set up automation of log backups to copy the backed up files and restore them on a standby server.

The main idea behind log shipping is to have a standby server available with a fairly recent copy of one or more databases. While it is very easy to run through the wizard and set it up, log shipping still requires a fair amount of planning and consideration to implement a successful disaster recovery plan.

More advice for log shipping with SQL Server

There are two acronyms commonly used when talking about disaster recovery – RPO and RTO. RPO stands for recovery point objective, which basically the amount of data you can reasonably afford to lose. RPO is measured in time; typically minutes. The other term, RTO, stands for recovery time objective. The question here is, how long can the database be down before a downtime becomes unacceptable?

For some companies, log shipping provides a false sense of security. This is because once a DBA sets up log shipping to occur every 10 minutes to satisfy the 10 minute RPO requirement (for example), everyone assumes they cannot lose more than 10 minutes of data, while in reality, they still could.

There are several major factors that can affect SQL Server's ability to backup, ship and restore within a defined time interval, therefore negatively affecting both RPO and RTO:


  1. Index defragmentation. Rebuilding of indexes should be a part of your regular maintenance tasks. Since this is a logged activity, however, the log backups taken while the re-indexing job is running can become huge.

    On large databases, it is not uncommon to generate more than 100 GB worth of log backups within a single hour. These backup files then need to be copied and restored. When restoring a log backup, all transactions should be replayed on the standby server. Since it is generally slower to replay transactions than it is to back them up, it takes much longer to restore a file than it takes to create a backup.

    If you are re-indexing large databases, you may run into a scenario where the standby server starts running behind with restoring log backups. I saw one case where a large database generated about 200 GB of backup files within 90 minutes and the standby server couldn't keep up restoring them, so it started falling behind by as much as 130 minutes. This had a negative effect on RTO because if the primary server died towards the end of the re-indexing job, it would mean waiting almost two hours for the standby server to reapply all the transactions before the database could be brought online. If the staff decided not to wait that long, they couldn't restore all the backups and their RPO would not be met. Why? Because they would lose more than the 10 minutes that had been defined as acceptable RPO.

    For this reason, I highly recommend watching the log shipping monitor and keeping an eye on the delta for the restore job during the re-indexing process. If you notice that the standby is falling behind with the restore job, try to spread your re-indexing across multiple days, or use the WAITFOR DELAY command to pause the re-indexing job for a few seconds between rebuilding indexes to allow log shipping to catch up.

  2. Network speed. If your network speed is relatively slow, or if it slows down when you are copying large files, the copy job might fall behind and not be able to keep up. One thing you can do is set a low threshold on log shipping notifications and configure SQL Server to notify you if log shipping falls behind my more than 30 minutes. Then, if you don't receive notifications that your copy job is falling behind, you'll know the network speed is likely capable of handling the data transfer.
  3. Overlapping of backup and copy jobs. When you define log shipping, you configure time intervals for three different jobs – backup, copy and restore. If the backup and the copy job kick off at the same time, the copy job will likely not pick up the file that was just backed up. This is because while it can take several seconds or minutes to finish a log backup, the copy job just grabs all the files that are available in the back up folder and copies them to the destination server. So what this means is that it might take two time intervals instead of one before a backup file ends up on the standby server.

    For example, let's say you defined your RPO to be 10 minutes. So in order not to lose more than 10 minutes worth of data, you defined your log shipping jobs to run every 10 minutes. At 2:30, when the copy and the backup jobs kick off, the copy job will not pick up the 2:30 backup file because it will not be available at 2:30 on the nose. Instead, the copy job will copy the 2:20 backup file, the one that was not ready 10 minutes earlier for the same reason. Your restore job will then restore the 2:20 file on the other end and your standby server will be current as of 2:20. The 2:30 backup file would get picked up at 2:40, but what if your primary server dies at 2:39 and you need to activate your standby server. Now all of a sudden you realize that you lost not 10, but 19 minutes of data – almost doubling your acceptable RPO.

    So what can you do? The easiest way to remedy this is to configure the copy job to run one or two minutes behind the backup job (at 2:22 or 2:32, for example). There is also no reason not to schedule the copy job to run every minute, since it executes on the standby server so there is no overhead on the primary, aside from reading backup files. The same recommendation goes for the restore job. This way, as soon as a file gets backed up or copied, the next job in the queue will pick it up and keep the standby server current.

  4. Full backups blocking log backups. This issue only applies to SQL Server 2000, but there are still enough of those in production to justify pointing it out. Prior to SQL Server 2005, log backups could not run while full backups were running. This means that when you execute the BACKUP LOG command while a full backup is running, SQL Server 2000 won't give you an error, but the log backup will not start until the full backup is done.

    During this time, your potential to lose more data than you are willing to accept is greatly increased. If your full backup takes too long to execute when compared to your acceptable RPO, you should do the following:

    • If possible, backup to a local file since it will likely finish faster than if you were backing up to a network drive.
    • Schedule the backup for the time when the database activity is at its lowest. This way you can minimize the number of transactions that would potentially be lost should the primary server fail during this period.

One final word of advice. Be sure to configure notifications with your log shipping and then -- aside from looking at the log shipping monitor reports -- watch for patterns in the notification emails. If you get an error once in a while because the copy job was interrupted by a network glitch, you don't have to worry too much. But if you are consistently being notified that your log shipping is out of sync as described in the examples above, you might need to take corrective action to get your disaster recovery plan back on track.


Roman Rehak is a senior database architect at in Colchester, Vt. He specializes in SQL Server development, database performance tuning, ADO.NET and writing database tools. Roman also serves as the technical chair for the SQL Server track at the annual DevTeach conferences in Canada and is the president of the Vermont SQL Server User Group.


Dig Deeper on Microsoft SQL Server Performance Monitoring and Tuning