10 Steps to Troubleshooting NetBackup Failures

From World History Wiki
Jump to: navigation, search
World History Wiki is Brought to you by:
S.J.'s Adventures


  1. Typically Backup failures will be indicated by the Control-m Wrapper script existing with a non-zero status code. Check the "Sysout" for more details on why the job failed.
    If the issue has not produced a ticket, create one manually.
    1. Based on the output of the wrapper script you will take different actions to resolve the issue.
    2. If the jobs are not running, you'll need to validate that the control-m jobs are setup properly.
    3. If the job is running long, check the NetBackup Java Admin Console Activity Monitor for the Job in question. Check the Detailed Status within the job details for additional information.
  2. Verify the error codes within the NetBackup Java Admin Console
    1. Also check if this is a re-occurring issue.
    2. Check the Detailed Status Log for details on what may have caused the job to fail.
  3. For re-occurring issues, follow up with the individual who originally took impressionability for following up on the resolution to the problem.
    1. Check to see what steps (or other tickets) have already been done to resolve the issue (relate applicable tickets to the new ticket).
    2. The on-call admin takes responsibility for the issue, and is responsible to see the issue threw until it is resolved
      (or coordinate with someone else to have them take responsibility for the issue threw resolution).
  4. Using the links provided in the Sysout, check to see what error codes were produced, and use the NBU error codes troubleshooting info for additional steps related to this specific error code,
    And the NetBackup Troubleshooting Guide for more details.
    1. This will help you to resolve the issue, or find additional information to determine what needs to be done to fix the issue.
  5. Using the links provided in the Sysout, check the Jobs Problems Report for additional details on what may have caused the failure.
  6. Follow suggested troubleshooting steps to try and resolve the problem.
    1. Use the information gathered in the previous steps to help you determine the cause.
  7. If you are able to resolve the problem, follow the instructions in the Control-m Wrapper script sysout on re-starting the backup via control-m.
  8. If you are un-able to resolve the issue, send the ticket back to the NOC.
    1. The NOC should updated the control-m job and re-assigned the ticket to the group you specified
      This group should be the one most capable of further troubleshooting or resolving the issue.
  9. Continue to work with the other group the ticket was re-assigned too, to insure they are following up on the issue.
  10. Provide additional information as needed, and continue to follow up until the issue is resolved.
    1. If necessary Open a case with a Support Vendor to get help with troubleshooting.
      They will most likely want you to gather some logs for them with the verbosity set to "5";
      generally this means you have to change the verbose setting, and then reproduce the error.

Common Error Codes

1
  • Partial Success, most often due to files being locked/in-use during backups. Missed files are logged, but usually no further action is taken.
Occasionally this error occurs with Database backups, specifically with Exchange, and usually due to a misconfiguration or a specific Exchange database being off-line.
Due to the very common nature of this error occurring with Windows servers, the Control-m wrapper script for NetBackup is setup to not report this error as a failure for basic Windows clients and a few other specific Backup Policies, but will report it as an error, resulting in a ticket, for all other types of backups.
13, 14 and overloaded server issues.
  • Often happens when servers are rebooted or under maintenance during backup windows. Low free filesystem space can cause these problems; requiring space to be freed up. Can also be caused by problems with the affected server; including heavy loads on the server or corrupted files that the filesystem/OS has difficulty reading. Tickets usually get sent to the system owner or Infrastructure to resolve the issues with the server.
A reboot of the affected server sometimes helps. Also see error code 156 for other possible fixes.
24
  • Often happens when servers are rebooted or under maintenance during backup windows. Can also be caused by a wide verity of problems with the affected server or network, and anything in between. Tickets usually get sent to the system owner or Platform team to resolve the underlying issue(s) with the server.
Also see error 13 and 14 above for other possible causes.
57
  • Usually caused by the NetBackup client service not being installed or not running. Restart the NetBackup Client service on the affected system. Often assistance is needed from the Platform Team or system owner to correct.
  • Less common but still possible are DNS or network related miss-configurations.
58
  • Typically caused by a server being down; often due to it having been decommissioned. Often goes back to the system owner or to Platform team to address.
59
  • Typically a miss-configured NetBackup client, with an out-dated server list. If we are able to log-in we can update it, but otherwise it goes back to the system owner or to Platform Team for assistance to correct the configuration issue.
  • Less common but still possible are DNS related miss-configurations, or other network related problems.
71
  • The file list within the Backup Policy is likely incorrect, or a wild-card is used to generate the file list, and a file or directory was changed mid-backup. The Backup Policy configuration is typically verified, and updated as needed.
83 thru 87
  • Tape or media error when writing data. Investigate possible bad tape, tape drive, or disk target. Often caused by drives needing to be cleaned, or some other problem related to the drive, tape, or disk. Backups usually works on a second try, but repeated failures on the same tape, drive, or media server are an indication of a problem with the specific device.
If the problem is narrowed down to a specific device, or feature, the best course is usually to open a case with the appropriate support vendor to further investigate problems with that device.
96
  • Out of available media. Have Operators or COE's load more "Scratch" tapes. Once tapes are loaded the backup is re-run.
156
For more advanced "Snapshot" backup configurations, more in-depth troubleshooting would be needed, but we do no currently have any such configurations within our environment.
191, and 231
  • These are typically seen with Storage Lifecycle Policies, and can occur with optimized duplications when network WAN connectivity to the remote replication site being down, or Tape media issues occur.
Check for any known network issues; particularly with the WAN to the DR site. Also check for any reported media errors or downed drives.
  • Other more complex issues, and miss-configurations can also cause these errors. Insure that everything is properly configured within a new SLP, and that you have thoroughly tested it.

Common Issues

DNS
  • By far our most common problem with NetBackup in complicated environments with multiple domains or "forests" has to do with DNS.
    NetBackup is very sensitive to poorly implemented DNS environments or issues with DNS especially if you run on UNIX and rely on WINS.
    • A lack of standards around DNS must be addressed.
      • Using FQDNs and insuring that all backup interfaces be put into the same domain as the primary domain of the "client" can be a big help in these situations.
        This helps insure you don't end up with duplicate DNS entries in two different domains, and that you don't have to rely on WINS when the domain is not listed in the search order.
    • The next biggest issue is that reverse DNS many times does not work or is outdated or incorrect. There are settings within NetBackup to change how much NetBackup relies on DNS, but these can have potential security risks as well. Fixing reverse DNS look-up issues should be a best practice.


  • To test for DNS issues you have to use "nslookup" or "dig" from the NetBackup servers as well as from the NetBackup clients, for both forward and reverse DNS look ups.
    Do the look ups on not only for the client, but also for the NetBackup servers from both the client and the server.
  • Also watch out for intermittent DNS issues as this is common as well. You'll see something work one time, but if you do a rapid fire "nslookup" you may notice that one or two of the answers are wrong.
    using "dig" or "nslookup" in such a way that you specify the specific DNS servers (and test each one) can also help identify if the issue is with a specific DNS server.
  • DNS issues can cause a large variety of errors usually pointing to some kind of network communications issues, or even access denied errors.
  • If you suspect NetBackup is caching an erroneous DNS information you can use the following command on the effected system to clear this cache:
INSTALL_PATH\NetBackup\bin\bpclntcmd -clear_host_cache
Restores
Client Related Issues
  • We often see problems when clients are reconfigured, rebooted, or decommissioned without sufficient notification to the NetBackup team.
    If a server is not operations when backups run, backups will fail.
  • If the Server list configured within the NetBackup Client Software is not correct you will usually get an error 59, but other errors have also been see when this was the issue.
  • If dealing with a backup network, be sure the client has been setup properly and meets Backup Network Client Configuration Requirements.
  • Be sure be familiar with and follow the above listed steps to troubleshooting backup failures.
Media Errors
  • Improper or lackadaisical Media Management can also lead to failures.
    • It is the responsibility of the NOC staff to manage media for us, and to work with the local COE's and 3rd Party vendors to insure we have plenty of tapes in the library;
      However, it is ultimately the NetBackup teams responsibility to insure the environment is working well as a whole.
      Thus the NetBackup Admin On-call must keep an eye on this and follow the Expiring Tapes Early when out of Scratch Tapes procedure only if absolutely necessary.
    • Since tapes are mechanical they do tend to fail from time to time, and must be properly cared for and tracked to prevent and identify issues.
      For more information see Frozen Tapes Procedures.
    • Also, if we get to a point were we can rely on disk systems to replicate data off-site, and only use tape for longer retention Monthly backups, then we will see a large reduction in tape errors, and thus fewer Media related failures.
    • NetBackup is usually good about detecting media related errors and reporting them as such.
      The real challenge in this area is determining if it was caused by the tape or the drive. There are many tools and reports in NetBackup and on our Robotic Libraries that can be used to make these determinations.

Drive Cleaning

  • Often times drive errors are caused by them needing to be cleaned. Cleaning the drive will correct this problem; however, we should have automated cleaning setup on all of our tape libraries.
    • Check the automated cleanup process to make sure it's working and that cleaning tapes have not used up their entire cleaning count (usually 50).
    • Cleaning can be done via the tape library, in which case you'll need to log into the library to check.
    • Cleaning can be done via NetBackup - if the tape drives/library provide the proper SCSI triggers to notify NetBackup of the needed cleaning - in which case you'll need to check NetBackup to check.
Downed Drives
  • NetBackup will down a drive after it has reached a pre-defined (and configurable) threshold of so many errors within a certain period of time.
  • For Linux, if the server was recently rebooted, the device paths were most likely re-done and are no longer correct.
  • On Solaris we use the "SG drivers" provided with NetBackup. If they are not setup correctly you may have problems.
  • You may also need to validate that TLU & VLT Masking is setup correctly.
  • You can also check for errors generated by the library or VTL to determine if the case is hardware related.
  • Be patient with physical tape drives; expecting them to respond too quickly and being too reactive can make small issues seem bigger then they really are.
Heavy Server Loads
  • Backups will put load on a system, and if the system is already over utilized, it will cause backups to fail (often with a status 13 or 14).
    • VM's are especially susceptible to these kinds of issues due to the heavy I/O loads backups place on them, and the limited I/O capabilities of VM's.
      • One solutions to this that we are investigating is to use "Consolidated Backup" features of NetBackup for Hyper-V and VMware
    • Changing Weekly and Monthly Full backups into Synthetic Full backups, can help greatly reduce loads.
      • This would greatly reduce the overhead put on the NetBackup "clients", but will require the use of Disk to keep at least one months worth of backups on-site to enable these features.
        The Monthly backup would then become the only ones sent to tape, but this requires us to rely on replicating the disk images off-site instead of sending tapes off-site.
        Still this solution would not work in all situations - particularly with all types of database backups.
    • Backups can also slow down critical systems during their "Month End Cycles, so if someone come to you asking "Hey, my server is running slow" follow instructions on Canceling NBU Jobs On Overutilized Clients
      This is also one reason why we use Control-m for our scheduling, so that our jobs can be dependent on Month end processes finishing first before backups start.
NetBackup Servers
  • Usually when we do have NetBackup Server issues, it is shortly after an upgrade (or some other change) were we have run into a new bug (or feature) in the new version.
  • A heavily loaded NetBackup server can also experience problems with running out of memory or processes failing due to not being able to handle all the requests.
    This is why it's important for us to make sure our resources are not over-utilized, and that our servers and job scheduling are well balanced.
NetBackup Appliances
  • Monitor Media servers and Appliances to insure they are up, and all daemons and services are running.
Oracle
Hyper-V
VMware Permissions

http://www.symantec.com/docs/TECH130493 http://kb.vmware.com/kb/2063054



Back to NetBackup