Troubleshooting File System Problems
This article presents a systematic approach to troubleshooting file system problems on servers running Windows Server 2003. Various tools for troubleshooting disk problems are examined and best practices for using them are explained.
A corrupt or damaged file system can result in various effects ranging from data loss to rendering your system unbootable. Smart IT pros will therefore take steps to maintain their servers' file systems and will know how to systematically troubleshoot disks when things go wrong. This article discusses both preventive disk maintenance and provides some tips for using various tools to maintain and troubleshoot file systems on Windows servers.
Seven Golden Rules for Disk Maintenance
Let's begin with a proactive approach to file system maintenance. What steps should an administrator take to help prevent file system problems from happening in the first place? Here are my seven golden rules on the subject, in no particular order:
1. Upgrade your servers to Windows Server 2003. There's real value in doing this as far as disk maintenance is concerned, for example:
• The chkdsk command in Windows Server 2003 runs a lot faster than the Windows 2000 version of this utility, plus it can fix things like a corrupt Master File Table (MFT) that the previous version of the utility would choke on.
• Powerful new command-line tools like DiskPart.exe, Fsutil.exe and Defrag.exe give you more flexibility for managing disks from the command-line instead of the GUI. These tools can be scripted to automate common disk management tasks you need to perform on a regular basis.
• The new Automated System Recovery (ASR) feature greatly simplifies the task of restoring your system/boot volume in the event of catastrophic disk failure.
2. Use hardware redundancy. RAID 1 disk mirroring lets you recover from catastrophic system volume failure with zero downtime, while RAID 5 is a great way of protecting your data volumes. Windows servers include support for built-in software RAID but you'll get better performance and true hot-swap redundancy by investing more money and buying a hardware RAID controller for your system instead. Don't forget though, keep a few spare drives handy so you can swap them during an emergency—redundancy is useless if you don't have the redundant hardware around to use it. Note that if you do choose to go with the software RAID provided by Windows, mirroring your boot and system volumes requires that these volumes be one and the same i.e. one volume is both your boot volume (contains operating system files) and your system volume (contains hardware-specific boot files).
3. Use a good antivirus program. Viruses can be nasty, and one of the things they can do when they infect a machine is to corrupt the Master Boot Record (MBR) and other critical portions of your hard drives. Not only should you have AV installed on your servers, you should also avoid risky behaviors such as running scripts from untrusted sources, browsing the web, and so on. These are just the kinds of behavior that can lead to infecting your system, so avoid doing things like this on your production servers.
4. Defragment your file systems on a regular basis. This is especially important on servers on which a high number of transactional operations occur as the file systems can quickly become fragmented, dragging down the performance of applications running on your server. To perform a successful defrag you should really have at least 15% free space left on your disk, so make sure you don't let critical system or data disks fill up too much or they'll be harder to maintain. The new command-line Defrag.exe tool of Windows Server 2003 is useful here since you can schedule regular running of this tool during off-hours using the Schtasks.exe command instead of having to defrag manually or buy a third-party defrag tool.
5. Run chkdsk /r on a regular basis. This command finds bad sectors on your disk and tries to fix them by recovering data from them and moving it elsewhere. You can run this command either from a command-prompt window or from the Recovery Console if you can't boot your system normally. Remember that when you try and run chkdsk.exe on your system or boot volume, Windows configures autochk.exe (the boot version of chkdsk.exe) to run at your next reboot. This means you'll need to schedule downtime for your server when you perform this kind of maintenance so that autochk.exe can run.
6. Check your event logs regularly for any disk-related events. Windows sometimes determines on its own when a disk is "dirty" i.e. there are file system errors present on it. In that case, Windows automatically schedules autochk.exe to run at the next reboot, but it also writes an event to the Application log using either the source name "Chkdsk" or "Winlogon". So filter your Application log to view these kinds of events on a regular basis or collect them using Microsoft Operations Manager (MOM) or whatever other systems management tool you use on your network.
7. Back up all your volumes regularly. As a last recourse in the event of a disaster, having working backups of both your system/boot volume and data volumes is critical. ASR in Windows Server 2003 makes backing up the boot/system volume easier, while backing up your data volumes can be done using the Windows Backup (ntbackup.exe) tool or any other backup tool such as one from a third-party vendor. Whatever way you choose to back up your system, do it regularly and verify your backups to ensure you can recover your system using them.
I should also add an eighth and final rule as well:
8. (the Platinum rule) If your disk starts to make funny sounds, don't ignore them—do something. Disk failure is often preceded by funny sounds emanating from your computer. These clicking, scraping, screeching, or other types of sounds mean trouble, so when you hear them it's time to make sure you've got a recent backup and a spare disk handy just in case. And it's also time to check your event logs, run chkdsk –r, and use other maintenance and troubleshooting tools to check the health of your disks. Don't ignore these funny sounds!
Tips for Troubleshooting
While a proactive approach to maintaining disks and their file systems is important, it's also inevitable that disasters will occur and you'll need to react to them appropriately. Here are some tips to using one of the key maintenance tools for disk and file systems that is included with Windows Server 2003, namely Chkdsk.exe:
• Make sure you know you have a good recent backup before you run chkdsk.exe.
• Never interrupt Chkdsk.exe while it's doing its job.
• Make sure you have enough time during your maintenance downtime window to run Chkdsk.exe—on very large volumes this command can take a long time to finish its work. To speed up the operation of Chkdsk.exe on very large volumes, you can run it in a "light" form by specifying chkdsk drive_letter /f /c /i before you try running the slower chkdsk /r.
• Chkdsk.exe can't run on the boot/system volume when Windows is running, and it also can't run on data volumes when file handles are open on the volume. The reason being that in both of these situations Chkdsk.exe is unable to lock the volume for its exclusive use. In these cases, Chkdsk.exe will be scheduled to run at the next system restart.
• If you think your volume may be dirty but you don't want Autochk.exe to run when it reboots—for instance, if your server is heavily used and you can't afford the downtime while Autochk.exe runs—you can use the Chkntfs.exe command to first determine whether the volume is dirty or not, and second to find out whether Autochk.exe is currently schedule to run at the next restart. If you determine that the volume is dirty and Autochk.exe is scheduled to run at next restart, you can delay running Autochk.exe using the chkntfs /d command. Note however that doing this is risky—if your volume is dirty you should deal with it as soon as possible and not procrastinate.
Conclusion
Proper disk maintenance requires both proactive actions and knowledge of how to properly use file system troubleshooting tools. Make sure you become familiar with the tools included in Windows Server 2003, and be sure to follow the seven (or eight) rules outlined in this article so you can keep your disks humming (but not screeching) along.
Nirakar's
Thursday, April 5, 2007
Wednesday, April 4, 2007
The Importance of Backup Systems
Hope for the Best ... But Prepare for the Worst
Even though most of us know that we need to do regular backups, the fact is that many us don't. In part 1 of a two-part series, we review why it's important to perform these backups on a regular basis.
Earlier this week a client contacted me with a rather severe problem. When I arrived on the scene, I discovered that the problem was far worse then I had originally thought. Originally, I thought that the server's hard drive had crashed and would need to be replaced. While this is without question a serious problem, I knew that the server was equipped with a RAID [define] system that replicated [define] the data across multiple hard drives.
A RAID system provides redundancy for your data. So in the event that one of the hard drives fails, as was the case here, all you need to do is replace the crashed hard drive with a new one and let the RAID array rebuild the data onto the new drive.
Unfortunately, the problem was even more severe than I had originally feared. It turns out that the entire RAID array was damaged. This means that all of the hard drives that made up the array needed to be replaced and the data had to be restored from backups before the server could be brought back online.
This is where the nightmare begins. The client had a problem with their tape backup drive about two months earlier and, as a result, did not have any current backups of the data. This meant that once the RAID array was back online and the tape drive was functional, I would have to find the last complete backup they had (which in this case was Feb 7), perform a restore, and then visit each PC to get that data copied back to the server.
This meant that it could take weeks before getting fully restored. Even then, some missing data would never be recovered.
When they first informed me of their tape problems, I tried to impart on them the seriousness of the situation and how important it was that the tape drive be repaired or replaced as quickly as possible. They failed to heed the warning. Now they're paying the price.
Don't Let It Happen to You
You don't have to suffer the same outcome. Despite the fact that my client was negligent in getting the tape drive repaired, this problem was not unique to them. In fact, this problem has affected many of us mdash; particularly those users who spend a lot of their time working on the road or from a home office.
And this problem certainly isn't exclusive to non-technical people. Even some of the most experienced techies I know have often fallen into this trap. As a matter of fact, an associate of mine just recently had the hard drive in his laptop crash. He didn't have a current backup and, as a result, lost six months worth of work!
The point of all this is that even though most of us know that we need to do regular backups, the fact is that many, if not most of us, don't.
So let's take a moment to review why it's important to perform these backups on a regular basis. Here are six of the more popular ones:
1. The Human Eraser – Have you ever reformatted a hard disk when you meant to format a floppy? Have you ever typed "Y" when you meant "N" and then it was too late? Have you ever overwritten a file by mistake? How about installing software you later found you really did not want? Today's computers can do a lot of damage in a very short period of time.
The fastest erasers known consist of a fast computer combined with an unprepared or tired brain. Backup systems can save you hours, days, or months of trying to reconstruct your valuable data. Before you do any important system change, such as adding hardware or software, remember to perform a backup before you proceed.
2. Hard disk failure – MTBFs (Mean Times Between Failure) [define] have improved dramatically in the past several years for all peripherals. But so have data capacity — and the amount you could lose if your disk fails.
The problem is you never know when a failure will occur. And, according to the Murphy's Law, the loss will occur at the worst possible time. Backup systems give you immediate and automatic protection from unpredictable disk failures.
3. Virus protection and spyware protection – Some unscrupulous individuals continue to write viruses that innocently hide in shareware [define] programs and all throughout the Internet. These programs have the capability to copy themselves and load into your system along with the software you think you are getting.
Once loaded, they proceed to wreak havoc with your system, causing errors, lockups and loss of data. A reliable backup system can restore data lost through virus infection when used in conjunction with good virus detection software and an earlier backup.
4. Free up disk space – While we can't stop the steady growth of application software and related data, we can help you do something about it by allowing you to offload some of the less-used files from your hard disk to a secondary storage medium like tape or DVDs. Removing those inactive files can open up your hard disk for new programs or growing data files.
Inexpensive DVD or tape cartridges are a sure way to archive your programs and data while still keeping them accessible when you do need them. It could even enable you to put off buying a larger disk.
5. Events beyond your control – Both natural and manmade disasters inject a disconcerting variability into any application that requires large amounts of data storage. These include fire, floods, lightning and outright theft.
After such an occurrence, how will your business survive? Many don't, according to statistics. Regenerating vital billing or customer information would be very difficult from paper records, if not impossible. Backup systems protect your data against such calamity.
Besides doing daily backups, plan to do an extra backup weekly. Then store the backup either in a fireproof safe or at an offsite location. If your system goes, your data stays — and that may mean the difference between business as usual and bankruptcy.
6. Large file transfers — Transferring large volumes of data can be time consuming. Tape backup drives in particular have the capacity for very high data transfer rates, making them ideal for moving large quantities of data between systems. Tapes are also compact, inexpensive and have a long shelf life.
So your data will be archived and accessible for years to come. And with a tape backup system you can conveniently send a tape cartridge across the country, through the mail, or across the office in your shirt pocket.
Now that we have been reacquainted ourselves with the reasons why backing up our systems is so important, we need to figure out how best to go about doing it so that it happens consistently and reliably. In our next installment, we'll discuss some of the different backup methods available and take a look at some of the different backup mediums now available. Till next time.
Hope for the Best ... But Prepare for the Worst
Even though most of us know that we need to do regular backups, the fact is that many us don't. In part 1 of a two-part series, we review why it's important to perform these backups on a regular basis.
Earlier this week a client contacted me with a rather severe problem. When I arrived on the scene, I discovered that the problem was far worse then I had originally thought. Originally, I thought that the server's hard drive had crashed and would need to be replaced. While this is without question a serious problem, I knew that the server was equipped with a RAID [define] system that replicated [define] the data across multiple hard drives.
A RAID system provides redundancy for your data. So in the event that one of the hard drives fails, as was the case here, all you need to do is replace the crashed hard drive with a new one and let the RAID array rebuild the data onto the new drive.
Unfortunately, the problem was even more severe than I had originally feared. It turns out that the entire RAID array was damaged. This means that all of the hard drives that made up the array needed to be replaced and the data had to be restored from backups before the server could be brought back online.
This is where the nightmare begins. The client had a problem with their tape backup drive about two months earlier and, as a result, did not have any current backups of the data. This meant that once the RAID array was back online and the tape drive was functional, I would have to find the last complete backup they had (which in this case was Feb 7), perform a restore, and then visit each PC to get that data copied back to the server.
This meant that it could take weeks before getting fully restored. Even then, some missing data would never be recovered.
When they first informed me of their tape problems, I tried to impart on them the seriousness of the situation and how important it was that the tape drive be repaired or replaced as quickly as possible. They failed to heed the warning. Now they're paying the price.
Don't Let It Happen to You
You don't have to suffer the same outcome. Despite the fact that my client was negligent in getting the tape drive repaired, this problem was not unique to them. In fact, this problem has affected many of us mdash; particularly those users who spend a lot of their time working on the road or from a home office.
And this problem certainly isn't exclusive to non-technical people. Even some of the most experienced techies I know have often fallen into this trap. As a matter of fact, an associate of mine just recently had the hard drive in his laptop crash. He didn't have a current backup and, as a result, lost six months worth of work!
The point of all this is that even though most of us know that we need to do regular backups, the fact is that many, if not most of us, don't.
So let's take a moment to review why it's important to perform these backups on a regular basis. Here are six of the more popular ones:
1. The Human Eraser – Have you ever reformatted a hard disk when you meant to format a floppy? Have you ever typed "Y" when you meant "N" and then it was too late? Have you ever overwritten a file by mistake? How about installing software you later found you really did not want? Today's computers can do a lot of damage in a very short period of time.
The fastest erasers known consist of a fast computer combined with an unprepared or tired brain. Backup systems can save you hours, days, or months of trying to reconstruct your valuable data. Before you do any important system change, such as adding hardware or software, remember to perform a backup before you proceed.
2. Hard disk failure – MTBFs (Mean Times Between Failure) [define] have improved dramatically in the past several years for all peripherals. But so have data capacity — and the amount you could lose if your disk fails.
The problem is you never know when a failure will occur. And, according to the Murphy's Law, the loss will occur at the worst possible time. Backup systems give you immediate and automatic protection from unpredictable disk failures.
3. Virus protection and spyware protection – Some unscrupulous individuals continue to write viruses that innocently hide in shareware [define] programs and all throughout the Internet. These programs have the capability to copy themselves and load into your system along with the software you think you are getting.
Once loaded, they proceed to wreak havoc with your system, causing errors, lockups and loss of data. A reliable backup system can restore data lost through virus infection when used in conjunction with good virus detection software and an earlier backup.
4. Free up disk space – While we can't stop the steady growth of application software and related data, we can help you do something about it by allowing you to offload some of the less-used files from your hard disk to a secondary storage medium like tape or DVDs. Removing those inactive files can open up your hard disk for new programs or growing data files.
Inexpensive DVD or tape cartridges are a sure way to archive your programs and data while still keeping them accessible when you do need them. It could even enable you to put off buying a larger disk.
5. Events beyond your control – Both natural and manmade disasters inject a disconcerting variability into any application that requires large amounts of data storage. These include fire, floods, lightning and outright theft.
After such an occurrence, how will your business survive? Many don't, according to statistics. Regenerating vital billing or customer information would be very difficult from paper records, if not impossible. Backup systems protect your data against such calamity.
Besides doing daily backups, plan to do an extra backup weekly. Then store the backup either in a fireproof safe or at an offsite location. If your system goes, your data stays — and that may mean the difference between business as usual and bankruptcy.
6. Large file transfers — Transferring large volumes of data can be time consuming. Tape backup drives in particular have the capacity for very high data transfer rates, making them ideal for moving large quantities of data between systems. Tapes are also compact, inexpensive and have a long shelf life.
So your data will be archived and accessible for years to come. And with a tape backup system you can conveniently send a tape cartridge across the country, through the mail, or across the office in your shirt pocket.
Now that we have been reacquainted ourselves with the reasons why backing up our systems is so important, we need to figure out how best to go about doing it so that it happens consistently and reliably. In our next installment, we'll discuss some of the different backup methods available and take a look at some of the different backup mediums now available. Till next time.
Subscribe to:
Comments (Atom)
