Prasanna Rambhatla

Fault management is about finding things that go wrong such as failed or damaged components, broken cables etc. This is about things that 'should' be working correctly but aren't. It is not about misconfiguration or user error.

Therefore the best way to tackle fault management is by Monitoring.

Once any piece of hardware, cable or network process is running properly it will normally only stop due to failure (of a part etc.) as long as none of its environmental factors change.

When a problem occurs, the task of fault management is to detect, isolate, and repair malfunctions in the network and its associated systems.

The first step, detection, can be thought of as an online process that gives indication of malfunctioning. Real-time detection mechanisms are usually implemented within the network protocols and devices. These can raise alarms either directly or via monitoring software.

The second step, consists of fault localization and identification. Fault localization is typically achieved through algorithms (procedures) that compute a possible set of faults while fault identification is done by testing and comparing the hypothetical faulty component(s) with known working equipment.

The last step, repair is achieved by taking corrective actions. This step may need equipment replacement, change of system configuration, or software removal of bugs.

Example: AppleTalk
On an AppleTalk network you can watch the Sockets of your major network applications such as AppleShare. If one of these applications crashed your monitoring software would register the fault. It could then inform you by beeping, paging you or even emailing other staff.

Fault Management SystemsAre used to monitor and record information about software and hardware on your network. Each time a fault occurs it should be logged so that the "fault history" of each component of the system/network can be recorded.

Typical information that should be recorded is:

Time/Date fault occured.
How was the fault detected?
What was the duration of the fault? = How long until service restored?
How was service restored? = repair, replace, etc.
What was the cost (time/labour/other) of restoring service?
Time/Date service restored.

By keeping a record of at least the information mentioned above you can easily determine how often a particular item breaks down, when an item is no longer worth repairing, which components are most reliable or otherwise.

--------------------------------------------------------------------------------

Prasanna Rambhatla

Friday, May 06, 2005

No comments:

ARCHIVES