Equipment Uptime Systems

Complete System

Troubleshooting Framework

Troubleshooting
Framework System

A complete structured diagnostic process for maintenance teams — so every fault gets properly identified, every fix gets confirmed, and every diagnosis builds team knowledge that outlasts the technician who made it.

What's Included

The five-phase diagnostic process — explained and applied
Decision trees for 10 common fault classes
Failure mode documentation templates
Worked examples: electrical, mechanical, and control system faults
30-day team implementation guide

ProductComplete System

Published2026 Edition

PublisherEquipment Uptime Systems

01Why Most Troubleshooting Fails — and What to Do Instead

The speed vs. understanding trap

What informal diagnosis actually costs

The case for structured process

02The Five-Phase Diagnostic Process

Phase 1: Define the problem precisely

Phase 2: Gather baseline evidence

Phase 3: Form and rank hypotheses

Phase 4: Test systematically

Phase 5: Confirm and document

03Decision Trees — 10 Common Fault Classes

Motor / drive faults · Sensor / input faults · Power supply faults

Mechanical wear · Alignment / vibration · Pneumatic / hydraulic faults

Thermal faults · PLC / control logic faults · Intermittent faults · Communication faults

04Worked Examples

Electrical example: VFD overcurrent fault — recurring, no obvious cause

Mechanical example: bearing failure — earlier than expected

Control system example: intermittent position fault on servo axis

05Failure Mode Documentation Templates

Asset fault history card

Root cause summary template

Team knowledge base entry format

0630-Day Team Implementation Guide

Week 1: Foundation · Week 2: First applications · Week 3: Review · Week 4: Embed

Section 1

Why Most Troubleshooting Fails — and What to Do Instead

The majority of maintenance teams are technically skilled but diagnostically inconsistent. The problem is not ability — it is process. Understanding why informal diagnosis fails, and what it costs, is the foundation for everything that follows in this system.

Ask any maintenance manager to describe their team's biggest operational frustration and the answer is almost always some version of the same thing: faults that keep coming back. A machine goes down, the team responds, something gets replaced or adjusted, the machine runs — and then the same fault returns days or weeks later. The cycle repeats. Parts accumulate on the shelf. The team starts to develop a narrative about certain machines being "problem equipment" when in reality the equipment is fine and the diagnosis has been wrong every time.

This is not a technician competence problem. The teams caught in this cycle are often experienced, capable, and committed to good work. The problem is structural. The diagnostic process they are using was never designed to identify root causes reliably. It was designed to get the machine running again quickly, which is a different objective entirely — and one that actively works against root cause resolution when applied under production pressure.

The Speed vs. Understanding Trap

Production pressure creates a specific incentive structure for maintenance teams. When a machine goes down, every minute of downtime has a cost — real or perceived — and the fastest path to resuming production is to replace the most likely suspect. An experienced technician who has seen a similar symptom before can often get a machine running in 30–45 minutes using this approach. A structured diagnostic process that identifies root cause might take two to three hours for a complex fault.

In the short term, the speed approach appears to win. In the medium term, it loses badly. The same faults return on a predictable cycle. Each recurrence generates its own downtime, its own parts cost, its own disruption to schedule. The team's time gets consumed by repeat visits to the same machines. Junior technicians never develop genuine diagnostic skill because they are always watching a senior technician guess quickly rather than reason systematically.

The Repeat Failure Cost Calculation

A fault that causes 3 hours of downtime per event and recurs six times per year costs 18 hours of annual downtime plus six sets of parts. A one-time root cause investigation taking 4 hours eliminates that cost entirely. The payback on structured diagnosis is typically measured in weeks, not years.

What Informal Diagnosis Actually Costs

The costs of unstructured troubleshooting are distributed and easy to overlook in any single event. The real picture only becomes visible at the annual level:

Repeated downtime on the same fault — The most direct cost. A fault that recurs four times at three hours each is 12 hours of lost production, not 3.
Incorrect parts replacement — Replacing probable causes that turn out not to be the root cause. These parts get consumed, sometimes discarded, and the real cause goes unaddressed.
Non-transferable knowledge — Every diagnostic insight that stays in one technician's head is lost when they leave or are unavailable. The next person starts from zero.
Technician development stagnation — Junior technicians who only see guessing-based diagnosis never build systematic thinking. Their ceiling as diagnosticians is set by what they observe.
Escalating complexity — Faults that go unresolved often develop secondary effects. A root cause that was addressable in one hour at first occurrence can become a major failure requiring days of work after months of partial fixes.

The Case for Structured Process

Structured diagnostic process does not mean slow. It means deliberate. A technician who has internalized the five-phase framework described in Section 2 applies it quickly — often faster than a technician who is improvising, because they are not backtracking and retesting things they have already ruled out. The framework provides a sequence that eliminates wasted effort, not one that adds it.

Three things change when a team adopts structured diagnostic process:

Root cause identification improves. The process is specifically designed to distinguish between symptoms, contributing factors, and root causes — a distinction that informal diagnosis routinely blurs.
Knowledge becomes transferable. Documentation requirements built into the process ensure that every diagnosis produces a record that the next technician can use.
Team capability grows over time. Each structured diagnostic event is also a learning event. Junior technicians who work through the process on real faults develop genuine diagnostic skill, not just pattern-matching for common failures.

What This System Provides

The five-phase process in Section 2 is the core of this system. The decision trees in Section 3 apply that process to ten specific fault classes with branching logic you can follow in the field. The worked examples in Section 4 show the process applied to real-world faults from start to resolution. The documentation templates in Section 5 capture the output. The implementation guide in Section 6 shows how to roll it out with a real team in 30 days.

Section 2

The Five-Phase Diagnostic Process

Each phase has a specific objective and a defined output. The sequence prevents the most common diagnostic errors: jumping to conclusions, missing evidence, and confusing correlation with cause. Apply it consistently — especially under pressure, when skipping steps is most tempting.

Phase 1 — Define the Problem Precisely

Most troubleshooting starts too fast. A technician arrives, hears a brief description of the symptom, and begins testing before the problem definition is complete. This creates an invisible constraint on everything that follows, because the technician is working from an incomplete and unverified understanding of what they are solving.

Output of Phase 1: A precise written problem statement answering six questions before any testing begins.

The Six Problem Definition Questions

What is the exact symptom? Not "machine stopped" — "Drive fault F7 on Conveyor 3B at 14:22 under 80% load."
When did it first occur? Date, time, time since last maintenance, time since last similar event.
Under what conditions? Load, speed, temperature, position in cycle, environmental conditions.
What changed recently? New components, software updates, process changes, recent maintenance work.
What is the impact? Full stop, degraded output, quality issue, safety concern, intermittent.
What has already been tried? Previous repairs, replacements, adjustments — and their results.

The "what changed recently" question deserves particular emphasis. A significant proportion of equipment faults — estimated at 40–60% in most facilities — are directly related to a recent change: a replacement part, a parameter adjustment, a process change, a cleaning procedure that disturbed a connection. Identifying this link early eliminates an enormous amount of unnecessary diagnostic work.

Common Phase 1 Mistakes

Accepting the operator's interpretation rather than the raw symptom ("the motor is bad" vs. "the conveyor stopped with a drive fault")
Not asking about recent changes because the question feels accusatory
Skipping previous repair history because the CMMS is slow to search
Describing the impact rather than the fault ("we lost two hours" tells you nothing about what failed)

Phase 2 — Gather Baseline Evidence

Before forming any hypothesis, collect objective data across all relevant evidence categories. Evidence gathering and hypothesis formation must be kept separate. Mixing them produces confirmation bias — you find evidence that supports your initial guess and stop looking.

Output of Phase 2: A complete set of objective measurements and observations recorded before any hypothesis has been formed.

Category	Key Checks	Tools
Electrical	Supply voltage (all phases), current draw, insulation resistance, ground continuity, fuse/breaker states, connector condition	Multimeter, clamp meter, insulation tester
Mechanical	Temperature (motor, bearing, gearbox), vibration, noise, visible wear, alignment, lubrication, coupling condition	IR thermometer, vibration pen, visual inspection
Control system	Active fault codes, event log timestamps, I/O states, parameter values, sensor signal levels, last program modification date	HMI, PLC software, documentation
Process	Flow rates, pressures, temperatures, quality data, cycle time trends	Process instrumentation, SCADA historian
Environmental	Ambient temperature, humidity, contamination, cleaning schedule, recent weather	Observation, facility logs

Evidence Quality Standard

"Voltage seemed fine" is not useful evidence. "Measured 231V L1-N, 229V L2-N, 233V L3-N at drive input terminals under no-load" is. Specific, measured, timestamped evidence narrows hypotheses and provides comparison points when the fault recurs. Vague observations produce vague diagnoses.

Phase 3 — Form and Rank Hypotheses

Generate a complete list of possible causes before filtering. The goal at hypothesis generation is breadth — write down everything that could plausibly produce the observed symptoms. Ranking comes immediately after.

Output of Phase 3: A ranked list of hypotheses ordered by evidence alignment and test accessibility.

Three Angles for Hypothesis Generation

Component angle: Which components in the affected subsystem could produce this symptom?
Mechanism angle: What failure mechanisms (wear, contamination, thermal, electrical overstress, software error) are consistent with the evidence?
History angle: What have been the confirmed causes of similar symptoms on this machine or machine type previously?

Ranking Criteria

Evidence alignment: How strongly does the gathered evidence support this hypothesis?
Test accessibility: Can this hypothesis be confirmed or ruled out quickly with available tools? High-evidence, quick-test hypotheses rank first.
Failure probability: Given the machine's age, history, and operating conditions, how probable is this failure mode?

The Most Common Ranking Error

Technicians consistently over-rank the hypothesis they have seen confirmed before on similar equipment. Experience-based pattern matching is valuable — but it fails when the same equipment has a different contributing factor this time. Use history as one input to ranking, not the only input.

Phase 4 — Test Systematically

Test one hypothesis at a time, starting with Rank 1. Testing multiple things simultaneously — the most common field shortcut — makes it impossible to determine which test revealed the cause, or whether the machine recovered for an unrelated reason. Single-variable testing produces reliable conclusions.

Output of Phase 4: A documented test log with one entry per test.

The Test-Confirm-Document Sequence

Before running each test:

State the hypothesis being tested
State the expected result if the hypothesis is correct
State the expected result if the hypothesis is wrong
Run the test and record the actual result
Conclude: confirmed, ruled out, or inconclusive (needs additional test)

A ruled-out hypothesis is not wasted effort — it is useful information. Technicians who treat ruled-out hypotheses as failures will skip documentation and potentially repeat the same test at the next occurrence.

When Testing Gets Stuck

If you have worked through all ranked hypotheses without confirmation, one of three things happened: (1) the root cause was not on your hypothesis list, (2) evidence gathering was incomplete, or (3) the fault is intermittent and currently not present. In each case, return to an earlier phase — never start replacing untested components.

Phase 5 — Confirm and Document

A completed repair still requires functional confirmation and documentation before the work order closes. Both are frequently skipped under production pressure. Both matter.

Output of Phase 5: A verified resolution and a permanent record that becomes part of the asset's diagnostic knowledge base.

Functional Confirmation Standard

Machine operating at normal load for 15–30 minutes with all monitored parameters within specification and no recurrence of the fault condition. A machine that starts but has not been confirmed under load may still have the root cause present — particularly for thermal faults, intermittent electrical faults, and load-dependent mechanical issues.

Required Documentation Fields

Root cause — the specific component, condition, or mechanism that produced the fault
Confirming evidence — the test result or observation that confirmed root cause
Corrective action — exact repair, including part numbers, settings, and torque values where applicable
Contributing conditions — what allowed this fault to develop (age, contamination, overload, design limitation)
Preventive recommendation — what PM task, inspection interval, or design change would prevent recurrence

Why Documentation Compounds in Value

The first fault record for an asset is useful. The fifth reveals patterns. The twentieth shows failure modes, wear rates, and design weaknesses that no single event could identify. Teams that document consistently for 12 months routinely discover that 20% of their assets generate 80% of their diagnostic events — and that most of those events share the same three or four root causes. That insight is only visible in the record.

Section 3

Decision Trees — 10 Common Fault Classes

Each decision tree applies the five-phase framework to a specific fault class, giving you a structured branching path from symptom to confirmed cause. Follow each step in sequence. Do not skip steps because the answer seems obvious — the value is in the process, not just the outcome.

Decision Tree 1 of 10

Motor / VFD Drive Fault

Record the exact fault code and timestamp from the drive.

Look up the fault code in the drive manual. Distinguish between protection faults (overcurrent, overtemperature, overvoltage) and communication/control faults. Protection faults indicate a real electrical or mechanical condition. Communication faults indicate a wiring or signal issue.

Is this an overcurrent (OC) or overload (OL) fault?

YES → Measure motor current at last known running condition (from drive display or clamp meter). Compare to motor nameplate FLA. If current > 110% FLA: check for mechanical overload — binding, jamming, increased process load. If current normal but fault persists: check drive current sensor calibration, motor insulation resistance, and cable integrity.
NO → Proceed to step 3.

Is this an overtemperature fault (motor or drive)?

YES → Measure motor surface temperature and drive heatsink temperature with IR thermometer. Check: cooling fan operation, ventilation blockages, ambient temperature vs. drive rating, duty cycle vs. motor rating. Overtemperature faults that clear on cooling and return on load indicate genuine thermal overload — address the load or cooling, not the drive.
NO → Proceed to step 4.

Is this an earth fault, phase loss, or supply voltage fault?

YES → Measure all three supply phases at the drive input. Check for phase imbalance >2%. Measure insulation resistance of motor cable: L1, L2, L3 to earth — should exceed 1MΩ (new cable >100MΩ). Phase loss: check upstream fusing and supply cabling. Earth fault: locate insulation breakdown in cable, motor windings, or terminal connections.
NO → Check drive parameter settings against application requirements. Compare acceleration/deceleration ramps, current limits, and motor data entries to nameplate. Log multiple fault occurrences with timestamps to identify pattern.

Confirm: Run motor at full load for 15 min. Verify no fault recurrence and current within 5% of baseline.

Decision Tree 2 of 10

Sensor / Digital Input Fault

Check the input status in the PLC/HMI with the machine at rest and the target object correctly positioned.

The input should show its expected state. If it does not, the fault is present now and testable. If it does show the correct state, the fault is intermittent — proceed with systematic checks regardless.

Measure supply voltage at the sensor (not at the panel — at the sensor terminal).

Voltage correct (per sensor spec) → Proceed to step 3.
Voltage missing or low → Trace supply cable from panel to sensor. Check: wiring continuity, connector seating, cable damage, fuse at supply point. Replace as required.

Measure the sensor output signal with the target in and out of range.

A healthy sensor should switch cleanly between its two states. Measure the output voltage in both states against specification. A sensor that produces intermediate voltages, chatters, or responds slowly is failing. Physically move the target through the detection range while monitoring the signal — partial response narrows the fault to sensing element or gap setting.

Does the signal reach the PLC input card correctly?

Verify signal continuity from sensor output terminal to PLC input terminal. A sensor that tests correctly in isolation but produces wrong PLC state has a wiring problem between sensor and card, or a faulty input card channel. Verify with a known-good signal on the same card channel.

Confirm: Cycle machine through full sequence. Verify all sensor states transition correctly at each position. Monitor for 10 complete cycles minimum.

Decision Tree 3 of 10

Power Supply / Control Voltage Fault

Measure the supply output voltage under load (machine powered, not isolated).

Compare to the rated output. Tolerance for most DC supplies is ±5%. A supply reading correctly at no load but collapsing under load indicates an overloaded or failing supply — do not test power supplies in isolation.

Is the supply output low, absent, or intermittent?

Low or absent → Check input voltage to the supply (mains or upstream DC bus). If input correct and output wrong: supply is failing — replace. If input also wrong: trace upstream.
Intermittent → Measure ripple on output with oscilloscope or AC setting on multimeter. High ripple (AC >1% of DC output) indicates capacitor degradation. Log the conditions under which the fault occurs — temperature-dependent intermittent supply faults indicate thermal cycling of failing components.

Is the supply output correct but a downstream circuit is faulted?

Measure voltage at the load (PLC, I/O module, relay coil) — not just at the supply output. Voltage drop in wiring indicates high-resistance connection or undersized cable. Check all terminals in the supply circuit for tightness and corrosion. A terminal that looks clean may still have 0.5–1.0Ω of resistance — enough to drop voltage significantly under load.

Confirm: Monitor supply voltage under full machine load for 30 minutes. Verify output remains within ±2% of rated value throughout.

Decision Tree 4 of 10

Mechanical Wear / Bearing Failure

Characterize the symptom: noise type, vibration location, temperature location, performance degradation.

Bearing failure typically presents with progressive noise (rumble → rattle → squeal), elevated temperature at the bearing housing, and increased vibration at the running frequency. Early-stage bearing faults are often detectable by feel (vibration in housing) and sound before they produce performance degradation or drive faults.

Measure bearing housing temperature and compare to baseline or similar equipment.

Normal bearing temperature: ambient + 20–40°C. Temperatures exceeding ambient + 60°C indicate lubrication failure, overloading, or misalignment. High temperature alone does not confirm bearing failure — also verify correct lubricant type and quantity, and check for shaft misalignment which generates heat through friction.

Was the bearing replaced recently (within last 6 months)?

YES → Early failure of a replaced bearing almost always indicates installation error or a root cause that was not addressed: contamination ingress, misalignment, overloading, incorrect bearing specification. Do not simply replace again — investigate the contributing cause first.
NO → Compare bearing age to expected L10 life for the application. If running beyond L10 life, schedule replacement. If failing early, investigate lubrication history, load history, and alignment records.

Confirm: After replacement, verify alignment to within coupling manufacturer's specification. Re-check temperature after 2 hours running. Document bearing type, installation date, and lubrication schedule.

Decision Tree 5 of 10

Alignment / Vibration Fault

Identify when vibration started and whether it was sudden or progressive.

Sudden onset vibration after maintenance almost always indicates a procedural issue: incorrect reassembly, incorrect torque, component omission, or alignment not re-checked. Progressive vibration developing over weeks or months indicates wear, looseness, or a developing imbalance.

Does vibration frequency correlate with running speed?

YES (1× running speed) → Indicates mass imbalance or shaft bow. Check for debris attached to rotating element, verify shaft straightness.
YES (2× running speed) → Indicates misalignment (angular or parallel). Perform precision alignment check.
YES (bearing frequency) → Bearing defect — proceed as DT4.
NO correlation → Check for structural resonance, loose fasteners, or external vibration source.

Check baseplate fasteners and mounting for looseness.

Loose mounting amplifies all other vibration sources and generates its own. Check every fastener in the mounting train — motor feet, gearbox mount, driven machine mount, baseplate anchors. Soft foot (motor frame not sitting flat on baseplate) produces vibration that alignment correction cannot fix.

Confirm: After corrective action, measure vibration velocity at all bearing housings. Target: below 2.8 mm/s RMS for normal industrial machinery. Document baseline for future comparison.

Decision Tree 6 of 10

Pneumatic / Hydraulic Fault

Characterize the fault: insufficient force/speed, failure to extend/retract, drift under load, or leak.

Insufficient force or speed: pressure or flow problem. Failure to move: valve, solenoid, or mechanical binding. Drift under load: internal leakage past seals or valve spool. External leak: seal, fitting, or hose failure.

Measure system pressure at the actuator under load (not just at the supply).

Correct supply pressure does not confirm correct actuator pressure. Pressure drop between supply and actuator indicates flow restriction: blocked filter, pinched hose, partially closed valve, or undersized circuit. Compare pressure readings at multiple points to locate the restriction zone.

Verify valve solenoid energization and spool movement.

Measure voltage at the solenoid coil during a commanded movement. Correct voltage but no movement: coil or spool fault. No voltage: electrical issue upstream — check PLC output, wiring, fusing. A valve that operates correctly when manually actuated but not on solenoid command is an electrical problem, not a valve problem.

Confirm: Cycle actuator through full stroke 10 times under normal process load. Verify consistent force, speed, and position. Check for leaks at all connections after cycling.

Decision Tree 7 of 10

Thermal / Overtemperature Fault

Identify what is overheating: motor, drive, control panel, gearbox, hydraulic fluid, or process component.

Use an IR thermometer or thermal camera to map temperature distribution. Hot spots within a motor or drive indicate localized failure (winding short, heatsink blockage). Uniform overtemperature indicates an ambient or cooling system issue.

Is the ambient temperature elevated, or has the cooling arrangement changed?

YES → Calculate whether the component is operating within its rated temperature range given the new ambient. If not: cooling upgrade, derating, or process change required — not a component fault.
NO → Proceed to step 3.

Verify cooling system operation: fans, filters, heat exchangers, fluid cooling circuits.

Check: cooling fan rotation and airflow direction, filter condition (blocked filters are the single most common cause of control panel overtemperature), heat exchanger fouling, coolant fluid level and condition. A drive or motor that ran cool for years and now overheats usually has a cooling system fault, not a component fault.

Confirm: Monitor temperatures at operating load for 2 hours after corrective action. Verify all temperatures stable within rated limits. Establish cleaning or replacement schedule for cooling components.

Decision Tree 8 of 10

PLC / Control Logic Fault

Connect to the PLC and check the diagnostic buffer — event log, fault queue, or error register.

Most PLCs log every fault with a timestamp. Read the log before touching anything. Look for: I/O module faults, communication errors, program errors, power supply events, and watchdog trips. The log sequence shows whether the control issue caused a mechanical fault, or a mechanical fault triggered a control response.

Was the program modified recently?

YES → Compare the current program to the last known-good version. Check: modified rung logic, changed timer/counter presets, altered analog scaling, sequence step changes. A fault that appeared after a program change and was not present before is almost certainly caused by the program change.
NO → Proceed to step 3.

Monitor the I/O states in real time during a fault attempt.

Force the machine to the condition just before the fault occurs and watch all relevant inputs and outputs in the PLC monitor. A logic fault becomes visible as an output that should energize but does not, or an input that reads incorrectly. This technique converts a difficult "machine doesn't run" problem into a specific "this rung condition is never true" problem.

Confirm: Cycle machine through full auto sequence 5 times without fault intervention. Verify all I/O transitions correctly at each step. Back up the verified program before closing the job.

Decision Tree 9 of 10

Intermittent Fault (No Fault Present on Arrival)

Document the fault pattern: time of day, operating conditions, frequency, duration, self-clear behavior.

Intermittent faults are condition-dependent. The pattern reveals the condition. Faults that occur only during the first hour of operation suggest thermal effects (cold components, moisture). Faults correlated with specific product or load suggest a mechanical condition. Random-appearing faults with no pattern are often electrical — loose connections, failing capacitors, or marginal signal margins.

Inspect all electrical connections in the affected subsystem.

Loose or high-resistance connections are the single most common cause of intermittent electrical faults. Inspect every terminal in the signal path: tug-test each wire, check torque on screw terminals, inspect crimp terminals for pull-out, look for wire insulation damaged by sharp edges or heat. A connection that looks good may still be intermittent under vibration or thermal cycling.

Set up monitoring to capture the fault when it occurs.

If the fault cannot be reproduced on demand, it must be captured. Options: enable PLC extended event logging, set up data logging on suspect signals, configure drive fault capture at maximum resolution, or assign a technician to observe during the window when faults typically occur. The goal is objective evidence from the moment of fault — not a post-event investigation of equipment that has already self-cleared.

Confirm: After any repair to an intermittent fault, monitor for a minimum of 5× the typical recurrence interval before declaring resolved. An intermittent fault that "hasn't come back yet" is not confirmed resolved.

Decision Tree 10 of 10

Network / Communication Fault

Identify which devices lost communication and which retained it.

A partial outage (some devices lost, others retained) indicates a network topology problem: a segment fault, switch failure, or node failure downstream of where the split occurred. A total outage (all devices lost) indicates a master/scanner failure, network-wide power issue, or master configuration problem.

Check physical layer first: cables, connectors, termination resistors.

The majority of fieldbus communication faults are physical layer issues, not protocol issues. Check: cable condition and routing (away from power cables), connector seating and shield termination, termination resistors present and correct value at both ends of the segment, no T-junction stubs exceeding specification length. A bus that worked for years and suddenly faults often has a single cable or connector that has degraded to the point of failure.

Check for duplicate node addresses or configuration changes.

A newly added or recently replaced device with a duplicate address will cause network-wide communication disruption on many protocols. Verify all node addresses are unique. If a device was recently replaced, confirm the replacement has the same address configuration as the original. Check that IP addresses, node IDs, or station numbers match the network design documentation.

Confirm: Monitor network diagnostic counters (error count, retry count, bus load) for 30 minutes under normal operation. Zero increasing error count confirms resolution. Document network topology with node addresses for future reference.

Section 4

Worked Examples

These three examples show the five-phase framework applied to real-world fault types. Each follows the same structure: problem definition, evidence, hypotheses, test sequence, and confirmed resolution. Read them as process demonstrations, not just case studies — the thinking at each phase is what transfers to your own faults.

Worked Example 1 — Electrical Fault

VFD Overcurrent Fault — Recurring, No Obvious Cause

Phase 1 — Problem Definition

Symptom: Drive OC1 (overcurrent during acceleration) fault on a 15kW conveyor drive. Fault has occurred 4 times in the past 6 weeks. Clears on reset. Machine runs normally between events. No operator-reported changes.

Conditions: All four events occurred during the first 30 minutes of the morning shift. No faults during afternoon or evening shift.

Recent changes: Maintenance team notes that the conveyor belt was replaced 8 weeks ago — approximately 2 weeks before the first fault occurrence.

Previous action: Drive parameter check performed after second event — no issues found. No other action taken.

Phase 2 — Evidence Gathering

Drive fault log: all four OC1 events at 06:08–06:31. No faults outside morning startup window.
Motor current during fault (from drive display): 28A. Motor nameplate FLA: 27A. Ratio: 104% — borderline overload.
Motor insulation resistance: 480MΩ — healthy.
Supply voltage: 399V, 401V, 398V L1/L2/L3 — balanced, within spec.
Conveyor belt visual: new belt installed 8 weeks ago. Belt tension appears high — belt deflects less than 10mm under 10kg load at midspan (spec: 15–20mm).
Ambient temperature at 06:00: 4°C (facility not heated overnight).

Phase 3 — Hypotheses (Ranked)

Belt over-tension + cold ambient: Cold belt is stiffer and higher tension creates higher startup load. As belt warms through first 30 minutes, load normalizes. Evidence: timing pattern (morning only), cold ambient, high belt tension, borderline current. Quick test: check belt tension. Rank 1.
Drive acceleration ramp too short: Motor asked to accelerate too quickly, pulling high current. Evidence: OC1 (acceleration overcurrent). Counterfact: fault has only started occurring recently, ramps unchanged. Rank 2.
Motor deterioration: Winding or mechanical fault increasing current. Evidence against: insulation resistance healthy, current only marginally over FLA. Rank 3.

Phase 4 — Testing

Test 1: Measure belt tension using deflection method. Result: 8mm deflection under 10kg at midspan. Spec: 15–20mm. Belt is over-tensioned by approximately 40%.

Test 2: Adjust belt tension to 17mm deflection (midpoint of spec). Result: Belt correctly tensioned.

Test 3: Cold-start machine at 06:00 ambient on the following three mornings, monitoring drive current. Result: Peak startup current 23A, 22A, 24A across three days. No OC1 fault. Current reduced from 28A to 22–24A range.

Phase 5 — Confirmed Resolution

Root cause: Replacement belt installed with excessive tension, causing elevated startup load. Combined with cold ambient temperatures (stiff belt, thicker lubricant), startup current exceeded OC1 threshold during acceleration.

Contributing condition: Belt tension not checked after installation. No documented tension specification for this conveyor in the maintenance records.

Corrective action: Belt re-tensioned to specification. Belt tension specification added to PM task card. Technician who installed belt briefed on tension measurement procedure.

Preventive recommendation: Add belt tension check to post-installation sign-off and to quarterly PM inspection for all conveyor assets.

Worked Example 2 — Mechanical Fault

Bearing Failure — Earlier Than Expected

Phase 1 — Problem Definition

Symptom: Increasing noise from pump motor DE bearing on Pump 4. Noise described as "grinding, intermittent." Motor runs but vibration increasing over past 3 weeks.

History: Bearing replaced 7 months ago after previous failure. Previous bearing had run for approximately 14 months. Expected bearing life at this duty: 24–36 months (L10 calculation).

Recent changes: Pump impeller replaced 6 months ago (1 month after last bearing replacement). Seal replaced at same time as impeller.

Impact: Machine running but vibration trending toward shutdown threshold.

Phase 2 — Evidence Gathering

DE bearing housing temperature: 74°C. NDE bearing: 52°C. Ambient: 28°C. DE delta: 46°C — elevated (normal ≤40°C).
Vibration at DE bearing: 6.2 mm/s RMS. NDE: 2.1 mm/s. Baseline reading 3 months ago: DE 2.4 mm/s. Significant increase.
Lubrication records: last greased 4 months ago. Interval: quarterly. Lubricant: Grease Type A (lithium complex).
Impeller condition: visual inspection through inspection port shows no visible damage or imbalance.
Shaft alignment: not checked at impeller replacement. No alignment record exists for this pump.

Phase 3 — Hypotheses (Ranked)

Shaft misalignment introduced at impeller replacement: No alignment check performed after impeller/seal work. 2× running speed vibration signature would confirm. Alignment generates heat and accelerates bearing wear. Evidence: timing (failure shortly after impeller work), no alignment record, elevated temperature and vibration. Rank 1.
Lubrication failure: Wrong lubricant, insufficient quantity, contamination. Evidence: temperature elevated, correct interval but no quantity or condition check in records. Rank 2.
Bearing specification error: Incorrect bearing installed at last replacement. Less likely given 7 months of acceptable operation. Rank 3.

Phase 4 — Testing

Test 1: Measure vibration frequency spectrum. Result: Strong 2× running speed component (97 Hz at 2900 RPM). This is the classic angular misalignment signature. Rank 1 hypothesis strongly supported.

Test 2: Bearing replaced (already at point of failure) and shaft alignment checked with dial indicators. Result: Angular misalignment 0.22° (spec: ≤0.05°). Parallel misalignment 0.31mm (spec: ≤0.10mm). Significant misalignment confirmed.

Test 3: Alignment corrected, machine restarted. Vibration at DE bearing: 1.9 mm/s. Temperature after 2 hours: 47°C (delta 19°C — within normal range).

Phase 5 — Confirmed Resolution

Root cause: Shaft misalignment introduced when impeller and seal were replaced 6 months ago. Alignment was not checked after reassembly. Misalignment generated excessive radial load on DE bearing, causing accelerated wear and premature failure at 7 months (vs. expected 24–36 months).

Corrective action: Bearing replaced, shaft aligned to specification (angular 0.02°, parallel 0.06mm). Alignment documentation created for this pump.

Preventive recommendation: Add mandatory alignment check to post-maintenance sign-off for any job involving shaft, coupling, impeller, or seal replacement on rotating equipment. Establish alignment records file for all pump assets.

Worked Example 3 — Control System Fault

Intermittent Position Fault on Servo Axis

Phase 1 — Problem Definition

Symptom: Servo axis 3 (X-axis, pick-and-place cell) generates "position deviation exceeded" fault approximately 3–5 times per shift. Fault clears on reset. Machine runs normally for 30–120 minutes between events. Fault has occurred for 3 weeks.

Conditions: No obvious pattern to timing within shift. Operators report it seems worse "when the machine is running fast" but cannot confirm.

Recent changes: None reported. Cell has been running this job for 8 months.

Previous action: Servo drive replaced 2 weeks ago (did not resolve fault). Parts cost: $1,840.

Phase 2 — Evidence Gathering

Drive fault log: all position deviation faults occur during axis deceleration phase of the fast-traverse move — not during acceleration or constant velocity.
Encoder signal quality: checked via drive diagnostic — signal quality indicator shows occasional "noise" events coincident with fault timestamps.
Encoder cable routing: cable runs parallel to servo motor power cable for approximately 800mm in the cable chain. Separation: approximately 30mm.
Cable chain inspection: encoder cable shows evidence of wear on outer jacket at a bend in the chain — insulation slightly scuffed but not breached.
Earth bonding: encoder cable screen connected at drive end only. Motor power cable screen connected at both ends.
Drive bus voltage during fault: no anomalies noted in log.

Phase 3 — Hypotheses (Ranked)

Encoder cable noise injection from proximity to power cable: During deceleration, motor generates regenerative voltage spikes on power cable. These induce noise into encoder cable running parallel. Encoder signal corruption causes drive to read incorrect position, triggering deviation fault. Evidence: fault timing (deceleration), encoder noise events in log, cable routing. Quick test: reroute or add separation. Rank 1.
Encoder cable damage at chain bend: Cable jacket wear may be causing intermittent signal degradation under flexion. Evidence: physical wear found on inspection. Would explain intermittent nature. Rank 2 (may be secondary to or concurrent with Rank 1).
Mechanical issue on axis: Actual mechanical position deviation (not signal error). Against: fault only during deceleration of fast-traverse, not under normal positioning. Rank 3.

Phase 4 — Testing

Test 1: Reroute encoder cable away from power cable in cable chain (minimum 100mm separation maintained along full run). Replace worn section of encoder cable. Result: No position deviation fault in next 8-hour shift. Monitoring continued.

Test 2: Monitor for five full shifts. Result: Zero faults over 5 shifts (previous rate: 3–5 faults per shift). Fault resolved.

Test 3: Verify encoder screen bonding — reconnect screen at encoder end also (correct for encoder cables: single-end bonding can leave screen floating and less effective). Result: Drive encoder diagnostic shows no noise events post-correction.

Phase 5 — Confirmed Resolution

Root cause: Encoder signal cable routed in excessive proximity to motor power cable in cable chain. Regenerative voltage transients during deceleration induced noise onto encoder signal, causing sporadic position reading errors and triggering deviation fault. Cable wear at chain bend was a contributing factor reducing noise immunity.

Note on the previous drive replacement: Replacing the drive did not address the cable routing issue and therefore had no effect on the fault. This is a clear example of parts replacement without root cause identification — $1,840 spent with zero impact on the fault.

Preventive recommendation: Add encoder cable routing inspection (minimum separation from power cables) to commissioning checklist and periodic PM for all servo axes. Document cable routing specifications for each axis in machine documentation.

Section 5

Failure Mode Documentation Templates

These three templates create the documentation infrastructure that turns individual diagnostic events into organizational knowledge. Use them consistently and your team will accumulate a diagnostic database that becomes more valuable with every entry.

Template 1 — Asset Fault History Card

One card per asset. Maintained as a running record of every fault event. Filed with or linked to the asset's maintenance records. The fault history card answers the first question any technician should ask when arriving at a faulted machine: "Has this happened before?"

Asset Fault History Card — Field Layout

Asset ID: ______________ Asset Name: ______________ Location: ______________

Asset Description: ______________ Commissioned: ______________ Card Started: ______________

Date	W/O #	Symptom (brief)	Root Cause	Action Taken	Parts Used	Downtime (hrs)	Tech

Running totals: Total fault events this year: _____ Total downtime hours this year: _____ Most frequent root cause: _____________________

Template 2 — Root Cause Summary

Completed once per significant fault event — any fault requiring more than 2 hours to resolve, any recurring fault, or any fault with safety or quality implications. This is the detailed record that supports future diagnostic work on this asset.

Root Cause Summary — Field Layout

Work Order #: _______ Asset ID: _______ Date: _______ Technician: _______________________

Fault Description (exact symptom, fault code, conditions):

_____________________________________________________________________________

Evidence Summary (key measurements and observations):

_____________________________________________________________________________

Root Cause (specific component, condition, or mechanism):

_____________________________________________________________________________

Confirming Evidence (what confirmed this was the root cause):

_____________________________________________________________________________

Corrective Action (exact repair, part numbers, settings):

_____________________________________________________________________________

Contributing Conditions (what allowed this fault to develop):

_____________________________________________________________________________

Preventive Recommendation:

_____________________________________________________________________________

Fault Classification: ☐ Electrical ☐ Mechanical — wear ☐ Mechanical — alignment ☐ Control — sensor ☐ Control — drive ☐ Control — logic ☐ Pneumatic/Hydraulic ☐ Thermal ☐ Environmental

Total diagnostic time: _____ hrs Total downtime: _____ hrs Confirmed running by: _________________

Template 3 — Team Knowledge Base Entry

A condensed, searchable record designed for quick reference during future diagnostic events. One entry per confirmed root cause finding. Filed by asset, fault classification, or both. When a technician arrives at a faulted machine and asks "has anyone seen this before?", this is what they should be able to find in under two minutes.

Knowledge Base Entry — Field Layout

Entry ID: _______ Asset / Asset Type: _______________________ Date Added: _______

One-Line Summary (for search):

Example: "VFD OC fault on conveyor startup — caused by over-tensioned replacement belt + cold ambient"

_____________________________________________________________________________

Symptom to look for: _____________________________________________

Key evidence that points to this cause: _____________________________________

Root cause: _____________________________________________________________

Fix: __________________________________________________________________

Confirmation test: ______________________________________________________

Watch out for: _________________________________________________________

Source W/O #: _______ Added by: _______________________

How to Build a Useful Knowledge Base — Three Rules

File it within 24 hours. Entries written days later are incomplete. The details that matter — exact measurements, specific observations, the sequence of tests — fade quickly. Same-day or next-day documentation is the standard.
Write for the technician who wasn't there. Entries that say "fixed the drive" are useless. Entries that say "OC1 fault on AC10 cleared by re-tensioning belt to 17mm deflection after finding 8mm — check belt tension before any drive work on this conveyor" are useful for years.
Review quarterly for patterns. Four entries about different faults on the same asset in one quarter is a signal that the asset needs a deeper look — not just more diagnostic work. The knowledge base reveals what individual work orders cannot show.

Section 6

30-Day Team Implementation Guide

This guide is for the manager or lead technician rolling out structured diagnostic process with a real team. It is designed to produce measurable results within 30 days — not by overhauling everything at once, but by introducing two or three specific changes per week and letting them embed before adding more.

Before You Start — Realistic Expectations

Teams do not adopt new processes because they are told to. They adopt them because the new process makes their work easier, produces better outcomes, and gets recognized. This implementation guide is built around making structured diagnosis genuinely useful to the technicians doing the work — not just adding paperwork for management reporting. If your technicians do not see the value within 30 days, something in the rollout needs adjusting, not the process itself.

Week 1 — Foundation

Day 1–2: Brief the Team

Present the five-phase framework in a 30–45 minute session. Do not lecture — use a recent recurring fault from your own facility as the example. Walk through how that fault would have been handled differently using the structured process. The goal is to connect the framework to problems the team actually recognizes, not to abstract process theory.

Address the "we don't have time" objection directly: show the math on the most expensive recurring fault from the past year. Total downtime hours times cost-per-hour, multiplied by recurrence count. Compare to the estimated time a proper root cause investigation would have taken. The numbers usually make the case without further argument.

Day 3–5: Introduce Two Requirements Only

Do not roll out the full process at once. Start with exactly two requirements:

For any fault that has occurred before: complete the six-question problem definition (Phase 1) before beginning work. This takes five minutes. It produces the documentation that makes the next occurrence faster to resolve.
Before closing any work order: record root cause and corrective action in the work order. Not just "replaced sensor" — the root cause of why the sensor failed.

These two requirements alone will produce visible improvement within 30 days. They are also easy to verify and easy to coach — you can check any work order and immediately see whether the requirements were met.

Week 2 — First Applications

Day 8–10: First Joint Diagnostic Session

Select a recurring fault — ideally one that has come back at least three times in the past year. Work through the full five-phase framework on that fault, with a senior technician and at least one junior technician working together. The senior technician narrates their reasoning at each phase: "I'm checking the fault log before anything else because I want to see what conditions the fault occurs under before I start guessing" — making the thinking explicit.

Document the session fully using the diagnostic worksheet (included in the toolkit download). The completed worksheet becomes the first entry in the knowledge base for that asset.

Day 11–14: Apply to All New Work Orders on Recurring Faults

Extend the two Week 1 requirements to all technicians on all recurring fault work orders. Check compliance during daily walk-rounds — not in a punitive way, but by asking technicians to talk through their Phase 1 problem definition before they start work. This both reinforces the habit and gives you the opportunity to add context they may have missed.

Week 3 — Review and Adjust

Day 15–17: Review First Two Weeks

Pull all work orders from the past two weeks and review against the two requirements. Count: how many recurring fault work orders had a complete problem definition? How many had root cause documented at closure? The numbers tell you where the adoption gaps are.

Common patterns and their remedies:

Problem definition missing: Usually because there was no time pressure to comply. Add a verbal check-in before the technician leaves the planning area for any recurring fault job.
Root cause documented as symptoms: "Motor overheated" is a symptom, not a root cause. Coach technicians to ask "why?" one more time: "Motor overheated because cooling fan had failed — root cause is fan bearing failure due to contamination ingress at shaft seal."
Good documentation on complex jobs, absent on quick jobs: The quick jobs are often the most important to document — they are usually the recurring ones that will be back in two weeks.

Day 18–21: Introduce the Knowledge Base

Take the best-documented work orders from the first two weeks and convert them into knowledge base entries (Template 3, Section 5). Do this as a team exercise if possible — it reinforces what good documentation looks like and makes the value of the knowledge base visible immediately.

File the knowledge base entries where technicians can find them quickly: a physical binder in the workshop, a shared drive folder, or a pinboard near the machine. The format matters less than the accessibility. A knowledge base that requires three clicks and a password to access will not be used.

Week 4 — Embed

Day 22–25: Extend to All Fault Work Orders

Extend the two requirements from recurring faults only to all fault work orders. The habit should be established enough by now that this is an incremental change rather than a major additional burden. The root cause documentation requirement in particular should be automatic for any technician who has been consistently applying it to recurring faults for three weeks.

Day 26–28: Second Joint Diagnostic Session

Run a second full-framework diagnostic session, this time with a different senior technician leading and a different junior technician observing. The goal is to demonstrate that the process is transferable and not dependent on one person's implementation style. Different technicians will apply the phases with different emphasis and different pacing — that variation is fine, as long as the sequence is maintained.

Day 29–30: 30-Day Review

Compare the 30-day window to the same period 30 days prior. Metrics to check:

Recurring fault recurrence rate — have previously recurring faults recurred less?
Work order closure quality — are root causes documented on a higher proportion of work orders?
Mean time to resolve — is resolution time increasing (expected, while team builds diagnostic habits) or already decreasing on fault types with knowledge base entries?
Parts expenditure — too early for a clear trend, but track it from this baseline.

Present the 30-day numbers to the team. Even small improvements are worth naming — they make the process feel like it is working, which sustains adoption.

Beyond 30 Days

At 30 days, the five-phase framework should be the default approach for recurring and complex faults — applied habitually by all technicians who have been through the first-month program. The knowledge base should have at least 10–15 entries. The next milestones:

60 days: Review the knowledge base for the first patterns. Which assets appear most frequently? Which fault classifications dominate? Use this data to prioritize PM improvements and design reviews.
90 days: Conduct a formal recurring fault analysis — identify the five assets generating the most diagnostic events and the five fault root causes that appear most often. These become the priority targets for engineering intervention, PM redesign, or replacement planning.
6 months: Run a before/after comparison on total reactive maintenance hours, total parts expenditure on the highest-recurring fault classes, and mean time to resolve on the fault types that now have knowledge base entries. These numbers make the business case for continued investment and for extending the process to additional teams or sites.

What Sustained Implementation Looks Like

The 12-Month Picture

Teams that implement structured diagnostic process consistently for 12 months typically report three changes they did not expect when they started:

Junior technicians develop faster. Working through the framework on real faults — narrated by a senior technician — produces diagnostic skill in 6–12 months that would otherwise take years of trial and error.
The knowledge base becomes a competitive asset. When experienced technicians leave (and they do), their diagnostic knowledge does not leave with them. It is in the records. The new technician who arrives has access to years of documented fault history from day one.
Management conversations change. When every fault event produces a documented root cause and a preventive recommendation, the data exists to make the case for equipment upgrades, design changes, and PM investment — not as an opinion, but as a pattern visible in the record.

TroubleshootingFramework System

Contents

Why Most Troubleshooting Fails — and What to Do Instead

The Speed vs. Understanding Trap

What Informal Diagnosis Actually Costs

The Case for Structured Process

The Five-Phase Diagnostic Process

Phase 1 — Define the Problem Precisely

Common Phase 1 Mistakes

Phase 2 — Gather Baseline Evidence

Evidence Quality Standard

Phase 3 — Form and Rank Hypotheses

Three Angles for Hypothesis Generation

Ranking Criteria

Phase 4 — Test Systematically

The Test-Confirm-Document Sequence

When Testing Gets Stuck

Phase 5 — Confirm and Document

Functional Confirmation Standard

Required Documentation Fields

Decision Trees — 10 Common Fault Classes

Worked Examples

Failure Mode Documentation Templates

Template 1 — Asset Fault History Card

Template 2 — Root Cause Summary

Template 3 — Team Knowledge Base Entry

30-Day Team Implementation Guide

Week 1 — Foundation

Day 1–2: Brief the Team

Day 3–5: Introduce Two Requirements Only

Week 2 — First Applications

Day 8–10: First Joint Diagnostic Session

Day 11–14: Apply to All New Work Orders on Recurring Faults

Week 3 — Review and Adjust

Day 15–17: Review First Two Weeks

Day 18–21: Introduce the Knowledge Base

Week 4 — Embed

Day 22–25: Extend to All Fault Work Orders

Day 26–28: Second Joint Diagnostic Session

Day 29–30: 30-Day Review

Beyond 30 Days

The 12-Month Picture

Troubleshooting
Framework System