The Hardware Troubleshooting Grimoire: A Spellbook for Tech Sorcery
Hardware trouble can be difficult to troubleshoot. Components interact in ways that sometimes, a memory issue may look like a GPU issue, or a motherboard issue looks like a CPU issue, and so forth. At times one may consider this to be like voodoo. Hence the idea to call this a âgrimoireâ; a spellbook with incantations to help you remove the curse from the hardware ;)
I will update this and add more things over time, feel free to send me suggestions if you have any, the idea is to make this as complete as possible over time and keep it somewhat up to date.
This guide will be both for simple users and advanced server diagnosers at hyperscale datacenters, because it has a ton of overlap, there may just be more to it in one or the other case, so Iâll try to keep that separation clear.
To get a good picture of whatâs actually going on, itâs important to know what kind of tools one can use to acquire information about the state of the system, so you get a good picture of potential issues and can rule out one or the other thing. This document will contain several sections; Tools, Data sources, and Methods (and perhaps more in the future).
Tools
Sometimes you need a tool to figure out the issue.
Memory Testing Tools
Memory errors can cause a wide range of issues, from random crashes and data corruption to system instability, and are an important start for any hardware problem investigation, as they can affect other things too.
Operating System Agnostic Tools
- Memtest86/Memtest86+:
- These are the gold standard for memory testing. They boot from a USB drive or CD and thoroughly test your RAM outside of the operating system.
- Download memtest86+: https://www.memtest.org/
- Key Features:
- Extensive testing algorithms to catch a wide range of errors.
- Bootable environment, so it tests RAM independently of the OS.
- Easy to use, even for beginners.
- PassMark MemTest86 Pro:
- A commercial alternative to memtest86 with advanced features and a user-friendly interface.
- Memtest86/Passmark Memtest86Pro: https://www.memtest86.com/download.htm
- Key Features:
- Comprehensive testing algorithms.
- Detailed error reporting.
- Support for newer hardware and technologies.
Windows-Specific Tools
- Windows Memory Diagnostic:
- This built-in tool is a good starting point for basic memory testing.
- How to Use:
- Type âWindows Memory Diagnosticâ in the Start menu search bar.
- Click âRestart now and check for problems (recommended).â
- The tool will run automatically on reboot and display the results.
Linux-Specific Tools
- Memtester:
- A command-line memory testing tool available in most Linux distributions.
- How to Use:
- Install the
memtesterpackage using your package manager (e.g.,sudo apt/dnf/yay/equo/etc install memtesteron most platforms). - Run
memtesterwith the desired memory size to test (e.g.,sudo memtester 8Gto test 8GB of RAM).
- Install the
macOS-Specific Tools
- Apple Diagnostics:
- While primarily a general hardware diagnostic tool, Apple Diagnostics also includes some basic memory testing.
- How to Use:
- Shut down your Mac.
- Press and hold the D key while turning it on.
- Follow the on-screen instructions to run the diagnostics.
Important Note:
- Multiple Passes: For thorough testing, itâs recommended to run memory tests for multiple passes (e.g., overnight). This increases the chance of catching intermittent errors that might not appear in shorter tests. Memory is literally just chips filled with little terrible capacitors. Heat, power instability, electric fields in the air, and lots of other things can affect it. Testing over at least a 24h period allows you to simulate most of the usual range of conditions the memory goes through.
Hardware Monitoring Tools
This next section was written by Gemini, after giving it the right information (it thought hwmonitor was crossplatform and cpu-z was a mobile-only thing and some other issues haha, but this info below is verified!)
Keeping tabs on your hardwareâs vital signs â temperatures, voltages, fan speeds, and more â is crucial for identifying potential problems before they cause damage or system failures. Here are some powerful tools to help you monitor your hardwareâs health:
Operating System Agnostic Tools
- Open Hardware Monitor:
- An open-source, cross-platform tool (Windows, Linux, macOS(?)) for monitoring temperatures, voltages, fan speeds, and load levels for CPU, GPU, motherboard, hard drives, and more.
- Download: https://openhardwaremonitor.org/
Linux-Specific Tools
- lm-sensors:
- This command-line tool provides information from various hardware sensors, including temperatures, voltages, and fan speeds. Itâs widely available on Linux distributions.
- Installation: Typically available through your package manager (e.g.,
sudo apt install lm-sensorson Debian/Ubuntu). - Usage: Use
sensors-detectto automatically configure sensors, and thensensorsto view the data.
- psensor:
- A graphical frontend for lm-sensors, offering a more user-friendly interface for monitoring hardware sensors.
Windows-Specific Tools
- HWMonitor:
- Primarily designed for Windows, but a limited version is available for Linux. The Windows version offers a wider range of features and more polished interface.
- Download: https://www.cpuid.com/softwares/hwmonitor.html (Note: Choose the appropriate version for your operating system.)
- AIDA64 Extreme:
- A comprehensive hardware information and monitoring tool, offering detailed insights into your systemâs components. While primarily designed for Windows, it also has Android and iOS versions.
- Download: https://www.aida64.com/downloads
- CPU-Z:
- Another popular tool for displaying detailed information about your CPU, motherboard, RAM, and other hardware components. Primarily designed for Windows, but also has Android and iOS versions.
- Download: https://www.cpuid.com/softwares/cpu-z.html
- SpeedFan:
- A classic tool for monitoring temperatures and controlling fan speeds. It can be a bit complex for beginners, but offers powerful customization options.
macOS-Specific Tools
- iStat Menus:
- A popular app that provides a wealth of information in your menu bar, including CPU, GPU, memory, and disk usage, as well as temperatures and fan speeds.
- Download: https://bjango.com/mac/istatmenus/
- TG Pro:
- Another comprehensive hardware monitoring app for macOS, offering in-depth temperature monitoring, fan control, and diagnostics.
- Download: https://www.tunabellysoftware.com/tgpro/
Mobile Tools (Android/iOS)
- AIDA64:
- The mobile version of the popular Windows tool. Provides detailed information about your deviceâs hardware, including temperatures, battery health, and sensor data.
- Download: Available on the App Store (iOS) and Google Play Store (Android).
- CPU-Z:
- The mobile version of the popular Windows tool. Displays information about your deviceâs processor, battery, sensors, and more.
- Download: Available on the App Store (iOS) and Google Play Store (Android).
Data sources
Logs and other things to get information
Linux kernel
dmesg and /var/log/messages are invaluable tools whether itâs on desktops or servers, because if the kernel or any of its modules notice anything wonky with the hardware itâll be in there.
Look for things like âMachine Check Exceptionâ and crashes of GPU-related modules, it may seem like an intimidating file at first but itâs very doable to scroll through the whole thing in a minute or so, do take the time to do so rather than just grepping for things because you might miss important things.
systemd journald
with journalctl -xb -1 you can request the logs from a previous boot. use -2 for one boot before that, -3 before that, and so forth. This can be useful in the case of hard crashes to try figure out whatâs going on.
Methods
To figure out things about a machine, sometimes you need to use structured methods to understand whatâs going on.
Long tests
Minimal Configuration (minconfig)
If a machine is not booting at all, or acting too eratic/strangely, or failing to get to the point you can actually do things, you can try to remove parts until youâre at a bare minimum state.
This may be less relevant with desktops with e.g 1 stick of RAM/SSD/CPU/GPU/etc, but itâs very useful with servers that arenât booting, where you may have over 32 sticks of RAM, 2+ CPUs, a whole array of disks, multiple storage HBA/SAS/etc cards, NICs, etc etc. Bring it back to 1 of each, and see if that boots, if it does then at least you know the base components are working, and if not, then you know to try swapping out one of those.
But of course even if you have a desktop with multiple parts, you can try to bring it back to a minimal state by disconnecting anything unnecessary, to see if it will at least boot.
Dancing with the hardware
Sometimes, especially with more complex hardware like servers, itâs hard to tell where a problem is originating. For instance, your fancy GPU or HBA or NIC runs at x8 speed instead of x16, is it the CPU? is it dirty pins? is it a broken pin or trace in the motherboard? is it the GPU/HBA/NIC/etc? maybe an interposer or extension cable carrying the signal? And this doesnât just go for PCIE of course; networking equipment can be similar in those ways, or USB/SATA not functioning (USB/SATA is often in the CPU these days!), there are always several parts where it could be going wrong.
To rule out which component is causing it, you can âdanceâ components between places, e.g dance/swap the two CPUs around so CPU1 is now in CPU0âs socket and vice versa. Or move the dimms/GPU/HBA/NIC/disks around to different ports. This allows you to at least rule out the device, or motherboard/CPU, but maybe even CPU if you have multiple sockets. Dancing or reseating can sometimes also fix issues with pin connectivity and is thus always worth trying.