According to a report by Phoronix, Linux kernel version 6.1 is introducing a new logging system for identifying bad CPUs and its associated cores within a server. The logging system can detect exactly which core, CPU and socket failed at a specific time.
This isn’t a fully automated system, and it’s only for logging; it won’t stress the CPU to check for faults. As a result, Rik Van Riel — who is responsible for authorizing the CPU logging system for 6.1, says system admins will want to run commonly run kernel code known to cause faults with a known faulty system with the logger enabled to see which cores are bad.
The logger isn’t perfect, since the kernel tasks might get rescheduled toward another CPU or CPU core, but he finds this strategy to be good enough to find bad CPUs or cores. Often times, CPU faults can be “oddly specific” where specific programs or pieces of code will crash the core only.
This program isn’t really designed for consumers, but is aimed primarily at system admins running a host of Linux-based servers. For these admins, this new tool can be really useful for hunting down mysterious hardware faults when full blown CPU stress testers such as Prime95 or Aida64 are perfectly stable.
Error checkers such like this one, as well as Intel’s new In-Field-Scan technology are continuing to grow in popularity in the server industry. As CPUs continue to get smaller and smaller with bleeding edge nodes, there are higher chances of errors occurring within the silicon (which are commonly known as soft errors).
As we get closer and closer to the physical limits of transistor sizes (such as 1 nm or smaller), CPUs should theoretically become more and more susceptible to errors – particularly towards cosmic radiation. As a result, CPU error checking will become exponentially more important as time goes on and transistor density continues to improve.