Richard J. Rinehart

November 28th, 2011

Case Study of a Micro Controller Based Power Supply Design Failure

By Richard J. Rinehart, Senior Electrical Engineer.

With the advent of more powerful microprocessors and microcontrollers at steadily decreasing prices in recent decades, the use of logic devices has blossomed. The feasibility of digital-based regulators and “smart” power supplies (which can monitor and report their own performance and health) has been amply demonstrated. Such supplies have become the status quo over less intelligent power circuitry among higher -end applications.

Quite often, the controller of choice for such applications is based on Harvard architecture. The Harvard architecture is most often contrasted with Von Neumann architecture. In a nutshell, Harvard architecture devices have code and data in separate memory structures, while the Von Neumann architecture has code and data in common memory structure(s). The Von Neumann architecture is arguably the most familiar style in use at this time, being represented by most commercial computing equipment such as Windows-based PCs. Harvard architecture is more suited to controller functions where there is little need to modify the code on a frequent basis. By separating code and data, the “day-to-day” functions of the controller are protected from accidental corruption by errant software routines. In practice, the rigorous application of the Harvard design philosophy leads to certain difficulties in software development. However, a hybrid Harvard architecture, which has a restricted ability to write data in code memory, has found the widest acceptance in today’s market.

One such supply under development in recent years contained two digital controllers, one a Harvard architecture device that provided self-monitoring and interface communications, and the other a DSP style workhorse controller for the internal supervisory regulation/performance of the supply. The supply was a single printed circuit board with 3 different voltage outputs, intended for usage in close physical proximity to the CPU of a server. As working voltages of CPUs come down to reduce the power dissipation requirements of ever faster switching, the current requirements go up, yet PC board electrical conductor requirements become prohibitive for transmitting large currents from a remote supply. The adopted design approach is to provide higher voltage and lower current to a high efficiency conversion/regulation point near the CPU.

While the design of the supply was in the pre-production phase it was put into production under a pilot process to prove the design and manufacturability of the supply prior to initiating volume production in an offshore facility. The final touches were being applied to the design of the hardware, software, the process and the test equipment. At this time, spurious and somewhat infrequent changes to non-volatile data memory location 0 were occurring, changing from its default value of FFh to 00h. The non-volatile memory storage was not critical to the basic functions of the supply. The customer had slated it for some future, undefined use that might never even be implemented.

At the first occurrence, collective opinion among the design and development team was to dismiss it as a random fluke. When the same error repeated 3 or 4 times in different modules over the course of a week, the error began to receive serious consideration from most, but not all, of the development team. The lead design engineer was opposed to chasing after a perceived minor hiccup with a vestigial feature of the supply.

Intermittent failures are sometimes the most troubling of all failures. In the case of mission critical applications such as medical, nuclear, airborne or space-borne equipment, such failures must be positively proven as understood and fixed prior to proceeding with the manufacture of the item. Since intermittent problems are usually difficult to catch “red-handed”, i.e., observed at the moment of occurrence, it is often tempting to sweep the problem under the rug.

In spite of conflicting opinion on how to proceed, management granted a “grudged consent” to proceed with pinpointing the failure. Over the course of several weeks of repeated testing and data analysis, no progress was made. Operating the supply for many hours under close monitoring would not reproduce the failure.

The various contributing factors to the error were taken into consideration. During teleconferences, the software development group located on the East Coast was asked to review their code, particularly any recent revisions. The hardware design and manufacturing groups, located on the West Coast, were asked to review their functions for contributing factors. The controller vendor was also asked to suggest insights into the situation.

The software development group observed that the routine by which the controller was able to update its own code, as well as, the routine that wrote to non-volatile memory storage and the DSP code updates had been implemented shortly before the problem began to appear.

The hardware design group had its history established in analog design principles. These kinds of problems previously had been attributed to design tolerances, internal part manufacturing flaws and similar analog considerations.

Manufacturing processes were reviewed, but no contributors could be found among their domain that would cause the error to occur.

During a teleconference with the controller vendor, one of the applications engineers mentioned that a similar problem had occurred in a similar design for other customers. The cause of the error in that instance had been attributed to random execution of code during the power down of the circuit. As the voltage supplied to the controller rolled off during power down, the controller was continuing to execute many, multiple processor cycles under questionable circuit conditions. This eventually proved to be the correct direction to pursue but was loudly protested by the lead hardware designer as a ridiculous course of action and a waste of time!

Under protest and close scrutiny, software development was requested to supply a temporary code revision to the routine that performed writing to the non-volatile memory. Rather than writing a default value of 0h, a request was made to change the default value written to A5h. The revised code required several days to be received from the group located on the other side of the country. The temporary revised code could only be placed in the manufacturing process with careful supervision of the modified units, lest some of the test subjects were to escape into the customers’ hands with the unproven, unqualified code inside.

These items were done, and lo and behold, after a few days of manufacturing pre-production, 3 of the test units displayed a value of A5h in non-volatile memory location 0! The final explanation is that the CPU would occasionally erroneously vector to the code at interrupt location 0 during power down which happened to be the routine which wrote values into the non-volatile memory.

Epilogue

The work around for the immediate difficulty was to implement a software revision, which caused the first non-volatile memory location to be written at the very top of memory, which was unlikely to ever be used. The final solution was never known, as the company went out of business not long after this incident, due to an unrelated manufacturing error that resulted in a multi-million dollar product recall.

All microprocessors and microcontrollers familiar to this author have a variety of input pins that can enable or disable the CPU. The pins are usually labeled as “CPU DISABLE”, “POWER GOOD” or similar. In breadboard kits supplied by vendors or educational circuits designed for introductory computer engineering courses, these pins are usually tied to whichever voltage rail will make them irrelevant to the CPU operation.

Many integrated circuit vendors market a small, inexpensive chip commonly referred to as a supervisory circuit, or brownout protector. An IC monitors operational parameters of the controller and its circuitry. Although it is a fairly small consideration, the failure to pay attention to its importance can have potentially disastrous consequences. The failure that was observed and very nearly ignored in this case study was only “the tip of the iceberg”. It is unknown what other functions might have been potentially randomly executed by the controller during power down.

A significant factor in this story concerns the subject of expertise. The lead hardware design engineer was extremely competent at the analog factors in designing power supplies, but lacked equivalent prowess in digital design. Pride and ego got in the way of listening to others and learning from them when the subject matter was outside of his area of expertise.

Leave a Reply