Updated: June 24, 2018
Presumably, if a watchdog times out, there is a flaw in the latest firmware being executed. Thus, there is every chance the problem will arise again. Indeed, the ESP will probably fall into an endless loop, with the same watchdog timing out over and over again. In the same vein, what if the problem is an exception, say an unforeseen division by 0 that causes an endless sequence of restarts without ever causing any of the watchdogs to bite?
In other words, we have to return to the subject of recovery briefly mentioned in the first post of this series. In this blog I will show how to break out of an endless loop of resets by automatically reloading a "known good" version of the firmware over the air. At the same time, I will incorporate even more ideas from Nial Murphy and Jack Ganssle about a better watchdog into the loop watchdog introduced in the previous post.
Table of Contents
- Possible Remedies
- Non-Volatile Memory
- The
restart
Header File - An Improved Loop Watchdog
- An Improved Recovery
- Conclusion
Here is the scenario that I dread. After performing an over the air (OTA) upgrade of the firmware on my garage door monitor, the system breaks down. It will no longer automatically close the door when I forget to do it. Worse, it is impossible to reload a previous version of the firmware because neither the web server nor the mqtt service is working. As mentioned above, that is what would happen if the ESP8266 were locked in an endless loop of restarts.
Since OTA upgrades are implemented, that means that there are probably
two versions of the firmware stored in the ESP flash memory: the current
malfunctioning version and the previous version. So why not switch back to
the previous version and from there upgrade to a corrected new version? It
would be an elegant solution if the presence of a "good enough" older
version of the firmware could be guaranteed. It could very well be that
this is not the case, so this strategy would be a first line of defence
only. That is not how it works. In the end, an OTA uploaded upgrade is
copied over the previous sketch which will no longer be resident in
the flash memory (see Updater class).
A "hands off" solution to recover from an endless loop of restarts is needed for an embedded ESP8266 based device. In a previous post entitled Over the Air Sonoff Flashing I discussed how my home automation system was hosted by a Raspberry Pi that remains on at all times. That system also runs a little web server from which upgrades to various IoT devices can be downloaded over the radio. Why not leave a working version of the garage monitor firmware on that web server and automatically use the over the air update capabilities of the ESP8266 Arduino Core at startup to load that version if needed? Actually, a previous version of the monitor need not be used. All that would be needed is a minimal program capable of OTA uploading to then get a working version of the firmware back onto the garage door monitor.
There are three complications that I can think of. The first is that when
the loop watchdog (lwdt) times out, it restarts the device using the
ESP.restart()
method. Of course the standard method to find out
why the ESP was rebooted does not know about the lwdt and it reports that a
software or system reset occurred. So we have to find a way to differentiate
software restarts from lwdt timeouts and from user invoked restarts. This
distinction would not be necessary if the EspClass
methods
restart()
and reset()
were never used. However,
they are used in the sketches to which I want to add this recovery
method.
The second complication is that it may well be that a watchdog timed out because of a transient spike in the power line or some cosmic ray causing a bit to flip in RAM but without affecting the flashed firmware itself. It would be foolish to replace the firmware on a single occurrence of a problem. So we need to keep track of the previous reason for a restart of the ESP as well the consecutive count of that reason.
Finally, RAM cannot be used for storing this information because it will not be persistent across system restarts. Each time the ESP is restarted, no matter the reason, all RAM variables will be restored to their initial programmed values. Non-volatile memory has to be used to store the information needed.
Like many other microcontrollers, the ESP8266 has non-volatile memory or rather its embedded memory controller can access up to 16 megabytes of external serial flash memory using the Serial Peripheral Interface Bus (SPI). On the Sonoff WiFi switch, there is 1 megabyte of flash memory, on a Wemos D1 mini there are 4 megabytes. This is where the firmware is stored. But some of that memory can also be set aside to store data which can be accessed with the EEPROM library.
Since my sketches already use this library to save configuration data, it is not too difficult to add the information needed for recovery purposes. However, this is probably not the best idea because of a limitation of flash memory. Most flash memory can perform a limited number of write operations: from a hundred thousand to one million. That may appear to be a non binding limit on the number of changes to the configuration data. However the sketch will write data to flash memory each time the ESP8266 restarts so it could become a problem. Remember that restarts caused by an exception will be typically become loops that pile up a significant number of flash memory writes before the problem is noticed and corrected.
Luckily, the ESP8266 has a small amount of non-volatile memory incorporated in its real-time clock (RTC). This is static random access memory (RAM) just like the working RAM, but it is always powered even when the chip is put into deep sleep. The downside is that there is only 512 bytes of RTC memory available to us.
In this post, I will show how to use RTC memory. However, the sketch which can be downloaded can use either RTC or EEPROM memory. This is controlled by a preprocessor directive.
This is the content of the file restart
header file restart.h
.
The enumerated type restartReasont_t
extends the
rst_reason
found in user_interface.h
to include
the new loop watchdog restart. There is probably a better way of doing this
in C++ but I not familiar with that language.
As the comment says, before using the other functions,
restartBegin
must be invoked with the address in RTC memory
where the restart information will be saved. The parameter can be any "bucket
address" from 0 to 126. RTC memory is divided in 4 byte buckets. Because the
restart data occupies two buckets, it would overflow if stored at bucket 127.
Hence, if addr
is greater than 126 is specified the function
returns false.
The loop watchdog does not restart the ESP directly. It must call the
lwdtRestart
function which stores information in RTC memory
before calling ESP.restart()
. That is how it will be possible to
discriminate restarts caused by the loop watchdog.
The comment for the last function is self-explanatory I hope. Note that
this function should only be called once. There is no real mechanism to enforce
that rule except that it will systematically return RR_POWER_ON
with a count
of -1 after the first call.
I decided to treat different exceptions as different reasons for restarting
the ESP. So if an exception 3 follows an exception 0, the value of
count
returned with the second exception is 1.
The details of the implementation are in the restart.eno
file. I will not discuss these here. The next section looks at how all this is
used. It also explains what is meant by where
a loop watchdog bite occurs.
I tend to create very short modular Arduino program loops. Here is an example:
All the work is done in four "modules" (xxxxModule()
). As
before, the program loop starts by feeding the loop watchdog. The last task
is setting the value of the lwdWhere
variable to
LOOP_START
. This is part of the improvements brought to the
loop watchdog.
Each module begins by setting lwdWhere
to a unique value
identifying the start of the module:
void inputModule() { lwdWhere = INPUT_MODULE;
That way the loop watchdog can report which module was being executed if
it bites. This module identifier is what the loop watchdog passes on to the
lwdtRestart
function discussed above.
There are a coupe of special values that are not associated with an individual module. Recall that two values are constantly fed to the watchdog. The difference between the two values is monitored by the watchdog and if it changes then the watchdog bites because presumably the firmware has gone rogue and is overwriting memory. In that case the loop watchdog will set the "location" of the timeout at LWD_OVERWRITTEN.
The LOOP_START
value actually identifies the behind-the-scene
code performed at the top of the program loop. The function feeding the watchdog
checks that lwdtWhere
is still equal to LOOP_START
.
If not then the behind the scene code modified the content of
lwdtRestart
lwdtWhere
or, somehow, there was a short circuit so that the complete
sequence of modules was not executed before the program loop restarted.
This will cause the watchdog to bite and the situation is signalled albeit in a
rather arcane way.
Notice how the top 16 bits of all module identifiers is 0xAB00
.
When lwdtFeed
restarts the ESP, it replaces the top 16 bits
with the value 0xBA00
.
lwdtWhere
might be overwritten
and not lwdtRestart
. Of course, the lwdtRestart
code
could be clobbered, but then all bets are off as discussed in the conclusion.
References:
Murphy, Niall (2000), Watchdog Timers.
Ganssle, Jack (2016), Great Watchdog Timers for Embedded Systems.
Recovery is done in the setup()
function of the sketch.
Here is striped down version of what the code could be.
This is not too complicated. There is some initial
housekeeping including opening the serial port and making sure that we are not
trapped by the restart after flashing bug (that was covered in the
first post on this subject). Then the restart module is
initialized and the reason for the latest restart is obtained. There is a
potential need for automatic updating of the firmware if the restart was
caused by a watchdog timeout or an exception. Updating will be done if the
number of consecutive restarts for that reason is greater than
AUTO_UPDATE_COUNT
and if the latter is greater than 0.
In the case of a rogue program thrashing RAM and flash memory because of a catastrophic programming error or because of particularly disruptive cosmic rays, I am not convinced that the technique will save the day. Would it not be amazing that it spared all the code presented here and the ESP WiFi code and the HTTP update code and so on?
On the other hand, I do think that the technique will be useful as protection from self-inflicted wounds. I am an optimist and likely to do OTA uploads of new, probably buggy, versions of the firmware to embedded devices. The errors introduced at such times will probably not be immense, and reloading a known good version may very well work as expected.
You can download the complete example (***). It is a more sophisticated "blinky" that serves as a test bed with a bunch of defines at the beginning to create watchdog timeouts or exceptions in a particular module.
restart.ino
. It did not report the name of the "module" in
which the loop watchdog was biting. sizeof(RESTART)
returned 4 bytes, which I assume is the size of a pointer and not 8 bytes
which is the size the RESTART
structure. In Pascal, my preferred
programming language, the size of the record would have been returned. That
is an understandable error on my part but, in all humility, it was
inexcusable to not have noticed the problem in the first place. Maybe I had
done all the tests with sizeof(restart)
and then when it was
time to create the archive, I just did a search and replace because I thought
sizeof(RESTART)
was "better".sizeof
returns the same value when its argument is the name of
the struct
or an instance of the struct
.
There was a problem because the restart
structure was not
aligned on a 32 bit boundary. It appears that reading and writing to the RTC
memory can only be done when the destination or source address is aligned
properly. In that example, there is a configuration module using the
EEPROM
library to save data that presumably could be changed at
runtime. It shows how the restart module can play nice with the EEPROM
configuration module even when the former is also implemented using
EEPROM memory.
setup()
reports in much more detail the cause of the
system restart. This could be useful for diagnostic purposes if there were
a way to get at it remotely. It will be useful during development stage
if the Arduino serial window is open.
As I said, I am not familiar with C and C++. I taught myself procedural programming in Pascal and then object oriented programming with a then new language: Java. With the advent of Delphi 2, I returned to Pascal which I have used almost exclusively since then. All that to say that while I recognize that what I called the "restart module" should be redone as a class, I will not do it any time soon.
Clearly then, the example does not contain well-formed C++ code. If you find particularly egregious bits, you can inform me by clicking on my name at the bottom of the page. A big thank you ahead of time to all who send in corrections and suggestions or useful criticism.