Diagnosing Errors

Here we list some of the tools Mynewt has available. This is by no means comprehensive. It does give an overview of some facilities more commonly used.

To create output for this demo, I’m including the ‘test/crash_test’ package while building the ‘slinky’ app.

Crash with Console

When system restarts due to an error, we always attempt to do it in a way which allows us to dump the current system state. Calls to assert(), error interrupts (regular faults, memory faults) both take this path.

The basic facility is the printout of registers to console.

Here is an example of what it looks like with Cortex-M4:

041088 Unhandled interrupt (3), exception sp 0x20001d28
041088  r0:0x00000000  r1:0x00017b61  r2:0x00000000  r3:0x0000002a
041088  r4:0x20001dc8  r5:0x00000000  r6:0x200034e4  r7:0x20003464
041088  r8:0x2000349c  r9:0x20001f08 r10:0x00017bd4 r11:0x00000000
041088 r12:0x00000000  lr:0x00014e29  pc:0x00014e58 psr:0x61000000
041088 ICSR:0x00421803 HFSR:0x40000000 CFSR:0x02000000
041088 BFAR:0xe000ed38 MMFAR:0xe000ed34

Output includes the values of registers. The most interesting pieces would be $pc (program counter), and $lr (link register), which show the instruction where the fault happened, and the return address from that function.

You would then take these values, and match them against the image you’re running on the target.

For example:

[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ arm-elf-linux-gdb bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf
GNU gdb (GDB) 7.8.1
  ...
Reading symbols from bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf...done.
(gdb) list *0x00014e58
0x14e58 is in crash_device (repos/apache-mynewt-core/test/crash_test/src/crash_test.c:47).
42      if (!strcmp(how, "div0")) {
43
44          val1 = 42;
45          val2 = 0;
46
47          val3 = val1 / val2;
48          console_printf("42/0 = %d\n", val3);
49      } else if (!strcmp(how, "jump0")) {
50          ((void (*)(void))0)();
51      } else if (!strcmp(how, "ref0")) {

You can see that the system crashed due to divide-by-zero.

Crash with Verbose Location

We often call assert(), testing specific conditions which should be met for program execution to continue. If the condition fails, we reset with a message listing the instruction where assert() was called from.

829008 compat> Assert @ 0x14e9f
831203 Unhandled interrupt (2), exception sp 0x20001d20

This address (here 0x14e9f) should then be used to find the line in the program where it happened. Note that this address is specific to your binary. So you have to do the search to locate the specific file/line using your .elf file.

You can also enable more verbose report by setting syscfg variable OS_CRASH_FILE_LINE to 1. The call to assert() from above would then look like so:

000230 compat> Assert @ 0x1503b - repos/apache-mynewt-core/test/crash_test/src/crash_test.c:54
001462 Unhandled interrupt (2), exception sp 0x20001d20

Note that this will increase the program text size, as we now need to store the names for the files which call assert().

Crash with Console with Stacktrace

Quite often you need a deeper view of the call chain to figure out how the program got to place where it crashed. Especially if the routine can be reached via multiple call chain routes.

You can enable a dump of possible call chain candidates by setting syscfg variable OS_CRASH_STACKTRACE to 1.

When crash happens, the dumping routine walks backwards in the stack, and prints any address which falls within text region. All these values are candidates of being part of the call chain. However, you need to filter it yourself to figure out which values are actually part of it, and which addresses just happen to be in the stack at that time.

Here is a sample output:

004067 Unhandled interrupt (3), exception sp 0x20001d28
004067  r0:0x00000000  r1:0x00017c55  r2:0x00000000  r3:0x0000002a
004067  r4:0x20001dc8  r5:0x00000000  r6:0x200034e4  r7:0x20003464
004067  r8:0x2000349c  r9:0x20001f08 r10:0x00017cc8 r11:0x00000000
004067 r12:0x00000000  lr:0x00014ef9  pc:0x00014f28 psr:0x61000000
004067 ICSR:0x00421803 HFSR:0x40000000 CFSR:0x02000000
004067 BFAR:0xe000ed38 MMFAR:0xe000ed34
004067 task:main
004067  0x20001d64: 0x0001501b
004067  0x20001d68: 0x00017338
004067  0x20001dd0: 0x00017cc8
004067  0x20001dd4: 0x00015c2b
004067  0x20001dd8: 0x00015c1d
004067  0x20001de4: 0x0001161d
004067  0x20001df4: 0x00011841
004067  0x20001e04: 0x000117c9
004067  0x20001e0c: 0x00013d8f
004067  0x20001e20: 0x00014375
004067  0x20001e4c: 0x00013dfd
004067  0x20001e54: 0x00013e09
004067  0x20001e58: 0x00013e01
004067  0x20001e5c: 0x0000925d
004067  0x20001e64: 0x00008933
004067  0x20001e7c: 0x00008cc3
004067  0x20001e98: 0x000169e0
004067  0x20001e9c: 0x00008cb9
004067  0x20001ea0: 0x000088a9

You can see that it contains the usual dump of registers in the beginning, followed by addresses to stack, and the value at that location. Note that we skip the area of memory within stack which contains the stashed register values. This means that $pc and $lr values will not show up in the array of addresses. Depending on your CPU architecture, different stuff gets pushed to stack. Given where Mynewt is ported to, there’s always function return address pushed there.

You would then take these values, and see if they seem legit. Here we’ll show what to do with the output above.

[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ arm-elf-linux-gdb bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf
GNU gdb (GDB) 7.8.1
Copyright (C) 2014 Free Software Foundation, Inc.
...
No symbol table is loaded.  Use the "file" command.
Reading symbols from bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf...done.
(gdb) list *0x0001501b
0x1501b is in crash_test_nmgr_write (repos/apache-mynewt-core/test/crash_test/src/crash_nmgr.c:68).
63      if (rc != 0) {
64          return MGMT_ERR_EINVAL;
65      }
66
67      rc = crash_device(tmp_str);  <--- That is likely part of it
68      if (rc != 0) {
69          return MGMT_ERR_EINVAL;
70      }
71
72      rc = mgmt_cbuf_setoerr(cb, 0);
(gdb) list *0x00017338  <--- not relevant
(gdb) list *0x00017cc8  <--- neither is this
(gdb) list *0x00015c2b
0x15c2b is in cbor_mbuf_writer (repos/apache-mynewt-core/encoding/tinycbor/src/cbor_mbuf_writer.c:32).
27  {
28      int rc;
29      struct cbor_mbuf_writer *cb = (struct cbor_mbuf_writer *) arg;
30
31      rc = os_mbuf_append(cb->m, data, len);  <-- Does not seem likely. This probably ended here due to an earlier work that this task was doing.
32      if (rc) {
33          return CborErrorOutOfMemory;
34      }
35      cb->enc.bytes_written += len;
36      return CborNoError;
(gdb) list *0x00015c1d
0x15c1d is in cbor_mbuf_writer (repos/apache-mynewt-core/encoding/tinycbor/src/cbor_mbuf_writer.c:27).
22  #include <tinycbor/cbor.h>
23  #include <tinycbor/cbor_mbuf_writer.h>
24
25  int
26  cbor_mbuf_writer(struct cbor_encoder_writer *arg, const char *data, int len)
27  {
28      int rc;      <---- Nope. Not relevant.
29      struct cbor_mbuf_writer *cb = (struct cbor_mbuf_writer *) arg;
30
31      rc = os_mbuf_append(cb->m, data, len);
(gdb) list *0x0001161d
0x1161d is in create_container (repos/apache-mynewt-core/encoding/tinycbor/src/cborencoder.c:244).
239     memcpy(where, &v, sizeof(v));
240 }
241
242 static inline CborError append_to_buffer(CborEncoder *encoder, const void *data, size_t len)
243 {
244     return encoder->writer->write(encoder->writer, data, len);  <--- Hmmm, unlikely
245 }
246
247 static inline CborError append_byte_to_buffer(CborEncoder *encoder, uint8_t byte)
248 {
(gdb) list *0x00011841
0x11841 is in preparse_value (repos/apache-mynewt-core/encoding/tinycbor/src/cborparser.c:182).
177     /* are we at the end? */
178     if (it->offset == parser->end)
179         return CborErrorUnexpectedEOF;
180
181     uint8_t descriptor = parser->d->get8(parser->d, it->offset); <--- Probably not relevant.
182     uint8_t type = descriptor & MajorTypeMask;
183     it->type = type;
184     it->flags = 0;
185     it->extra = (descriptor &= SmallValueMask);
186
(gdb) list *0x000117c9
0x117c9 is in cbor_encoder_create_map (repos/apache-mynewt-core/encoding/tinycbor/src/cborencoder.c:521).
516  */
517 CborError cbor_encoder_create_map(CborEncoder *encoder, CborEncoder *mapEncoder, size_t length)
518 {
519     if (length != CborIndefiniteLength && length > SIZE_MAX / 2)
520         return CborErrorDataTooLarge;
521     return create_container(encoder, mapEncoder, length, MapType << MajorTypeShift);     <-- I don't think this is relevant either.
522 }
523
524 /**
525  * Creates a indefinite-length byte string in the CBOR stream provided by
(gdb) list *0x00013d8f
0x13d8f is in nmgr_handle_req (repos/apache-mynewt-core/mgmt/newtmgr/src/newtmgr.c:261).
256         } else {
257             rc = MGMT_ERR_ENOENT;
258         }
259     } else if (hdr.nh_op == NMGR_OP_WRITE) {
260         if (handler->mh_write) {
261             rc = handler->mh_write(&nmgr_task_cbuf.n_b); <-- This is part of it.
262         } else {
263             rc = MGMT_ERR_ENOENT;
264         }
265     } else {

I was sending a newtmgr command which crashed the system. As you can see from this this example, little less than 50% of the addresses were part of the call chain. So read this output with care.

Coredump

Coredump contains a full dump of system memory, and CPU registers. You can inspect the system state, including stack of the failing task, or any of the global state.

You can enable coredumps by setting syscfg variable OS_COREDUMP to 1. When crash happens, dumper will write the contents to flash_map region COREDUMP_FLASH_AREA. After the crash, you can use imgmgr to download the coredump for offline analysis. To enable download/erase commands, you need to set syscfg variable IMGMGR_COREDUMP to 1.

Coredump package does not overwrite a previous coredump, if it exists in the flash. To get a new one, you first need to erase the area.

You can either use a dedicated area of flash, or use image slot 1 as the target. If image slot1 and coredump areas are co-located, coredump will overwrite the image, unless image upgrade is in progress (slot1 marked as pending, or slot0 is not confirmed).

Fetching Coredump with newtmgr

newtmgr coredump management commands are handled by imgmgr. You can query whether corefile is present, download one if it exists, and then erase the area afterwards.

[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ newtmgr -c ttys005 image corelist
Corefile present
[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ newtmgr -c ttys005 image coredownload -e core.elf
0
512
1024
1536
 ...
65676
Done writing core file to core.elf; hash=7e20dcdd136c6796fcb2a51e7384e90800d2ec045f2ee088af32529e929e2130
[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ newtmgr -c ttys005 image coreerase
Done

I specified option ‘-e’ to coredownload command. This converts the internal data representation to ELF file; a format that gdb understands.

Coredump Analysis Offline

Not all GDB configurations support corefiles. Specifically for the architecture we’re showing here, ‘arm-none-eabi-gdb’ does not have it built in. However, ‘arm-elf-linux-gdb’ does.

Here’s how I built it with MacOS:

tar xvzf gdb-7.8.1.tar.gz
cd gdb-7.8.1
./configure --target=arm-elf-linux --without-lzma --without-guile --without-libunwind-ia64 --with-zlib
make
gdb/gdb -v

And this is how I built it for Linux:

sudo apt-get install zlibc zlib1g zlib1g-dev libexpat-dev libncurses5-dev liblzma-dev
tar xvzf gdb-7.8.1.tar.gz
cd gdb-7.8.1
./configure --target=arm-elf-linux --with-lzma --with-expat --without-libunwind-ia64 --with-zlib --without-babeltrace
make
gdb/gdb -v

Now that I have a suitable version of gdb at hand, I can use it to analyze the corefile.

[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ arm-elf-linux-gdb bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf core.elf
GNU gdb (GDB) 7.8.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-apple-darwin14.5.0 --target=arm-elf-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
No symbol table is loaded.  Use the "file" command.
Reading symbols from bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf...done.
[New process 1]
#0  0x00014f28 in crash_device (
     how=how@entry=0x20001dc8 <os_main_stack+3904> "div0")
     at repos/apache-mynewt-core/test/crash_test/src/crash_test.c:47
47          val3 = val1 / val2;
(gdb) bt
#0  0x00014f28 in crash_device (
    how=how@entry=0x20001dc8 <os_main_stack+3904> "div0")
    at repos/apache-mynewt-core/test/crash_test/src/crash_test.c:47
#1  0x0001501a in crash_test_nmgr_write (cb=0x20003464 <nmgr_task_cbuf>)
    at repos/apache-mynewt-core/test/crash_test/src/crash_nmgr.c:67
#2  0x00013d8e in nmgr_handle_req (
    nt=nt@entry=0x200034e4 <nmgr_shell_transport>,
    req=0x20002014 <os_msys_init_1_data+292>)
    at repos/apache-mynewt-core/mgmt/newtmgr/src/newtmgr.c:261
#3  0x00013dfc in nmgr_process (nt=0x200034e4 <nmgr_shell_transport>)
    at repos/apache-mynewt-core/mgmt/newtmgr/src/newtmgr.c:325
#4  0x00013e08 in nmgr_event_data_in (ev=<optimized out>)
    at repos/apache-mynewt-core/mgmt/newtmgr/src/newtmgr.c:332
#5  0x0000925c in os_eventq_run (evq=<optimized out>)
    at repos/apache-mynewt-core/kernel/os/src/os_eventq.c:162
#6  0x00008932 in main (argc=<optimized out>, argv=<optimized out>)
    at repos/apache-mynewt-core/apps/slinky/src/main.c:289
(gdb) p g_current_task
$1 = (struct os_task *) 0x20001e88 <os_main_task>

Setting up coredumps, setting up a mechanism to download them, managing them on the device and building gdb is certainly more work, but you’ll have much more data available when you start analyzing problems. I don’t think I can emphasize too much how valuable having all this data is, when you’re stuck with a difficult-to-solve crash.

Task Stack Use

One of the common problems is stack overflow within a task. Mynewt OS fills the stack of a task with a pattern at the time task gets created. If you’re suspecting that stack might be blown, check the start of the stack (and memory right before it).

There are also APIs to check the stack use. Task statistics, accessible with console CLI or newtmgr, attempt to report the stack use by the tasks. They do it by inspecting the presence of this pattern, so it is possible to fool them also.

Here is what that data looks like at console. ‘stksz’ is the amount of stack, and ‘stkuse’ is how much of it we think the task used it.

7109237 compat> tasks
7109387 Tasks:
7109387     task pri tid  runtime      csw    stksz   stkuse   lcheck   ncheck flg
7109390     idle 255   0  7109375    72810       64       26        0        0
7109392     main 127   1       15     6645     1024      388        0        0
7109394    task1   8   2        0    55543      192      116        0        0
7109396    task2   9   3        0    55543       64       28        0        0

There are some helper macros for this when using gdb also. You’ll need to eyeball the stack manually, but at least it will tell you where to find the stacks.

[marko@IsMyLaptop:~/src2/incubator-mynewt-blinky]$ newt debug slinky_nrf52
Debugging bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf
 ...
Reading symbols from bin/targets/slinky_nrf52/app/apps/slinky/slinky.elf...done.
os_tick_idle (ticks=128)
     at repos/apache-mynewt-core/hw/mcu/nordic/nrf52xxx/src/hal_os_tick.c:164
164     if (ticks > 0) {
(gdb) os_tasks
Undefined command: "os_tasks".  Try "help".
(gdb) source repos/apache-mynewt-core/compiler/gdbmacros/os.gdb
(gdb) os_tasks
 prio state      stack  stksz       task name
* 255   0x1 0x20000e88     64 0x200035fc idle
  127   0x2 0x20001e88   1024 0x20001e88 main
    8   0x2 0x20003ab0    192 0x20000cbc task1
    9   0x2 0x20003bc0     64 0x20000d0c task2

Stack Check during Context Switch

You can ask OS to check the top of the stack on every context switch. It’ll walk over X number of os_stack_t elements, checking if our pattern is still there for the task which is about to be switched out.

You can enable this by setting syscfg variable OS_CTX_SW_STACK_CHECK to 1. Another syscfg variable, OS_CTX_SW_STACK_GUARD, controls how far into the stack we peek.

Memory Corruption

The recommended way of allocating blocks of memory is by using the OS memory pools. As with all dynamically allocated memory, there can issues with using after it gets freed, or it does not get freed at all, code tries to free the same element multiple times, or you end up overwriting the data within the next element.

There are syscfg variables you can set to get OS to do some of sanity checks for you.

Sanity Check on Free

By setting syscfg variable OS_MEMPOOL_CHECK to 1, OS will act when an element is freed. It will walk through the list of already freed entries, and check that the entry being freed does not already appear in the free list. It will also check that the pool entry the code is trying to free is within the memory range belonging to that pool.

Poisoning Pool

If you set syscfg variable OS_MEMPOOL_POISON, the OS fills pool entry memory with a data pattern whenever that entry is being freed. It then checks that this pattern has not been disturbed when that pool entry is allocated again. This would help to detect the scenarios when code is still using a freed pool entry. Of course, the detection will only be triggered if the code tries to modify the data belonging tot that entry.

Pool Guard Areas

Normally OS takes the memory region allocated for pool use and splits it into chunks which are exactly the size of the pool entry (taking into account the alignment restriction). By setting syscfg variable OS_MEMPOOL_GUARD to 1, it creates a ‘guard area’ between pool elements, and writes a specific data pattern to it. This is then checked whenever the pool entry is freed, or allocated. If code writes past the data area belonging to pool entry, the pattern would likely be disturbed, and would cause an assert.

Exhausting Memory

Memory pool usage can be monitored through the mempool API. The data collected there can be seen via the console CLI and/or using newtmgr.

Here is that data as seen from console:

7375292 compat> mpool
7375581 Mempools:
7375581                             name blksz  cnt free  min
7375584                           msys_1   292   12   12    8
7375585              modlog_mapping_pool    12   16   12   12

We list the memory pools OS knows about. ‘cnt’ tells how many elements the pool has in total. ‘free’ tells how many elements are currently in the free pool, and ‘min’ tells what’s the low point for the number of elements within the pool.

What you should check is the ‘min’ number. If it gets to 0, it means that the pool has been exhausted at one point. Depending on what the pool is used for, you might want to take action.

There’s some gdb helper macros to inspect the data belonging to the default networking memory pool, msys_1.

(gdb) source repos/apache-mynewt-core/compiler/gdbmacros/mbuf.gdb
(gdb) mn_msys1_print
Mbuf addr: 0x20001ef0
Mbuf header: $1 = {om_data = 0x20002380 <os_msys_init_1_data+1168> "\244$",
  om_flags = 0 '\000', om_pkthdr_len = 8 '\b', om_len = 14,
  om_omp = 0x20002ca0 <os_msys_init_1_mbuf_pool>, om_next = {sle_next = 0x0},
  om_databuf = 0x20001f00 <os_msys_init_1_data+16> "\016"}
Packet header: $2 = {omp_len = 14, omp_flags = 0, omp_next = {stqe_next = 0x0}}
  ....
---Type <return> to continue, or q <return> to quit---q
Quit
(gdb) help mn_msys1_print
usage: mn_msys1_print

Prints all mbufs in the first msys pool.  Both allocated and unallocated mbufs
are printed.
(gdb) mn_msys1_free_print
Mbuf addr: 0x20002138
 ....
Mbuf addr: 0x20002b7c

Crash in Log

System restarts can also be recorded in reboot log. Assuming reboot log has been set up, it’ll write a record either when system is asked to reset (managed reset), or when system is coming back after an unexpected reset.

To facilitate the latter, you should include a call from main() to reboot package, reporting what HAL is telling you.

int
main(int argc, char **argv)
{

    sysinit();

    /* .... */
    reboot_start(hal_reset_cause());

This will then show if the system was restarted due to POR, brownout, or hardware watchdog (the cause determination is MCU specific, so YMMV). As you remember, exit via assert() and unhandled interrupt is going via the dumper. So these kind of reboots will show up as ‘SOFT’ reset.

Restarts due to hardware watchdog will show up here, however. You can also add a printout to beginning of main(), outputting what hal_reset_cause_str() returns. That will generate a report on the console.