Wednesday, July 13, 2016

USB Host MSC test -hung 'zombie' process

Validating software is the process of trying to get reasonable coverage of the functionality in reasonable situations.
A 2nd test at trying to push the USB Host MSC has produced an interesting result -

A new version of ps also supplies a deeper view of the tasks and how much the stack has been used.
The end result was a series of hung processes - to be poetic a zombie undead process!  
My take is that somewhere deep in the system (far deeper than I can go) occasionally the process isn't exiting correctly.

The good news for me is that the USB Host MSC has not failed in a bunch of system failures, zombie processes,  Hard Faults, and that is what I need from the system.
Trying to be a good netizen (is that a nutter!) I've raised an issue against it in case anybody else is playing in this area.bitbucket.org/nuttx/nuttx/issues/11/hung-process 

'ps' when called uses this as per this update.  
https://bitbucket.org/nuttx/apps/commits/dd7c4c94935889fee77d79a188b62f11aa3fb0c6

To enable the stack (and heap) to be monitored with a watermark (the functionality changed about this time - but the Stack Coloration needs to be set now for its usage in ps, whenever its done)  
From make menuconfigs  
-->Build Setup-->Debug Options
    [*] Stack coloration
    [*] Heap coloration
Specifically Clicking on "Stack coloration" and then pressing 'h'
"   Enable stack coloration to initialize the stack memory to the value
   of STACK_COLOR and enable the stack checking APIs that can be used  
   to monitor the level of stack usage."

So pullin the latest changes, making a new build
$mknx.sh
$stwr

nsh> mount -t vfat /dev/sda /usb0
nsh>

Taking the one test script of the previous blog and using vi (cheat sheet for vi) I did the following
shuntils----  
 date >> /usb0/t/l_free  
 date >> /usb0/t/l_date
 hello >> /usb0/t/l_free
 free >> /usb0/t/l_free
 ps   >> /usb0/t/l_free
------
shspawn-----
set FSTOP /usb0/t/fstop
until [ -f $FSTOP ]
do
 sh /usb0/t/shuntils &
 sleep 1
done
--------

nsh> sh /usb0/t/shspawn &
# However after about 15 minutes it had failures. (see later) and seems to have left some zombie processes.
#The pid in this code is a signed integer, and wraps at 16xxxx

nsh>
#however also failed on 
nsh> umount usb0
nsh: umount: umount failed: 16 "Device or resource busy"

It turns out this "Device or resource busy" also means files open on /usb0 so I'm guessing is likely to be derived from the hung processes - and while an indicator not the root cause.


nsh> uname -a
NuttX 17.6 nxd47wk_wkd47tst1_20160713_1444 arm stm32f4discovery

# off its starts testing with some manual 'ps'.


sh [121:100]   #new spawn with PID
sh [124:100]
sh [127:100]
ps
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    852  20.9%  init
    4 100 RR       pthread --- Waiting  Signal    00000000   2044    788  38.5%  <pthread> 0x10002f80
nsh> sh [135:100]
sh [138:100]
sh [141:100]
sh [149:100]
sh [152:100]
ps
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    852  20.9%  init
    4 100 RR       pthread --- Waiting  Signal    00000000   2044    788  38.5%  <pthread> 0x10002f80
nsh> sh [155:100]
sh [158:100]
sh [166:100]
ps sh [169:100]

  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    852  20.9%  init
    4 100 RR       pthread --- Waiting  Signal    00000000   2044    788  38.5%  <pthread> 0x10002f80
  169 100 RR       pthread --- Ready              00000000   2044    812  39.7%  <pthread> 0x10004420
#NOTE: A spawned file snapshoted while running

nsh> sh [172:100]
sh [175:100]
sh [183:100]
sh [186:100]


# Hey Hey off for lunch while the system works 
#start is- Jul 13 14:54:49 2016
#sadly on returning  find that it failed again (with more of a log) 20minutes later at  Jul 13 15:14:33 2016
# and the log shows  ps: open failed: 24 "Too many open files" from  nuttx/include/errno.h:

Jul 13 15:14:32 2016
Hello, World!!
             total       used       free    largest
Mem:        188096      31392     156704     120368
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Waiting  Semaphore 00000000   4076    852  20.9%  init
    4 100 RR       pthread --- Waiting  Signal    00000000   2044    788  38.5%  <pthread> 0x10002f80
 5065 100 RR       pthread --- Running            00000000   2044   1060  51.8%  <pthread> 0x10005310
  859 100 RR       pthread --- Waiting  Semaphore 00000000   2044    900  44.0%  <pthread> 0x10004420
Jul 13 15:14:33 2016
Hello, World!!
<<<<missing ps>>>>>
Jul 13 15:14:34 2016
Hello, World!!
<<<<<strange ps>>>>
             total       used       free    largest
Mem:        188096      36336     151760     120368
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
nsh: ps: open failed: 24 "Too many open files"
    0                                                      nsh: ps: open failed: 24
     0      0   0.0%  nsh: ps: open failed: 24
Jul 13 15:14:35 2016
Hello, World!!
<<<<<strange ps>>>>
             total       used       free    largest
Mem:        188096      36336     151760     120368
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
nsh: ps: open failed: 24
    0                                                      nsh: ps: open failed: 24
     0      0   0.0%  nsh: ps: open failed: 24

# so now stop the script from running by 

nsh> echo stop > /usb0/t/fstop   #pwd is usb/tst1

#Trying 'ps' there are a few strang processes still present, that don't look like they died.
nsh> ps
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    972  23.8%  init
  859 100 RR       pthread --- Waiting  Semaphore 00000000   2044    900  44.0%  <pthread> 0x10004420
 5069 100 RR       pthread --- Waiting  Semaphore 00000000   2044    812  39.7%  <pthread> 0x10005310
 7614 100 RR       pthread --- Waiting  Semaphore 00000000   2044    812  39.7%  <pthread> 0x10005f80

So the ps that failed while the script was running is that it has exceeded the CONFIG_NFILE_DESCRIPTORS  allocation of files at that point in time which is 8.
Of course the release investigation is why it isn't cleaning up, which is probably related to not dispatching the process correctly - leaving the zombie process.

include/errno.h: 24 -> EMFILE

fs/vfs/fs_open.c:
191 /* Associate the inode with a file structure */
192
193 fd = files_allocate(inode, oflags, 0, 0);
194 if (fd < 0)
195 {
196 ret = EMFILE;
197 goto errout_with_inode;
198 }

The suggestion from Greg Nuttx was "you need to set CONFIG_NFILE_DESCRIPTORS larger."
However trying builds  for 10, 14, 32 resulted in it running for a few seconds then a "Hard Fault Exception" - won't go there. This seems like another Nuttx brittleness.

The good news is that the file system is good and removable for inspection on Win10.
nsh> ls -ls usb0/tst1
/usb0/tst1:
 -rw-rw-rw-     172 shuntil
 -rw-rw-rw- 1123855 l_free
 -rw-rw-rw-     136 shuntils
 -rw-rw-rw-      79 shspawn
 -rw-rw-rw-   32823 l_date
 -rw-rw-rw-       6 fstop

So I'm learning that having two destination files l_date & l_free seems to cause a total increment in pid of 3. Possibly one for the new process "shuntils"  and one for each file - but that's a guess.
Rerunning this test with an extra date to verify time taken
/usb0/t/shuntils----
 date  >> /usb0/t/l_date
 hello >> /usb0/t/l_free
 date  >> /usb0/t/l_free
 free  >> /usb0/t/l_free
 ps    >> /usb0/t/l_free
 date  >> /usb0/t/l_free


------
/usb0/t/shspawn-----
set FSTOP /usb0/tst1/fstop
until [ -f $FSTOP ]
do
 sh /usb0/t//shuntils &
 sleep 1
done
--------

nsh> mount -t vfat /dev/sda /usb0

nsh> sh /usb0/t/shspawn &
For a  2sec interval - this is a lot slower - I ran it overnight and the pid wrapped at least twice

sh [32764:100] 
sh [32767:100] 
sh [7:100] 
sh [10:100] 

For some reason three was one zombie process pid=283, and it happened in the first pid 0-32767

sh [28060:100]
sh [28063:100]
ps
sh [28071:100]

  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    868  21.2%  init
    5 100 RR       pthread --- Waiting  Signal    00000000   2044    788  38.5%  <pthread> 0x10002f80
28071 100 RR       pthread --- Ready              00000000   2044   1100  53.8%  <pthread> 0x10005330
  283 100 RR       pthread --- Waiting  Semaphore 00000000   2044    812  39.7%  <pthread> 0x10004420
nsh> sh [28074:100]
sh [28078:100]
sh [28086:100]

So stopping
echo stop >/usb0/t/fstop
nsh> cd
nsh> umount usb0
nsh: umount: umount failed: 16

# Failed again - probably due to pid=283 - but no way of listing it.

# Checking ps, the stack useage is very pleasing, and the stack sizes could be reduced if needed - but better to have them to large at this stage.
nsh> ps
  PID PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0   0 FIFO     Kthread --- Ready              00000000      0      0   0.0%  Idle Task
    1 224 FIFO     Kthread --- Waiting  Signal    00000000   2028    292  14.3%  hpwork
    2 100 FIFO     Task    --- Waiting  Semaphore 00000000   1004    508  50.5%  usbhost
    3 100 FIFO     Task    --- Running            00000000   4076    972  23.8%  init
  283 100 RR       pthread --- Waiting  Semaphore 00000000   2044    812  39.7%  <pthread> 0x10004420





No comments:

Post a Comment