[OE-core] Never ending stream of bitbake exceptions when the builder runs out of disk space

Tue Jun 27 08:08:52 UTC 2017

On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> This issue exists for very long time.
> 
> 
> I know that when the builder runs out of disk space there are multiple
> things which might go wrong (I've seen bad archives on premirrors, bad
> sstate archives caused by this), so this issue isn't the main problem,
> but still would be nice to fail faster.
> 
> 
> In last build which was running for some 9 hours, it was first
> building for maybe 2 hours before it run out of disk space and this
> morning there is 50MB log just from bitbake output stored on the
> jenkins master. Repeating following message very quickly
> 
> 
> # grep -c "Errno 28" consoleText.txt 
> 42986
> 
> 
> ERROR: Running command [['world'], 'build']
> Traceback (most recent call last):
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 211, in fire(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>,
> d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
>      
>     >    fire_class_handlers(event, d)
>          if worker_fire:
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> 0x7fd00330b198>):
>                          continue
>     >            execute_handler(name, handler, event, d)
>      
>   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> 106, in execute_handler(name='runqueue_stats', handler=<function
> runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> 0x7fd00330b198>):
>          try:
>     >        ret = handler(event)
>          except (bb.parse.SkipRecipe, bb.BBHandledException):
>   File
> "/home/jenkins/oe/world/shr-core/openembedded-core/meta/classes/buildstats.bbclass", line 212, in runqueue_stats(e=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>):
>              done = isinstance(e, bb.event.BuildCompleted)
>     >        system_stats.sample(e, force=done)
>              if done:
>   File
> "/home/jenkins/oe/world/shr-core/openembedded-core/meta/lib/buildstats.py", line 148, in SystemStats.sample(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>, force=False):
>                                       data +
>     >                                 b'\n')
>                  self.last_proc = now
> OSError: [Errno 28] No space left on device
> 
> 
> It would be better to exit completely when something as bad as Errno
> 28 happens.

Do you have BB_DISKMON_DIRS active? Probably yes.

The reason why it did not trigger here might be that the build ran out
of disk space so quickly that the disk monitoring had no chance to
detect the problem before system stat sampling itself started failing
with the error above.

System stat sampling and disk monitoring are hooking into the same
event, so my theory is that once the system stat sampling fails, disk
monitoring code no longer runs.

I'm not sure what exactly the right fix is: detect uncaught OSError like
28 in the bitbake event loop and abort the build, and/or catch the error
in buildstats.py and ignore it so that the normal disk monitoring can
happen?

I know how to do the latter, but not the former.

-- 
Best Regards, Patrick Ohly

The content of this message is my personal opinion only and although
I am an employee of Intel, the statements I make here in no way
represent Intel's position on the issue, nor am I authorized to speak
on behalf of Intel on this matter.