[OE-core] Never ending stream of bitbake exceptions when the builder runs out of disk space

Tue Jun 27 08:12:21 UTC 2017

Is BB_DISKMON_DIRS enabled by default?

Quick grep shows it only in local.conf.sample*:
meta/conf/local.conf.sample:BB_DISKMON_DIRS = "\
meta/conf/local.conf.sample.extended:# inode is running low, it is enabled
when BB_DISKMON_DIRS is set.
meta/conf/local.conf.sample.extended:#BB_DISKMON_DIRS =
"STOPTASKS,${TMPDIR},1G,100K WARN,${SSTATE_DIR},1G,100K"

and my jenkins builds are very close to default oe-core nodistro config, so
I don't think I have that enabled.

Maybe I should enable it, or maybe it should be enabled by default if we
cannot fix this exception stream.

Thanks

On Tue, Jun 27, 2017 at 10:08 AM, Patrick Ohly <patrick.ohly at intel.com>
wrote:

> On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> > This issue exists for very long time.
> >
> >
> > I know that when the builder runs out of disk space there are multiple
> > things which might go wrong (I've seen bad archives on premirrors, bad
> > sstate archives caused by this), so this issue isn't the main problem,
> > but still would be nice to fail faster.
> >
> >
> > In last build which was running for some 9 hours, it was first
> > building for maybe 2 hours before it run out of disk space and this
> > morning there is 50MB log just from bitbake output stored on the
> > jenkins master. Repeating following message very quickly
> >
> >
> > # grep -c "Errno 28" consoleText.txt
> > 42986
> >
> >
> > ERROR: Running command [['world'], 'build']
> > Traceback (most recent call last):
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 211, in fire(event=<bb.event.HeartbeatEvent object at 0x7fcfed3e96a0>,
> > d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
> >
> >     >    fire_class_handlers(event, d)
> >          if worker_fire:
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >                          continue
> >     >            execute_handler(name, handler, event, d)
> >
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py", line
> > 106, in execute_handler(name='runqueue_stats', handler=<function
> > runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> > object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >          try:
> >     >        ret = handler(event)
> >          except (bb.parse.SkipRecipe, bb.BBHandledException):
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-core/meta/classes/buildstats.bbclass",
> line 212, in runqueue_stats(e=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>):
> >              done = isinstance(e, bb.event.BuildCompleted)
> >     >        system_stats.sample(e, force=done)
> >              if done:
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-core/meta/lib/buildstats.py",
> line 148, in SystemStats.sample(event=<bb.event.HeartbeatEvent object at
> 0x7fcfed3e96a0>, force=False):
> >                                       data +
> >     >                                 b'\n')
> >                  self.last_proc = now
> > OSError: [Errno 28] No space left on device
> >
> >
> > It would be better to exit completely when something as bad as Errno
> > 28 happens.
>
> Do you have BB_DISKMON_DIRS active? Probably yes.
>
> The reason why it did not trigger here might be that the build ran out
> of disk space so quickly that the disk monitoring had no chance to
> detect the problem before system stat sampling itself started failing
> with the error above.
>
> System stat sampling and disk monitoring are hooking into the same
> event, so my theory is that once the system stat sampling fails, disk
> monitoring code no longer runs.
>
> I'm not sure what exactly the right fix is: detect uncaught OSError like
> 28 in the bitbake event loop and abort the build, and/or catch the error
> in buildstats.py and ignore it so that the normal disk monitoring can
> happen?
>
> I know how to do the latter, but not the former.
>
> --
> Best Regards, Patrick Ohly
>
> The content of this message is my personal opinion only and although
> I am an employee of Intel, the statements I make here in no way
> represent Intel's position on the issue, nor am I authorized to speak
> on behalf of Intel on this matter.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openembedded.org/pipermail/openembedded-core/attachments/20170627/85b377f1/attachment-0002.html>