[OE-core] [PATCH 0/2] Avoid build failures due to setscene errors

Wed Aug 30 09:52:48 UTC 2017

> -----Original Message-----
> From: Richard Purdie [mailto:richard.purdie at linuxfoundation.org]
> Sent: den 30 augusti 2017 10:03
> To: Peter Kjellerstedt <peter.kjellerstedt at axis.com>; Andre McCurdy
> <armccurdy at gmail.com>
> Cc: OE Core mailing list <openembedded-core at lists.openembedded.org>
> Subject: Re: [OE-core] [PATCH 0/2] Avoid build failures due to setscene
> errors
> 
> On Wed, 2017-08-30 at 06:44 +0000, Peter Kjellerstedt wrote:
> > > I have left this code as an error deliberately as this kind of
> > > thing should not happen and if it does, there is really something
> > > wrong which you need to figure out. It means that at one point
> > > bitbake thinks the sstate is present and valid, then later it
> > > isn't.
> >
> > True, but since the operations of checking if an sstate file exists
> > and retrieving it is not an atomic operation, there are always
> > problems that can occur. Some may be fixable, some may not. However,
> > using a build failure to detect these kind of problems is a bit harsh
> > on the developers who only sees their builds complete only to get an
> > error for something that is not their fault. We have better ways to
> > detect these kinds of problems, e.g., through log monitoring, without
> > having to cause unnecessary grief amongst the developers.
> 
> Files are randomly disappearing from your sstate source. So far you've
> been lucky and these are not causing corruption, but they could.

Somehow I fail to see how missing sstate cache files can cause 
corruption. If they are missing, the real task is run and all is well.

Also, I do not actually know if the files disappear permanently or 
temporarily, because at the time when I look at the global sstate cache 
the files are there, newly created because the build continued and let 
the real task run. My guess though is that the files only temporarily 
disappeared due to some network glitch, but currently I cannot verify it.

Regardless of whether my proposed changes are accepted or not, if you 
want to keep the default behavior that a failed setscene task will 
eventually cause the build to fail, then we should change it to fail 
immediately instead. Continuing the build when you know it will fail 
makes no sense at all.

> Please figure out and fix your sstate infrastructure, not hack the code
> to avoid the errors.

As Martin Jansa mentioned in another response, the problem may be due 
to NFS or general network disturbances. And I see no way to protect 
ourselves from them. And apparently we are not alone in seeing these 
kinds of transient errors.

> I do appreciate its painful, we did once see this issue on the
> autobuilder. There was a real error in the sstate cleanup scripts and
> we fixed that but it took some work to find it.

Are your sstate cache clean up scripts available somewhere? Because 
obviously it is not trivial to get it right, and since keeping the 
sstate cache clean is something that I expect many like to do, having 
a common script for this seems like a good thing.

Otherwise I can contribute our script. If nothing else it would 
probably be good to have it reviewed by someone who is an expert on 
the sstate cache. It currently features:

* configurable retention period (default is 10 days)
* removes related .tgz and .tgz.siginfo files as one
* can remove stale symbolic links (typically wanted for a local sstate 
  cache which has links into a global sstate cache which have seen the 
  actual files being cleaned away)
* dry run mode
* quiet mode (only prints a summary stating how much was clean up and 
  the current size of the sstate cache; very nice for running it as a 
  cronjob) 

> Also, with changes like this you can end up in a state where sstate can
> completely stop working and the only way you'd tell is by increased
> build time.

As I mentioned, we have monitoring of our builds in place and would 
definitely notice if the global sstate cache is not used as expected.

> > > I'm not convinced patching out the errors is the right solution
> > > here...
> > 
> > How about I make it conditional by adding an IGNORE_SETSCENE_ERRORS?
> > That way it can default to "0", but we can set it to "1" to
> > prioritize the production builds.
> 
> I'm still not convinced, sorry.
> 
> [The reason being complexity. I don't like having multiple ways of
> doing things if we can help it, particularly when one of them is a
> workaround for a problem elsewhere. One of the codepaths in a case like
> this is unlikely to get well tested.]

Well, as long as the conditional path is clearly marked as "only 
enable this if you know what you are doing", I do not see a problem 
with that path receiving less or no testing by you. It should get 
enough testing by those of us who rely on it.

The problem for me in this kind of situations is that we do not want 
to make changes to anything inside the Poky repository (which would 
effectively fork it), because down that route lies madness. So instead 
we rely on making all adaptations in our own layers. Making changes to 
recipes is easy as we can use .bbappends in our layers. Making changes 
to classes or configuration files works by copying them to our layers 
and changing them there, even though I personally hate it because it 
causes extra maintenance for me since I often need to build with a 
newer version of Poky than our layers are currently adapted for in 
preparations for updating to the next Poky release. However, changes 
to anything inside bitbake is near impossible. The same with changes 
to anything in meta/lib/oe. Thus we rely on being able to find a way 
to get these kinds of changes integrated upstream.

> Cheers,
> 
> Richard

And in case any of the above sounds as if I am trying to force a 
feature down your throat that you do not like, then I beg for 
forgiveness. We really do appreciate your expertise and dedication 
to the OE community, and I hope we can work this to something that 
you can accept and that we can use.

//Peter