[OE-core] Cache unihash ... doesn't match BB_UNIHASH ...

Mon Feb 10 14:22:15 UTC 2020

On Sun, Feb 9, 2020 at 4:25 PM Alex Kiernan <alex.kiernan at gmail.com> wrote:
>
> On Sun, Feb 9, 2020 at 7:27 AM Alex Kiernan <alex.kiernan at gmail.com> wrote:
> >
> > On Sun, Feb 9, 2020 at 12:23 AM chris.laplante at agilent.com
> > <chris.laplante at agilent.com> wrote:
> > >
> > > Hi Richard,
> > >
> > > > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > > > Granted it's only been a week. But before that, hash equiv +
> > > > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > > > >
> > > > > That is a really helpful data point, thanks. I should probably clean up
> > > > > those bitbake patches and get them merged then, I couldn't decide if
> > > > > they were right or not...
> > > > >
> > > >
> > > > I just picked all your pending changes out of master-next into our
> > > > local patch queue - will let you know how it looks when it's finished
> > > > cooking!
> > >
> > > There are two small issues I have observed.
> > >
> > > One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
> > >
> > > Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):
> > >
> > >         # Iterate over the task list and call into the siggen code
> > >         dealtwith = set()
> > >         todeal = set(self.runtaskentries)
> > >         while len(todeal) > 0:
> > >             for tid in todeal.copy():
> > >                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
> > >                     dealtwith.add(tid)
> > >                     todeal.remove(tid)
> > >                     self.prepare_task_hash(tid)
> > >
> > > When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here?
> > >
> >
> > I've a new failure:
> >
> > 00:20:59.829  Traceback (most recent call last):
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/server/process.py",
> > line 278, in ProcessServer.idle_commands(delay=0.1,
> > fds=[<socket.socket fd=6, family=AddressFamily.AF_UNIX,
> > type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> > <socket.socket fd=18, family=AddressFamily.AF_UNIX,
> > type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> > <bb.server.process.ConnectionReader object at 0x7f831b7adb70>]):
> > 00:20:59.829                   try:
> > 00:20:59.829      >                retval = function(self, data, False)
> > 00:20:59.829                       if retval is False:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/cooker.py",
> > line 1434, in buildTargetsIdle(server=<ProcessServer(ProcessServer-1,
> > started)>, rq=<bb.runqueue.RunQueue object at 0x7f82f5112f98>,
> > abort=False):
> > 00:20:59.829                   try:
> > 00:20:59.829      >                retval = rq.execute_runqueue()
> > 00:20:59.829                   except runqueue.TaskFailure as exc:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1522, in RunQueue.execute_runqueue():
> > 00:20:59.829               try:
> > 00:20:59.829      >            return self._execute_runqueue()
> > 00:20:59.829               except bb.runqueue.TaskFailure:
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1488, in RunQueue._execute_runqueue():
> > 00:20:59.829               if self.state is runQueueRunning:
> > 00:20:59.829      >            retval = self.rqexe.execute()
> > 00:20:59.829
> > 00:20:59.829    File
> > "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> > line 1997, in RunQueueExecute.execute():
> > 00:20:59.829                               else:
> > 00:20:59.829      >
> > self.sqdata.outrightfail.remove(nexttask)
> > 00:20:59.829                           if nexttask in self.sqdata.outrightfail:
> >
> > Just testing locally with:
> >
> > diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
> > index 71108eeed752..a94a9bb27ae2 100644
> > --- a/bitbake/lib/bb/runqueue.py
> > +++ b/bitbake/lib/bb/runqueue.py
> > @@ -1994,7 +1994,7 @@ class RunQueueExecute:
> >                              self.sq_task_failoutright(nexttask)
> >                              return True
> >                          else:
> > -                            self.sqdata.outrightfail.remove(nexttask)
> > +                            self.sqdata.outrightfail.discard(nexttask)
> >                      if nexttask in self.sqdata.outrightfail:
> >                          logger.debug(2, 'No package found, so
> > skipping setscene task %s', nexttask)
> >                          self.sq_task_failoutright(nexttask)
> >
>
> That change has got me a clean build to complete end to end, which a
> rebuild is then successfully using the sstate-cache.
>

With this change on top of master I've 7 green builds one after
another, which is better than we've managed in a week.

> But something is upsetting sstate I'm serving back from the jenkins
> box to a local build, as I'm getting different hashes for the same
> sstate:
>
> akiernan at akiernan-virtual-machine:~/nanohub/build$ find sstate-cache
> -name '*quilt-native*populate_sysroot*' -ls
>   2240468     40 -rw-rw-r--   1 akiernan akiernan    39406 Feb  9
> 13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
>   2241106     56 -rw-rw-r--   1 akiernan akiernan    53302 Feb  9
> 13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz
>   2634859     40 -rw-rw-r--   1 akiernan akiernan    39387 Feb  9
> 16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
>   2634858     52 -rw-rw-r--   1 akiernan akiernan    52543 Feb  9
> 16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz
> akiernan at akiernan-virtual-machine:~/nanohub/build$ bitbake-diffsigs
> sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
> sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
> NOTE: Starting bitbake server...
> akiernan at akiernan-virtual-machine:~/nanohub/build$
>
> Running dumpsig and diffing them manually I'm none the wiser - other
> than variables being a in a different order in the two sstate files,
> they're identical.
>

Whatever my problem is here, it's now vanished again :|

-- 
Alex Kiernan