[OE-core] Cache unihash ... doesn't match BB_UNIHASH ...

Sun Feb 9 16:25:11 UTC 2020

On Sun, Feb 9, 2020 at 7:27 AM Alex Kiernan <alex.kiernan at gmail.com> wrote:
>
> On Sun, Feb 9, 2020 at 12:23 AM chris.laplante at agilent.com
> <chris.laplante at agilent.com> wrote:
> >
> > Hi Richard,
> >
> > > > > Anecdotally, we are running Zeus for nightly builds with three
> > > > > multiconfigs. I cherry-picked your "bitbake: fix2" and "bitbake:
> > > > > fixup" patches and haven't seen any of the BB_UNIHASH errors since.
> > > > > Granted it's only been a week. But before that, hash equiv +
> > > > > multiconfig was unusable due to the BB_UNIHASH errors.
> > > >
> > > > That is a really helpful data point, thanks. I should probably clean up
> > > > those bitbake patches and get them merged then, I couldn't decide if
> > > > they were right or not...
> > > >
> > >
> > > I just picked all your pending changes out of master-next into our
> > > local patch queue - will let you know how it looks when it's finished
> > > cooking!
> >
> > There are two small issues I have observed.
> >
> > One is occasionally I get a lot of undeterministic metadata errors when BB_CACHE_POLICY = "cache", multiconfig, and hash equiv are enabled. The errors are all on recipes for which SRCREV = "${AUTOREV}". It doesn't always happen. But it did just now when I rebased our "zeus-modified" branch onto the upstream "zeus" branch, to get the changes starting with 7dc72fde6edeb5d6ac6b3832530998afeea67cbc.
> >
> > Two is that, sometimes "Initializing tasks" stage appears stuck at 44% for a couple minutes. I traced it down to this code in runqueue.py (line 1168 on zeus):
> >
> >         # Iterate over the task list and call into the siggen code
> >         dealtwith = set()
> >         todeal = set(self.runtaskentries)
> >         while len(todeal) > 0:
> >             for tid in todeal.copy():
> >                 if len(self.runtaskentries[tid].depends - dealtwith) == 0:
> >                     dealtwith.add(tid)
> >                     todeal.remove(tid)
> >                     self.prepare_task_hash(tid)
> >
> > When I instrument the loop to print out the size of "todeal", I see it decrease very slowly, sometimes only a couple at a time. I'm guessing this is because prepare_task_hash is contacting the hash equiv server, in a serial manner here. I'm over my work VPN which makes things extra slow. Is there an opportunity for batching here?
> >
>
> I've a new failure:
>
> 00:20:59.829  Traceback (most recent call last):
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/server/process.py",
> line 278, in ProcessServer.idle_commands(delay=0.1,
> fds=[<socket.socket fd=6, family=AddressFamily.AF_UNIX,
> type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> <socket.socket fd=18, family=AddressFamily.AF_UNIX,
> type=SocketKind.SOCK_STREAM, proto=0, laddr=bitbake.sock>,
> <bb.server.process.ConnectionReader object at 0x7f831b7adb70>]):
> 00:20:59.829                   try:
> 00:20:59.829      >                retval = function(self, data, False)
> 00:20:59.829                       if retval is False:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/cooker.py",
> line 1434, in buildTargetsIdle(server=<ProcessServer(ProcessServer-1,
> started)>, rq=<bb.runqueue.RunQueue object at 0x7f82f5112f98>,
> abort=False):
> 00:20:59.829                   try:
> 00:20:59.829      >                retval = rq.execute_runqueue()
> 00:20:59.829                   except runqueue.TaskFailure as exc:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1522, in RunQueue.execute_runqueue():
> 00:20:59.829               try:
> 00:20:59.829      >            return self._execute_runqueue()
> 00:20:59.829               except bb.runqueue.TaskFailure:
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1488, in RunQueue._execute_runqueue():
> 00:20:59.829               if self.state is runQueueRunning:
> 00:20:59.829      >            retval = self.rqexe.execute()
> 00:20:59.829
> 00:20:59.829    File
> "/var/lib/jenkins/workspace/nanohub_master/poky/bitbake/lib/bb/runqueue.py",
> line 1997, in RunQueueExecute.execute():
> 00:20:59.829                               else:
> 00:20:59.829      >
> self.sqdata.outrightfail.remove(nexttask)
> 00:20:59.829                           if nexttask in self.sqdata.outrightfail:
>
> Just testing locally with:
>
> diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
> index 71108eeed752..a94a9bb27ae2 100644
> --- a/bitbake/lib/bb/runqueue.py
> +++ b/bitbake/lib/bb/runqueue.py
> @@ -1994,7 +1994,7 @@ class RunQueueExecute:
>                              self.sq_task_failoutright(nexttask)
>                              return True
>                          else:
> -                            self.sqdata.outrightfail.remove(nexttask)
> +                            self.sqdata.outrightfail.discard(nexttask)
>                      if nexttask in self.sqdata.outrightfail:
>                          logger.debug(2, 'No package found, so
> skipping setscene task %s', nexttask)
>                          self.sq_task_failoutright(nexttask)
>

That change has got me a clean build to complete end to end, which a
rebuild is then successfully using the sstate-cache.

But something is upsetting sstate I'm serving back from the jenkins
box to a local build, as I'm getting different hashes for the same
sstate:

akiernan at akiernan-virtual-machine:~/nanohub/build$ find sstate-cache
-name '*quilt-native*populate_sysroot*' -ls
  2240468     40 -rw-rw-r--   1 akiernan akiernan    39406 Feb  9
13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
  2241106     56 -rw-rw-r--   1 akiernan akiernan    53302 Feb  9
13:53 sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz
  2634859     40 -rw-rw-r--   1 akiernan akiernan    39387 Feb  9
16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
  2634858     52 -rw-rw-r--   1 akiernan akiernan    52543 Feb  9
16:16 sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz
akiernan at akiernan-virtual-machine:~/nanohub/build$ bitbake-diffsigs
sstate-cache/universal/ff/29/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:ff29b95eb35bba9a4c2e0857372991e6f08c0e9fcb72f76bc2dfbad5d12cade1_populate_sysroot.tgz.siginfo
sstate-cache/universal/83/30/sstate:quilt-native:x86_64-linux:0.66:r0:x86_64:3:83309dcd3c0c7e2ab03ed24b2a5b8d6bf9e35e7b4c8c27373fd68513c8c2b29e_populate_sysroot.tgz.siginfo
NOTE: Starting bitbake server...
akiernan at akiernan-virtual-machine:~/nanohub/build$

Running dumpsig and diffing them manually I'm none the wiser - other
than variables being a in a different order in the two sstate files,
they're identical.

-- 
Alex Kiernan