[OE-core] SetScene tasks hang forever?

Sun May 6 17:36:44 UTC 2012

On 5/2/12 16:06 , Richard Purdie wrote:
> On Wed, 2012-05-02 at 14:48 -0500, Mark Hatle wrote:
>> On 5/2/12 2:45 PM, Rich Pixley wrote:
>>> On 5/2/12 12:40 , Mark Hatle wrote:
>>>> On 5/2/12 2:16 PM, Rich Pixley wrote:
>>>>> On 5/2/12 11:40 , Mark Hatle wrote:
>>>>>> On 5/2/12 1:21 PM, Rich Pixley wrote:
>>>>>>> I'm seeing a lot of builds apparently hanging forever, (the ones that
>>>>>>> work seem to work within seconds - the ones that hang seem to hang for
>>>>>>> at least 10's of minutes), with:
>>>>>>>
>>>>>>> rich at dolphin>       nice tail -f Log
>>>>>>> MACHINE           = "qemux86"
>>>>>>> DISTRO            = ""
>>>>>>> DISTRO_VERSION    = "oe-core.0"
>>>>>>> TUNE_FEATURES     = "m32 i586"
>>>>>>> TARGET_FPU        = ""
>>>>>>> meta              = "master:35b5fb2dd2131d4c7dc6635c14c6e08ea6926457"
>>>>>>>
>>>>>>> NOTE: Resolving any missing task queue dependencies
>>>>>>> NOTE: Preparing runqueue
>>>>>>> NOTE: Executing SetScene Tasks
>>>>>>>
>>>>>>> If I run top, I see one processor pinned at 98 - 99% utilization running
>>>>>>> python, but no other clues.
>>>>>>>
>>>>>>> Can anyone point me to doc, explain what's going on here, or point me in
>>>>>>> the right direction to debug this?
>>>>>> The only time I've seen "hang-like" behavior the system actually opened a
>>>>>> devshell and was awaiting input.   But based on your log, it doesn't look like
>>>>>> that is the case.
>>>>>>
>>>>>> Run bitbake with -DDD option, you will get considerably more debug information
>>>>>> and it might help point out what it thinks it is doing.
>>>>> NOTE: Executing SetScene Tasks
>>>>> DEBUG: Stamp for underlying task
>>>>> 12(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/opkg/opkg_svn.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 16(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/opkg-utils/opkg-utils_git.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 20(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/makedevs/makedevs_1.0.0.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 24(/home/rich/projects/webos/openembedded-core/meta/recipes-core/eglibc/ldconfig-native_2.12.1.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 32(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/genext2fs/genext2fs_1.4.1.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 36(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/e2fsprogs/e2fsprogs_1.42.1.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 40(virtual:native:/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/qemu/qemu_0.15.1.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>> DEBUG: Stamp for underlying task
>>>>> 44(/home/rich/projects/webos/openembedded-core/meta/recipes-devtools/qemu/qemu-helper-native_1.0.bb,
>>>>> do_populate_sysroot) is current, so skipping setscene variant
>>>>>
>>>>> And then the spinning hang.
>>>> Sorry, I don't know how to continue debugging what might be wrong.  The only
>>>> other thing I can suggest is check that your filesystem is "real", not a
>>>> netapp/nfs/network emulated filesystem....
>>>>
>>>> And if you were continuing a previous build, start a new build directory and
>>>> retry it.
>>> Local file system.  I'm building a second time expecting a null build
>>> pass.  I was able to get a null build pass in the same directory yesterday.
>>>
>>> Removing my build directory and starting over has been working, but
>>> costs me a few hours each time, and this happens frequently enough that
>>> I get no other work done.  :(.
>> Ya, that is certainly not acceptable.  If you could file a bug on the
>> bugzilla.yoctoproject.org someone might be able to help you diagnose this
>> further and hopefully figure out a fix.
> What would really help is a way to reproduce this...
>
> Does it reproduce with a certain set of metadata/sstate perhaps?
>
> What is odd about the above logs is that it appears bitbake never
> executes any task. Its possible something might have crashed somewhere I
> guess and not realise part of the system had died. Or it could be some
> kind of circular dependency loop where X needs Y to build and Y needs X
> so nothing happens. We are supposed to spot and error if that would have
> happened.
>
> Does strace give an idea of which bits of bitbake are alive/looping? I'd
> probably resort to a few print()/bb.error() in the code at this point to
> find out what is alive, what is dead and where its looping...
I have more info now.

What I suspected was looping, (since it took longer than the ~1hr I was 
willing to wait), isn't actual looping.  Given enough time, the builds 
do complete and I have comparable results on 5 different servers, (all 
ubuntu-12.04 amd64 and all on btrfs).

My initial, full builds of core-image-minimal do build, and they build 
in ~60min, (~30min if I hand seed the downloads directory).  I'm using 
no mirrors other than the defaults.  My second build in an already built 
directory, (expected to do nothing), takes anywhere from 7 - 10.5hrs to 
complete and successfully do nothing, depending on the server.

During this time, top shows a single cpu pinned at 98 - 100% 
utilization, and strace shows literally millions of access and stat 
calls on stamp files, mkdir on the stamps directory, etc.  Statistical 
analysis of just the do_fetch access calls shows a distribution that 
seems to mimic the topological tree.  That is, the most called access is 
for quilt-native and the components higher up the tree get fewer stats.

Oh, and the setscene stamps are all nonexistent.  I presume that's expected.

First, I can't imagine why there would need to be more than one mkdir on 
the stamps directory within a single instantiation of bitbake.  I can 
imagine that it was easier to attempt to mkdir it than to check first, 
but once it has been mkdir'd, (or checked), there's no need to do it 
another million times, is there?

Second, I can't imagine why there would need to be all the redundant 
stamp checking.  That info is cached internally, isn't it?

And third, the fact that it seems to be checking the entire subtree what 
appear to be multiple times at every node suggests to me that the 
checking algorithm is broken.  Back of the envelope... perhaps 300 
components, maybe 10 tasks per component ~= 3e3 tasks.  Figure a 
geometric explosion of checks for an inefficient algorithm and we're up 
to around 10e6 checks.  I haven't counted an entire run, but based on 
the time it takes to run, I'd say I'm seeing one, maybe two orders of 
magnitude more checks than that.  I've seen a few million node 
traversals in about 15min and a node traversal appears to involve 
several accesses and at least one stat.

I'm not familiar with the current bitbake internals so my next thought 
would be to replace the calls to access, stat, and mkdir on the stamp 
files with caching, counting calls.  Build a dictionary of each file 
called, if it's new, do the kernel call and cache the result in the 
dictionary.  If it's already in the dictionary, then inc a counter for 
it and return the cached value.  This should a) improve the speed of the 
current algorithm, b) improve the speed of the eventual replacement 
algorithm, and c) give us some useful statistical data in the mean time.

I'm also going to try reformating one of the systems and compare how 
long a build on ext4 takes.

Any other ideas?

--rich