[OE-core] Need arm64/qemu help

Sun Mar 11 00:11:12 UTC 2018

Hi Richard, Ian,

Any progress on the issue? In case if not, I am adding few Linaro guys
who work on aarch64 qemu. Maybe they can give some insight.

I was able to reproduce on my system and I
and look at it under gdb. It seems that some strange aarch64
percularity might be in play. Details inline, root cause is still
not clear.

On Sat, 3 Mar 2018, Ian Arkver wrote:

> On 03/03/18 10:51, Ian Arkver wrote:
>> On 03/03/18 09:00, Richard Purdie wrote:
>>> Hi,
>>>
>>> I need some help with a problem we keep seeing:
>>>
>>> https://autobuilder.yocto.io/builders/nightly-arm64/builds/798
>>>
>>> Basically, now and again, for reasons we don't understand, all the
>>> sanity tests fail for qemuarm64.
>>>
>>> I've poked at this a bit and if I go in onto the failed machine and run
>>> this again, they work, using the same image, kernel and qemu binaries.
>>> We've seen this on two different autobuilder infrastructure on varying
>>> host OSs. They always seem to fail all three at once.
>>>
>>> Whilst this was a mut build, I saw this repeat three builds in a row on
>>> the new autobuilder we're setting up with master.
>>>
>>> The kernels always seem to hang somewhere around the:
>>>
>>> | [    0.766079] raid6: int64x1  xor()   302 MB/s
>>> | [    0.844597] raid6: int64x2  gen()   675 MB/s
>> 
>> I believe this is related to btrfs and comes from having btrfs compiled 
>> in to the kernel. You could maybe side-step the problem (and hence leave 
>> it lurking) by changing btrfs to a module.
>
> Actually, this comes from a library (lib/raid6), and in 4.14.y's arm64 
> defconfig BTRFS is already a module, so please disregard my hack suggestion.

Indeed, in my case when I run qemu with enabled remote gdbserver, and
in kernel hang boot case I press Ctrl-C and drop into gdb I see the
following traceback:

(gdb) bt
#0  vectors ()
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:376
#1  0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, dptrs=<optimized out>)
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
#2  raid6_select_algo ()
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:253
#3  0xffffff8008083b8c in do_one_initcall (fn=0xffffff80089a2e64 <raid6_select_algo>)
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:832
#4  0xffffff8008970e80 in do_initcall_level (level=<optimized out>)
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:898
#5  do_initcalls () at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:906
#6  do_basic_setup () at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:924
#7  kernel_init_freeable ()
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:1073
#8  0xffffff80087a2e00 in kernel_init (unused=<optimized out>)
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:999
#9  0xffffff80080850ec in ret_from_fork ()
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:994
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x /10i $pc - 12
    0xffffff8008082274 <vectors+628>:	nop
    0xffffff8008082278 <vectors+632>:	nop
    0xffffff800808227c <vectors+636>:	nop
=> 0xffffff8008082280 <vectors+640>:	sub	sp, sp, #0x140
    0xffffff8008082284 <vectors+644>:	add	sp, sp, x0
    0xffffff8008082288 <vectors+648>:	sub	x0, sp, x0
    0xffffff800808228c <vectors+652>:	tbnz	w0, #14, 0xffffff800808229c <vectors+668>
    0xffffff8008082290 <vectors+656>:	sub	x0, sp, x0
    0xffffff8008082294 <vectors+660>:	sub	sp, sp, x0
    0xffffff8008082298 <vectors+664>:	b	0xffffff8008082fc0 <el1_irq>
(gdb) f 1
#1  0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, dptrs=<optimized out>)
     at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
190				preempt_disable();
(gdb) x /12i $pc - 12
    0xffffff80089a2fe8 <raid6_select_algo+388>:	cbz	x0, 0xffffff80089a3098 <raid6_select_algo+564>
    0xffffff80089a2fec <raid6_select_algo+392>:	mov	w0, #0x1                   	// #1
    0xffffff80089a2ff0 <raid6_select_algo+396>:	bl	0xffffff80080cc498 <preempt_count_add>
=> 0xffffff80089a2ff4 <raid6_select_algo+400>:	ldr	x0, [x23, #2688]
    0xffffff80089a2ff8 <raid6_select_algo+404>:	ldr	x5, [x23, #2688]
    0xffffff80089a2ffc <raid6_select_algo+408>:	cmp	x0, x5
    0xffffff80089a3000 <raid6_select_algo+412>:	b.ne	0xffffff80089a300c <raid6_select_algo+424>  // b.any
    0xffffff80089a3004 <raid6_select_algo+416>:	yield
    0xffffff80089a3008 <raid6_select_algo+420>:	b	0xffffff80089a2ff8 <raid6_select_algo+404>
    0xffffff80089a300c <raid6_select_algo+424>:	mov	x25, #0x0                   	// #0
    0xffffff80089a3010 <raid6_select_algo+428>:	ldr	x0, [x23, #2688]
    0xffffff80089a3014 <raid6_select_algo+432>:	mov	x4, x27

(gdb) b *0xffffff80089a2ff4
Breakpoint 8 at 0xffffff80089a2ff4: file /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c, line 191.

This corresponds to this code in lib/raid6/algos.c

    190                          preempt_disable();
    191                          j0 = jiffies;
    192                          while ((j1 = jiffies) == j0)
    193                                  cpu_relax();
    194                          while (time_before(jiffies,
    195                                              j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
    196                                  (*algo)->xor_syndrome(disks, start, stop,
    197                                                        PAGE_SIZE, *dptrs);
    198                                  perf++;
    199                          }
    200                          preempt_enable();

If for experiment sake I disable loop that tries to find
jiffies transition. I.e have something like this:

diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 4769947..e0199fc 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -166,8 +166,12 @@ static inline const struct raid6_calls *raid6_choose_gen(

                         preempt_disable();
                         j0 = jiffies;
+#if 0
                         while ((j1 = jiffies) == j0)
                                 cpu_relax();
+#else
+                        j1 = jiffies;
+#endif /* 0 */
                         while (time_before(jiffies,
                                             j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
                                 (*algo)->gen_syndrome(disks, PAGE_SIZE, *dptrs);
@@ -189,8 +193,12 @@ static inline const struct raid6_calls *raid6_choose_gen(

                         preempt_disable();
                         j0 = jiffies;
+#if 0
                         while ((j1 = jiffies) == j0)
                                 cpu_relax();
+#else
+                        j1 = jiffies;
+#endif /* 0 */
                         while (time_before(jiffies,
                                             j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
                                 (*algo)->xor_syndrome(disks, start, stop,

Image boots fine after that.

I.e it looks as some strange effect in aarch64 qemu that seems does not
progress jiffies and code stuck.

Another observation is that if I put breakpoint for example
in do_timer, it actually hits the breakpoint, ie timer interrupt
happens in this case, and strangely raid6_choose_gen sequence
does progress, ie debugger breakpoints make this case unstuck.
Actually several pressing Ctrl-C to interrupt target, followed
by continue in gdb let code eventually go out of raid6_choose_gen.

Also whenever I presss Ctrl-C in gdb to stop target it always
in stalled case drops with $pc into first instruction of el1_irq,
I never saw different $pc hang code interrupt. Does it mean qemu
hangged on first instruction of el1_irq handler? Note once I do
stepi after that it ables to proceseed. If I continue steping
eventually it gets to arch_timer_handler_virt and do_timer.

For Linaro qemu aarch64 guys more details:

Situation happens on latest openembedded-core, for qemuarm64 MACHINE.
It does not happens always, i.e sometimes it works.

Qemu version is 2.11.1 and it is invoked like this (through regular
oe runqemu helper utility):

/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-aarch64 -device virtio-net-device,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -drive id=disk0,file=/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/core-image-minimal-qemuarm64-20180305025002.rootfs.ext4,if=none,format=raw -device virtio-blk-device,drive=disk0 -show-cursor -device virtio-rng-pci -monitor null -machine virt -cpu cortex-a57 -m 512 -serial mon:vc -serial null -kernel /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/Image -append root=/dev/vda rw highres=off  mem=512M ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyAMA0,38400

My host system is ubuntu-16.04.

Please let's me know if you need additional info and/or want to
enable additional debug/trace options.

Thanks,
Victor

>> Regards,
>> Ian
>> 
>>> raid timing measurements.
>>>
>>> In the past we've dived in and handled these kinds of things but I've
>>> run out of people to lean on and I need help from the wider community.
>>>
>>> Can anyone help look into and fix this?
>>>
>>> This is serious as if nobody cares, I'll have to simply stop boot
>>> testing qemuarm64.
>>>
>>> Not sure if there is an open bug yet either :/.
>>>
>>> Cheers,
>>>
>>> Richard
>>>
> -- 
> _______________________________________________
> Openembedded-core mailing list
> Openembedded-core at lists.openembedded.org
> http://lists.openembedded.org/mailman/listinfo/openembedded-core
>