[OE-core] Need arm64/qemu help
Victor Kamensky
kamensky at cisco.com
Sun Mar 11 00:11:12 UTC 2018
Hi Richard, Ian,
Any progress on the issue? In case if not, I am adding few Linaro guys
who work on aarch64 qemu. Maybe they can give some insight.
I was able to reproduce on my system and I
and look at it under gdb. It seems that some strange aarch64
percularity might be in play. Details inline, root cause is still
not clear.
On Sat, 3 Mar 2018, Ian Arkver wrote:
> On 03/03/18 10:51, Ian Arkver wrote:
>> On 03/03/18 09:00, Richard Purdie wrote:
>>> Hi,
>>>
>>> I need some help with a problem we keep seeing:
>>>
>>> https://autobuilder.yocto.io/builders/nightly-arm64/builds/798
>>>
>>> Basically, now and again, for reasons we don't understand, all the
>>> sanity tests fail for qemuarm64.
>>>
>>> I've poked at this a bit and if I go in onto the failed machine and run
>>> this again, they work, using the same image, kernel and qemu binaries.
>>> We've seen this on two different autobuilder infrastructure on varying
>>> host OSs. They always seem to fail all three at once.
>>>
>>> Whilst this was a mut build, I saw this repeat three builds in a row on
>>> the new autobuilder we're setting up with master.
>>>
>>> The kernels always seem to hang somewhere around the:
>>>
>>> | [ 0.766079] raid6: int64x1 xor() 302 MB/s
>>> | [ 0.844597] raid6: int64x2 gen() 675 MB/s
>>
>> I believe this is related to btrfs and comes from having btrfs compiled
>> in to the kernel. You could maybe side-step the problem (and hence leave
>> it lurking) by changing btrfs to a module.
>
> Actually, this comes from a library (lib/raid6), and in 4.14.y's arm64
> defconfig BTRFS is already a module, so please disregard my hack suggestion.
Indeed, in my case when I run qemu with enabled remote gdbserver, and
in kernel hang boot case I press Ctrl-C and drop into gdb I see the
following traceback:
(gdb) bt
#0 vectors ()
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:376
#1 0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, dptrs=<optimized out>)
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
#2 raid6_select_algo ()
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:253
#3 0xffffff8008083b8c in do_one_initcall (fn=0xffffff80089a2e64 <raid6_select_algo>)
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:832
#4 0xffffff8008970e80 in do_initcall_level (level=<optimized out>)
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:898
#5 do_initcalls () at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:906
#6 do_basic_setup () at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:924
#7 kernel_init_freeable ()
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:1073
#8 0xffffff80087a2e00 in kernel_init (unused=<optimized out>)
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:999
#9 0xffffff80080850ec in ret_from_fork ()
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:994
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x /10i $pc - 12
0xffffff8008082274 <vectors+628>: nop
0xffffff8008082278 <vectors+632>: nop
0xffffff800808227c <vectors+636>: nop
=> 0xffffff8008082280 <vectors+640>: sub sp, sp, #0x140
0xffffff8008082284 <vectors+644>: add sp, sp, x0
0xffffff8008082288 <vectors+648>: sub x0, sp, x0
0xffffff800808228c <vectors+652>: tbnz w0, #14, 0xffffff800808229c <vectors+668>
0xffffff8008082290 <vectors+656>: sub x0, sp, x0
0xffffff8008082294 <vectors+660>: sub sp, sp, x0
0xffffff8008082298 <vectors+664>: b 0xffffff8008082fc0 <el1_irq>
(gdb) f 1
#1 0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, dptrs=<optimized out>)
at /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
190 preempt_disable();
(gdb) x /12i $pc - 12
0xffffff80089a2fe8 <raid6_select_algo+388>: cbz x0, 0xffffff80089a3098 <raid6_select_algo+564>
0xffffff80089a2fec <raid6_select_algo+392>: mov w0, #0x1 // #1
0xffffff80089a2ff0 <raid6_select_algo+396>: bl 0xffffff80080cc498 <preempt_count_add>
=> 0xffffff80089a2ff4 <raid6_select_algo+400>: ldr x0, [x23, #2688]
0xffffff80089a2ff8 <raid6_select_algo+404>: ldr x5, [x23, #2688]
0xffffff80089a2ffc <raid6_select_algo+408>: cmp x0, x5
0xffffff80089a3000 <raid6_select_algo+412>: b.ne 0xffffff80089a300c <raid6_select_algo+424> // b.any
0xffffff80089a3004 <raid6_select_algo+416>: yield
0xffffff80089a3008 <raid6_select_algo+420>: b 0xffffff80089a2ff8 <raid6_select_algo+404>
0xffffff80089a300c <raid6_select_algo+424>: mov x25, #0x0 // #0
0xffffff80089a3010 <raid6_select_algo+428>: ldr x0, [x23, #2688]
0xffffff80089a3014 <raid6_select_algo+432>: mov x4, x27
(gdb) b *0xffffff80089a2ff4
Breakpoint 8 at 0xffffff80089a2ff4: file /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c, line 191.
This corresponds to this code in lib/raid6/algos.c
190 preempt_disable();
191 j0 = jiffies;
192 while ((j1 = jiffies) == j0)
193 cpu_relax();
194 while (time_before(jiffies,
195 j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
196 (*algo)->xor_syndrome(disks, start, stop,
197 PAGE_SIZE, *dptrs);
198 perf++;
199 }
200 preempt_enable();
If for experiment sake I disable loop that tries to find
jiffies transition. I.e have something like this:
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 4769947..e0199fc 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -166,8 +166,12 @@ static inline const struct raid6_calls *raid6_choose_gen(
preempt_disable();
j0 = jiffies;
+#if 0
while ((j1 = jiffies) == j0)
cpu_relax();
+#else
+ j1 = jiffies;
+#endif /* 0 */
while (time_before(jiffies,
j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
(*algo)->gen_syndrome(disks, PAGE_SIZE, *dptrs);
@@ -189,8 +193,12 @@ static inline const struct raid6_calls *raid6_choose_gen(
preempt_disable();
j0 = jiffies;
+#if 0
while ((j1 = jiffies) == j0)
cpu_relax();
+#else
+ j1 = jiffies;
+#endif /* 0 */
while (time_before(jiffies,
j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
(*algo)->xor_syndrome(disks, start, stop,
Image boots fine after that.
I.e it looks as some strange effect in aarch64 qemu that seems does not
progress jiffies and code stuck.
Another observation is that if I put breakpoint for example
in do_timer, it actually hits the breakpoint, ie timer interrupt
happens in this case, and strangely raid6_choose_gen sequence
does progress, ie debugger breakpoints make this case unstuck.
Actually several pressing Ctrl-C to interrupt target, followed
by continue in gdb let code eventually go out of raid6_choose_gen.
Also whenever I presss Ctrl-C in gdb to stop target it always
in stalled case drops with $pc into first instruction of el1_irq,
I never saw different $pc hang code interrupt. Does it mean qemu
hangged on first instruction of el1_irq handler? Note once I do
stepi after that it ables to proceseed. If I continue steping
eventually it gets to arch_timer_handler_virt and do_timer.
For Linaro qemu aarch64 guys more details:
Situation happens on latest openembedded-core, for qemuarm64 MACHINE.
It does not happens always, i.e sometimes it works.
Qemu version is 2.11.1 and it is invoked like this (through regular
oe runqemu helper utility):
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-aarch64 -device virtio-net-device,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -drive id=disk0,file=/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/core-image-minimal-qemuarm64-20180305025002.rootfs.ext4,if=none,format=raw -device virtio-blk-device,drive=disk0 -show-cursor -device virtio-rng-pci -monitor null -machine virt -cpu cortex-a57 -m 512 -serial mon:vc -serial null -kernel /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/Image -append root=/dev/vda rw highres=off mem=512M ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyAMA0,38400
My host system is ubuntu-16.04.
Please let's me know if you need additional info and/or want to
enable additional debug/trace options.
Thanks,
Victor
>> Regards,
>> Ian
>>
>>> raid timing measurements.
>>>
>>> In the past we've dived in and handled these kinds of things but I've
>>> run out of people to lean on and I need help from the wider community.
>>>
>>> Can anyone help look into and fix this?
>>>
>>> This is serious as if nobody cares, I'll have to simply stop boot
>>> testing qemuarm64.
>>>
>>> Not sure if there is an open bug yet either :/.
>>>
>>> Cheers,
>>>
>>> Richard
>>>
> --
> _______________________________________________
> Openembedded-core mailing list
> Openembedded-core at lists.openembedded.org
> http://lists.openembedded.org/mailman/listinfo/openembedded-core
>
More information about the Openembedded-core
mailing list