[OE-core] Need arm64/qemu help

Victor Kamensky kamensky at cisco.com
Mon Mar 12 02:25:08 UTC 2018



On Sun, 11 Mar 2018, Peter Maydell wrote:

> On 11 March 2018 at 00:11, Victor Kamensky <kamensky at cisco.com> wrote:
>> Hi Richard, Ian,
>>
>> Any progress on the issue? In case if not, I am adding few Linaro guys
>> who work on aarch64 qemu. Maybe they can give some insight.
>
> No immediate answers, but we might be able to have a look
> if you can provide a repro case (image, commandline, etc)
> that doesn't require us to know anything about OE and your
> build/test infra to look at.

Peter, thank you! Appreciate your attention and response to
this. It is fair ask, I should have tried to narrow test
case down before punting it to you guys.

> (QEMU's currently just about
> to head into codefreeze for our next release, so I'm a bit
> busy for the next week or so. Alex, do you have time to
> take a look at this?)
>
> Does this repro with the current head-of-git QEMU?

I've tried head-of-git QEMU (Mar 9) on my ubuntu-16.04
with the same target Image and rootfs I could not reproduce
the issue.

I've started to play around more trying to reduce the test
case. In my setup with OE qith qemu 2.11.1, if I just passed
'-serial sdtio' or '-nographic', instead of '-serial mon:vc'
- with all things the same image boots fine.

So, I started to suspect, even if problem manifests itself
as some functional failure of qemu, the issue could be some
nasty memory corruption of some qemu operational data.
And since qemu pull bunch of dependent
libraries, problem might be not even in qemu.

I realized that in OE in order to disconnect itself from
underlying host, OE builds a lot of its own "native"
libaries and OE qemu uses them. So I've tried to build
head-of-git QEMU but with all native libraries that OE
builds - now such combinations hangs in the same way.

Also I noticed that OE qemu is built with SDL (v1.2),
and libsdl is one that reponsible for '-serial mon:vc'
handling. And I noticed in default OE conf/local.conf
the following statements:

#
# Qemu configuration
#
# By default qemu will build with a builtin VNC server where graphical 
output can be
# seen. The two lines below enable the SDL backend too. By default 
libsdl-native will
# be built, if you want to use your host's libSDL instead of the minimal 
libsdl built
# by libsdl-native then uncomment the ASSUME_PROVIDED line below.
PACKAGECONFIG_append_pn-qemu-native = " sdl"
PACKAGECONFIG_append_pn-nativesdk-qemu = " sdl"
#ASSUME_PROVIDED += "libsdl-native"

I've tried to build against my host's libSDL and uncommented
above line. It actually failed to build, because my host libSDL
were not happy about ncurses native libraries, so I ended up
adding this as well:

ASSUME_PROVIDED += "ncurses-native"

After that I had to rebuild qemu-native and qemu-helper-native.
With resulting qemu and the same target files, image boots
OK.

With such nasty corruption problem, it always hard to say for
sure, it maybe just timing changes .. , but now it seems it
somewhat points to some issue in OE libsdl version ... And
still it is fairly bizarre, libsdl
that in OE (1.2.15) is the same that I have on my ubuntu
machine and there is no additional patches for it in OE,
although configure options might be quite different.

Thanks,
Victor

>> If for experiment sake I disable loop that tries to find
>> jiffies transition. I.e have something like this:
>>
>> diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
>> index 4769947..e0199fc 100644
>> --- a/lib/raid6/algos.c
>> +++ b/lib/raid6/algos.c
>> @@ -166,8 +166,12 @@ static inline const struct raid6_calls
>> *raid6_choose_gen(
>>
>>                         preempt_disable();
>>                         j0 = jiffies;
>> +#if 0
>>                         while ((j1 = jiffies) == j0)
>>                                 cpu_relax();
>> +#else
>> +                        j1 = jiffies;
>> +#endif /* 0 */
>>                         while (time_before(jiffies,
>>                                             j1 +
>> (1<<RAID6_TIME_JIFFIES_LG2))) {
>>                                 (*algo)->gen_syndrome(disks, PAGE_SIZE,
>> *dptrs);
>> @@ -189,8 +193,12 @@ static inline const struct raid6_calls
>> *raid6_choose_gen(
>>
>>                         preempt_disable();
>>                         j0 = jiffies;
>> +#if 0
>>                         while ((j1 = jiffies) == j0)
>>                                 cpu_relax();
>> +#else
>> +                        j1 = jiffies;
>> +#endif /* 0 */
>>                         while (time_before(jiffies,
>>                                             j1 +
>> (1<<RAID6_TIME_JIFFIES_LG2))) {
>>                                 (*algo)->xor_syndrome(disks, start, stop,
>>
>> Image boots fine after that.
>>
>> I.e it looks as some strange effect in aarch64 qemu that seems does not
>> progress jiffies and code stuck.
>
>> Another observation is that if I put breakpoint for example
>> in do_timer, it actually hits the breakpoint, ie timer interrupt
>> happens in this case, and strangely raid6_choose_gen sequence
>> does progress, ie debugger breakpoints make this case unstuck.
>> Actually several pressing Ctrl-C to interrupt target, followed
>> by continue in gdb let code eventually go out of raid6_choose_gen.
>>
>> Also whenever I presss Ctrl-C in gdb to stop target it always
>> in stalled case drops with $pc into first instruction of el1_irq,
>> I never saw different $pc hang code interrupt. Does it mean qemu
>> hangged on first instruction of el1_irq handler? Note once I do
>> stepi after that it ables to proceseed. If I continue steping
>> eventually it gets to arch_timer_handler_virt and do_timer.
>
> This is definitely rather weird and suggestive of a QEMU bug...
>
>> For Linaro qemu aarch64 guys more details:
>>
>> Situation happens on latest openembedded-core, for qemuarm64 MACHINE.
>> It does not happens always, i.e sometimes it works.
>>
>> Qemu version is 2.11.1 and it is invoked like this (through regular
>> oe runqemu helper utility):
>>
>> /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-aarch64
>> -device virtio-net-device,netdev=net0,mac=52:54:00:12:34:02 -netdev
>> tap,id=net0,ifname=tap0,script=no,downscript=no -drive
>> id=disk0,file=/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/core-image-minimal-qemuarm64-20180305025002.rootfs.ext4,if=none,format=raw
>> -device virtio-blk-device,drive=disk0 -show-cursor -device virtio-rng-pci
>> -monitor null -machine virt -cpu cortex-a57 -m 512 -serial mon:vc -serial
>> null -kernel
>> /wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/Image
>> -append root=/dev/vda rw highres=off  mem=512M
>> ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyAMA0,38400
>
> Well, you're not running an SMP config, which rules a few
> things out at least.
>
> thanks
> -- PMM
>



More information about the Openembedded-core mailing list