[OE-core] Mis-generation of shell script (run.do_install)?

Mon Dec 17 21:24:52 UTC 2018

On Mon, 2018-12-17 at 12:21 -0800, Andre McCurdy wrote:
> On Mon, Dec 17, 2018 at 6:44 AM <richard.purdie at linuxfoundation.org>
> wrote:
> > On Sat, 2018-12-15 at 20:19 -0500, Jason Andryuk wrote:
> > > As far as I can tell, pysh is working properly - it's just the
> > > bb_codeparser.dat which is returning the incorrect shellCacheLine
> > > entry.  It seems like I have an md5 collision between a pyro
> > > core2-64
> > > binutils do_install and core2-32 python-async
> > > distutils_do_install in
> > > the shellCacheLine.  python-async's entry got in first, so that's
> > > why
> > > binutils run.do_install doesn't include autotools_do_install -
> > > the
> > > shellCacheLine `execs` entry doesn't include it.  Or somehow the
> > > `bb_codeparser.dat` file was corrupted to have an incorrect
> > > `execs`
> > > for the binutils do_install hash.
> > 
> > That is rather worrying. Looking at the known issues with md5, I
> > can
> > see how this could happen though.
> 
> How do you see this could happen? By random bad luck?
> 
> Despite md5 now being susceptible to targeted attacks, the chances of
> accidentally hitting a collision between two 128bit hashes is as
> unlikely as it's always been.
> 
>   http://big.info/2013/04/md5-hash-collision-probability-using.html
> 
> "It is not that easy to get hash collisions when using MD5 algorithm.
> Even after you have generated 26 trillion hash values, the
> probability of the next generated hash value to be the same as one of
> those 26 trillion previously generated hash values is 1/1trillion (1
> out of 1 trillion)."
> 
> It seems much more likely that there's a bug somewhere in the way the
> hashes are used. Unless we understand that then switching to a longer
> hash might not solve anything.

The md5 collision generators have demonstrated its possible to get
checksums where there is a block of contiguous fixed data and a block
of arbitrary data in ratios of up to about 75% to 25%.

That pattern nearly exactly matches our function templating mechanism
where two functions may be nearly identical except for a name or a
small subset of it.

Two random hashes colliding are less interesting than the chances of
two very similar but subtly different pieces of code getting the same
hash. I don't have a mathematical level proof of it but looking at the
way you can generate collisions, I suspect our data is susceptible and
the fact you can do it at all with such large blocks is concerning.

I would love to have definitive proof. I'd be really interested if
Jason has the "bad" checksum and one of the inputs which matches it as
I'd probably see if we could brute force the other. I've read enough to
lose faith in our current code though.

Also though, there is the human factor. What I don't want to have is
people put off the project deeming it "insecure". I already get raised
eyebrows at the use of md5. Its probably time to switch and be done
with any perception anyway, particularly now questions are being asked,
valid or not as the performance hit, whilst noticeable on a profile is
not earth shattering.

Finally, by all means please do audit the codepaths and see if there is
another explanation. Our hash use is fairly simple but its possible
there is some other logic error and if there is we should fix it.

Cheers,

Richard