[bitbake-devel] [RFC][WIP][PATCHv1] lib/bb/checksum.py: Speed-up checksum gen when directory is git

Ola x Nilsson ola.x.nilsson at axis.com
Fri Oct 11 12:36:06 UTC 2019


On Wed, Oct 09 2019, richard.purdie at linuxfoundation.org wrote:

> On Wed, 2019-10-09 at 11:02 +0200, Nicolas Dechesne wrote:
>> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie at linuxfoundation.org>
>> wrote:
>> > On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
>> > > In some cases people/organizations are using a SRC_URI with
>> > > file:///PATH_TO_DIR that contains a git repository for different
>> > > reasons, it is useful when want to do an internal build without
>> > > clone sources from outside.
>> > > 
>> > > This could consume a lot of CPU time because the current taskhash
>> > > generation mechanism didn't identify that the folder is a VCS
>> > > (git, svn, cvs) and makes the cksum for every file including the
>> > > .git repository in this case.
>> > > 
>> > > There are different ways to improve the situation,
>> > > 
>> > > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
>> > >   calculated before the fetcher identifies the protocol, will
>> > > require
>> > >   some changes in bitbake codebase.
>> > > * This patch: When directory is a git repository (contains .git)
>> > >   use HEAD rev + git diff to calculate checksum instead of do it
>> > >   in every file, that is hackish because make some assumptions
>> > > about
>> > >   .git directory contents.
>> > > * Variant of this patch: Make a list of VCS directories (.git,
>> > > .svn,
>> > >   .cvs) and take out for cksum calculations, same as before
>> > > making
>> > >   assumptions about the . folders content.
>> > 
>> > This is an interesting one.
>> 
>> Are you referring to the last bullet here? I suspect it's the second
>> one.
>
> Sorry I wasn't clear, I was meaning in general.
>
>> 
>> Also to give a bit more background to everyone, as it might not be
>> obvious. I've seen the same pattern used several times , especially
>> in
>> large/corporate deployment of OE/YP. the whole build workspace is
>> built as:
>> 
>> > - sources
>> > ---- kernel
>> > ---- component_A
>> > ---- component_B
>> > - layers
>> > ---- poky
>> > ---- meta-mycompany
>> > -------- recipes for kernel, component_A, ...
>> 
>> The whole workspace is managed with a repo manifest, and the recipes
>> are written to use source code from the 'sources' local folder.
>> 
>> I am not trying to argue whether this is a good practice or not ;-)
>> but from the perspectives of the folks I've talked to , there are a
>> couple of critical advantages of doing something like that:
>> * it looks like Android development workflow ;-)
>> * it relates to the company license/legal process and review. e.g.
>> all
>> the software that gets out of the company is managed by a single repo
>> manifest xml file
>> * it solves "nicely" the problem of being able to iteratively develop
>> using bitbake natively. e.g. "bibtake myimage" always work, and uses
>> local changes from 'sources'
>> 
>> So overall, i am being convinced that this is a valid use case for OE
>> end users. I don't think we can use the git:// fetcher as we need the
>> snapshot of the current 'sources' (with local changes), and using the
>> file:// fetcher has important performance impacts:
>> * checksum for 'each' file (which can be large, especially for
>> kernel)
>> * un-expected rebuild when running repo sync, if any new git objects
>> are put in .git (even when no changes are made to the local worktree
>> of the git project).
>> 
>> > File checksums are added to the hashes "late" so that we don't have
>> > to
>> > reparse entire recipes when files change. We do need a mechanism to
>> > know when we need to reparse the checksum. I think this means you
>> > can
>> > skip the checksum calculation for each file but you do still end up
>> > having to stat all files in the tree separately for bitbake's
>> > tracking
>> > and for git. We also have to notice when new files are added.
>> > 
>> > As such I'm not convinced this patch will work correctly (e.g.
>> > would it
>> > notice if I copy in a new file to the directory untracked by git).
>> 
>> At least I confirm that with the file:// fetcher everything works
>> fine, when modifying files. I don't think I have tried adding new
>> files. But I will try that.
>
> I'd like to check that bitbake's hashes are changing correctly in the
> different modification cases.
>
> I did also wondering about this kind of trick, borrowed from stack
> overflow:
>
> if [ ! -e  .git/allfilesindex ]; then
>     cp .git/index .git/allfilesindex
> fi
> GIT_INDEX_FILE=.git/allfilesindex git add -u; git write-tree
>
> to get a hash which represents the state of the tree. Using git add -A
> might track untracked files too.

This is what externalsrc.bbclass does to detect changes.

/Ola

>> Are you trying to say that to fix this properly we might need another
>> Fetcher , something in between file:// and git://, e.g. localgit://?
>> Would that make this problem easier to solve?
>
> I'm not sure about that. I'm mainly worried that we have reports that
> file:// doesn't work correctly today before we add this kind of
> complexity on top. Hence my comments about needing better tests in this
> area, with and without git involved.
>
> I'm a little bit too focused on the release to be able to think clearly
> about this right now which doesn't help.
>
> Cheers,
>
> Richard
>
>> > A first step may be to add some further tests to bitbake-selftest
>> > to
>> > better cover this area...
>> > 
>> > Cheers,
>> > 
>> > Richard
>> > 
>> > 
>> > 
>> > 
>> > 


-- 
Ola x Nilsson


More information about the bitbake-devel mailing list