[bitbake-devel] [RFC][WIP][PATCHv1] lib/bb/checksum.py: Speed-up checksum gen when directory is git

richard.purdie at linuxfoundation.org richard.purdie at linuxfoundation.org
Wed Oct 9 11:53:32 UTC 2019


On Wed, 2019-10-09 at 11:02 +0200, Nicolas Dechesne wrote:
> On Wed, Oct 9, 2019 at 1:15 AM <richard.purdie at linuxfoundation.org>
> wrote:
> > On Tue, 2019-10-08 at 13:45 -0500, Aníbal Limón wrote:
> > > In some cases people/organizations are using a SRC_URI with
> > > file:///PATH_TO_DIR that contains a git repository for different
> > > reasons, it is useful when want to do an internal build without
> > > clone sources from outside.
> > > 
> > > This could consume a lot of CPU time because the current taskhash
> > > generation mechanism didn't identify that the folder is a VCS
> > > (git, svn, cvs) and makes the cksum for every file including the
> > > .git repository in this case.
> > > 
> > > There are different ways to improve the situation,
> > > 
> > > * Add protocol=gitscm in file:// SRC_URI but the taskhash is
> > >   calculated before the fetcher identifies the protocol, will
> > > require
> > >   some changes in bitbake codebase.
> > > * This patch: When directory is a git repository (contains .git)
> > >   use HEAD rev + git diff to calculate checksum instead of do it
> > >   in every file, that is hackish because make some assumptions
> > > about
> > >   .git directory contents.
> > > * Variant of this patch: Make a list of VCS directories (.git,
> > > .svn,
> > >   .cvs) and take out for cksum calculations, same as before
> > > making
> > >   assumptions about the . folders content.
> > 
> > This is an interesting one.
> 
> Are you referring to the last bullet here? I suspect it's the second
> one.

Sorry I wasn't clear, I was meaning in general.

> 
> Also to give a bit more background to everyone, as it might not be
> obvious. I've seen the same pattern used several times , especially
> in
> large/corporate deployment of OE/YP. the whole build workspace is
> built as:
> 
> > - sources
> > ---- kernel
> > ---- component_A
> > ---- component_B
> > - layers
> > ---- poky
> > ---- meta-mycompany
> > -------- recipes for kernel, component_A, ...
> 
> The whole workspace is managed with a repo manifest, and the recipes
> are written to use source code from the 'sources' local folder.
> 
> I am not trying to argue whether this is a good practice or not ;-)
> but from the perspectives of the folks I've talked to , there are a
> couple of critical advantages of doing something like that:
> * it looks like Android development workflow ;-)
> * it relates to the company license/legal process and review. e.g.
> all
> the software that gets out of the company is managed by a single repo
> manifest xml file
> * it solves "nicely" the problem of being able to iteratively develop
> using bitbake natively. e.g. "bibtake myimage" always work, and uses
> local changes from 'sources'
> 
> So overall, i am being convinced that this is a valid use case for OE
> end users. I don't think we can use the git:// fetcher as we need the
> snapshot of the current 'sources' (with local changes), and using the
> file:// fetcher has important performance impacts:
> * checksum for 'each' file (which can be large, especially for
> kernel)
> * un-expected rebuild when running repo sync, if any new git objects
> are put in .git (even when no changes are made to the local worktree
> of the git project).
> 
> > File checksums are added to the hashes "late" so that we don't have
> > to
> > reparse entire recipes when files change. We do need a mechanism to
> > know when we need to reparse the checksum. I think this means you
> > can
> > skip the checksum calculation for each file but you do still end up
> > having to stat all files in the tree separately for bitbake's
> > tracking
> > and for git. We also have to notice when new files are added.
> > 
> > As such I'm not convinced this patch will work correctly (e.g.
> > would it
> > notice if I copy in a new file to the directory untracked by git).
> 
> At least I confirm that with the file:// fetcher everything works
> fine, when modifying files. I don't think I have tried adding new
> files. But I will try that.

I'd like to check that bitbake's hashes are changing correctly in the
different modification cases.

I did also wondering about this kind of trick, borrowed from stack
overflow:

if [ ! -e  .git/allfilesindex ]; then
    cp .git/index .git/allfilesindex
fi
GIT_INDEX_FILE=.git/allfilesindex git add -u; git write-tree

to get a hash which represents the state of the tree. Using git add -A
might track untracked files too.

> Are you trying to say that to fix this properly we might need another
> Fetcher , something in between file:// and git://, e.g. localgit://?
> Would that make this problem easier to solve?

I'm not sure about that. I'm mainly worried that we have reports that
file:// doesn't work correctly today before we add this kind of
complexity on top. Hence my comments about needing better tests in this
area, with and without git involved.

I'm a little bit too focused on the release to be able to think clearly
about this right now which doesn't help.

Cheers,

Richard

> > A first step may be to add some further tests to bitbake-selftest
> > to
> > better cover this area...
> > 
> > Cheers,
> > 
> > Richard
> > 
> > 
> > 
> > 
> > 



More information about the bitbake-devel mailing list