[Reproducible-builds] Patch V2 for build nodes pools

Mon Dec 21 18:00:33 UTC 2015

Hi Vagrant,

On Montag, 21. Dezember 2015, Vagrant Cascadian wrote:
> > put some 4cores in one pool, and 2cores in another?
> It could be done any number of ways, I merely added it to show how it
> would work with the code I was proposing.

well, I don't think we should develop an example here, but rather actual 
working code :) Thus we need to settle on one way.

Thus it occurred to me what the main differating attribute is: whether it's 
running in future or not. Additionally I would try to also differate other 
attributes along the same line (but this is bound to fail, still lets keep the 
impact of this failure small):

- have two pools, "future machines", "todays machines"
- distribute cpu power roughly evenly among them
- _ideally_ by having all 2 core machines be one (or the other)
- then also, have machines in one pool run the normal jessie kernel and have 
machines in the other pool run the bpo kernel or newer.

This should achieve the best possible variation while utilising maximum build 
power.

> I was hoping the pool code
> could be ready enough to use with the new nodes that should be coming by
> the end of the year, and they'd be reasonable first tests...

I'd rather not do that for two reasons:

- we want to use the new nodes fast, so lets use them. the existing scheduling 
is really not that bad.
- also, developing new stuff tends to break. if we throw more ressources at 
this while development, the breakage will be bigger, thus the fallout.

Thus, let's stay with the plan to develop this with one builder job first.

> >> - Split load estimating into it's own script, and add support for
> >> available memory.
> > 
> > I'd still suggest to measure the load constantly by a job outside the
> > build script… (then it's also easy to read "not updated node load since
> > $time" as "node is to busy to be scheduled on…)
> > 
> >> - Call timeout so that the ssh processes don't take too long to
> >> complete.
> > 
> > see above, don't ssh from the build script please.
> 
> Implementing that outside of the build script would make this much more
> complicated...

Implementing this inside the build script makes the build script more error 
prone. As I understand your proposal, you want to check 15 nodes via ssh at 
each build, that means 15 times watching an ssh connection, waiting til this 
ends… and I dont want to even think about doing this in parallel in shell. 
This is bound to break, IMO.

> The second build needs to check for load when it is about to be run, as
> it doesn't make sense to check when build_rebuild is run (unless you run
> both build1 and build2 in parallel... but that's a whole different
> proposal), as the load of the machines is likely to change between the
> first and second build.

Then run the job to check the load every 5min. Or every 2min. But provide 
something the build script can *consume* in *no time*. We currently have 15 
armhf builder jobs, the new hardware should make this 40 or so jobs in one or 
two months. And each job should crawl the nodes? NO.

> I'm not sure how to do all that outside the build script and keep the
> code reasonably simple.

Your prososal aint simple. Or rather: the environment it needs to fit in.

> What's the primary concern with ssh from within the build script? Taking
> too long to get a response?

see above. 

> >> diff --git a/bin/reproducible_build.sh b/bin/reproducible_build.sh
> > 
> > I'll only comment on the most "pressing" issues now.
> > 
> >>  build_rebuild() {
> >>  
> >>  	FTBFS=1
> >>  	mkdir b1 b2
> >> 
> >> +	local selected_node
> >> +	selected_node=$(select_least_loaded_node $NODE1_POOL)
> > 
> > please make this somehow conditional so that this code path is not used
> > for "normal operation" (=without this new pooling), so we can test this
> > easily on one builder job, but not on all.
> 
> It basically is conditional in that the select_least_loaded_node
> function simply returns the node if only one argument is passed.

Please make it "conditionally", not "basically conditionally". The code is run 
to build 5-10k packages on amd64 too, I neither want to fork it, nor risk 
destablisation.

> > …reproducible_build.sh should probably be called with
> > "experimental-pooling" as first param, which is then shifted away…
> That shouldn't be too hard, sure.

:-)

> Could alternately use something like:
> 
>    - '16': { my_node1: 'pool,wbd0-armhf-rb:2223,wbq0-armhf-rb:2225',
>              my_node2: 'pool,bpi0-armhf-rb:2222,odxu4-armhf-rb:2229' }

no. rather: "pool" as $1 for the script, not as part of $2+$3.

> Maybe this should be written in two stages, first implementing a simpler
> patch just providing failover, and then adding the load checks later.

small patches are easier to take, yes.

cheers,
	Holger
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.alioth.debian.org/pipermail/reproducible-builds/attachments/20151221/db35984e/attachment.sig>