[Reproducible-builds] Preliminary review of dpkg-genbuildinfo

Guillem Jover guillem at debian.org
Sun Feb 1 02:48:52 UTC 2015


Hi!

Looking at <git://anonscm.debian.org/reproducible/dpkg.git>.

Have you seen any actual problem to warrant the «Ensure stable order
of Checksums-* fields» commit? In principle the output order is
preserved from the input one.

And here's a quickish review of the dpkg-genbuildinfo changes, taking
into account that I'm looking at this as a general tool for recording
the build environment, not specific for just reproducible builds. Some
general comments first:

 * It would have been nice to get the .buildinfo stuff up for review
   in the debian-dpkg list too (besides the ftp-masters bug 763822).

 * I'm still somewhat unconvinced that having byte-for-byte identical
   container binary .deb packages is the ideal minimal reproducible
   unit. This will completely disallow digital signatures embedded in
   the binary packages or per-executable signatures in their xattr
   metadata for example, and that seems to me was completely ignored
   last time around. I'd be very unhappy if at some point reproducible
   builds were enforced and that'd mean we've painted ourselves into a
   corner with other potential additions to the .deb that might not be
   reproducible by design.

 * Somewhat related to the above point and this new file. Timestamps
   in some places (mainly on the actual file metadata, for example in
   ar and tar containers) are actually very useful, because as long as
   we don't have a stored recording of the build environment, it's the
   next "best" approximation to that. Getting rid of those prematurely
   makes us miss very valuable information for debugging and similar.
   And yes, even replacing the current time with the changelog time is
   missing information. This has also been dismissed previously (with,
   I'd say, even a bit of contempt!).

 * I'm not entirely sure if this really makes sense as a different
   file, but at least given that it's controlled by dpkg-buildpackage
   we can always fold it into dpkg-genchanges if we deem that's a
   better course of action. So it seems fine this way for now.

 * Some of the information in this file trigger my privacy alarm bells,
   for example the Build-Path field.


> diff --git a/debian/usertags b/debian/usertags
> index 0fc26f2..0419488 100644
> --- a/debian/usertags
> +++ b/debian/usertags
> @@ -65,6 +65,7 @@ dpkg-checkbuilddeps	[DPKG-CHECKBUILDDEPS]
>  dpkg-deb		[DPKG-DEB]
>  dpkg-distaddfile	[DPKG-DISTADDFILE]
>  dpkg-divert		[DPKG-DIVERT]
> +dpkg-genbuildinfo	[DPKG-GENBUILDINFO]

The pseudo-tag is only needed for old tools for legacy reasons. I
should just do a round of «bts retitle» at some point.

>  dpkg-genchanges		[DPKG-GENCHANGES]
>  dpkg-gencontrol		[DPKG-GENCONTROL]
>  dpkg-gensymbols		[DPKG-GENCSYMBOLS]
> diff --git a/man/dpkg-genbuildinfo.1 b/man/dpkg-genbuildinfo.1
> new file mode 100644
> index 0000000..626bf66
> --- /dev/null
> +++ b/man/dpkg-genbuildinfo.1
> @@ -0,0 +1,89 @@
> +.\" dpkg manual page - dpkg-genbuildinfo(1)
> +.\"
> +.\" Copyright © 2015 Jérémy Bobbio <lunar at debian.org>

This probably needs to be completed with the © from the sources it's
borrowing from.


The changes to man/po/po4a.conf are missing.

> diff --git a/scripts/Dpkg/Control/FieldsCore.pm b/scripts/Dpkg/Control/FieldsCore.pm
> index 8a5695b..75de185 100644
> --- a/scripts/Dpkg/Control/FieldsCore.pm
> +++ b/scripts/Dpkg/Control/FieldsCore.pm
> @@ -361,6 +373,11 @@ our %FIELD_ORDER = (
>          Vcs-Svn Testsuite), &field_list_src_dep(), qw(Package-List),
>          @checksum_fields, qw(Files)
>      ],
> +    CTRL_FILE_BUILDINFO() => [
> +        qw(Format Build-Architecture Source Binary Architecture Version
> +        Build-Environment Build-Path),
> +        @checksum_fields,
> +    ],

Probably pack all «Build-» fields together at the end after the checksum
ones, and leave Build-Environment as the last of those, as it's the one
taking the most vertical space.

>      CTRL_FILE_CHANGES() => [
>          qw(Format Date Source Binary Binary-Only Built-For-Profiles Architecture
>          Version Distribution Urgency Maintainer Changed-By Description
> diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
> index a3cca87..018f37d 100755
> --- a/scripts/dpkg-buildpackage.pl
> +++ b/scripts/dpkg-buildpackage.pl
> @@ -158,6 +158,7 @@ my $since;
>  my $maint;
>  my $changedby;
>  my $desc;
> +my @buildinfo_opts;
>  my @changes_opts;
>  my @hook_names = qw(
>      init preclean source build binary changes postclean check sign done

The buildinfo hook name is missing in the @hook_names array.

> @@ -567,6 +570,25 @@ if ($include & BUILD_BINARY) {
>      withecho(@debian_rules, $buildtarget);
>      run_hook('binary', 1);
>      withecho(@rootcommand, @debian_rules, $binarytarget);
> +
> +    if (-e "../$pv.dsc") {

Why is the buildinfo file tied to whether there's or not a source
package? I'd say it might make sense to not generate a buildinfo file
only if we are not building binary packages. To check for that use the
BUILD_ constants.

> +        run_hook('buildinfo', 1);

The hook should not be conditional to the code being executed. The
second argument should represent that.

> +        push @buildinfo_opts, "--admindir=$admindir" if $admindir;
> +
> +        my $buildinfo_path = "../$pva.buildinfo";
> +        my $buildinfo = Dpkg::Control->new(type => CTRL_FILE_BUILDINFO);
> +
> +        print { *STDERR } " dpkg-genbuildinfo @buildinfo_opts >$buildinfo_path\n";
> +
> +        open my $buildinfo_fh, '-|', 'dpkg-genbuildinfo', @buildinfo_opts
> +            or subprocerr('dpkg-genbuildinfo');
> +        $buildinfo->parse($buildinfo_fh, _g('parse buildinfo file'));
> +        $buildinfo->save($buildinfo_path);
> +        close $buildinfo_fh or subprocerr(_g('dpkg-genbuildinfo'));

There's no need for all this, just call dpkg-genbuildinfo redirecting
its output to the destination file. The $changes code where this is
originating from is accessed later on, so that's why it's being parsed.

> +    } else {
> +        warning(_g('no .dsc file, skipping .buildinfo generation'));
> +    }
>  }

> diff --git a/scripts/dpkg-genbuildinfo.pl b/scripts/dpkg-genbuildinfo.pl
> new file mode 100755
> index 0000000..d7784ee
> --- /dev/null
> +++ b/scripts/dpkg-genbuildinfo.pl
> @@ -0,0 +1,263 @@
> +#!/usr/bin/perl
> +#
> +# dpkg-genbuildinfo
> +#
> +# Copyright © 2003-2013 Yann Dirson <dirson at debian.org>
> +# Copyright © 2014 Niko Tyni <ntyni at debian.org>
> +# Copyright © 2014-2015 Jérémy Bobbio <lunar at debian.org>

This also probably needs to be completed with the © from the other
sources it's borrowing from.

> +my $quiet = 0;

There's nothing to be quiet about here.

> +my $buildinfo_format = '1.9';

Why is this format 1.9?

> +my $checksums = Dpkg::Checksums->new();
> +my %archadded;
> +my @archvalues;
> +
> +# There's almost the same function in dpkg-checkbuilddeps,
> +# they probably should be factored out.
> +sub parse_status {

I've a local commit switching the parse_status() in dpkg-checkbuilddeps
to use Dpkg::Index, unfortunately due to the current implementation
it incurs a significant speed regression, but I'm reworking the
Dpkg::Control parser to speed it up, so this code will be able to
switch to something similar too.

> +    my $status = shift;
> +
> +    my $facts = Dpkg::Deps::KnownFacts->new();
> +    my %depends;
> +    local $/ = '';
> +    open(my $status_fh, '<', $status)
> +        or syserr(_g('cannot open %s'), $status);
> +    while (<$status_fh>) {
> +        next unless /^Status: .*ok installed$/m;
> +
> +        my ($package) = /^Package: (.*)$/m;
> +        my ($version) = /^Version: (.*)$/m;
> +        my ($arch) = /^Architecture: (.*)$/m;
> +        my ($multiarch) = /^Multi-Arch: (.*)$/m;
> +        $facts->add_installed_package($package, $version, $arch,
> +                                      $multiarch);
> +
> +        if (/^Provides: (.*)$/m) {
> +            my $provides = deps_parse($1, reduce_arch => 1, union => 1);
> +            next if not defined $provides;
> +            foreach (grep { $_->isa('Dpkg::Deps::Simple') }
> +                                 $provides->get_deps())
> +            {
> +                $facts->add_provided_package($_->{package},
> +                                    $_->{relation}, $_->{version},
> +                                    $package);
> +            }
> +        }
> +
> +        if (/^(?:Pre-)?Depends: (.*)$/m) {
> +            foreach (split(/,\s*/, $1)) {
> +                push @{$depends{$package}}, $_;

This will merge all dependencies from all Multi-Arch:same instances.
But in any case why record the dependencies of the interesting packages?

> +            }
> +        }
> +    }
> +    close $status_fh;
> +
> +    return ($facts, \%depends);
> +}

> +# Retrieve info from the current changelog entry
> +my %options = (file => $changelogfile);
> +$options{changelogformat} = $changelogformat if $changelogformat;
> +my $changelog = changelog_parse(%options);
> +# Other initializations
> +my $control = Dpkg::Control::Info->new($controlfile);
> +my $fields = Dpkg::Control->new(type => CTRL_FILE_BUILDINFO);
> +
> +my $dist = Dpkg::Dist::Files->new();
> +
> +# include .dsc
> +my $spackage = $changelog->{'Source'};
> +(my $sversion = $changelog->{'Version'}) =~ s/^\d+://;

This will grab the binary version, not the source version, this will
break on binNMUs.

We probably need to record both source and binary versions in the
buildinfo file. And in case they differ the changelog entry, which
tends to differ per arch.

> +my $dsc = "${spackage}_${sversion}.dsc";
> +my $dsc_pathname = "$uploadfilesdir/$dsc";
> +my $dsc_fields = Dpkg::Control->new(type => CTRL_PKG_SRC);
> +$checksums->add_from_file($dsc_pathname, key => $dsc);
> +
> +my $dist_count = 0;
> +
> +$dist_count = $dist->load($fileslistfile) if -e $fileslistfile;
> +
> +error(_g('binary build with no binary artifacts found; cannot distribute'))
> +    if $dist_count == 0;
> +
> +foreach my $file (sort $dist->get_files()) {

Why the sort, the function is supposed to preserve the same insertion
order.

> +}

> +my @status = parse_status("$admindir/status");
> +my $facts = shift @status;
> +my %depends=%{shift @status};
> +my $deps;
> +my %env_pkgs;
> +
> +# parse essential list
> +open (RAWFILE, '/usr/share/build-essential/essential-packages-list')
> +    or error("cannot read /usr/share/build-essential/essential-packages-list: $!\n");
> +# skip header
> +while (my $line = <RAWFILE>) {
> +    chomp $line;
> +    last if $line eq '';
> +}
> +while (my $line = <RAWFILE>) {
> +    chomp $line;
> +    $env_pkgs{$line} = 1;
> +}
> +close RAWFILE;

Hmm, please just record installed packages that have the Essential:yes
field. I don't want the code to rely on such externally defined files.

(As an aside, in the dpkg codebase please use scalar variables for
filehandles instead of barewords, use syserr when $! is concerned,
mark strings for translation, and pass filenames as format string
arguments.)

> +append_deps(\%env_pkgs, 'build-essential',
> +                        $control->{source}->{'Build-Depends-Indep'},
> +                        $control->{source}->{'Build-Depends'});

Missing Build-Depends-Arch field. But those fields should not be
recorded if the build does not concern them. And this uses an
undocumented access to the $control object.

> +while (my ($pkg, $seen) = each(%env_pkgs)) {
> +    next if $seen;
> +    $env_pkgs{$pkg} = 1; # mark as seen
> +    next unless defined $facts->{pkg}->{$pkg}->[0];

The Dpkg::Deps::KnownFacts object is not supposed to be accessed
directly, it's only for internal use by other Dpkg::Deps objects.

> +    append_deps(\%env_pkgs, @{$depends{$pkg}});
> +    keys %env_pkgs; # reset iterator
> +}
> +
> +my $environment = Dpkg::Deps::AND->new();
> +foreach my $pkg (sort keys %env_pkgs) {
> +    foreach my $installed_pkg (@{$facts->{pkg}->{$pkg}}) {
> +        my $version = $installed_pkg->{version};
> +        my $architecture = $installed_pkg->{architecture};
> +        my $pkg_name = $pkg;

Useless variable?

> +        if (defined $architecture &&
> +            $architecture ne 'all' && $architecture ne $build_arch) {

This should probably be host_arch.

> +            $pkg_name = "$pkg_name:$architecture";
> +        }
> +        my $dep = Dpkg::Deps::Simple->new("$pkg_name (= $version)");
> +        $environment->add($dep);
> +    }
> +}
> +$environment = "\n" . $environment->output();
> +$environment =~ s/, /,\n/g;

This is changing the type of $environment on assignment which seems
rather confusing. And it seem to be doing lots of parsing and dumping
for something that's simply string concatenations. Just do the
equivalent of:

  $var .= "$pkgname (= $pkgver),\n";

or probably even better, just use an array, and join them with ",\n".

> +$fields->{'Build-Environment'} = $environment;
> +
> +$fields->output(\*STDOUT);

I get the impression the code tracking the used installed packages is
made much more difficult than it needs to be.

> diff --git a/scripts/dpkg-genchanges.pl b/scripts/dpkg-genchanges.pl
> index 3303d5c..4daf530 100755
> --- a/scripts/dpkg-genchanges.pl
> +++ b/scripts/dpkg-genchanges.pl
> @@ -342,6 +341,10 @@ if ($include & BUILD_BINARY) {
>                  if $file->{arch} and not $archadded{$file->{arch}}++;
>          }
>      }
> +
> +    my $arch = $include == BUILD_ARCH_INDEP ? 'all' : $host_arch;
> +    my $buildinfo = "${spackage}_${sversion}_${arch}.buildinfo";

The problem here is that sversion is the actual source version, which
means on binNMUs the file will not exist.

> +    $dist->add_file($buildinfo, $sec, $pri);

And I don't feel very comfortable assuming the .buildinfo name here
when it is not being generated in this same script, and when the
script generating the output is not even defining the name itself.

Thanks,
Guillem



More information about the Reproducible-builds mailing list