I’ve been looking into how easy it is to confirm that a binary package corresponds to a source package. It turns out that it is not easy at all. So I’ve written down my findings in this blog entry.
I think that the topic of reproducible builds is one that is of fundamental importance to the free software and larger community; the trustworthiness of binaries based on source code is a topic quite neglected. We know about tivoization and the reality that code can be open yet unchangeable. What is not appreciated in sufficient measure is that parties can, quite unchecked, distribute binaries that do not correspond to the alleged source code.
Trust is good, but especially in a post-Snowden world, control is better. Can a person rely on binaries or should we all compile from source? I hope to raise awareness about the need for a reproducible way to create binaries from source code.
Free software means users have the four essential freedoms. Freedom 1 is the freedom to study how the program works and change it so it does your computing as you wish. It also means that the program does not do you what you do not want it to do. Instead of having to trust the supplier of the software, you can check that the software works as advertised and does not contain e.g. spyware. Access to the source code is a precondition for this freedom.
Many software packages are distributed in binary form and come with a license that makes the right to the source code explicit. For example the GNU GPL v2.0 says:
For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable.
A license that promises access to the source code is one thing, but an interesting question is: is the published source code the same source code that was used to create the executable? The straightforward way to find this out is to compile the code and check that the result is the same. Unfortunately, the result of compiling the source code depends on many things besides the source code and build scripts such as which compiler was used. No free software license requires that this information is made available and so it would seem that it is a challenge to confirm if the given source code corresponds to the executable.
Collecting software packages that form a working operating system is one of the services of a distribution. Another service that most provide is compiling that software into executables and shipping those in convenient packages. Most distributions ship two types of packages: source packages and binary packages. A distribution is a complete system that includes all the tools to compile source code. Those tools go beyond the tools that are used in the build scripts from the upstream developer. Distributions contain tools to create binary packages from source packages. Does this mean that it is less of a challenge to confirm if the source code corresponds to the executable?
I have built a binary package from a source package for a number of distributions (Debian, Fedora, and OpenSUSE) and compared the self-built binary package with the one published by the distribution. All tests were run on fresh, minimal installs of the latest version of each distribution using the tools that are recommended by the distributions. To keep the complexity low, one simple package was chosen: tar. Will the self-built package be exactly the same, totally different or only slightly different?
Debian was installed from a downloaded netinstall image:
debian-7.0.0-amd64-netinst.iso
. The system was installed on
a VirtualBox machine. The version of tar that comes with Debian is
1.26+dfsg-0.1. According to the instructions
compiling the tar package from source is as simple as running:
apt-get -b source tar
which downloads the source files and compiles them. This results in a
file: tar_1.26+dfsg-0.1_amd64.deb
. The name of the file is
the same as the name of the binary package published by Debian, but the
size of the file is different from the size of the published package,
984376 vs 984768. Running the command again in a different empty
directory gives yet another size for the deb file. The command
apt-get -b source tar
is clearly not deterministic.
To investigate what the differences between the packages are, they are unpacked:
ar vx tar_1.26+dfsg-0.1_amd64.deb
tar xf control.tar.gz
tar xf data.tar.gz
and the files in the self-built and the published package are
compared. The files bin/tar
, /usr/sbin/rmt-tar
and /usr/share/man/man1/tar.1.gz
differ. The manual file is
the easiest to investigate. It turns out that it has a header with the
date and time at which it was created:
generated by script on Fri May 24 15:52:20 2013
The files /bin/tar
differ in 20 consecutive bytes. The
files /usr/sbin/rmt-tar
also only differ in 20 consecutive
bytes. With readelf -a bin/tar
this can be investigated:
the difference between the two executables is in the
“.note.gnu.build-id” ELF note section. This section can be set with the
argument --build-id
of ld
which defaults to
taking the sha1 sum of the linked object files. The build id is derived
from the object files. In the Debian build, the object files are created
with debug information which is later removed from the executable by
stripping. The debug information contains the build path and it is this
build path which is the reason for the different build id. If tar is
compiled repeatedly in the same directory the binary will be identical.
A tar executable compiled in a different directory will have a different
build id. The build id could be left out with
ld --build-id=none
.
Apart from these two differences, there is another common difference
from the published binary package. A deb archive is an ar archive that
contains two tar archives: control.tar.gz and data.tar.gz. The ar
archive and the two tar archives contain timestamps. This can be seen
with ar tv tar_1.26+dfsg-0.1_amd64.deb
and
tar tvf control.tar.gz
. If a build should be repeatable,
the time that is stored should be a time that is taken from the provided
files and not from the computer clock. The timestamps, user and group
and file mode information can be left out of archives. ar
can be run in a deterministic mode:
ar qD archive-file file...
and tar
takes
arguments for explicitly setting this information (--mtime
,
--owner
, --group
, and
--mode
).
The binary package that was built from a Debian source package was not identical to the published binary package, but the differences are limited to timestamps and the build id in the executables. Unless the function of the executable relies on this build-id, the self-built tar executable functions in the same way as the published version.
Fedora 18 was installed from a net install. The option ‘minimal system’ was chosen as the software selection option. This creates a system with 236 packages that take up 625MB. The tar binary and source RPMs were downloaded from the fedora repository and built with:
mock rebuild tar-126-9.fc18.src.rpm
and unpacked with
rpm2cpio tar-1.26-9.fc18.x86_64.rpm | cpio -idmv
The files that differ are /bin/tar
and four info files
with paths like /usr/share/info/tar.info.gz
. Interestingly,
the files /usr/share/man/man1/tar.1.gz
are the same in
published and compiled packages. This is because the man file is taken
from the source package: Fedora has modified the man page and ships the
generated version in the source rpm. The man pages for tar
and gtar
are the same file.
The info files give a large diff. This is due to the presence of a
timestamp and a lot of generated cross-references. The executables are
also very different. The self-compile tar is 8 bytes larger. The build
id is different and there are differences scattered throughout the file.
Many of these are just single bytes and probably different offsets to
functions. This idea is consistent with the difference in output of
readelf -a tar
. All the function names are there in the
same order, but many numbers are different.
Just like ar
and tar
files,
rpm
files contain timestamps which can be seen with
rpm -qvlp tar-1.26-9.fc18.x86_64.rpm
. The timestamps of the
compiled files have the time of the build as their time stamp.
The Fedora package showed more differences with the published package than the Debian package did and unlike the Debian case, not all of the differences can be explained. The executable built from the published sources is so different from the published executable that it is not easy to know if it will function the same way.
A minimal system with OpenSUSE 12.3 was set up from a network installation iso. In OpenSUSE it is also easy to create a binary package from a source package:
rpmbuild --rebuild tar-1.26-14.1.1.src.rpm
Only two files differed: /bin/tar
and
/usr/share/man/man1/tar.1.gz
. The man files differed, as in
the deb file, due to their timestamp. The tar binary contained a
surprise: the self-built file was 5 times as large as the published
version. The debug information was not stripped. Stripping the file
completely reduced the difference in size to 48 bytes. The build id was
different and the published version contained a .gnu_debuglink entry
whereas the self-built file contained a .comment section. Apart from the
header and the last 2k bytes the files were identical.
A cherished characteristic of computers is their deterministic behaviour: software gives the same result for the same input. This makes it possible, in theory, to build binary packages from source packages that are bit for bit identical to the published binary packages. In practice however, building a binary package results in a different file each time. This is mostly due to timestamps stored in the builds. In packages built on OpenSUSE and Fedora differences are seen that are harder to explain. They may be due to any number of differences in the build environment. If these can be eliminated, the builds will be more predictable. Binary package would need to contain a description of the environment in which they were built.
Compiling software is resource intensive and it is valuable to have someone compile software for you. Unless it is possible to verify that compiled software corresponds to the source code it claims to correspond to, one has to trust the service that compiles the software. Based on a test with a simple package, tar, there is hope that with relatively minor changes to the build tools it is possible to make bit perfect builds.
Comments
openSUSE
One advantage of the Open Build Service has is that it documents the build environment and allows you to recreate it relatively precisely (not exactly though... it mostly lacks version control, not an issue here). OBS handles building the debuginfo packages so I'm guessing this is why rpmbuild didn't strip the binaries.
So you could use osc instead of rpmbuild.
By eean at Thu, 06/20/2013 - 15:14
openSUSE
On slashdot, an anonymous coward noted also that osc is the tool of choice and even claimed to have gotten identical builds. I have not yet verified that, but it sounds great. The mentioned package, build-compare, has some scripts that are meant to compare package whilst ignoring variable parts of the build.
By Jos van den Oever at Thu, 06/20/2013 - 23:43
yeap
It's true that osc (the Open Build Service command line client, also available for most other linux distributions) is the most-used and standard build tool for RPM's on openSUSE. It's not mandatory or anything, of course... It builds using a chroot, I believe, so it does indeed probably lead to very close or identical results. It is of course limited to building for the architecture of the system it is on, OBS does not have that problem as it uses clean VM's each time it builds. That makes builds even more reliable and easy to perfectly reproduce.
By jospoortvliet at Thu, 06/27/2013 - 08:45