This commit is contained in:
Luj 2025-02-11 10:16:26 +01:00
commit b03965b764
Signed by: luj
GPG key ID: 6FC74C847011FD83
52 changed files with 3576 additions and 0 deletions

593
index.qmd Normal file
View file

@ -0,0 +1,593 @@
---
title: Reproducibility in functional package management
date: 2025-2-11
date-format: long
lightbox: true
logo: telecom.png
margin-top: "0px"
author:
- name:
given: Julien
family: Malka
url: https://luj.fr
email: julien.malka@telecom-paris.fr
orcid: 0009-0008-9845-6300
roles:
- conceptualization
- investigation
- writing original draft
affiliations:
- id: telecom
name: Télécom Paris, Institut Polytechnique de Paris
fig-align: center
code-overflow: wrap
code-line-numbers: false
css: styles.css
format:
metropolis-beamer-revealjs:
theme: slide.scss
toc: false
toc-title: Plan
toc-depth: 2
slide-level: 2
slide-number: true
---
## Research topics
**Main topics:** Cybersecurity & Software engineering
*How can one trust the software installed on ones system is not malicious?*
- What if we make the assumption that the software is **open source**?
## Software supply chain
::: {.r-fit-text}
**Definition:** All the **components**, **tools** and **processes** used to **produce**, **compile** and **distribute** software.
- An increasing number of attacks that target the software supply chain of the software instead of the software itself : for example *Solarwinds* (2020).
- Will to create security norms of the software supply chain (*USA Executive order on improving the nations cybersecurity*/*EU Cyber Resilience Act*).
:::
::: {#fig-eval-build}
![](./software_supply_chain.png){.r-stretch}
Software supply chain overview ([slsa.dev](https://slsa.dev))
:::
## Main PhD research question
How to increase trust in the Open Source Software Supply Chain with **functional package managers** and **reproducible builds**?
## Functional package managers {auto-animate="true"}
New software deployment model (from which **Nix** has been the first example).
```nix
{ stdenv, lib, fetchFromGitHub }:
```
## Functional package managers {auto-animate="true"}
New software deployment model (from which **Nix** has been the first example).
```nix
{ stdenv, lib, fetchFromGitHub }:
stdenv.mkDerivation rec {
version = "1.3.7";
pname = "htpdate";
src = fetchFromGitHub {
owner = "twekkel";
repo = pname;
rev = "v${version}";
sha256 = "sha256-X7r95Uc4oGB0eVum5D7pC4tebZIyyz73g6Q/D0cjuFM=";
};
```
## Functional package managers {auto-animate="true"}
New software deployment model (from which **Nix** has been the first example).
```nix
{ stdenv, lib, fetchFromGitHub }:
stdenv.mkDerivation rec {
version = "1.3.7";
pname = "htpdate";
src = fetchFromGitHub {
owner = "twekkel";
repo = pname;
rev = "v${version}";
sha256 = "sha256-X7r95Uc4oGB0eVum5D7pC4tebZIyyz73g6Q/D0cjuFM=";
};
makeFlags = [
"prefix=$(out)"
];
```
## Functional package managers {auto-animate="true"}
::: {.r-fit-text}
New software deployment model (from which **Nix** has been the first example).
```nix
{ stdenv, lib, fetchFromGitHub }:
stdenv.mkDerivation rec {
version = "1.3.7";
pname = "htpdate";
src = fetchFromGitHub {
owner = "twekkel";
repo = pname;
rev = "v${version}";
sha256 = "sha256-X7r95Uc4oGB0eVum5D7pC4tebZIyyz73g6Q/D0cjuFM=";
};
makeFlags = [
"prefix=$(out)"
];
meta = with lib; {
description = "Utility to fetch time and set the system clock over HTTP";
platforms = platforms.linux;
license = licenses.gpl2Plus;
maintainers = with maintainers; [ julienmalka ];
};
}
```
:::
## Evaluation->Build pipeline
::: {#fig-eval-build}
![](./eval-build.png){.r-stretch}
Eval-build pipeline
:::
## Functional package managers for SSC security
Functional package managers also have interesting properties for software supply chain security (which are of interest for us):
- Builds from source;
- Sandboxed compilation.
## Functional package managers for SSC security
- Installed packages create a static graph structure (a Merkle tree) that can be analysed in order to find known vulnerability in the dependencies.
::: {#fig-eval-build}
![](./graph.png){.r-stretch}
Example of a package dependency graph
:::
## Binary distribution
- It is not always reasonable to compile all the software a user wants to install on their own machine: creates the necessity of binary caches ;
- **But** binary caches make us lose some of the interesting security properties of functional package managers.
## Reproducible builds {.lol}
![](rb.png){height='4em' fig-align="center"}
A build is **reproducible** if given the same source code, build environment and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts.
## Why is build reproducibility important?
::: {#fig-rb}
![](rb-verif.png){.r-stretch fig-align="center"}
Leveraging reproducible-builds to increase trust in distributed artifacts.
:::
## Research questions {auto-animate="true"}
**How reproducible is software in the functional package management model?**
- Is Nix evaluation reproducible? Can we reproduce *build environments* of Nix packages?
- Do functional package management enable **bitwise build reproducibility**?
## Reproducibility of build environments {auto-animate="true"}
- Is Nix evaluation reproducible? Can we reproduce *build environments* of Nix packages?
:::: {.columns}
::: {.column width="50%"}
![](./icse.jpeg){height='14em'}
:::
::: {.column width="50%"}
**"Reproducibility of Build Environments through Space and Time"**, ICSE 2024 (New Ideas and Emerging Results track), *J. Malka, S. Zacchiroli, T. Zimmermann*.
:::
::::
## Reproducibility of build environments
We say that two build environments are identical if they contain the **exact same set of executables, up to their specific versions**.
![](./env.png)
## Reproducibility in Space
::: {#fig-rb}
![](./space.png)
Reproducibility of build environments in Space
:::
## Reproducibility in Time
::: {#fig-rb}
![](./time.png){.r-stretch}
Reproducibility of build environments in Time
:::
## Research questions
::: {.incremental}
- **RQ1:** Is space and time reproducibility of build environments achievable with Nix ?
- **RQ2:** Does it allow rebuilding of past software versions ?
:::
## Experimental protocol
::: {.incremental}
1) Sample 200 revisions of the Nix software repository, picked from 2017 to 2023;
2) For each sampled revision, perform the **evaluation** of each package and compare with the historical truth (historical CI results);
3) For the *oldest revision* of our samples, perform the **build** of each package and compare with the historical truth.
:::
## Results {auto-animate="true"}
**RQ1:** *Reproducibility of build environments*
- We were able to **reproduce the build environment of 99.99% of the packages** we tested;
- Discrepancies we found were due to the (unfortunate) use of some of Nixs impure builtins.
## Results {auto-animate="true"}
**RQ1:** *Reproducibility of build environments*
- We were able to **reproduce the build environment of 99.99% of the packages** we tested;
- Discrepancies we found were due to the (unfortunate) use of some of Nixs impure builtins.
**RQ2:** *Rebuilding past software versions*
- We were able to **build successfully 14233 out of the 14242 (99.94%) packages that were built successfully by CI in 2017**;
- Discrepancies we found were due to leakages of the Nix build sandbox, that we wish to investigate further.
## Research questions {auto-animate="true"}
**How reproducible is software in the functional package management model?**
- Is Nix evaluation reproducible? Can we reproduce *build environments* of Nix packages?
- Do functional package management enable **bitwise build reproducibility**?
## Reproducibility of build environments {auto-animate="true"}
- Do functional package management enable **bitwise build reproducibility**?
:::: {.columns}
::: {.column width="50%"}
![](./msr.jpeg){height='14em'}
:::
::: {.column width="50%"}
**"Does Functional Package Management Enable Reproducible Builds at Scale? Yes."**, MSR 2025, *J. Malka, S. Zacchiroli, T. Zimmermann*.
:::
::::
## Nix **does not** garantee reproducible builds!
```nix
let
pkgs = import <nixpkgs> { };
in
pkgs.runCommand "random" { } ''
echo $RANDOM > $out
''
```
```{mermaid}
flowchart TD
A[nix-build]
A --run 1--> B[12505]
A --run 2--> C[29217]
```
&rarr; Will produce an artifact with a different number at each run!
## So how reproducible packages of the Nix distribution are?
::: {#fig-monitoring}
![](monitoring.png)
[https://reproducible.nixos.org](https://reproducible.nixos.org)
:::
## So how reproducible packages of the Nix distribution are?
::: {#fig-diffoscope}
![](diffoscope.png)
Example of a diffoscope.
:::
**Problem:**
- Only monitors a small subset of `nixpkgs` (~1300 packages for the Gnome image runtime closure)
## Research questions
- **RQ1:** What is the evolution of bitwise reproducible packages in `nixpkgs` between 2017 and 2023?
- **RQ2:** What are the unreproducible packages?
- **RQ3:** Why are packages unreproducible?
- **RQ4:** How are unreproducibilities fixed?
## Research methodology
::: {#fig-methodology}
![](pipelinev2.png){fig-align="center"}
Pipeline summarizing our research methodology.
:::
## A few figures
::::{.columns}
::: {.column width="60%"}
::: {#fig-diffoscope}
![](./ecosystems.png)
Evolution of the size of the nine most popular software ecosystems in `nixpkgs`.
:::
:::
::: {.column width="40%"}
- 709 816 packages built;
- 14 296 total build hours;
- 548 390 tracked by name and corresponding to <span style="white-space: nowrap;">59 103</span> unique packages associated to a specific software ecossytem .
:::
::::
## RQ1: Evolution of bitwise reproducible packages
::: {#fig-overall}
![](reproducibility-overall-relative.png){.r-stretch fig-align="center"}
Proportion of reproducible, rebuildable and non-rebuildable packages over time.
:::
## RQ1: Evolution of bitwise reproducible packages
::: {#fig-overall2}
![](reproducibility-overall-absolute.png){.r-stretch fig-align="center"}
Absolute numbers of reproducible, rebuildable and non-rebuildable packages over time.
:::
## RQ1: Evolution of bitwise reproducible packages
::: {#fig-overall2-reg}
![](reproducibility-overall-absolute-reg.png){.r-stretch fig-align="center"}
Reproducibility regression around June 2020.
:::
## RQ2: What are the unreproducible packages?
::: {#fig-diff}
![](./reproducibility-ecosystems.png){.r-stretch fig-align="center"}
Proportion of reproducible packages belonging to the three most popular ecosystems and the base namespace of nixpkgs.
:::
## RQ3: Why are packages unreproducible?
::: {#fig-diff}
![](evolution-heuristics.png){.r-stretch fig-align="center"}
Evolution of the number of packages that are matched by each of our heuristics, over time.
:::
## RQ4: How are unreproducibilities fixed?
- Sampled 100 fixes in our dataset of reproducibility fixed (obtained by bisection of the `nixpkgs` repository):
&rarr; **In 93 instances, "reproducibility" was not mentionned on the pull request / commit message.**
&rarr; **In 75 cases the fix was merely a package update.**
<br>
- Studied the 15 most impactful fixes (from 3052 to 27 packages fixed):
&rarr; **In 8/15 instances, the reproducibility issue being fixed is documented.**
## Conclusion
- Bitwise reproducibility in `nixpkgs` as of 2023: 91%;
- This justifies investing resources/conducting research on distributed cache solutions relying on build reproducibility.
## Thank you for your attention!
<h3><ins>My socials:</ins></h3>
<div style="margin-bottom: 40px;"></div>
{{< bi mastodon >}} luj@chaos.social
{{< bi envelope >}} julien.malka@telecom-paris.fr
:::: {.columns}
::: {.column width="50%"}
![](qr-icse.png){height='11em'}
:::
::: {.column width="50%"}
![](qr-msr.png){height='11em'}
:::
::::
## RQ1: Evolution of bitwise reproducible packages
::: {#fig-overall2-reg}
![](./sankey-average.png){.r-stretch fig-align="center"}
Sankey graph of the average flow of packages between two revisions, excluding the revision from June 2020, considered as an outlier.
:::