Scientific Workflows on the Eve of the Butlerian Jihad

A practical manifesto for scientific software in the era of vibecoding: let’s write beautiful code while we still can
Author

W. H. W. Thompson

Published

February 26, 2026

Why This Post Exists

In 1959, Richard Feynman gave a talk called “There’s Plenty of Room at the Bottom” [1]. It was incredibly prescient. Six years later, Gordon Moore observed that transistor density was doubling roughly every two years [2], and for half a century that observation held. Hardware kept getting faster, and if your code was slow you could just wait for the next generation of chips.

That era is over. Leiserson et al. [3] documented this shift in their aptly titled paper “There’s Plenty of Room at the Top.” Hardware gains are flattening. The performance improvements now come from software: better algorithms, better compilers, better use of parallelism. Python, for all its virtues, is roughly 60,000 times slower than C. Julia [4] made a serious attempt at solving the two-language problem, where you prototype in Python and rewrite in C for performance. But in practice, the bottleneck for most computational researchers is not the core algorithm. It is the tooling. The plumbing. The workflow that surrounds the science.

But good scientific software development practices are not taught [5]. Codebases are full of Jupyter notebooks that are fragile, that require a specific and totally opaque run order to produce correct results. They are full of poorly documented workflows of CSV and JSON files that do not clearly mark their own provenance. This is particularly true when it comes to high-performance computing. A large percentage of working computational researchers could benefit from HPC platforms, and most universities provide a cluster, but the barrier to entry is high due to arcane tooling. Software like SLURM is hostile to newcomers, especially those not already familiar with running commands on Unix systems. I have seen codebases collapse beneath their own weight. I have seen projects scaled down because of computational complexity that could have been avoided.

The “world if” meme is funny but the underlying point is serious. Enormous amounts of scientific compute have been consumed not on the science but on the scaffolding. Re-running lost experiments, porting code that was never meant to be portable, debugging race conditions in ad-hoc parallel scripts. Without SLURM we would have cured cancer by now. (Probably not. But the sentiment captures real frustration.)

The fix is not to heroically tolerate the friction. It is to reduce it structurally.

The good news is the calculus has changed. There has long been a trade-off between investing time in learning a tool that will improve your workflow and the opportunity cost of that learning curve. LLMs have shifted this exploration-exploitation trade-off dramatically. Where learning a new framework or workflow once required weeks or months to become proficient, much of the scaffolding work can now be offloaded to an LLM. This allows scientists to adopt substantially better software development practices without a huge upfront time investment. Those practices, in turn, make it easier to iterate and more reliable to quickly spin up new projects.

In Dune, the Butlerian Jihad is the moment humanity outlaws thinking machines after they have caused catastrophic harm. We are not there yet. But we are, I think, on its eve. LLMs are currently good enough to dramatically lower the cost of adopting better practices: scaffolding your pipelines, generating your schemas, writing your tests. They are not yet good enough to make those practices irrelevant. The scientist who builds a clean, reproducible pipeline now gets the full compounding benefit of LLM assistance without sacrificing legibility or control. That window may not stay open. Write beautiful code while you still can.

I originally made this as a talk for students in my department, and I included a companion git repository with concrete examples to show how all of these tools are used in practice. In this post, I have outlined each tool I use in my workflow and linked to the relevant places in the codebase where each one is used.


The Trade-Off

There is, in my experience, a damned-if-you-do, damned-if-you-don’t calculus when it comes to engineering research code. If you spend time early on building a system that is modular and flexible, that accommodates future features, inevitably none of that modularity, none of that flexibility, will actually be used in your project. Inevitably, the things you need to change are the things you did not think you would need to change. On the other hand, you can write low-quality disposable code and iterate quickly, until the codebase becomes unmaintainable. It does not feel like there is a good way out.

  • If you take the time to build software the right way, you worry you will never reuse it.
  • If you write sloppy code, it will eventually bite you.
  • Damned if you do. Damned if you don’t.

Most computational researchers have a similar experience. Most of the code they write is not the core algorithm, the part that really demands investment and thought. Most of it is boilerplate: scripts to perform I/O, save results, specify sweeps, load metadata, launch jobs, collect and analyze outputs. In an ideal world, all of this boilerplate could be automated away. Tools exist to automate much of this, but they are not taught or widely used in many fields.

What I Imagined

The Reality

Is It Worth The Time? (XKCD 1205)

If you run a task once a week and automation takes a week to build, it pays for itself in two years. If you run it daily and it takes a day to automate, you break even in a month. Most scientific pipelines fall firmly in the “worth automating” zone.

The stakes here are not insignificant. The pace of scientific research is non-negligibly slowed by poor workflows. If every computational researcher had an efficient, reproducible workflow, the pace of science would be measurably faster and the scope of projects could be much larger.


The Workflow

This is the key insight. Good development practices, type annotations, modular code, well-defined schemas, make LLMs dramatically better at writing reliable code for your project. And LLMs lower the barrier to adopting those practices in the first place. It is a flywheel. The more structure you give an LLM to work with, the better its output. The better its output, the more time you save, and the more you can invest in structure.

Type annotations are a simple example. They are cheap to write, but they let an LLM generate substantially more reliable code, specifically functions that respect your actual data structures instead of guessing at them. Better output means less time debugging generated code, which means more time adding structure, which compounds.

Most scientific projects follow a very similar workflow. In this post, I will assume that your science follows something like the following.

The Process

  • You have data or simulation parameters.
  • You have a model: ABM, statistical, or ML.
  • You want to run sweeps over parameter space with fixed logic.
  • You want to collect and visualize results without manual copy-paste.
  • You want to turn plots into a publication or shareable artifact.

The Goal

Automate the plumbing so you can focus on the science. Every hour spent hand-managing results is an hour not spent thinking about what those results mean.

To make these ideas concrete, we’ll use a simple NLP pipeline as a reference throughout: fine-tune a BERT model [6] on AG News (4 classes of news headlines), sweep over hyperparameters, and collect results automatically. The full working code is in the companion repo. The steps are:

  1. Fine-tune BERT on AG News (4 classes: World, Sports, Business, Sci/Tech).
  2. Train a lightweight adapter/MLP on extracted embeddings.
  3. Run parameter sweeps over adapter dimension and learning rate.
  4. Compare results and publish figures automatically.
/Users/willthompson/Documents/CSDS/side_projects/PythonForScientificDevelopment/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Sample from the AG News dataset: 4 categories of news article headlines. We use this as a reference pipeline throughout; the full code is in the companion repo.

Each step in this pipeline maps to a tool. The stack is not the only valid one, but it is coherent: every piece talks to every other piece, and the whole thing can be run with a single command. The table below is opinionated but the reasoning is practical. Each tool was chosen because it reduces a specific class of friction without adding much complexity in return.

Problem Tooling Choice
Reproducible environment uv + lockfiles
Data validation Pydantic schemas
Hardware acceleration PyTorch / JAX
Code reliability pytest + CI
Workflow automation Snakemake
Experiment tracking Weights & Biases
Publication and dashboards Quarto

The Foundation

The tooling splits into two layers. The Foundation covers what your code is built on: how you structure logic, manage environments, validate data, and run computation. The Plumbing, in the next section, is what connects the pieces: the glue that turns individual scripts into a coordinated, reproducible pipeline. One prerequisite sits under both: if you are not already using git, start there before anything else. None of what follows works well without version control.

1. Keep Core Logic in Python Source

One of the simplest improvements you can make to your workflow, if you are not already doing it, is keeping all of the logic of your codebase in a source file and separating this from plotting and analysis. Ideally you would not use any Jupyter notebooks, at least not for anything serious, but if you need to be weaned off, keep the important analytical code in a different directory and import it into your scripts directly.

# src/core/model.py — importable, testable, versionable
def simulate_trajectories(
    n_agents: int,
    time_steps: int,
    params: dict[str, float],
) -> list:
    """Simulates N trajectories over T steps."""
    ...

Another easy change: type annotations. Python is a dynamically typed language, but in recent releases Python added optional type annotations to describe the types of parameters and outputs for functions. These are incredibly handy and they can be set either to throw a warning or an error if they are violated. There is very little reason not to use these and they can make your code much easier to debug.

In the BERT example, all training logic lives in src/core/trainer.py and the adapter architecture in src/core/model.py, both importable, testable, and completely separate from sweep configuration or visualization notebooks.

2. Reproducible Environments with uv

Let’s face it. Python was not designed to be a good language for scientific computing and it was not designed to be portable. We have all gone through the agony of trying to get someone else’s code, maybe from a collaborator, maybe from GitHub, to run locally on our machine. Conda has a system for recording requirements in a requirements.txt file and so does pip, but these are not very reliable because they do not reproduce every aspect of the environment. uv is a modern package manager that wraps the syntax of pip so you do not have to learn new instructions but works much better. It is much faster and it creates an exactly reproducible document of everything in your environment: all of the versions for dependencies and their dependencies. It also creates link dependencies which prevents you from having 10 copies of PyTorch for different projects on your machine.

# pyproject.toml
[project]
name = "scientific_dev"
requires-python = ">=3.12"
dependencies = [
    "jax>=0.9.0",
    "snakemake>=8.4.0",
    "pydantic>=2.12.5",
]

The BERT example’s full environment, with specific versions of PyTorch, Transformers, and Pydantic pinned, reproduces exactly with uv sync. No more “it worked on my machine.”

3. Hardware Acceleration with JAX / PyTorch

The rule of thumb: if your computation can be expressed as vectorized operations over arrays, it will run faster on a GPU. The jump in throughput for embarrassingly parallel workloads (parameter sweeps, ensemble simulations, batch inference) is typically one to two orders of magnitude.

JAX is particularly well-suited to scientific computing: it supports automatic differentiation through arbitrary programs (including ODE solvers and agent-based models), and vmap turns a function over one data point into a function over a batch with no manual batching logic:

import jax.numpy as jnp
from jax import vmap

def compute_score(x):
    return jnp.sum(jnp.sin(x) ** 2)

large_input_matrix = jnp.ones((1000, 128))  # 1000 samples, 128 features each

# Run over all rows in parallel on GPU
batch_score = vmap(compute_score)
scores = batch_score(large_input_matrix)

With JAX you can often just swap np for jnp and get immediate hardware acceleration on the same code. PyTorch is the default for neural networks. For custom simulations or differentiable physics, JAX is usually the better fit.

4. Validation with Pydantic

Python’s dynamic typing is a liability in large codebases. A wrong-type value passes silently through function calls until it causes a failure three layers down the stack, by which point the original error is hard to trace. Pydantic solves this by letting you define data classes with typed fields and custom validators that run at instantiation time. You get an error immediately, at the boundary where bad data entered, not somewhere downstream.

This has a secondary advantage for LLMs. The clearer you are about the structure of your code and the data types, the easier it is for LLMs to write tests, generate configurations, and produce new code while ensuring that it remains reliable and faithful to your codebase.

# src/core/schema.py
from pydantic import BaseModel, Field, field_validator

class NLPModelConfig(BaseModel):
    model_name: str = Field(default="distilbert-base-uncased")
    adapter_dim: int = Field(default=256)

    @field_validator("adapter_dim")
    def validate_adapter_dim(cls, v: int):
        if v <= 0:
            raise ValueError("adapter_dim must be positive")
        return v

View Schema →

In the BERT example, NLPModelConfig is what Snakemake passes to each training run: a schema that validates every sweep configuration before it touches the model.


The Plumbing

1. Command Center with just

Often you end up running the same commands over and over and over again. One way to deal with this is predefining these commands in a makefile or, better yet, in a justfile. You can describe your workflows, rsync commands, Quarto preview commands which may be very complicated, and save them with a single easy-to-remember just command. Again, this is very good for LLMs. You can define common operations as just commands in your justfile and then instruct the LLM to run these. This limits the amount of code and scripts that LLMs are generating and ensures that they use your pipelines and configurations whenever possible.

# Default config
config := "configs/nlp_baseline.yaml"

# Run full pipeline
all p="local" c=config:
    uv run snakemake --workflow-profile ./workflow/profiles/{{p}} --configfile {{c}}

# Run training only
train p="local" c=config:
    uv run snakemake train_all --workflow-profile ./workflow/profiles/{{p}} --configfile {{c}}

# Run evaluation only
test p="local" c=config:
    uv run snakemake test_all --workflow-profile ./workflow/profiles/{{p}} --configfile {{c}}

# Sync results from cluster
sync:
    ./tools/sync_vacc.sh

# Regenerate figures from synced data
plot:
    uv run snakemake plots/fig1_roc_sweep.png -j1

# Unlock snakemake after a crash
unlock:
    uv run snakemake --unlock

# Preview Quarto site locally
preview:
    cd notebooks/quarto && quarto preview . --no-browser

View Justfile →

In the BERT example, just all runs the entire pipeline from data through figures with a single command.

2. Orchestration with Snakemake

One of the most important tools in this blog post is Snakemake [7]. Snakemake is a workflow orchestration tool that automatically determines what needs to be run based on what already exists. You define Snakemake rules which work backwards from how you might expect: you define the output files using some regular expression and then the rule to generate those output files. You can chain these rules together and Snakemake will automatically find the rule to generate dependencies for a given output and build a directed acyclic graph of all the workflows needed. If things have already been run, it remembers and does not rerun them. If a given source file changes, it will rerun only the files which depend on that source file.

It also integrates natively with SLURM, so no more directories full of batch scripts. It can automatically restart jobs with elevated memory and time requirements if you keep hitting timeout errors or getting jobs canceled. It is also a huge boon for reproducibility. You can specify all of the code in a workflow needed to generate one of the figures in your paper and then simply run a single command. Anyone who downloads your code from GitHub can run the same Snakemake command and the entire workflow will execute and generate the figure.

In the BERT example, step 3 (sweeping over adapter dimensions and learning rates) is a single Snakemake rule that spawns one GPU job per configuration. No bash loops, no manual job tracking, no accidental reruns.

Snakemake infers the full dependency graph from your rules and only re-runs what’s out of date. This DAG was generated automatically from the BERT pipeline rules.
# workflow/rules/train.smk
rule train_bert_classifier:
    output:
        weights = "results/{cfg_name}/{params_path}/model.pt",
        stats   = "results/{cfg_name}/{params_path}/training_stats.parquet"
    resources:
        gpu = 1
    shell:
        "PYTHONPATH=src python scripts/run_training.py "
        "--config configs/{wildcards.cfg_name}.yaml "
        "--output {output.weights}"

View Full Rule →

3. Testing with pytest and CI

I cannot tell you how many times I have made a tiny change to one part of my code and something else totally unrelated breaks. What happened? I barely changed anything and now nothing works. One of the ways around this is by writing unit tests. By setting up a workflow that runs these unit tests every time you push your code, you can validate the inputs and outputs of every function and you know immediately when something breaks.

# tests/test_nlp_logic.py
def test_validation():
    with pytest.raises(ValueError):
        NLPModelConfig(adapter_dim=-1)
# .github/workflows/ci.yml
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: uv run pytest tests/

View CI Workflow →

In the BERT example, the CI workflow runs in under two minutes on every push and has caught several silent breakages that would otherwise have reached the paper.

4. Experiment Tracking with Weights & Biases

If you have long-running jobs or you want to summarize results at a glance without having to load a bunch of CSV files into a Jupyter notebook, you can use Weights and Biases. You set up a logger in your code and Weights and Biases logs metrics in real time and produces plots. This means you can check on your SLURM jobs as they go, check on losses over training, or whatever else you might want. In the BERT example, WandB makes the sweep visible in real time: loss curves for each adapter dimension, accessible from any browser while the jobs are still running on the cluster.

# src/core/trainer.py
def init_logger(self):
    wandb.init(
        project="neural-adapter-research",
        config=self.config.model_dump(),
        name=self.config.full_run_name,
    )

wandb.log({
    "train/step_loss": loss.item(),
    "train/learning_rate": self.optimizer.param_groups[0]["lr"],
    "train/global_step": epoch * len(self.train_loader) + step,
})

View Trainer →

5. Storage and Querying with Parquet + DuckDB

One huge issue is dealing with output files once they have been generated. Oftentimes you are wrangling tons of CSV or JSON files and trying to parse metadata in file names. You could build a heavy-duty database solution, but that requires specifying a schema and accessing the data without the database can be difficult. Databases are obviously essential if you have huge amounts of data, but one very powerful alternative is Parquet and DuckDB. Parquet files are compressed, non-human-readable storage files that can be indexed with DuckDB, which acts like SQL on a directory of files. You can write SQL queries that run directly on a directory of Parquet files to return results. This is perfect for a huge number of scientific computing data workflows.

SELECT s.*, m."test/auroc"
FROM read_parquet('results/**/training_stats.parquet') s
LEFT JOIN read_parquet('results/**/test_metrics.parquet') m
  ON s.full_run_name = m.full_run_name

In the BERT example, all training stats and test metrics write to Parquet. The final analysis is a single DuckDB query that joins across every run in the sweep.

View Query Example →


Why Quarto Matters

You are reading this in Quarto right now.

Quarto is a document system that compiles .qmd files (containing Markdown, LaTeX, Python, Julia, and Observable JavaScript) into HTML, PDF, slides, or interactive dashboards from a single source. It replaces LaTeX for papers, Keynote for presentations, and Jupyter for exploratory notebooks, all in one format. The same document that contains your analysis code, figures, and equations also generates your talk slides and your paper draft. You write it once.

The reason it belongs in this stack is not just convenience. It closes the loop between computation and communication. Every figure in this post is generated by code that runs when you render the document. Every interactive visualization is live JavaScript compiled from the same .qmd source. If the underlying data changes, you re-render and everything updates. There is no manual copy-paste step between “running the analysis” and “writing the paper.” They are the same step.

It also supports Observable JS natively, which is what makes the Vicsek simulation in the next section possible. For scientific communication specifically, the ability to embed live, interactive models alongside prose and citations, with no external server, is genuinely new.

## Theoretical Loss
$$
\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \text{CrossEntropy}(f_\theta(x_i), y_i)
$$

```python
import seaborn as sns
sns.lineplot(data=df, x='epoch', y='loss')
```

A Live Toy Model

The simulation below is a Vicsek model [8]: agents moving in a plane that align their velocity with their neighbors. It runs entirely in the browser, no server, no backend. It is Observable JavaScript, rendered by Quarto, embedded in this post via quarto render. The physics is incidental to the point: a live, interactive, publishable artifact from the same document that contains the prose above and the citations below.

Turn the noise down and watch the agents self-organize into coherent flocks. Turn it back up and the order dissolves. This is the kind of model that is miserable to iterate on in a fragile notebook: output scattered across files, parameters buried in cells, no clear provenance. With the stack above, it is a Snakemake rule, a Pydantic config, and a Quarto document.


What I’d Tell a Student

The most consistent mistakes I’ve seen in computational research come down to four patterns. Fix these and most other problems get easier:

  1. Build one reproducible pipeline end-to-end before optimizing anything. You cannot automate chaos. Get one result to reproduce correctly, then scale.
  2. If you cannot reproduce a result from six months ago, you effectively never had it. Reproducibility is not a bureaucratic requirement. It is the minimum unit of scientific knowledge.
  3. Let CI catch failures early rather than during the final writeup. A failing test at 2pm on a Tuesday is a minor annoyance. The same failure the night before submission is a crisis.
  4. Use LLMs as force multipliers, not as replacements for scientific judgment. They are excellent at boilerplate, configuration, and lookup. They cannot tell you whether your experimental design is valid.

Closing

None of these tools are difficult to adopt. That is the point. The calculus has shifted. What once required weeks of arcane configuration now takes an afternoon with LLM assistance. What once required a dedicated research software engineer can now be set up by a graduate student on a Tuesday. The window where investing in your workflow has high returns and low costs has never been wider.

Whether that window stays open is a different question. As LLMs get better at generating code, the pressure to write clean, human-authored pipelines may diminish. The vibecoding era, if it fully arrives, will not particularly reward legibility or structure. This post is an argument for building good habits now, while they still compound, while the tools are cheap to learn, and while the science is still yours to control.

The companion repository has working examples of everything in this post. Clone it, break it, make it your own. Do it before the machines do it for you.


Acknowledgments

I have been lucky enough to have some fantastic mentors who taught me the basics of good scientific software development. Shout out specifically to Nicholas Landry, who picked up excellent software development practices from his work on the open-source XGI package. He has been incredibly patient, spending so much time walking me through the basics of making pull requests, using linters, writing maintainable code, and more. Thanks also to Phil Chodrow and Daniel Kaiser, who taught me a lot about the virtues of Quarto, Snakemake, and the rest.

References

[1]
R. P. Feynman, “There’s plenty of room at the bottom,” Engineering and Science, vol. 23, no. 5, pp. 22–36, 1960.
[2]
G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, pp. 114–117, 1965.
[3]
C. E. Leiserson et al., “There’s plenty of room at the top: What will drive computer performance after Moore’s law?” Science, vol. 368, no. 6495, p. eaam9744, 2020, doi: 10.1126/science.aam9744.
[4]
J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp. 65–98, 2017, doi: 10.1137/141000671.
[5]
G. Wilson, J. Bryan, K. Cranston, J. Kitzes, L. Nederbragt, and T. K. Teal, “Good enough practices in scientific computing,” PLOS Computational Biology, vol. 13, no. 6, p. e1005510, 2017, doi: 10.1371/journal.pcbi.1005510.
[6]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[7]
F. Mölder et al., Sustainable data analysis with Snakemake. (2021). doi: 10.12688/f1000research.29032.2.
[8]
T. Vicsek, A. Czirók, E. Ben-Jacob, I. Cohen, and O. Shochet, “Novel type of phase transition in a system of self-driven particles,” Physical Review Letters, vol. 75, no. 6, pp. 1226–1229, 1995, doi: 10.1103/PhysRevLett.75.1226.