97 stories
·
0 followers

A field guide to sandboxes for AI

1 Share
Dunes in Oman

Photo by Christian Weiss


Every AI agent eventually asks for the same thing:

Let me run a program.

Sometimes it’s a harmless pytest. Sometimes it’s pip install sketchy-package && python run.py. Either way, the moment you let an agent execute code, you’re running untrusted bytes on a machine you care about.

Years ago I first learned this lesson doing basic malware analysis. The mental model was blunt but effective: run hostile code in something you can delete. If it breaks out, you nuke it.

AI agents recreate the same problem, except now the “malware sample” is often:

  • code generated by a model,
  • code pasted in by a user,
  • or a dependency chain your agent pulled in because it “looked right.”

In all cases, the code becomes a kernel client. It gets whatever the kernel and policy allow: filesystem reads, network access, CPU time, memory, process creation, and sometimes GPUs.

And “escape” isn’t the only failure mode. Even without a kernel exploit, untrusted code can:

  • exfiltrate secrets (SSH keys, cloud creds, API tokens),
  • phone home with private repo code,
  • pivot into internal networks,
  • or just burn your money (crypto mining, fork bombs, runaway builds).

The request is simple: run this program, don’t let it become a machine-ownership problem.

This isn’t niche anymore. Vercel, Cloudflare, and Google have all shipped sandboxed execution products in the last year. But the underlying technology choices are still misunderstood, which leads to “sandboxes” that are either weaker than expected or more expensive than necessary.

Part of the confusion is that AI execution comes in multiple shapes:

  • Remote devbox / coding agent: long-lived workspace, shell access, package managers, sometimes GPU.
  • Stateless code interpreter: run a snippet, return output, discard state.
  • Tool calling: run small components (e.g., “read this file”, “call this API”) with explicit capabilities.
  • RL environments: lots of parallel runs, fast reset, sometimes snapshot/restore.

Different shapes want different lifecycle and different boundaries.

There’s also a hardware story hiding here. In 2010, “just use a VM” usually meant seconds of boot time and enough overhead to kill density. Containers won because they were cheap.

In 2026, you can boot a microVM fast enough to feel container-like, and you can snapshot/restore them to make “reset” almost free. The trade space changed, but our vocabulary didn’t.

The other part of the confusion is the word sandbox itself. In practice, people use it to mean at least four different boundaries:

A container shares the host kernel. Every syscall still lands in the same kernel that runs everything else. A kernel bug in any allowed syscall path is a host bug.

A microVM runs a guest kernel behind hardware virtualization. The workload talks to its kernel. The host kernel mostly sees KVM ioctls and virtio device I/O, not the full Linux syscall ABI.

gVisor interposes a userspace kernel. Application syscalls are handled by the Sentry rather than going straight to the host kernel; the Sentry itself uses a small allowlist of host syscalls.

WebAssembly / isolates constrain code inside a runtime. There is no ambient filesystem or network access; the guest only gets host capabilities you explicitly provide.

These aren’t interchangeable. They have different startup costs, compatibility stories, and failure modes. Pick the wrong one and you’ll either ship a “sandbox” that leaks, or a sandbox that can’t run the software you need.

Diagram comparing syscall paths across four sandbox boundaries: containers, gVisor, microVMs, and WebAssembly. Shows how each type mediates access to the host kernel with decreasing direct syscall ABI exposure from left to right.

How each boundary mediates access to the host kernel. Moving left → right, direct syscall ABI exposure shrinks.

The diagram is the whole game. Containers expose the host kernel’s syscall ABI (filtered by policy, but still the same kernel). gVisor interposes a userspace kernel. MicroVMs insert a guest kernel behind hardware virtualization. Wasm modules don’t get a syscall ABI at all — they get explicit host functions.

Moving left → right, direct syscall ABI exposure shrinks.

But you don’t get something for nothing. Every “stronger” boundary adds a different trusted component: a userspace kernel (gVisor), a guest kernel + VMM (microVMs), or a runtime + embedder (Wasm). Stronger doesn’t mean simpler. It means you’re betting on different code.

If this still feels weird, it’s because the industry trained us to treat containers as the default answer.

It’s the mid-2010s. Docker is exploding. Kubernetes makes containers the unit of scheduling. VMs are heavier, slower to boot, and harder to operate. Containers solve a real problem: start fast, pack densely, ship the same artifact everywhere.

For trusted workloads, that bet is often fine: you’re already trusting the host kernel.

AI agents change the threat model as now you are executing arbitrary code paths, often generated by a model, sometimes supplied by a user, inside your infrastructure.

That doesn’t mean “never use containers.” It means you need a clearer decision procedure than “whatever we already use.”

In the rest of this post, I’ll give you a simple mental model for evaluating sandboxes, then walk through the boundaries that show up in real AI execution systems: containers, gVisor, microVMs, and runtime sandboxes.


The three-question model

Here’s the mental model I use. Sandboxing is three separate decisions that people blur together, and keeping them distinct prevents a lot of bad calls.

Boundary is where isolation is enforced. It defines what sits on each side of the line.

  • Container boundary: processes in separate namespaces, still one host kernel.
  • gVisor boundary: workload syscalls are serviced by a userspace kernel (Sentry) first.
  • MicroVM boundary: syscalls go to a guest kernel; the host kernel sees hypervisor/VMM activity, not the guest syscall ABI.
  • Runtime boundary: guest code has no syscall ABI; it can only call explicit host APIs.

The boundary is the line you bet the attacker cannot cross.

Policy is what the code can touch inside the boundary:

  • filesystem paths (read/write/exec),
  • network destinations and protocols,
  • process creation/signals,
  • device access (including GPUs),
  • time/memory/CPU/disk quotas,
  • and the interface surface itself (syscalls, ioctls, imports).

A tight policy in a weak boundary is still a weak sandbox. A strong boundary with a permissive policy is a missed opportunity.

Lifecycle is what persists between runs. This matters a lot for agents and RL:

  • Fresh run: nothing persists. Great for hostile code. Bad for “agent workspace” UX.
  • Workspace: long-lived filesystem/session. Great for agents; dangerous if secrets leak or persistence is abused.
  • Snapshot/restore: fast reset by checkpointing VM or runtime state. Great for RL rollouts and “pre-warmed” agents.

Lifecycle also changes your operational choices. If you snapshot, you need a snapshotable boundary (microVMs, some runtimes). If you need a workspace, you need durable storage and a policy story for secrets.

The three questions

When evaluating any sandbox, ask:

  1. What is shared between this code and the host?
  2. What can the code touch (files, network, devices, syscalls)?
  3. What survives between runs?

If you can answer those, you understand your sandbox. If you can’t, you’re guessing.

A quick example makes the separation clearer:

  • Multi-tenant coding agent
    • Boundary: microVM (guest kernel)
    • Policy: allow workspace FS, deny host mounts, outbound allowlist, no raw devices
    • Lifecycle: snapshot base image, clone per session, destroy on close

Same product idea, different constraints:

  • Tool calling (e.g., “format this code”)
    • Boundary: Wasm component
    • Policy: preopen one directory, no network by default
    • Lifecycle: fresh per call

Vocabulary

Sandbox: boundary + policy + lifecycle.

Container: a packaging format plus process isolation built on kernel features. One kernel, many isolated views.

Virtual machine: a guest OS kernel running on virtual hardware.

MicroVM: a minimal VM optimized for fast boot and small footprint.

Runtime sandbox: isolation enforced by a runtime (Wasm, V8 isolates) rather than the OS.


Linux building blocks

Before comparing sandbox types, it helps to be explicit about what we’re sandboxing against. A process is a kernel client.

Diagram showing the anatomy of a Linux syscall: a userspace process makes a syscall request, the kernel validates and executes it, then returns the result to userspace.

It runs in userspace. When it needs something real — read a file, create a socket, allocate memory, spawn a process — it makes a syscall. That syscall enters the kernel. Kernel code runs with full privileges.

If there is a bug in any reachable syscall path, filesystem code, networking code, or ioctl handler, you can get local privilege escalation. That’s the root of the “container escape” story.

If you want a mental picture: the process is unprivileged code repeatedly asking the kernel to do work on its behalf. You can restrict which questions it’s allowed to ask. You cannot make the kernel stop being privileged.

Linux containers are a policy sandwich built from four primitives:

Namespaces

Namespaces give a process an isolated view of certain resources by providing separate instances of kernel subsystems:

  • PID namespace: isolated process tree / PID numbering
  • Mount namespace: isolated mount table / filesystem view
  • Network namespace: isolated network stack (interfaces, routes, netfilter state)
  • IPC/UTS namespaces: System V IPC isolation and hostname isolation
  • User namespace: UID/GID mappings and capability scoping

User namespaces are worth calling out. They let you map “root inside the container” to an unprivileged UID on the host. This changes the meaning of privilege. Rootless containers get a real security win here, because an accidental “root in the container” does not automatically mean “root on the host.”

But it’s still the same kernel boundary. A kernel bug is a kernel bug.

Capabilities

Linux breaks “root” into capabilities (fine-grained privileges). Containers typically start with a reduced capability set, but they often still include enough power to hurt you if you hand out the wrong ones.

The infamous one is CAP_SYS_ADMIN. It gates a huge, loosely-related collection of privileged operations, and in practice it often unlocks dangerous kernel interfaces. That’s why you’ll hear people say “SYS_ADMIN is the new root.”

In real sandboxes, treat capability grants as part of your attack surface budget. The easiest win is often just removing capabilities you don’t need.

Cgroups

Cgroups (control groups) limit and account for resources:

  • CPU quota/shares and CPU affinity sets
  • memory limits
  • I/O bandwidth / IOPS throttling
  • max process count (mitigates fork bombs)

Cgroups are primarily about preventing resource exhaustion. They don’t materially reduce kernel attack surface.

Seccomp

Seccomp is syscall filtering. A process installs a BPF program that runs on syscall entry; it can inspect the syscall number (and, in many profiles, arguments) and decide what happens: allow, deny, log, trap, kill, or notify a supervisor.

A tight seccomp profile blocks syscalls that expand kernel attack surface or enable escalation (ptrace, mount, kexec_load, bpf, perf_event_open, userfaultfd, etc). It also tends to block legacy interfaces that are hard to sandbox safely.

As a toy example, a “deny dangerous syscalls” seccomp rule often looks like:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    { "names": ["bpf", "perf_event_open", "kexec_load"], "action": "SCMP_ACT_ERRNO" }
  ]
}

In real sandboxes you also filter arguments (clone3 flags, ioctl request numbers) and you run an allowlist rather than a denylist.

There’s one more seccomp feature worth knowing because it shows up in sandboxes: seccomp user notifications (SECCOMP_RET_USER_NOTIF). Instead of simply allowing or denying, the kernel can pause a syscall and send it to a supervisor process for a decision. That lets you build “brokered syscalls” (e.g., only allow open() if the path matches a policy, or proxy network connects through a policy engine).

This is powerful, but it’s not free: brokered syscalls add latency and complexity, and your broker becomes part of the trusted computing base.

But the bottom line stays the same: the syscalls you do allow still execute in the host kernel.

How containers combine these

A “container” is just a regular process configured with a set of kernel policies plus a root filesystem:

  1. Namespaces scope/virtualize resources.
  2. Capabilities are reduced.
  3. Cgroups cap resource usage.
  4. Seccomp filters syscalls on entry.
  5. A root filesystem provides the container’s view of / (often layered via overlayfs).
  6. AppArmor/SELinux may apply additional policy.
Diagram showing Linux sandboxing building blocks: syscalls enter the host kernel, seccomp can block them before dispatch, namespaces scope what resources syscalls operate on, and cgroups enforce resource quotas.

Conceptually: syscalls enter the host kernel. Seccomp can block them before dispatch. Namespaces scope resources. Cgroups enforce quotas. But it’s still the same host kernel.

This is policy-based restriction within a shared kernel boundary. You reduce what the process can see (namespaces), cap what it can consume (cgroups), and restrict which syscalls it can invoke (seccomp). You do not insert a stronger isolation boundary.

So why do people treat containers as a security boundary? Because “isolated” gets conflated with “secure,” and because containers are operationally convenient.

But if you’re choosing a sandbox, the boundary matters more than convenience.


Where containers fail

I want to be direct: containers are not a sufficient security boundary for hostile code. They can be hardened, and that matters. But they still share the host kernel.

The failure modes I see most often are misconfiguration and kernel/runtime bugs — plus a third one that shows up in AI systems: policy leakage.

Misconfiguration escapes

Many container escapes are self-inflicted. The runtime offers ways to weaken isolation, and people use them.

--privileged removes most guardrails. It effectively turns the container into “root on the host with some extra steps.” If your sandbox needs privileged mode, you don’t have a sandbox.

The Docker socket (/var/run/docker.sock) is another classic. Mount it and you can ask the host Docker daemon to create a new privileged container, mount the host filesystem, and so on. In practice, access to the Docker socket is access to host root.

Sensitive mounts and broad capabilities are the rest of the usual list:

  • writable /sys or /proc/sys
  • host paths bind-mounted writable
  • adding broad capabilities (especially CAP_SYS_ADMIN)
  • joining host namespaces (--pid=host, --net=host)
  • device passthrough that exposes raw kernel interfaces

These are tractable problems: you can audit configs and ban obvious foot-guns.

Kernel and runtime bugs

A properly configured container still shares the host kernel. If the kernel has a bug reachable via an allowed syscall, filesystem path, network stack behavior, or ioctl, code inside the container can trigger it.

Examples of container-relevant local privilege escalations:

  • Dirty COW (CVE-2016-5195): copy-on-write race in the memory subsystem.
  • Dirty Pipe (CVE-2022-0847): pipe handling bug enabling overwriting data in read-only mappings.
  • fs_context overflow (CVE-2022-0185): filesystem context parsing bug exploited in container contexts.

Seccomp reduces exposure by blocking syscalls, but the syscalls you allow are still kernel code. Docker’s default seccomp profile is a compatibility-biased allowlist: it blocks some high-risk calls but still permits hundreds.

And it’s not only the kernel. A container runtime bug can be enough (for example, runC overwrite: CVE-2019-5736).

Policy leakage (the AI-specific one)

A lot of “agent sandbox” failures aren’t kernel escapes. They’re policy failures.

If your sandbox can read the repo and has outbound network access, the agent can leak the repo. If it can read ~/.aws or mount host volumes, it can leak credentials. If it can reach internal services, it can become a lateral-movement tool.

This is why sandbox design for agents is often more about explicit capability design than about “strongest boundary available.” Boundary matters, but policy is how you control the blast radius when the model does something dumb or malicious prompts steer it.

Two practical notes:

  • Rootless/user namespaces help. They reduce the damage from accidental privilege. They don’t make kernel bugs go away.
  • Multi-tenant changes everything. If you run code from different trust domains on the same kernel, you should assume someone will try to hit kernel bugs and side channels. “But it’s only build code” stops being a comfort.

One more container-specific gotcha: ioctl is a huge surface area. Even if you block dangerous syscalls, many real kernel interfaces live behind ioctl() on file descriptors (filesystems, devices, networking). If you pass through devices (especially GPUs), you’re exposing large driver code paths to untrusted input.

This is why many “AI sandboxes” that look like containers end up quietly adding one of the stronger boundaries underneath: either gVisor to reduce host syscalls, or microVMs to put a guest kernel in front of the host.

None of this means containers are “bad.” It means containers are a great tool when all code inside the container is in the same trust domain as the host. If you’re running your own services, that’s often true.

The moment you accept code from outside your trust boundary (users, agents, plugins), treat “shared kernel” as a conscious risk decision — not as the default.

Hardening options

Hardening tightens policy. It doesn’t change the boundary.

Hardening measureWhat it doesWhat it doesn’t do
Custom seccompblocks more syscalls/args than defaultsdoesn’t protect against bugs in allowed kernel paths
AppArmor/SELinuxconstrains filesystem/procfs and sensitive opsdoesn’t fix kernel bugs; it only reduces reachable paths
Drop capabilitiesremoves privileged interfaces (avoid SYS_ADMIN)doesn’t change shared-kernel boundary
Read-only rootfsprevents writes to container rootdoesn’t prevent in-memory/kernel exploitation
User namespacesmaps container root to unprivileged host UIDkernel bugs may still allow escalation

At the limit, you can harden a container until it’s nearly unusable. You still haven’t changed the boundary. The few syscalls you allow are still privileged kernel code. The kernel is still shared.

If a kernel exploit matters, you need a different boundary:

  • gVisor: syscall interposition
  • microVMs: guest kernel behind virtualization
  • Wasm/isolates: no syscall ABI at all

Stronger boundaries

Containers share a fundamental constraint: the workload’s syscalls go to the host kernel. To actually change the boundary, you have two main approaches:

  1. Syscall interposition: intercept syscalls before they reach the host kernel, reimplement enough of Linux in userspace
  2. Hardware virtualization: run a guest kernel behind a VMM/hypervisor, reduce host exposure to VM exits and device I/O

gVisor

gVisor is an “application kernel” that intercepts syscalls (and some faults) from a container and handles them in a userspace kernel called the Sentry.

If containers are “processes in namespaces,” gVisor is “processes in namespaces, but their syscalls don’t go straight to the host kernel.”

A few implementation details matter when you’re choosing it:

  • gVisor integrates as an OCI runtime (runsc), so it drops into Docker/Kubernetes.
  • The Sentry implements kernel logic in Go: syscalls, signals, parts of /proc, a network stack, etc.
  • Interception is done by a gVisor platform. Today the default is systrap; older deployments used ptrace, and there is optional KVM mode.

Two subsystems are worth understanding because they dominate performance and security behavior:

Filesystem mediation (Gofer / lisafs). gVisor commonly splits responsibilities: the Sentry enforces syscall semantics, but filesystem access may be mediated by a separate component that does host filesystem operations and serves them to the Sentry over a protocol (historically “Gofer”; newer work includes lisafs). This is one way gVisor keeps the Sentry’s host interface small and auditable.

Networking (netstack vs host). gVisor can use its own userspace network stack (“netstack”) and avoid interacting with the host network stack in the same way a container would. There are also modes that integrate more directly with host networking depending on deployment constraints.

The security point is that the workload no longer chooses which host syscalls to call. The Sentry does, and the Sentry itself can be constrained to a small allowlist of host syscalls. gVisor has published syscall allowlist figures: 53 host syscalls without networking, plus 15 more with networking (68 total) in one configuration, enforced with seccomp on the Sentry process.

That’s a very different interface than “whatever syscalls your container workload can make.”

The tradeoffs are predictable:

  • Compatibility: not every syscall and kernel behavior is identical. There’s a syscall compatibility table, and you will eventually hit it.
  • Overhead: syscall interposition isn’t free. Syscall-heavy workloads pay more. Filesystem-heavy workloads pay for mediation and extra copies/IPC.
  • Debuggability: failure modes include ENOSYS for unimplemented calls, or subtle semantic mismatches.

My take: gVisor fits best when you can tolerate “Linux, but with a compatibility matrix,” and you want a materially smaller host-kernel interface than a standard container.

Diagram showing gVisor architecture: the Sentry userspace kernel intercepts syscalls from the sandboxed application, handling most syscalls itself and making only a reduced set of host syscalls.

MicroVMs

The alternative to syscall interposition is hardware isolation. Run a guest kernel behind hardware virtualization (KVM on Linux, Hypervisor.framework on macOS). The host kernel sees VM exits and virtio device I/O rather than individual workload syscalls.

This is why microVMs are the default answer for “run arbitrary Linux code for strangers.” You get full Linux semantics without reimplementing the syscall ABI.

What the host kernel actually sees

A microVM still uses the host kernel, but the interface changes shape:

  • VMM makes /dev/kvm ioctls to create vCPUs, map guest memory, and run the VM.
  • Guest interacts with virtual devices (virtio-net, virtio-blk, vsock). Those devices are implemented by the VMM (or by backends like vhost-user).
  • Execution triggers VM exits (traps to the host) on privileged events, device I/O, interrupts, etc.

The host kernel still mediates access, but through a narrower, more structured interface than the full Linux syscall ABI.

What microVMs don’t solve by themselves

A guest kernel boundary doesn’t automatically mean “safe.”

You still need to decide policy:

  • Does the guest have outbound network access?
  • Does it mount secrets or credentials?
  • Does it have access to internal services?
  • Does it share any filesystem state between runs?

A microVM is a strong boundary, but you still need a strong policy to avoid turning it into a high-powered data exfiltration box.

At scale, microVM-based sandboxes converge on a small set of lifecycle patterns:

  • Ephemeral session VM: boot → run commands → destroy. Simple and dependable.
  • Snapshot cloning: boot a “golden” VM once (language runtimes, package cache) → snapshot → clone per session. Fast cold start and fast reset.
  • Fork-and-exec style: keep a pool of paused/suspended VMs, resume on demand. Operationally trickier but can reduce tail latency.

State injection is also a design choice, not a given:

  • Block device images (ext4 inside a virtio-blk): simple, portable, snapshot-friendly.
  • virtio-fs / 9p-like shares: share a host directory into the guest (useful for “workspace mirrors,” but it reintroduces host FS as part of the policy surface).
  • Network fetch: pull code into the guest from an object store/Git remote. Keeps host FS out of the guest boundary, but requires network policy.

And you still need a network story. Common patterns:

  • NAT + egress allowlist: most common for SaaS agents.
  • No direct internet: force all traffic through a proxy that enforces policy and logs.
  • Dedicated VPC/subnet: isolate “untrusted execution” away from internal services.

Finally, if you care about protecting the guest from the host (e.g., “users don’t trust the operator”), look at confidential computing (SEV-SNP/TDX) and projects like Confidential Containers. That’s a different threat model, but it’s increasingly relevant for hosted agent execution.

What is a VMM?

On Linux, “VM” is two layers:

  • KVM (in kernel): turns Linux into a hypervisor and exposes virtualization primitives via /dev/kvm ioctls.
  • VMM (userspace): allocates guest memory, configures vCPUs, and provides the virtual devices the guest uses.

QEMU is the classic general-purpose VMM: tons of devices, tons of legacy paths, tons of code. That flexibility is useful — and it’s also an attack surface and an ops cost.

MicroVM VMMs cut that down on purpose: fewer devices, fewer emulation paths, smaller footprint, faster boot.

The device model is the new interface

Moving from “container syscalls” to “microVM” shifts attack surface rather than eliminating it.

Instead of worrying about every syscall handler in the host kernel, you worry about:

  • KVM ioctl handling in the host kernel,
  • the VMM process (its parsing of device config, its event loops),
  • and the virtual device implementations (virtio-net, virtio-blk, virtio-fs, vsock).

This is why microVM VMMs are aggressive about minimal devices. Every device you add is more parsing, more state machines, more edge cases.

It also changes your patching responsibility. With microVMs, you now have two kernels to keep healthy:

  • the host kernel (KVM and related subsystems),
  • and the guest kernel (what your untrusted code actually attacks first).

The guest kernel can be slimmer than a general distro kernel: disable modules you don’t need, remove filesystems you don’t mount, avoid exotic drivers. This doesn’t replace patching, but it shrinks reachable code.

Virtio, but which flavor?

Virtio devices can be exposed via different transports. Firecracker uses virtio-mmio (simple, minimal) while other VMMs commonly use virtio-pci (more “normal VM” shaped, broader device ecosystem). You usually don’t care until you do: some guest tooling assumes PCI, some performance features assume a certain stack, and passthrough work tends to be PCI-centric.

“MicroVM” describes a design stance: remove legacy devices and keep the device model small.

Side-by-side comparison: container syscalls go to the shared host kernel; microVM syscalls go to a guest kernel, while the host sees KVM ioctls and virtio device I/O.

Firecracker

Firecracker is AWS’s minimalist VMM for multi-tenant serverless (Lambda, Fargate). It’s purpose-built for running lots of small VMs with tight host confinement.

The architecture is intentionally boring:

  • one Firecracker process per microVM,
  • minimal virtio device model (net, block, vsock, console),
  • and a “jailer” that sets up isolation for the VMM process before the guest ever runs.

Internally you can think of three thread types:

  • an API thread (control plane),
  • a VMM thread (device model and I/O),
  • vCPU threads (running the KVM loop).

The security story is defense in depth around the VMM:

  1. The jailer sets up chroot + namespaces + cgroups, drops privileges, then execs the VMM.
  2. A tight seccomp profile constrains the VMM. The Firecracker NSDI paper describes a whitelist of 24 syscalls (with argument filtering) and 30 ioctls.

A Firecracker microVM is configured with a small API surface (REST API or socket). Typical “remote devbox” products build higher-level lifecycle around that API: boot, snapshot, restore, pause, resume, collect logs.

Example: the VMM doesn’t “mount your repo.” You decide how to inject state: attach a block device, mount a virtio-fs share, or pull code via network inside the guest. Those choices are policy choices.

Snapshot/restore is a practical win for agents and RL:

  • Agents: you can pre-warm a base image (language runtimes, package cache) and clone quickly.
  • RL: you can reset to a known state without replaying a long initialization sequence.

Firecracker supports snapshots, but snapshot performance and correctness are workload-sensitive (memory size, device state, entropy sources). Treat it as “works well for many cases,” not “free.”

A few pragmatic limitations to keep in mind:

  • Firecracker focuses on modern Linux guests. It intentionally avoids a lot of “PC compatibility” hardware.
  • The device set is minimal by design. If your workload depends on obscure devices or kernel modules, you’ll either adapt the guest or choose a different VMM.
  • Debugging looks more like “debug a tiny VM” than “debug a container.” You’ll want good serial console logs and metrics from the VMM.

And a nuance that matters: “125ms boot” is usually the VMM and guest boot path. If you’re launching Firecracker via a heavier control plane (containerd, network setup, storage orchestration), end-to-end cold start can be higher. Measure in your stack.

Firecracker architecture diagram: single-process VMM with virtio devices; a jailer sets up chroot/namespaces/privilege dropping and applies seccomp-bpf.

cloud-hypervisor

cloud-hypervisor is also a Rust VMM, built from the same rust-vmm ecosystem, but it targets a broader class of “modern cloud VM” use cases.

The delta vs Firecracker is features and device model:

  • PCI-based virtio devices (no virtio-mmio)
  • CPU/memory/device hotplug
  • optional vhost-user backends (move device backends out-of-process)
  • VFIO passthrough (including GPUs, with the usual IOMMU/VFIO constraints)
  • Windows guest support

If you need “VM boundary and GPU passthrough,” this is a common landing spot. The project advertises boot-to-userspace under 100ms with direct kernel boot.

A caution on GPU passthrough: VFIO gives the guest much more direct access to hardware. That can be necessary and still safe, but it changes the failure modes. You now care about device firmware, IOMMU isolation, and hypervisor configuration in ways you don’t in “CPU-only microVM” designs. Many systems choose a hybrid: CPU-only microVMs for general code execution and a separate GPU service boundary for model execution.

libkrun

libkrun takes a different approach: it’s a minimal VMM embedded as a library with a C API. It uses KVM on Linux and Hypervisor.framework on macOS/ARM64.

This shows up in “run containers inside lightweight VMs on my laptop” tooling, and it’s relevant because local agents are increasingly a first-class workflow. libkrun also has an interesting macOS angle: virtio-gpu + Mesa Venus to forward Vulkan calls on Apple Silicon (Vulkan→MoltenVK→Metal). In many setups, this is how some container stacks get GPU acceleration on macOS without “full fat” VM managers.

The catch is that the trust boundary can be different. With an embedded VMM, you need to treat the VMM process itself as part of your TCB and sandbox it like any other privileged helper.

MicroVM options at a glance

VMMBest atNot great at
Firecrackermulti-tenant density, tight host confinement, snapshotsgeneral VM features, GPU passthrough
cloud-hypervisorbroader VM feature set; VFIO/hotplug; Windows supportsmallest possible surface area
libkrunlightweight VMs on dev machines (especially macOS/ARM64)large-scale multi-tenant control planes

gVisor vs microVMs

If you’re choosing between “syscall interposition” and “hardware virtualization,” the practical tradeoffs are:

  • Compatibility: microVMs are full Linux; gVisor is “mostly Linux” and you need to validate.
  • Overhead: gVisor avoids running a guest kernel; microVMs pay for a guest kernel and VMM (but can still be fast).
  • Attack surface: gVisor bets on a userspace kernel implementation and a small host syscall allowlist; microVMs bet on KVM + VMM device surface.

There isn’t a universal winner. The threat model and workload profile usually decide.


Kata Containers

Kata solves a common operational constraint: “I want to keep my container workflow, but I need VM-grade isolation.”

Kata Containers is an OCI-compatible runtime that runs containers inside a lightweight sandbox VM (often at the pod boundary in Kubernetes: multiple containers in a pod share one VM).

The architecture in one paragraph:

  • containerd/CRI creates a pod sandbox using a Kata runtime shim,
  • the shim launches a lightweight VM using a configured hypervisor backend (QEMU, Firecracker, cloud-hypervisor, etc),
  • a kata-agent inside the guest launches and manages the container processes,
  • the rootfs is shared into the guest (commonly via virtio-fs),
  • networking and vsock are provided via virtio devices.

The appeal is container ergonomics with a guest-kernel boundary. The cost is overhead: each sandbox VM carries a guest kernel plus VMM/device mediation, and boot adds latency.

Kata makes sense when you’re running mixed-trust workloads on the same Kubernetes cluster and you want a VM boundary without rewriting your platform.

Operationally, Kata is usually introduced via Kubernetes RuntimeClass: some pods use the default runc container runtime; untrusted pods use the Kata runtime class. That lets you mix trusted and untrusted workloads without standing up a separate cluster.

If you’re exploring confidential computing, Kata is also one of the common integration points for Confidential Containers (hardware-backed isolation like SEV-SNP/TDX). That’s not required for “sandbox hostile code,” but it is relevant if you need tenant isolation and stronger guarantees against the infrastructure operator.


Runtime sandboxes

Now for something genuinely different. Containers, gVisor, and microVMs all run “code as a process,” so the guest sees some syscall ABI (host kernel, userspace kernel, or guest kernel).

Runtime sandboxes flip that model. The boundary lives inside the runtime itself. The sandboxed code never gets the host’s syscall ABI. It only gets whatever the runtime (and embedder) explicitly provides.

Runtime sandbox boundary: the sandboxed code has no ambient OS access. Filesystem, network, clocks, and other effects are mediated by explicit host APIs (imports) and can be denied by default.

WebAssembly

WebAssembly is a bytecode sandbox with a clean capability story. Modules can’t touch the outside world unless the host exposes imports.

That’s a fundamentally different default from “here’s a POSIX process, please behave.”

Wasm runtimes enforce:

  • memory isolation: modules operate within a linear memory; out-of-bounds reads/writes trap.
  • constrained control flow: no arbitrary jumps to raw addresses.
  • no ambient OS access: host calls are explicit imports.

WASI (WebAssembly System Interface) extends this with a capability-oriented API. The signature move is preopened directories: instead of letting the module open arbitrary paths, you hand it a directory handle that represents “this subtree,” and it can only resolve relative paths inside it.

Here’s what that feels like in practice (pseudocode-ish):

// Give the module a view of ./workspace, not the host filesystem.
wasi.preopenDir("./workspace", "/work");

// Allow outbound HTTPS to a specific host.
wasi.allowNet(["api.github.com:443"]);

If you never preopen ~/.ssh, the module can’t read it. There’s no “oops, I forgot to check a path prefix” bug in your app that accidentally grants access, because the capability was never granted.

Performance is often excellent because there’s no guest OS to boot. Instance/module instantiation can be microseconds to milliseconds depending on runtime and compilation mode (AOT vs JIT). That’s why edge platforms like Wasm and isolates: high density, low cold start.

One detail that matters in real sandboxes: resource accounting. A runtime sandbox won’t magically stop infinite loops or exponential algorithms. You still need CPU/time limits. Many runtimes provide “fuel” or instruction metering (e.g., Wasmtime fuel) so you can preempt runaway execution deterministically.

The limitations often show up as “I need one more host function”:

  • The real security boundary is your import surface. If you expose a powerful runCommand() host call, you reinvented a shell.
  • Keep imports narrow and typed. Prefer structured operations (“read file X from preopened dir”) over generic ones (“open arbitrary path”).

One reason Wasm keeps showing up in agent tooling is the component model direction: instead of “here’s a module, call exported functions,” you get typed interfaces and better composition. That pushes you toward smaller tools with explicit inputs/outputs — which is exactly what you want for capability-scoped AI tools.

The flip side is that your host still has to be careful. The easiest way to ruin a clean Wasm sandbox is to expose one overly-powerful host function. Capability systems fail by accidental ambient authority.

Practical constraints to consider:

  • Threads exist, but support varies by runtime and platform.
  • WASI networking has been in flux (Preview 1 vs Preview 2; ecosystem catching up).
  • Dynamic languages usually require interpreters (Pyodide, etc).
  • Anything that expects “normal Linux” (shells, package managers, arbitrary binaries) doesn’t port cleanly.

Major runtimes, if you’re evaluating:

  • Wasmtime (Bytecode Alliance): security-focused, close to WASI/component-model work.
  • Wasmer: runtime plus tooling ecosystem, with WASIX for more POSIX-like APIs.
  • WasmEdge: edge/cloud-native focus.

V8 isolates (and deny-by-default runtimes)

V8 isolates are isolated instances of the V8 JavaScript engine within a process. Platforms like Cloudflare Workers use isolates to run many tenants at high density, with low startup overhead.

The isolation boundary here is the runtime: separate heaps and globals per isolate, with controlled ways to share or communicate.

Production systems still layer OS-level defenses for defense in depth (namespaces/seccomp around the runtime process, broker processes for I/O, mitigations for timing side channels). The runtime boundary is powerful, but it’s not a replacement for OS isolation if your threat model includes engine escapes.

Deno’s permission model is the same pattern exposed to developers: V8 plus a deny-by-default capability model (--allow-read, --allow-net, --allow-run).

The limitation is scope: isolates are for JS/TS (and embedded Wasm). If your sandbox needs arbitrary ELF binaries, this isn’t the tool.

Who uses runtime sandboxes for AI?

The pattern that keeps showing up is Wasm for tool isolation, not “Wasm as a whole dev environment”:

  • Microsoft Wassette: runs Wasm Components via MCP with a deny-by-default permission model.
  • NVIDIA: describes using Pyodide (CPython-in-Wasm) to run LLM-generated Python client-side inside the browser sandbox.
  • Extism: a Wasm plugin framework that’s basically “capability-scoped untrusted code execution.”

When runtime sandboxes are enough

Runtime sandboxes fit a specific profile:

ConstraintRuntime sandbox strength
Stateless executionexcellent
Cold start / densityexcellent
Full Linux compatibilityno (explicit host APIs only)
Language flexibilitylimited (Wasm languages / JS)
GPU accessonly via host-provided APIs (e.g., WebGPU), not raw devices

If the product needs a full userspace and arbitrary binaries, you’ll end up back at microVMs or gVisor. If the product is “run small tools safely,” runtime sandboxes can be the cleanest option.


Choosing a sandbox

Most selection mistakes come from skipping the boundary question and jumping straight to implementation details.

Here’s how I actually decide:

  1. Threat model: is this code trusted, semi-trusted, or hostile? Does a kernel exploit matter?
  2. Compatibility: do you need full Linux semantics, or can you live inside a capability API?
  3. Lifecycle: do you need fast reset/snapshots, or long-lived workspaces?
  4. Operations: can you run KVM and manage guest kernels, or are you constrained to containers?

A decision table:

WorkloadThreat modelCompatibility needsRecommended boundary
AI coding agent (multi-tenant SaaS)hostile (user-submitted code)full Linux, shell, package managersmicroVM (Firecracker / cloud-hypervisor)
AI coding agent (single-tenant / self-hosted)semi-trustedfull Linuxhardened container or gVisor
RL rollouts (parallel, lots of resets)mostly trusted code, needs isolation per runfast reset, snapshot/restoremicroVM with snapshot support
Code interpreter (stateless snippets)hostilescoped capabilities, no shellgVisor or runtime sandbox (if language fits)
Tool calling / pluginsmixedexplicit capability surfaceWasm / isolates

Two common mistakes I see:

Mistake 1: “Our code is trusted.” In agent systems, the code you run is shaped by untrusted input (prompt injection, dependency confusion, supply chain). Treat “semi-trusted” as the default unless you have a strong reason not to.

Mistake 2: “We’ll just block the network.” Network restrictions matter, but they’re not a boundary. If you’re running hostile code on a shared kernel, the “no network” sandbox can still become a kernel exploit sandbox.

Before picking a boundary, I also write down a minimum viable policy. If you can’t enforce these, you don’t have a sandbox yet:

  • Default-deny outbound network, then allowlist. (Or route everything through a policy proxy.)
  • No long-lived credentials in the sandbox. Use short-lived scoped tokens.
  • Workspace-only filesystem access. No host mounts besides what you explicitly intend.
  • Resource limits: CPU, memory, disk, timeouts, and PIDs.
  • Observability: log process tree, network egress, and failures. Sandboxes without telemetry become incident-response theater.

Concrete defaults, if you’re starting from scratch:

  • Multi-tenant AI agent execution: microVMs. Firecracker for density and a tight VMM surface. cloud-hypervisor if you need VFIO/hotplug/GPU.
  • “I already run Kubernetes”: gVisor is a good middle ground if compatibility is acceptable.
  • Trusted/internal automation: hardened containers are usually fine.
  • Capability-scoped tools: Wasm or isolates.

A quick rule of thumb that holds up surprisingly well:

  • If you need a shell + package managers and you don’t fully trust the code: start at microVM.
  • If you can live inside a compatibility matrix to save overhead: consider gVisor.
  • If you can model the task as capability-scoped operations: prefer Wasm/isolate.

Then validate with measurement: cold start, steady-state throughput, and operational complexity. MicroVMs can be cheap at scale, but only if your orchestration is built for it.

The point isn’t to crown a winner. It’s to know what would have to fail for an escape to happen, and to pick the boundary that matches that reality.


Appendix: Local OS sandboxes

Everything above assumes you’re running workloads on a server. Local agents (Claude Code, Codex CLI, etc.) are a different problem:

The agent runs on your laptop. It can see your filesystem. In this case, the failure mode is often not “kernel 0-day.” It’s “prompt injection tricks the agent into reading ~/.ssh or deleting your home directory.”

Each major OS has a mechanism for “lightweight, per-process” sandboxing. These are still policy enforcement within a shared kernel boundary, but they’re designed to be user-invocable and fast.

Diagram showing kernel-enforced sandbox policy: each file or network operation is checked against a profile; disallowed operations fail (commonly EPERM).

macOS Seatbelt (App Sandbox)

Seatbelt is macOS’s kernel-enforced sandbox policy system (SBPL profiles). Tools can generate a profile that allows the workspace and temp directories, and denies sensitive paths like ~/.ssh. When the agent tries anyway, the kernel returns EPERM. The shell can’t “escape” a kernel check.

A note on tooling: sandbox-exec has been deprecated for years, but the underlying mechanism is still used across Apple’s platforms. Modern tools tend to apply sandboxing via APIs/entitlements rather than via a CLI wrapper.

A tiny SBPL example looks like:

(version 1)
(deny default)
(allow file-read* (subpath "/Users/alice/dev/myrepo"))
(allow network-outbound (remote tcp "api.github.com:443"))

Linux Landlock (+ seccomp)

Landlock is an unprivileged Linux Security Module designed for self-sandboxing: a process can restrict its own filesystem access (and, on newer kernels, some TCP operations), and the restriction is irreversible and inherited by children.

In most setups, Landlock pairs well with seccomp: Landlock controls filesystem paths; seccomp blocks high-risk syscalls (ptrace, mount, etc).

At the API level, Landlock is a few syscalls: create a ruleset → add path rules → enforce. A minimal flow looks like:

int ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &rule, 0);
landlock_restrict_self(ruleset_fd, 0);

The important behavior is that the restriction is one-way (can’t be disabled) and inherited by children. That makes it suitable for “run this tool, but don’t let it touch anything else.”

If you want a practical tool to start with, landrun wraps Landlock into a CLI. It requires Linux 5.13+ for filesystem rules and 6.7+ for TCP bind/connect restrictions.

Windows AppContainer

AppContainer is Windows’ capability-based process isolation (SIDs/tokens). It’s widely used for browser renderer isolation. AppContainer processes run with a restricted token and only the capabilities you grant (network, filesystem locations, etc).

For coding agents it’s less common today mostly because setup is Win32-API heavy compared to “write a profile / apply a ruleset.” But if you need strong OS-native isolation on Windows without spinning up VMs, it’s the primitive to learn.

Comparison

AspectmacOS SeatbeltLinux Landlock + seccompWindows AppContainer
Privilege requirednonenonenone (but setup is explicit)
Filesystem controlprofile rulespath rulesetscapability grants
Network controlprofile + proxy patternsnamespaces/seccomp (or limited Landlock)capability grants
What it doesn’t solvekernel vulnerabilitieskernel vulnerabilitieskernel vulnerabilities

One operational pitfall across all local sandboxes: you have to allow enough for the program to function. Dynamic linkers, language runtimes, certificate stores, and temp directories are all “real” dependencies. A deny-by-default sandbox that forgets /usr/lib (or the Windows equivalent) will fail in confusing ways.

Treat profiles as code: version them, test them, and expect them to evolve as your agent’s needs change.

I personally run my code agents only with a sandbox enabled and do advise others to do the same.

Read the whole story
peior
4 days ago
reply
Share this story
Delete

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

2 Shares
Read the whole story
peior
19 days ago
reply
Share this story
Delete

True Stories from the (Data) Battlefield – Part 1: Communicating About Data

1 Share

The data professions (data science, analysis, engineering, etc.) are highly technical fields, and much online discussion (in particular, on this blog!), conference presentations and classes focus on technical aspects of data work or on the results of data analyses. These discussions are necessary for teaching important aspects of the data trade, but they often ignore the fact that data and analytics work takes place in an interpersonal and organizational context.

In this blog post, the first of a series, we’ll present a selection of non-technical but common issues that data professionals face in organizations. We will offer a diagnosis of “why” such problems arise and present some potential solutions.

We aim to contribute to the online data discourse by broadening the discussion to non-technical but important challenges to getting data work done, and to normalize the idea that, despite the decade-long hype around data, the day-to-day work is filled with common interpersonal and organizational challenges.

This blog post is based on a talk given by myself and Cesar Legendre at the 2024 meeting of the Royal Statistical Society of Belgium in Brussels.

In this first blog post, we will focus on problems that can occur when communicating to organizational stakeholders (colleagues, bosses, top management) about data topics.

Below, we present a series of short case studies illustrating a problem or challenge we have seen first-hand. We describe what happened, the reaction from organizational stakeholders or clients, a diagnosis of the underlying issue, and offer potential solutions for solving the problem or avoiding it in the first place.

Case 1: The Graph Was Too Complicated

What Happened

  • In a meeting among the data science team and top management, a data scientist showed a graph that was too complicated. We have all seen this type of graph - there are too many data points represented, the axes are unlabeled or have unclear labels, there is no takeaway message.

graph too complicated

The Reaction

  • This did not go over well. You could feel the energy leaving the room, the executives start to lose interest and disconnect. Non-data stakeholders started checking their phones… If the goal was to communicate the takeaways from a data analysis to decision makers, it failed spectacularly.

The Diagnosis

  • There was no match between the goals of the stakeholders (e.g. the management members in the room, who were the invitees to the discussion) and the data scientists presenting to them. The role of the executive in these types of discussions is to take in the essence of the situation at a high level, and make a decision, provide guidance or direction to the project. An inscrutable graph with no clear conclusion does not allow the executive to accomplish any of these tasks, and so they might rightly think that their time is being wasted in such a discussion.

The Solution

  • The solution here can be very simple: make simple, easy-to-understand graphs with clear takeaways (bonus points for including the conclusion in writing on the slide itself). This is harder than it sounds for many data professionals, because we are trained to see beauty in the whole story, and we often want to present the nuances that exist, particularly if they posed challenges to us in the analysis or cleaning of the data.

  • However, especially in interaction with management stakeholders, you don’t need to show everything! We have found that a good strategy is to focus on the message or conclusion you want to communicate, and to show the data that supports it. This isn’t peer-review – you have explicitly been hired by the organization to analyze and synthesize, and present the conclusions that you feel are justified. In a healthy organization (more on that below), you are empowered as an expert to make these choices.

Case 2: The Graph Was Too Simple

If complicated graphs can be the source of communication issues, then simple graphs should be the solution, right? Unfortunately, this is not always the case.

What Happened

  • In a meeting among the data science team and top management, a data scientist presented a graph that showed a simple comparison with a clear managerial / decision implication.

graph too simple

The Reaction

  • Rather than being rewarded for showing a clear and compelling data visualization that had decision implications, the executives’ response was: “This isn’t good enough.” One executive in particular insisted that they needed drill downs, for example per region, per store within region, product category within store, etc.

  • The marching orders from this meeting were to produce the additional visualizations and send them to management. However, there was no plan for following up on this additional work or any commitment for any resulting actions. The data team left the meeting with the feeling that this was just the beginning of a long and complicated cycle in which an endless series of graphs would be made, but that nothing would ever be done with them.

The Diagnosis

Why does this happen? There are at least 2 diagnoses in our opinion.

  • The first explanation assumes good intentions on the part of the management stakeholder. The simple truth is that not everyone is equally comfortable with data or with making data-driven decisions. Among some stakeholders there is a feeling that, if they had all of the information, they would be able to completely understand the situation and decide with confidence. A related belief is that more information is better (this is why organizations build an often overwhelming number of dashboards, showing splits of the data according to a never-ending series of categories). Furthermore, some decision makers want to make many local decisions – e.g. to manage each country, region, store, or individual product in its own way, though this can quickly become unmanageable.

  • The second explanation does not assume good intentions on the part of the organizational stakeholder. To executives who are used to simply going with their intuition or gut feeling, being challenged to integrate data into the decision-making process can be a somewhat threatening proposition. One strategy to simply ignore data completely is to leave the analysis unfinished by sending the data team down a never-ending series of rabbit holes while continuing to make decisions as usual.

The Solution

The solution here depends on the diagnosis of the problem.

  • If one assumes that the issue is a desire for mastery and a discomfort with numbers, a helpful strategy is to re-center the discussion on the high-level topic at hand. In the context of the graph shown above, one approach would be to say something along the lines of: “What are we trying to do here? We conducted an experiment to see whether using different types of promotions would impact our overall sales. We saw that customers who received Promotion A were 1.5 times more likely to buy the product than customers who received Promotion B. Our testing sample was representative of our customer base, and so the analysis suggests that we expand the use of Promotion A to all of our customers.”

  • If one assumes that the issue is a desire to ignore data completely and continue on with business as usual, there’s not much that you as a data professional can do to change the managerial culture and address the underlying issue. A single occurrence of such behavior could just be a fluke, but if you keep finding yourself in this position, we would encourage you to reflect on whether you are in the right place in your organization, or in the right organization at all. If all you do is chase rabbits, you will wind up exhausted, never finish anything, and your work will have no impact.

Case 3: When A Graph Leads to Stakeholder Tunnel Vision

What Happened

  • In a meeting among the data science team and management, a data scientist presented a simple graph showing some patterns in the data. (The graph here is a correlation matrix of the mtcars dataset, included for illustrative purposes).

graph leads to tunnel vision

The Reaction

  • For reasons that were unclear to the data team, a manager exhibited an odd focus on a single data point and asked a great many questions about it, derailing the meeting and the higher-level discussion in the process.

  • An example of what this looked like, applied to the correlational graph above:

    • “Ah so the bubble with Weight and Weight is a big blue circle. Well, this Weight score is very important because this is one of our key metrics. So why is the dot so blue? And why is it so big? And what does it mean that the Weight and Weight has one of the biggest and bluest bubbles, compared to the other ones?” (Note to the readers: the big blue circle indicates the correlation of the Weight variable with itself; the correlation is by definition 1, and explains why the bubble is big (it’s the largest possible correlation) and blue (the correlation is positive). There is nothing substantively interesting about this result.)

The Diagnosis

  • As in the case of the simple graph above, the problem might stem from discomfort on the part of a stakeholder who is not at ease with understanding data and graphs. Although this feels unnatural to many data professionals, the simple truth is that data makes many people uncomfortable. It is possible that this extreme focus on a single detail in a larger analysis reflects a desire for mastery over a topic that feels scary and uncontrollable.

  • Another potential explanation concerns the performative function of meetings in the modern workplace. In an organizational setting, meetings are not simply a neutral environment where information is given and received. Especially in larger companies, and in meetings with large numbers of participants, simply making a point or digging into a detail is a strategy some people use to draw attention to themselves or show that they are making a contribution to the discussion. (Even if the contribution is negligible or counter-productive!)

The Solution

  • As described above, the recommended solution here is to take a step back and recenter the discussion. In the context of our correlation analysis above, we might step up and say – “What are we trying to do here? We are showing a correlation matrix of the most important variables, and we see that the correlations in this plot show that larger and more powerful cars get fewer miles per gallon (are less fuel-efficient). In our next slide, we show the results of a regression analysis that identifies the key drivers of our outcome metric.” And then move on to the next slide in order to continue with the story you are giving in your presentation.

Case 4: When the Data Reveal an Uncomfortable Truth

What Happened

  • In a meeting with top management, a data scientist showed a simple chart of a metric by location, and made a simple observation about the pattern shown by the data – “ah, our metric is lower in Paris and Berlin vs. London.” This was a relatively simple observation, very clear from the graph, and the data scientist thought that this was some basic information that management should be aware of.

uncomfortable truth

The Reaction

  • The reaction on the part of the senior stakeholder, however, was explosive. There was a fair amount of yelling and shouting, not particularly related to the topic at hand. It was as if a toddler in a suit was having a meltdown. The meeting ended quickly, and the executive made it clear he did not want the project to continue or to see the data scientists again.

The Diagnosis

  • The executive was aware of the situation that the data visualization revealed. However, the facts that the data made clear were problematic for the executive’s scope of responsibility. Truly addressing the issue would have been difficult if not impossible given the organizational structure and geographic footprint. Rather than acknowledge the problem and work to solve it, the most expedient solution was to “shoot the messenger.”

The Solution

  • There’s not a lot that a data scientist, or even a data scientist leader, can do in these situations. Nevertheless, shouting in meetings is completely unacceptable. Our recommendation is to simply look for another job. There is enough work right now for data profiles, and you owe it to yourself to seek out a more professional organization.

Summary and Conclusion

The goal of this blog post was to describe some common non-technical problems that are encountered by data professionals while working in organizational settings. We focused here on communicating about data, and some of the many things that can go wrong when we talk about data or present the results of a data project to colleagues.

Communication in any organization can be difficult, and communication about data particularly so. Data topics can be complex, and such discussions can make people uncomfortable, or reveal facts that some stakeholders would prefer remain hidden. In many organizations, employees do not have much training or experience in data, and this often understandably leads to confusion as data projects start to be rolled out.

Our goal here was not simply to make a list of potential difficult situations, but to give you some things to think about and some solutions to try when you run into similar problems. In our view, it’s worth talking about what can and does go wrong in data projects, because we believe this work is important. It is our hope that, the more we as a field talk about problems like these, the better awareness is (among data practitioners and their stakeholders!) and hopefully the better the whole field works. There is a tremendous opportunity to do better, so let’s take it!

Coming Up Next

Our next blog post will focus on interpersonal issues that can pose problems in data projects. Stay tuned!

Post-Script: R Code Used to Produce the Graphs Shown Above

Read the whole story
peior
25 days ago
reply
Share this story
Delete

Benchmark Studies

1 Share

This is a live blog of the final lecture of the 2025 edition of my graduate machine learning class “Patterns, Predictions, and Actions.” A full Table of Contents is here. I tried to summarize my semester reflections in class on Thursday, but found my thoughts haven’t quite settled yet. I’m hoping a week of posting will help me sort through it.

My GSIs (one in particular) gave me a lot of guff about how the homework was too easy this semester. My response is that the theoretical foundations of the course are still too weak for me to pose hard math questions in good faith. Yes, there are difficult mathematical problems in finding the optimal constants for certain learning-theoretic bounds. But finding constants for theories that give bad advice in practice is a waste of everyone’s time. In machine learning, we learn more from social analysis than functional analysis, and it’s hard to write problem sets on sociology.

The theory in machine learning is frustratingly less mathematical than that of other fields of information engineering. For example, consider mathematical optimization, which I taught last fall. There, you begin with a mathematical modeling language. If you believe that you have a problem that can be written in this language, optimization provides algorithms to find the solution. If you can’t write your problem in that language, you’ll need to try something else.

You might think machine learning works that way too. If we believe we have a population in which a certain variable is well approximated by a parametric function of the other variables, then sure, we can look into algorithms to estimate that function via random sampling from the population. However, we rarely ever believe in this parametric assumption in applied machine learning. This is where we become untethered from the mathematics. In almost every problem that people care about, we simply don’t know the functional relationship connecting these quantities. Because if we did know it, we wouldn’t be using machine learning in the first place.

I like to illustrate this with two extreme cases:

  • I believe that there is a computer program that can tell whether a string of bits has an even or odd number of ones. I can write this program in a single line of Python. I don’t use machine learning.

  • I believe there is a computer program that can identify animals in images. I have no idea how to write that computer program. I turn to machine learning.

In machine learning, the price of admission is a belief that pattern recognition is possible and the conviction that it’s too hard to write a function describing these patterns from first principles. Generalization is an axiom, not a theorem. It would be nice if we could say one algorithm is better than another at finding these prediction functions, but we don’t have theory for that. Instead, we have to look to engineering “best practice” for what to do next.

Am I saying that every machine learning course has to have a few weeks on science studies? Yes I am.

Of course, this is frustrating as all hell to computer science students who are taught that you just type incantations into Jupyter notebooks and conjure up pure logical positivism. Machine learning has a recipe book, but, unlike when I’m teaching an undergraduate course in algorithms, I can’t justify much of it at all.

I know that I can approximate arbitrary nonlinear patterns as a composition of simple, componentwise nonlinear functions and linear maps (what we call neural networks). I know that I can arrange data streams into arrays that respect certain local consistency properties, thinking of text as sequences of tokens and images as spatial arrays of patches. I can compose these basic primitives to construct a bunch of different candidate prediction functions. I can use numerical search to find the linear maps in the resulting functional expression.

This set of weakly specified building blocks begets a zoo of methods. But we can only tell stories for why we’d prefer one over another. Random Forests, CNNs, RNNs, and Transformers are all potentially useful to get your number to go up, but they aren’t fundamental in any sense. There are some architectures that people get excited about for a while, and they yell about how awesome they are. Then some new architecture becomes exciting, and they yell about that. People build the next cool thing on the last cool thing. They tend not to emphasize how data is fundamental, and how it’s essential to set up your ETL pipeline in precisely the right way. But those pesky details are easy enough to find if you look through the various open-source machine learning repositories. And so machine learning continues, one pull request at a time.

This organic process is fine! But I don’t think you can explain anything about it with large deviation inequalities or functional analysis. How can I know which method is best? I check my answer on some leaderboard.

I’ve been trying to figure out how best to teach this in context. Our machine learning practices make it impossible to disentangle technical innovation from technical debt. I don’t want to prove theorems about some widget that is currently widely deployed because maybe it won’t be used next week. Some components are vestigial structures left behind after a frenzied series of paper deadlines and investor calls. Which structures? I can’t tell you.

On the other hand, machine learning has some shockingly robust practices that other fields should emulate. The train-test paradigm is fascinating. How can I know which method is best? I check my answer on some leaderboard. Despite statistical arguments declaring it fundamentally flawed, the culture of competitive testing on benchmarks has driven and still drives the engine of what the field defines as progress. We still can’t explain much about why it works as well as it does. We don’t have compelling theories for why benchmarks don’t go stale as they saturate, or why we see the patterns we see in empirical performance on these benchmarks. The social patterns here are fascinating, and they should be taught more explicitly in machine learning courses.

Although some argue that we need to move beyond the benchmarking paradigm, I would counter that the benchmarking paradigm defines the field. Believe that pattern recognition is possible. Specify your metric at the population level. Gather two samples representative of this population and use one for play and one for benchmarking, trying to maximize your metric. Once you get bored with the benchmark, make a new one. That’s machine learning in a nutshell. In practice, machine learning sociology is all we need.

Read the whole story
peior
25 days ago
reply
Share this story
Delete

ChatGPT Is Terraforming Earth

1 Share

Your ChatGPT subscription is subsidising the terraforming of Earth.

Not metaphorically. China’s solar overbuild is crashing electricity prices toward zero. AI demand absorbs the surplus. When power is essentially free, we manufacture fuel from thin air instead of drilling for it.

Solar has become a deflationary technology, following the same exponential cost curves that made computing ubiquitous. In 2024, China installed more solar capacity in a single year than the United States has installed in its entire history.1 The world added roughly 600 GW last year, equivalent to installing Australia’s entire grid every few months.

None of this was planned. More than $130 billion poured into China’s solar industry, driven by what industry leaders have called “blind investments and disorderly financial lending.”2 China now holds over 80% of global manufacturing capacity across the entire supply chain, with utilisation rates hovering around 50-60%. The industry is producing far more than the market can absorb. Solar hardware costs have collapsed to near $0.10 per watt and continue falling. We have left the era of energy scarcity and entered the era of manufactured energy.

Into this glut arrives the hungriest electricity buyer in history.

Grid connections take seven years while AI scaling laws operate on months. Tech giants are building dedicated solar installations behind the meter, bypassing the glacial pace of grid development. AI data centre demand is projected to double by 2030, potentially reaching a Japan-sized energy footprint. But unlike households or hospitals, AI loads are flexible. Workloads can shift across hours or days. Compute can relocate to wherever energy is cheapest. Training runs are price-sensitive and will chase arbitrage. This makes AI the perfect absorber of surplus daytime solar. Solar is becoming the world’s cheapest input precisely as AI becomes the world’s hungriest buyer.

To run a data centre 24/7 on solar power, you must overbuild capacity by four to five times.3 For roughly eight hours each day during peak sun, these facilities generate gigawatts of excess electricity that the batteries cannot hold and the servers do not need.

Turning air and water into natural gas is thermodynamically possible but economically absurd at normal electricity prices. Capture CO2 from the atmosphere, electrolyse water into hydrogen, combine them to produce methane. The chemistry has been understood for over a century, but it requires so much electricity that it only makes sense when power costs around $10 per megawatt hour.

That “waste” energy from AI solar farms is exactly that free electricity.

Terraform Industries has built container-sized units that do precisely this.4 Instead of curtailing that midday spike, we feed it into machines that convert sunlight and air into pipeline-grade natural gas. Input: sunlight plus air. Output: methane. Carbon becomes a circular industrial input rather than a pollutant extracted from underground. The synthesis plants soak up whatever the batteries cannot store.

Once electricity becomes essentially free during daylight hours, we stop drilling for fossil fuels and start manufacturing them from air. Desalination becomes trivial. Synthetic ammonia production gets cheap, shifting the cost structure of global food. Plastics become carbon-neutral.

Energy independence stops being about who controls oil reserves and starts being about who has land and sunlight. The Middle East of the future is Nevada, Australia, and Namibia.

This reframes the entire conversation about AI power usage. The real question is whether we can generate electricity cleanly. The answer is becoming yes, in such abundance that we will need to invent new uses for the surplus.

For most leaders, this feels distant. Most of us are not building hyperscale data centres. But the second-order effects matter.

Watch energy costs in cloud pricing. As electricity becomes the dominant cost in AI inference, providers will increasingly price by consumption. The ones who figure out solar arbitrage will offer cheaper inference. This will show up in your bills before you understand why.

Watch local AI. When running models locally becomes cheap enough, the economics of cloud versus on-premise shift. The line is moving. Keep an eye on where it lands.

Prepare for abundance thinking. The interesting question is shifting from “how do we afford AI?” to “what becomes possible when AI is essentially free?” LLM costs per token have dropped 100x in two years.5 Planning for scarcity when abundance is arriving is the wrong bet. The companies that win will be the ones who figure out what to build when the constraint disappears.

China’s industrial policy is creating the oversupply. AI is arriving to absorb it. Synthesis technology makes the excess useful. The path to addressing climate runs through extreme abundance, and your AI usage is part of what makes it viable.

ChatGPT Is Terraforming Earth

  1. China added 329 GW of solar capacity in 2024, accounting for 55% of global installations. The world added a record 597 GW total, bringing global capacity to 2.2 TW. Sources: SolarPower Europe Global Market Outlook 2025-2029, BNEF Global PV Market Outlook

  2. Gao Jifan, chairman and CEO of Trina Solar, identified “blind investments, excessive local government efforts, and disorderly financial lending” as key drivers of overcapacity. China now accounts for over 80% of global polysilicon, wafer, cell, and module manufacturing capacity. BNEF expects overcapacity to persist through at least 2027. Sources: Wood Mackenzie analysis of China’s 2024 PV manufacturing guidelines, BNEF Energy Transition Supply Chains 2025

  3. To guarantee 24/7 power from solar requires overbuilding capacity significantly to charge batteries for nighttime operation. The exact ratio depends on location and storage costs, but 4-5x is a reasonable planning assumption for most high-sun regions. See Casey Handmer’s analysis in We’re Going to Need a Lot of Solar Panels

  4. Terraform Industries builds modular synthetic fuel plants optimised for intermittent power sources. In April 2024 they delivered carbon-neutral pipeline-grade synthetic natural gas to utility partners at 99.4% methane purity. Their thesis is that solar electricity below $10/MWh makes atmospheric carbon capture and fuel synthesis economically competitive with extraction. See Casey Handmer’s Radical Energy Abundance for the detailed analysis. 

  5. GPT-4 cost $36 per million tokens at launch in March 2023. GPT-4o mini now costs $0.15 per million input tokens. The Stanford AI Index 2025 reports inference costs for GPT-3.5-class models fell 280-fold between 2020 and 2024. See DeepLearning.AI’s analysis of falling token prices

Read the whole story
peior
28 days ago
reply
Share this story
Delete

Quoting Obie Fernandez

1 Share

If the part of programming you enjoy most is the physical act of writing code, then agents will feel beside the point. You’re already where you want to be, even just with some Copilot or Cursor-style intelligent code auto completion, which makes you faster while still leaving you fully in the driver’s seat about the code that gets written.

But if the part you care about is the decision-making around the code, agents feel like they clear space. They take care of the mechanical expression and leave you with judgment, tradeoffs, and intent. Because truly, for someone at my experience level, that is my core value offering anyway. When I spend time actually typing code these days with my own fingers, it feels like a waste of my time.

Obie Fernandez, What happens when the coding becomes the least interesting part of the work

Tags: careers, ai-assisted-programming, generative-ai, ai, llms

Read the whole story
peior
28 days ago
reply
Share this story
Delete
Next Page of Stories