Cloud Compute

SciAgent runs scientific simulations on cloud clusters via SkyPilot, with a local Docker fallback for small jobs. The same tool surface (compute_run, compute_exec, compute_cluster, materialize, materialize_workspace) covers both backends — the agent picks the right one based on the job’s resource needs.

This page covers the user-facing surface. For internals, see Architecture → SkyPilot integration.

When SkyPilot vs local

The compute router (src/sciagent/compute/router.py) selects a backend per job:

Backend Used when
SkyPilot GPU requested, > 16 GB memory, > 8 CPUs, or backend="skypilot" explicitly
Local Docker Small jobs that fit on the developer’s machine

Both code paths produce the same JobResult shape, so downstream agents and analysis don’t branch on backend.

Two execution modes

compute_run supports two modes:

  • Managed jobs (mode="job", default) — Sky launches a transient cluster, runs the command, tears the cluster down on completion. One-shot. Use for “run this and I’ll come back when it’s done.”
  • Cluster mode (mode="cluster") — Sky launches a persistent cluster you can iterate against (compute_exec for follow-up commands, compute_cluster(action="refresh_mounts") to point it at new inputs). Use for case-file reproductions, multi-step pipelines, anything you’ll probe interactively before the real run.

For iterative scientific workflows, cluster mode is usually faster end-to-end: a stopped cluster restarts in seconds, while a fresh launch takes minutes.

Per-job wall-clock budget

Every cloud job has a timeout_sec budget — defaults to 1 hour (3600 s). The reaper kills clusters whose runtime exceeds this so a runaway job doesn’t sit billing forever.

Set the budget per call:

compute_run(service="openfoam", command="bash Allrun", timeout_sec=7200)  # 2-hour cap
compute_run(service="openfoam", command="bash Allrun", timeout_sec=0)     # disable wrapper

Or set the default once via CloudConfig.default_timeout_sec (see Tunables).

Cluster lifecycle: stop, not down

Two ways a cluster can leave the active set:

  • stop — preserves the cluster’s attached disk and identity. A subsequent start restarts it in seconds with the same name. Cheap. This is the default end-of-task action.
  • down — destroys the cluster. The S3-backed workspace bucket survives, but on-cluster scratch is gone. Use only for explicit cleanup or quota-driven cleanup.

The agent’s prompt enforces this — at the end of a compute task it stops, not downs. If you see down happening implicitly, that’s a bug.

The session workspace

Every cluster job auto-mounts a per-session durable bucket at /workspace/:

<cloud>://sciagent-workspace-<session_id>/

Where <cloud> is whichever provider the job runs on (s3, gs, az, r2, oci). The bucket survives cluster teardown — outputs persist beyond the cluster that produced them. This is the “data tier” that compute → analyze → verify all share.

Two data tiers, two tools:

  • materialize(uri=...) — pull a specific URI or a job’s outputs to local. Cloud-agnostic.
  • materialize_workspace(subpath=..., dest=...) — pull (a slice of) the session workspace bucket to local. Pairs with the auto-mount; what you write to /workspace/ in the cluster is reachable via this tool.

materialize_workspace(list_only=True, subpath="<run-id>/") is the cheap “what’s in the bucket?” probe.

Tools

compute_run

Launch a containerized compute job.

compute_run(
    service="openfoam",
    command="bash Allrun",
    mode="cluster",
    backend="skypilot",
    cluster_name="cfd-run-1",
    cpus=4,
    memory_gb=32,
    gpus=0,
    timeout_sec=3600,
)

Returns: {job_id, cluster_name, cluster_job_id, backend, ...}. Background by default — poll with bg_status / task_get or block with bg_wait / task_wait.

Service registry — pass service="<name>" to use a registered scientific image (see src/sciagent/services/registry.yaml). Or pass a raw image="..." for an arbitrary container.

Commit gate — when the optimizer’s estimated total exceeds $5.00, the tool prompts the user with a Sky-optimizer menu before launching. Tool-layer gate; the LLM cannot bypass it. Override via CloudConfig(commit_threshold_usd=...), env SCIAGENT_COMPUTE_COMMIT_THRESHOLD_USD, or ~/.sciagent/config.yaml (compute.commit_threshold_usd).

compute_exec

Run a follow-up command on a warm cluster.

compute_exec(cluster_name="cfd-run-1", command="postProcess -func writeCellVolumes")

Cluster-mode equivalent of “I want to run another step against the same data without spinning up a new cluster.” Returns a cluster_job_id you can wait on with compute_cluster(action="wait_for_job").

compute_cluster

Cluster lifecycle, action-dispatched:

Action Effect
status Sky cluster status (UP/STOPPED/INIT/AUTOSTOPPING/PENDING) + sciagent local manifest
wait_until_up Block within one LLM turn until the cluster reaches UP (default 300 s)
wait_for_job Block until a cluster_job_id reaches terminal state (cluster-mode bg_wait)
logs Tail of a cluster-mode job’s stdout; on-disk cache fallback for post-autostop forensics
stop Preserve the cluster (non-destructive). Default end-of-task action.
start Restart a stopped cluster reusing its disk
down Destroy the cluster — explicit cleanup only
autostop Update idle threshold / wait_for / hook
refresh_mounts Re-sync file_mounts via sky launch --no-setup (Sky’s canonical “point a warm cluster at new input data”)
compute_cluster(action="status", cluster_name="cfd-run-1")
compute_cluster(action="stop", cluster_name="cfd-run-1")

materialize

Pull outputs from cloud storage to local.

materialize(uri="s3://sciagent-workspace-abc123/run-001/fields/", dest="./_outputs/fields/")
materialize(uri="s3://sciagent-workspace-abc123/run-001/", list_only=True)

Cloud-agnostic — works with s3://, gs://, az://, r2://, oci://.

materialize_workspace

Pull the session workspace bucket (or a slice) to local.

materialize_workspace(subpath="run-001/derived/", dest="./_outputs/derived/")
materialize_workspace(list_only=True)

Pairs with the cluster auto-mount: anything written to /workspace/<path> from a cluster job is reachable via materialize_workspace(subpath="<path>") from the agent’s local environment.

Configuration

Install with cloud extras

pip install '.[cloud]'        # AWS
pip install '.[cloud-gcp]'    # GCP
pip install '.[cloud-azure]'  # Azure
pip install '.[cloud-all]'    # All three

The base install ships without SkyPilot — the cloud* extras pull it in.

Cloud credentials

SciAgent inherits whatever credentials SkyPilot can find. Set up your provider once with the SkyPilot-supported flow (aws configure, gcloud auth application-default login, az login) and sky check will confirm.

Tunables

Cloud-side knobs are bundled in the CloudConfig dataclass (kept separate from AgentConfig, which carries agent-loop concerns like model and tokens):

from sciagent import AgentConfig, AgentLoop
from sciagent.compute import CloudConfig

agent = AgentLoop(
    config=AgentConfig(),
    cloud_config=CloudConfig(
        commit_threshold_usd=10.0,           # raise the cost gate to $10
        workspace_store="s3",                # pin workspace bucket to AWS
        default_autostop_minutes=10,         # cluster autostop after 10 min idle
        default_timeout_sec=7200,            # raise per-job budget to 2 hours
    ),
)
Knob Env var YAML key Default
commit_threshold_usd SCIAGENT_COMPUTE_COMMIT_THRESHOLD_USD compute.commit_threshold_usd 5.0
workspace_store SCIAGENT_WORKSPACE_STORE auto-detect
default_autostop_minutes provider default
default_timeout_sec 3600 (1 hour)
subagent_warm_resume_seconds SCIAGENT_SUBAGENT_WARM_RESUME_SECONDS subagent.warm_resume_seconds

Precedence per knob: env > CloudConfig field > yaml > built-in default.

Per-cluster overrides via compute_cluster(action="autostop", idle_minutes=N) win over default_autostop_minutes.

See API Reference → CloudConfig for the full Python surface.

End-to-end example

The Datacenter CFD case study is the canonical end-to-end use of cloud compute: spin up a SkyPilot cluster, run OpenFOAM, materialize the result fields locally, KDE-analyze, stop the cluster. Read it for the full pattern.

See also