Autonomous Pipelines (jqmc_workflow)

Autonomous Pipelines (jqmc_workflow)#

Overview#

The jqmc_workflow package provides an autonomous pipeline engine for jQMC calculations. Users define a pipeline — a directed acyclic graph (DAG) of workflow steps — and the engine takes care of:

Input generation — TOML input files are created automatically from explicit parameter values (with sensible defaults).
Data transfer — Files are uploaded to (and downloaded from) remote supercomputers via SSH/SFTP using Paramiko.
Job submission, monitoring, and collection — Jobs are submitted through the site’s scheduler (PBS / Slurm / local bash), polled until completion, and output files are fetched back.
Dependency resolution — The DAG-based Launcher identifies which workflow steps are ready (all dependencies satisfied) and executes them in parallel using Python asyncio tasks.
Target error-bar estimation — When a target_error (Ha) is specified, a small pilot run is executed first; the statistical error is used to estimate the number of production steps required, together with the estimated wall time.

A single Python script can therefore express a full QMC pipeline — from wavefunction preparation (TREXIO → hamiltonian_data.h5) through VMC optimization, MCMC production sampling, and LRDMC extrapolation — and run it end-to-end with automatic restarts across interruptions.

Architecture#

run_pipeline.py
      │
      ▼
  ┌────────┐
  │Launcher│   DAG executor (asyncio)
  └──┬─────┘
     │  creates asyncio.Task per ready node
     │
     ├──► Container("vmc")
     │         └─► VMC_Workflow.configure() → .run()
     │
     ├──► Container("mcmc-prod")   ← runs in parallel
     │         └─► MCMC_Workflow.configure() → .run()
     │
     └──► Container("lrdmc-ext")   ← runs in parallel
               └─► LRDMC_Ext_Workflow.configure() → .run()
                       ├─► LRDMC_Workflow (alat=0.50)  ┐
                       ├─► LRDMC_Workflow (alat=0.40)  ├ parallel
                       └─► LRDMC_Workflow (alat=0.25)  ┘

Key components#

Class / Module	Role
`Launcher`	Executes a DAG of `Container` nodes in topological order, launching independent nodes in parallel. Provides `get_session_state()`, `get_current_job()`, `get_job_history()` for session introspection.
`Container`	Wraps any `Workflow` subclass in a dedicated project directory with state tracking (`workflow_state.toml`).
`FileFrom` / `ValueFrom`	Declare inter-workflow dependencies (files or computed values).
`WF_Workflow`	Converts a TREXIO file to `hamiltonian_data.h5`.
`VMC_Workflow`	Jastrow / orbital optimization (`job_type=vmc`).
`MCMC_Workflow`	Production energy sampling (`job_type=mcmc`).
`LRDMC_Workflow`	Lattice-regularized diffusion Monte Carlo for a single \(a\) value.
`LRDMC_Ext_Workflow`	Runs multiple `LRDMC_Workflow` instances at different lattice spacings and performs \(a^2 \to 0\) extrapolation.
`ScientificPhase`	Enum defining the scientific phases of a workflow session (INIT → SCF → WF_BUILD → VMC → MCMC → LRDMC → COMPLETED). See Phase management.
`WorkflowStatus` / `JobStatus`	Enums for workflow-level and per-job status values. See Status enums.

Inter-workflow data passing (`FileFrom` / `ValueFrom`)#

FileFrom and ValueFrom declare inter-workflow dependencies inside a Container definition. At launch time, the Launcher resolves them to actual paths or values.

`FileFrom(label, filename)`#

Pass a file produced by an upstream workflow. filename can be:

A static string — when the exact name is known at definition time:
```
FileFrom("vmc", "hamiltonian_data_opt_step_9.h5")
```
A ValueFrom object — when the name is determined at runtime (e.g. VMC early convergence produces a step number that cannot be predicted):
```
FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian"))
```
Here the ValueFrom is resolved first, yielding the actual basename (e.g. "hamiltonian_data_opt_step_91.h5"), which is then used to locate the file in the upstream directory.

Important: file renaming. When a dynamic FileFrom resolves to a name that differs from what the downstream workflow expects (e.g. hamiltonian_data_opt_step_91.h5 vs. hamiltonian_data.h5), use rename_input_files to map it to the expected name. Entries of None keep the original name. A pre-launch validation check will raise FileNotFoundError before any job is submitted if required files are missing.
Container(
    label="mcmc",
    dirname="02_mcmc",
    input_files=[
        FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")),
    ],
    rename_input_files=["hamiltonian_data.h5"],
    workflow=MCMC_Workflow(...),
)
With multiple input files, use None to skip renaming for specific entries:
input_files=[h5, FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian"))],
rename_input_files=[None, "hamiltonian_data.h5"],

`ValueFrom(label, key)`#

Pass a scalar value from an upstream workflow’s output_values dict. The available keys depend on the workflow class — see the table below.

Available `output_values` keys#

Each workflow populates output_values on completion. These keys can be referenced by downstream workflows via ValueFrom("label", "key").

`VMC_Workflow`#

Key	Type	Description
`optimized_hamiltonian`	`str`	Basename of the last optimised Hamiltonian file (e.g. `"hamiltonian_data_opt_step_91.h5"`).
`checkpoint`	`str`	Basename of the restart checkpoint file.
`num_mcmc_steps`	`int`	Estimated MCMC steps per optimisation step (automatic mode).
`estimated_mcmc_steps`	`int`	Same as above, in fixed-step mode.
`energy`	`float`	Energy from the last optimisation step (Ha).
`energy_error`	`float`	Statistical error on `energy` (Ha).
`signal_to_noise`	`float`	Average S/N over the trailing window (force convergence only).
`signal_to_noise_last`	`float`	S/N of the last optimisation step.
`energy_slope`	`float`	Slope of energy vs. step (energy-slope check only).
`energy_slope_std`	`float`	Standard deviation of the energy slope.

`MCMC_Workflow`#

Key	Type	Description
`energy`	`float`	VMC energy (Ha).
`energy_error`	`float`	Statistical error on `energy` (Ha).
`restart_chk`	`str`	Basename of the restart checkpoint file.
`forces`	`object`	Atomic forces (only when `atomic_force=True`). SWCT is applied when `use_swct=True`.
`num_mcmc_steps`	`int`	Estimated total measurement steps (automatic mode).
`estimated_steps`	`int`	Same as above, in fixed-step mode.

`LRDMC_Workflow`#

Key	Type	Description
`energy`	`float`	DMC energy (Ha).
`energy_error`	`float`	Statistical error on `energy` (Ha).
`alat`	`float`	Lattice spacing used for this run.
`restart_chk`	`str`	Basename of the restart checkpoint file.
`forces`	`object`	Atomic forces (only when `atomic_force=True`). SWCT is applied when `use_swct=True`.
`estimated_steps`	`int`	Estimated total measurement steps.
`num_projection_per_measurement`	`int`	GFMC projections per measurement (GFMC_n mode only).
`time_projection_tau`	`float`	Imaginary-time projection step (GFMC_t mode only).

`LRDMC_Ext_Workflow`#

Key	Type	Description
`extrapolated_energy`	`float`	Continuum-limit (\(a^2 \to 0\)) extrapolated energy (Ha).
`extrapolated_energy_error`	`float`	Statistical error on `extrapolated_energy` (Ha).
`per_alat_results`	`dict`	Per-alat energy/error results.
`errors`	`list[str]`	Error messages for alat runs that failed.
`error`	`str`	Top-level error message (only on failure).

Target error-bar estimation#

When target_error is set, each workflow (MCMC and LRDMC) operates in a pilot + production cycle. A separate queue may be used for the pilot via pilot_queue_label (defaults to queue_label).

MCMC / VMC#

Pilot — A short run of pilot_steps steps.
Production — Step count estimated from the pilot error bar.

LRDMC#

LRDMC has an additional calibration stage to automatically determine num_projection_per_measurement (GFMC projections per measurement) from a target_survived_walkers_ratio (default 0.97):

Calibration (_pilot_a/_pilot1 – _pilot_a/_pilot3, parallel) — Three short LRDMC runs with num_projection_per_measurement = Ne × k × (0.3/alat)² (k=2,4,6; \(N_e\) is the total electron count). A quadratic is fit to the observed survived-walkers ratio and the optimal num_projection_per_measurement is determined.
Error-bar pilot (_pilot_b) — A run with the calibrated num_projection_per_measurement; its error bar estimates the production step count.
Production (_1, _2, …) — Start from scratch, accumulate statistics until target_error is achieved.

If num_projection_per_measurement is given explicitly, the calibration stage is skipped and only the error-bar pilot is executed.

For LRDMC_Ext_Workflow (multi-alat extrapolation), every alat value independently runs its own calibration, error-bar pilot, and production in parallel. There is no inter-alat interaction until the final extrapolation step.

AO basis optimization#

VMC_Workflow supports optimizing the atomic-orbital (AO) Gaussian basis parameters alongside Jastrow / orbital coefficients. Four boolean flags control which basis parameters are included in the optimization:

Parameter	Default	Description
`opt_J3_basis_exp`	`False`	Optimize J3 (three-body Jastrow) AO Gaussian exponents
`opt_J3_basis_coeff`	`False`	Optimize J3 AO contraction coefficients
`opt_lambda_basis_exp`	`False`	Optimize Geminal AO Gaussian exponents
`opt_lambda_basis_coeff`	`False`	Optimize Geminal AO contraction coefficients

opt_lambda_basis_exp and opt_lambda_basis_coeff cannot be combined with opt_with_projected_MOs (changing Geminal AO exponents/coefficients invalidates the overlap matrix used by the MO projection). opt_J3_basis_exp and opt_J3_basis_coeff can be used together with opt_with_projected_MOs.

Primitives belonging to the same shell (same atom, same \(l\), same initial radial parameters) are constrained to share identical values throughout optimization. See Variational Monte Carlo (VMC) for details on the shell-sharing constraint mechanism.

When set, the corresponding parameters are passed through to the jqmc input TOML via resolve_with_defaults(). When left as None (the default), the jqmc binary applies its own defaults (false).

VMC convergence checks#

After all production runs complete, the VMC workflow checks whether the optimization has converged. Two independent criteria are available; when both are active, both must pass for convergence.

Signal-to-noise (S/N) check#

Enabled when target_snr is set (not None). The workflow averages the signal-to-noise ratio (S/N = max(|f| / |std f|)) over the last snr_avg_window optimization steps (default 5). If there are fewer steps than the window size, all available values are used.

The convergence criterion is:

\[ \overline{\text{S/N}}_{\text{last } W} \le \text{target\_snr} \]

where \(W\) is snr_avg_window.

Energy-slope check#

Enabled when energy_slope_sigma_threshold is set (not None). A weighted linear regression is fitted to the last energy_slope_window_size optimisation steps (default 5):

\[ E_k = a + b \cdot k + \varepsilon_k, \quad w_k = 1/\sigma_k^2 \]

The optimisation is considered converged (plateau) when the slope \(b\) is not significantly negative:

\[ b \ge -\sigma_b \times \text{energy\_slope\_sigma\_threshold} \]

If instead \(b < -\sigma_b \times \text{threshold}\), the energy is still decreasing and optimisation has not yet plateaued.

Combined verdict#

`target_snr`	`energy_slope_sigma_threshold`	Behaviour
`None`	`None`	No convergence check; always succeeds
set	`None`	S/N check only
`None`	set	Energy-slope check only
set	set	Both must pass

All numerical values (averaged S/N, slope, slope std) are recorded in output_values for downstream inspection.

In fixed-step mode (num_mcmc_steps is set), the convergence checks are not performed.

Early exit on convergence#

When convergence criteria are set, the VMC production loop exits early as soon as all active checks pass, instead of always running the full max_continuation steps. This avoids wasting compute time once the optimization has plateaued.

On re-runs with changed criteria (e.g. lowering target_snr), convergence is re-evaluated from the fetched results before submitting new jobs. If the existing results already satisfy the new criteria, the workflow completes immediately without launching additional runs.

When no convergence criterion is set (target_snr = None and energy_slope_sigma_threshold = None), all max_continuation steps run unconditionally. If convergence is not achieved after exhausting all production steps, the workflow returns FAILED.

Step estimation formula#

The required number of production steps is estimated via

\[ N_{\text{eff,prod}} = N_{\text{eff,pilot}} \times \left(\frac{\sigma_{\text{pilot}}}{\sigma_{\text{target}}}\right)^2 \times \frac{W_{\text{pilot}}}{W_{\text{prod}}} \]

where \(N_{\text{eff}} = N_{\text{total}} - N_{\text{warmup}}\) is the number of steps that actually contribute to statistics (warmup steps are discarded during post-processing via -w), and \(W = \text{walkers\_per\_MPI} \times \text{num\_MPI}\) is the total number of walkers. The ratio \(W_{\text{pilot}} / W_{\text{prod}}\) (the walker ratio) accounts for the pilot queue using fewer (or more) MPI processes than the production queue. The number of MPI processes is read from queue_data.toml (num_cores, or mpi_per_node × nodes as fallback).

The total production steps are then \(N_{\text{prod}} = N_{\text{eff,prod}} + N_{\text{warmup}}\).

The estimated production wall time is based on the net computation time parsed from the pilot output (Net GFMC time for LRDMC, Net total time for MCMC for MCMC/VMC). Only this net portion scales with the step count; overhead (JIT compilation, file I/O, queue wait) is treated as a constant:

\[ T_{\text{prod}} \approx (T_{\text{wall,pilot}} - T_{\text{net,pilot}}) + T_{\text{net,pilot}} \times \frac{N_{\text{prod}}}{N_{\text{pilot}}} \]

-- LRDMC Step Estimation Summary (a=0.25) --------
  pilot steps       = 100
  warmup steps      = 10
  pilot error       = 0.00393172 Ha
  target error      = 0.001 Ha
  nmpm              = 32
  pilot MPI procs   = 48
  prod. MPI procs   = 480
  walker ratio      = 0.1
  estimated steps   = 155
  pilot wall time   = 5m 12s
  pilot net time    = 2m 42s
  est. prod. time   = 6m 43s
--------------------------------------------------

The first production run starts from scratch (no restart from the pilot checkpoint, which lives in the _pilot_b/ subdirectory). After each production run, the error bar is re-evaluated. If it exceeds the target, additional steps are estimated and a continuation run is launched automatically, up to max_continuation times.

Target error not met#

If max_continuation runs complete but the statistical error still exceeds target_error, the workflow logs a warning and returns COMPLETED (not FAILED). The calculation itself succeeded; only the error-bar criterion was not fully met. This applies to both MCMC_Workflow and LRDMC_Workflow.

Continuation with tighter target error#

When re-running a pipeline with a stricter target_error (e.g. lowering from 1e-3 to 5e-4), the workflow detects that the cached error from previous runs exceeds the new target and automatically re-estimates additional steps from the accumulated data. Continuation runs are launched from where the previous execution left off, up to max_continuation.

Pre-launch validation#

Before any job is submitted, the engine verifies that all required files are present in the project directory. Every resolved entry in input_files (after renaming) must exist. Workflow-internal files (e.g. hamiltonian_file) are not checked because some workflows (e.g. WF_Workflow) produce them rather than consume them.

If any file is missing, a FileNotFoundError is raised immediately with a message listing the missing files:

FileNotFoundError: [mcmc-N2-0.80] Required file(s) missing in '05_mcmc/'
before workflow launch: ['hamiltonian_data.h5'].
Check that input_files and rename_input_files are configured correctly.

This catches misconfigured rename_input_files (e.g. a dynamic FileFrom + ValueFrom that produces hamiltonian_data_opt_step_91.h5 but the workflow expects hamiltonian_data.h5) before wasting compute resources.

Additionally, when the project directory already exists from a previous interrupted run, any input files that are missing (but available from the source) are automatically copied in. Existing files are not overwritten.

Post-completion cleanup (`cleanup_patterns`)#

When large checkpoint files (e.g. restart.h5, hamiltonian_opt*.h5) are no longer needed after a successful run, the cleanup_patterns parameter can be used to automatically delete them.

MCMC_Workflow(
    server_machine_name="cluster",
    target_error=0.001,
    cleanup_patterns=["restart.h5", "hamiltonian_opt*.h5"],
)

Behaviour:

cleanup_patterns accepts a list of glob patterns (e.g. ["restart.h5", "hamiltonian_opt*.h5"]).
Matching is recursive — patterns are applied to the workflow directory and all subdirectories (e.g. _pilot/, _pilot_a/, _pilot_b/).
Local files matching the patterns are always deleted.
Remote files are deleted only when the workflow targets a remote machine (server_machine_name is not "localhost").
Cleanup runs after CompletionStatus.OK is confirmed — it never touches files while the workflow might still need them for continuation.
Cleanup failures are logged as warnings and never cause a completed workflow to fail.
Default is an empty list (no cleanup), preserving backward compatibility.

This is especially useful for massively-parallel MCMC runs where restart.h5 can grow to tens of gigabytes.

All workflow classes (VMC_Workflow, MCMC_Workflow, LRDMC_Workflow, LRDMC_Ext_Workflow) support this parameter. For LRDMC_Ext_Workflow, the patterns are passed through to each child LRDMC_Workflow.

Restart behavior#

Every job is recorded in workflow_state.toml with a lifecycle:

submitted  →  completed  →  fetched

On restart, the engine checks each job’s status:

fetched — Input generation and submission are both skipped.
submitted / completed — Input is not regenerated; the job is resumed (polled or fetched).
No record — A fresh input file is generated and the job is submitted.

This means a pipeline can be interrupted at any point (Ctrl-C, node failure, wall-time limit) and simply re-run; it will pick up exactly where it left off.

When a job leaves the scheduler queue, the engine automatically collects job accounting data (if jobacct is configured in machine_data.yaml) before fetching output files. The accounting command and output file path are stored in the corresponding [[jobs]] record. See Job accounting.

Phase management#

A workflow session progresses through a sequence of scientific phases defined by the ScientificPhase enum (module _phase):

INIT → SCF → WF_BUILD → VMC_PILOT → VMC → MCMC_PILOT → MCMC
                                                         ↓
                        COMPLETED ← LRDMC_FIT ← LRDMC ← LRDMC_PILOT

Not every pipeline uses every phase (e.g. a VMC-only pipeline skips LRDMC phases). The allowed transitions are defined in PHASE_TRANSITIONS — for example, from VMC you may advance to MCMC_PILOT, MCMC, LRDMC_PILOT, LRDMC, or COMPLETED.

Each phase has a list of allowed actions (PHASE_ALLOWED_ACTIONS) that further depends on the current WorkflowStatus:

When status == RUNNING, configuration actions (configure_*) are filtered out.
When status == FAILED, only recovery actions (recover_*) and rollback_phase are available.

A set of always-allowed actions (advance_phase, rollback_phase, close_session, register_artifact, mark_unhealthy) is appended regardless of phase/status.

The require_action() function enforces these rules at the boundary between an MCP tool call and a workflow method — if the action is not permitted, a ValueError is raised immediately.

Status enums#

Workflow status and job status are represented by str-based enums (WorkflowStatus, JobStatus) so they remain human-readable in workflow_state.toml and can be compared directly with strings.

WorkflowStatus values:

Value	Meaning
`pending`	Not yet started
`copying`	Input files being transferred
`submitted`	Job submitted to scheduler
`running`	Execution in progress
`completed`	Finished successfully
`failed`	Terminated with an error
`cancelled`	Manually cancelled

JobStatus values (per-job, stored in [[jobs]] of workflow_state.toml):

Value	Meaning
`submitted`	Job submitted
`completed`	Job finished (output not yet fetched)
`fetched`	Output files retrieved
`failed`	Job failed

Each [[jobs]] record contains:

Field	Description
`input_file`	Basename of the generated TOML input file (`input_{jobname}_{step}_{run_id}.toml`)
`output_file`	Basename of the stdout capture file (`output_{jobname}_{step}_{run_id}.out`)
`job_id`	Scheduler job ID (or `"local"` for local runs)
`server_machine`	Machine name
`status`	One of the `JobStatus` values above
`submitted_at`	ISO 8601 timestamp
`step`	Step index (0 = pilot, 1, 2, … = production)
`run_id`	Short hex identifier for the job
`completed_at`	ISO 8601 timestamp (set on completion)
`fetched_at`	ISO 8601 timestamp (set on fetch)
`job_stdout`	Scheduler stdout path (queuing systems only)
`job_stderr`	Scheduler stderr path (queuing systems only)
`job_acct_command`	Accounting command executed (queuing systems only)
`job_acct_file`	Path to raw accounting output file (queuing systems only)

Artifact registry#

The engine records file lineage in the [[artifacts]] array of workflow_state.toml. Each entry tracks:

Field	Description
`filename`	Basename of the artifact file
`produced_by_job`	Input file that produced this artifact
`produced_at`	ISO 8601 timestamp
`artifact_type`	`"file"` (default)
`upstream`	Optional list of `{label, file}` dicts tracing the dependency chain

Use register_artifact() to add an entry, get_artifact_lineage() to look up a single file, and get_artifact_registry() to list all artifacts.

Input staleness detection#

When a Container completes successfully, it records the SHA-256 content hash of each input file in the [input_fingerprints] section of workflow_state.toml. On subsequent runs, if the container is already completed or running, the engine compares the current input files against the recorded fingerprints.

If any input has changed (e.g. an upstream VMC produced a new optimised wavefunction), the engine logs a warning:

WARNING: Inputs have changed but previous results are still present.
Delete 'mcmc_prod/' to re-run with the updated inputs.

The container is not automatically re-run — the user must manually delete the stale directory. This conservative approach avoids the risk of mixing old and new job data on the remote server.

Staleness is tracked per-file by SHA-256 content hashing. If an upstream workflow re-runs but produces a byte-identical output file, no warning is triggered.

Error recording#

When a workflow fails, the [error] section of workflow_state.toml is populated via set_error():

[error]
message = "Job killed after 86400s"
exception_type = "RuntimeError"
traceback = "..."

The engine records raw data only — failure classification and recovery strategy are responsibilities of external tooling (e.g. an MCP agent).

Job accounting#

For queuing systems (PBS, Slurm, Fujitsu TCS, etc.), the engine can collect scheduler accounting data after a job leaves the queue. This is configured via the optional jobacct field in machine_data.yaml (see machine_data.yaml parameters).

The engine executes {jobacct} {job_id} and writes the raw output to a separate file job_accounting_{job_id}.txt. The accounting command and file path are stored per-job in the [[jobs]] record:

[[jobs]]
input_file = "input_vmc-H2-0.74_1_aebf13bd.toml"
output_file = "output_vmc-H2-0.74_1_aebf13bd.out"
job_id = "12345"
server_machine = "my-cluster"
status = "fetched"
step = 1
run_id = "aebf13bd"
job_stdout = "job_vmc-H2-0.74.o"
job_stderr = "job_vmc-H2-0.74.e"
job_acct_command = "sacct -j 12345 --format=State,ExitCode,MaxRSS,Elapsed -P"
job_acct_file = "job_accounting_12345.txt"

No parsing or interpretation is performed — that responsibility belongs to external tooling. If jobacct is not configured, the job_acct_command and job_acct_file fields are simply absent.

Session state queries#

The Launcher provides three methods for programmatic introspection (useful for MCP adapters and monitoring tools):

Method	Returns
`get_session_state()`	Dict with per-workflow summaries, dependency graph, and progress counters (completed / failed / running / pending / total).
`get_current_job()`	The first workflow with status `running` or `submitted`, or `None`.
`get_job_history()`	Flat, chronologically-sorted list of all jobs across all workflows.

Machine catalog#

Two helper functions are available for machine discovery:

Function	Description
`list_machines()`	Returns a list of dicts summarising all machines defined in `machine_data.yaml` (name, machine_type, queuing, ssh_host, workspace_root).
`probe_environment(machine_name)`	Tests connectivity to the named machine (SSH for remote, always reachable for local). Returns `{"reachable": True/False, ...}`.

Configuration#

Configuration files are managed in the following directory hierarchy. The engine looks for a project-local override first, then falls back to the user-global directory:

./jqmc_setting_local/ — project-local override (if it exists in CWD)
~/.jqmc_setting/ — user-global default

On the very first run, if neither directory exists, the template shipped with the package is copied to ~/.jqmc_setting/ and the user is asked to edit it.

Directory structure#

~/.jqmc_setting/
├── machine_data.yaml           # Server machine definitions
├── localhost/                   # Settings for localhost
│   ├── queue_data.toml
│   └── submit_mpi.sh           # (name is user-defined)
├── my-cluster/                 # Settings for a remote cluster (nickname)
│   ├── queue_data.toml
│   └── submit_mpi.sh
└── ...

`machine_data.yaml`#

A YAML file that defines each compute machine. The top-level keys are nicknames (arbitrary labels); they do not need to match the SSH host. Remote machines use the ssh_host field to specify the SSH connection target.

Example#

localhost:
  machine_type: local
  queuing: false
  workspace_root: /home/user/jqmc_work

my-cluster:                          # nickname (freely chosen)
  ssh_host: pbs-cluster              # Host alias in ~/.ssh/config
  machine_type: remote
  queuing: true
  workspace_root: /home/user/jqmc_work
  jobsubmit: /opt/pbs/bin/qsub
  jobcheck: /opt/pbs/bin/qstat
  jobdel: /opt/pbs/bin/qdel
  jobnum_index: 0
  jobacct: /opt/pbs/bin/qstat -xf

my-slurm:                            # Slurm example
  ssh_host: slurm-cluster
  machine_type: remote
  queuing: true
  workspace_root: /home/user/jqmc_work
  jobsubmit: sbatch
  jobcheck: squeue --noheader
  jobdel: scancel
  jobnum_index: 3
  jobacct: sacct -j --format=State,ExitCode,MaxRSS,Elapsed,Timelimit -P --noheader

my-fujitsu:                          # Fujitsu TCS example
  ssh_host: fujitsu
  machine_type: remote
  queuing: true
  workspace_root: /home/user/jqmc_work
  jobsubmit: /usr/local/bin/pjsub
  jobcheck: /usr/local/bin/pjstat
  jobdel: /usr/local/bin/pjdel
  jobnum_index: 5
  jobacct: pjstat -H -s --choose jid,st,ec,elp,pc --data

Parameters#

Key	Type	Required	Description
`ssh_host`	String	When `machine_type: remote`	SSH connection target. Typically a `Host` alias defined in `~/.ssh/config`.
`machine_type`	`"local"` or `"remote"`	Yes	`local`: execute on the same host. `remote`: connect via SSH (Paramiko).
`queuing`	Boolean	Yes	`true`: use a batch scheduler. `false`: direct execution.
`workspace_root`	String (path)	Yes	Root directory for file management. Upload/download paths are resolved relative to this directory. Must be set on both localhost and the remote server.
`jobsubmit`	String (command)	When `queuing: true`	Command to submit a job (e.g. `qsub`, `sbatch`, `pjsub`). Not needed for `queuing: false`.
`jobcheck`	String (command)	When `queuing: true`	Command to check job status (e.g. `qstat`, `squeue --noheader`, `pjstat`).
`jobdel`	String (command)	When `queuing: true`	Command to cancel a job (e.g. `qdel`, `scancel`, `pjdel`).
`jobnum_index`	Integer	When `queuing: true`	0-based column index of the job ID in the output of `jobsubmit`. For PBS (`42.server`), use `0`. For Slurm (`Submitted batch job 42`), use `3`.
`jobacct`	String (command)	No	Scheduler accounting command with flags. The engine executes `{jobacct} {job_id}` after a job leaves the queue and saves the raw output to `job_accounting_{job_id}.txt`. No parsing is performed. Examples: `sacct -j --format=State,ExitCode,MaxRSS,Elapsed,Timelimit -P --noheader` (Slurm), `qstat -xf` (PBS), `pjstat -H -s --choose jid,st,ec,elp,pc --data` (Fujitsu TCS). If omitted, no accounting data is collected.
`ip`	String	No	IP address or hostname of the remote machine. Usually the SSH alias in `~/.ssh/config` is used instead.

SSH connection#

For machine_type: remote, the ssh_host field specifies the SSH connection target. This is typically a Host alias defined in ~/.ssh/config. The engine reads ~/.ssh/config via Paramiko to resolve HostName, User, IdentityFile, Port, and ProxyJump (multi-hop) settings. Your SSH key must be accessible without a passphrase prompt (use ssh-agent or a passphrase-less key).

Note: CanonicalizeHostname in ~/.ssh/config may trigger a Paramiko bug. The engine automatically works around this by removing the directive in-memory before connecting.

`queue_data.toml`#

Defines batch queue presets for each machine. Written in TOML format with queue labels as table keys.

Example#

[default]
    submit_template = 'submit_mpi.sh'
    max_job_submit = 10
    queue = 'batch'
    num_cores = 120
    omp_num_threads = 1
    nodes = 1
    mpi_per_node = 120
    max_time = '24:00:00'

[large]
    submit_template = 'submit_mpi.sh'
    max_job_submit = 10
    queue = 'batch'
    num_cores = 480
    omp_num_threads = 1
    nodes = 4
    mpi_per_node = 120
    max_time = '24:00:00'

Parameters#

Key	Type	Required	Description
`submit_template`	String	Yes	Name of the job submission script template file placed in the same directory (e.g. `"submit_mpi.sh"`, `"submit_gpu.sh"`).
`max_job_submit`	Integer	Yes	Maximum number of concurrently submitted jobs for this queue.
custom keys	Any	No	Any additional key-value pairs. These are substituted into job script templates as `_KEY_` (upper-case). Common keys: `num_cores`, `omp_num_threads`, `nodes`, `mpi_per_node`, `max_time`, `queue`, `account`, `partition`.

TOML format notes#

Strings must be quoted: queue = "small".
Booleans: lowercase true / false only.
Time values (e.g. max_time) must be quoted to avoid TOML local-time parsing: max_time = "24:00:00".

Job script templates#

Shell scripts placed in the per-machine directory, referenced by the submit_template key in queue_data.toml. Template variables (written as _KEY_ with underscores) are replaced at submission time. You can name the file anything you like (e.g. submit_gpu.sh).

Predefined variables#

Variable	Description	Default
`_INPUT_`	Path to the jqmc input TOML file	(set by workflow)
`_OUTPUT_`	Path to the jqmc stdout+stderr capture file	`"out.o"`
`_JOBNAME_`	Job name	`"jqmc-wf"`
`_JOB_STDOUT_`	Path for the scheduler stdout file (e.g. PBS `-o`, Slurm `--output`)	`"job_{jobname}.o"`
`_JOB_STDERR_`	Path for the scheduler stderr file (e.g. PBS `-e`, Slurm `--error`)	`"job_{jobname}.e"`

_JOB_STDOUT_ and _JOB_STDERR_ allow the engine to track where the scheduler writes its output, which is useful for failure diagnosis. If these placeholders are not present in the template the scheduler’s default naming convention is used (backward-compatible). The file paths are recorded per-job in the [[jobs]] records of workflow_state.toml (as job_stdout and job_stderr).

Note: job_stdout and job_stderr are treated as optional during fetch. If the files do not exist on the remote server (e.g. the job script template does not include #SBATCH --output / #SBATCH --error directives), the engine logs a warning and continues instead of raising an error. All other output files remain mandatory.

Custom variables#

All keys from queue_data.toml are available in upper-case with surrounding underscores. For example, num_cores = 48 becomes _NUM_CORES_ in the template.

Example (`submit_mpi.sh` for PBS)#

#!/bin/sh
#PBS -N _JOBNAME_
#PBS -q _QUEUE_
#PBS -l nodes=_NODES_:ppn=_MPI_PER_NODE_
#PBS -l walltime=_MAX_TIME_
#PBS -o _JOB_STDOUT_
#PBS -e _JOB_STDERR_

export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_

cd ${PBS_O_WORKDIR}
mpirun -np _NUM_CORES_ jqmc ${INPUT} > ${OUTPUT} 2>&1

Example (`submit_mpi.sh` for Slurm)#

#!/bin/bash
#SBATCH --job-name=_JOBNAME_
#SBATCH --partition=_QUEUE_
#SBATCH --nodes=_NODES_
#SBATCH --ntasks=_NUM_CORES_
#SBATCH --time=_MAX_TIME_
#SBATCH --output=_JOB_STDOUT_
#SBATCH --error=_JOB_STDERR_

export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_

srun jqmc ${INPUT} > ${OUTPUT} 2>&1

Example (`submit_serial.sh`)#

#!/bin/sh
#PBS -N _JOBNAME_
#PBS -q _QUEUE_
#PBS -l nodes=_NODES_
#PBS -l walltime=_MAX_TIME_
#PBS -o _JOB_STDOUT_
#PBS -e _JOB_STDERR_

export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_

cd ${PBS_O_WORKDIR}
jqmc ${INPUT} > ${OUTPUT} 2>&1

Pipeline example#

A minimal pipeline script that runs VMC → MCMC + LRDMC extrapolation:

from jqmc_workflow import (
    Container,
    FileFrom,
    Launcher,
    LRDMC_Ext_Workflow,
    MCMC_Workflow,
    ValueFrom,
    VMC_Workflow,
)

server = "pbs-cluster"
h5 = "hamiltonian_data.h5"

vmc = Container(
    label="vmc",
    dirname="vmc",
    input_files=[h5],
    workflow=VMC_Workflow(
        server_machine_name=server,
        hamiltonian_file=h5,
        queue_label="default",
        pilot_queue_label="small",
        jobname="vmc",
        target_error=0.001,
    ),
)

mcmc = Container(
    label="mcmc-prod",
    dirname="mcmc_prod",
    input_files=[
        h5,
        FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")),
    ],
    rename_input_files=[None, "hamiltonian_data.h5"],
    workflow=MCMC_Workflow(
        server_machine_name=server,
        hamiltonian_file=h5,
        queue_label="default",
        pilot_queue_label="small",
        jobname="mcmc",
        target_error=0.001,
        cleanup_patterns=["restart.h5"],  # delete large checkpoint on success
    ),
)

lrdmc = Container(
    label="lrdmc-ext",
    dirname="lrdmc_ext",
    input_files=[
        h5,
        FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")),
    ],
    workflow=LRDMC_Ext_Workflow(
        server_machine_name=server,
        alat_list=[0.10, 0.20, 0.25, 0.30],
        hamiltonian_file=h5,
        queue_label="default",
        pilot_queue_label="small",
        jobname_prefix="lrdmc",
        target_survived_walkers_ratio=0.97,
        E_scf=ValueFrom("mcmc-prod", "energy"),
        target_error=0.001,
    ),
)

pipeline = Launcher(workflows=[vmc, mcmc, lrdmc])
pipeline.launch()

In this example, mcmc-prod and lrdmc-ext depend on vmc (via FileFrom). The optimised Hamiltonian filename is resolved dynamically through ValueFrom("vmc", "optimized_hamiltonian"), so the pipeline works correctly even when VMC converges early (e.g. step 91 instead of 150). Additionally, lrdmc-ext depends on mcmc-prod (via ValueFrom for E_scf), so the DAG becomes VMC → MCMC → LRDMC-ext. The target_survived_walkers_ratio triggers automatic calibration of num_projection_per_measurement independently at each lattice spacing. All alat values run their calibration, error-bar pilot, and production phases in parallel.

Job Manager CLI (`jqmc-jobmanager`)#

The jqmc-jobmanager command-line tool monitors and manages running pipelines. It recursively discovers workflow_state.toml files under the current directory and displays a summary tree.

Commands#

Command	Description
`show`	Print the workflow tree. Add `--id N` to display full detail for one job.
`check`	Print the tree and query the scheduler on the remote machine for live job status. Use `--id N` to target a specific job.
`del`	Cancel a queued/running job and mark the workflow as `cancelled`. Requires `--id N`.

Common options#

Option	Description
`--id N`	Numeric job ID shown in the tree (0-based).
`-s` / `--server`	Server machine name (defaults to `localhost`). Used by `check` and `del` to connect to the remote scheduler.
`--log-level`	`INFO` (default) or `DEBUG`.

Usage examples#

Show the full workflow tree:

jqmc-jobmanager show

Show tree and detailed info for job 2:

jqmc-jobmanager show --id 2

Check live queue status for job 4 on pbs-cluster:

jqmc-jobmanager check --id 4 -s pbs-cluster

Cancel job 4 on pbs-cluster:

jqmc-jobmanager del --id 4 -s pbs-cluster

Example output#

==============================================================================
  Workflow Job Tree
  Root: /home/user/project
==============================================================================
   ID  Status       Label                Type                 Server       Job#
  --------------------------------------------------------------------------
    0  completed    wf                   WF_Workflow          ?            -
       dir: 00_wf
    1  completed    vmc                  VMC_Workflow         pbs-cluster  12345
       dir: 01_vmc
       energy: -17.17 +- 0.005 Ha
    2  running      mcmc-prod            MCMC_Workflow        pbs-cluster  12367
       dir: 02_mcmc
    3  running      lrdmc-ext            LRDMC_Ext_Workflow   ?            -
       dir: 03_lrdmc_ext

For the full API reference, see API reference for the pipeline (jqmc_workflow).

Autonomous Pipelines (jqmc_workflow)

Contents

Autonomous Pipelines (jqmc_workflow)#

Overview#

Architecture#

Key components#

Inter-workflow data passing (FileFrom / ValueFrom)#

FileFrom(label, filename)#

ValueFrom(label, key)#

Available output_values keys#

VMC_Workflow#

MCMC_Workflow#

LRDMC_Workflow#

LRDMC_Ext_Workflow#

Target error-bar estimation#

MCMC / VMC#

LRDMC#

AO basis optimization#

VMC convergence checks#

Signal-to-noise (S/N) check#

Energy-slope check#

Combined verdict#

Early exit on convergence#

Step estimation formula#

Target error not met#

Continuation with tighter target error#

Pre-launch validation#

Post-completion cleanup (cleanup_patterns)#

Restart behavior#

Phase management#

Status enums#

Artifact registry#

Input staleness detection#

Error recording#

Job accounting#

Session state queries#

Machine catalog#

Configuration#

Directory structure#

machine_data.yaml#

Example#

Parameters#

SSH connection#

queue_data.toml#

Example#

Parameters#

TOML format notes#

Job script templates#

Predefined variables#

Custom variables#

Example (submit_mpi.sh for PBS)#

Example (submit_mpi.sh for Slurm)#

Example (submit_serial.sh)#

Pipeline example#

Job Manager CLI (jqmc-jobmanager)#

Commands#

Common options#

Usage examples#

Example output#

Inter-workflow data passing (`FileFrom` / `ValueFrom`)#

`FileFrom(label, filename)`#

`ValueFrom(label, key)`#

Available `output_values` keys#

`VMC_Workflow`#

`MCMC_Workflow`#

`LRDMC_Workflow`#

`LRDMC_Ext_Workflow`#

Post-completion cleanup (`cleanup_patterns`)#

`machine_data.yaml`#

`queue_data.toml`#

Example (`submit_mpi.sh` for PBS)#

Example (`submit_mpi.sh` for Slurm)#

Example (`submit_serial.sh`)#

Job Manager CLI (`jqmc-jobmanager`)#