Autonomous Pipelines (jqmc_workflow)#
Overview#
The jqmc_workflow package provides an autonomous pipeline engine for jQMC calculations. Users define a pipeline — a directed acyclic graph (DAG) of workflow steps — and the engine takes care of:
Input generation — TOML input files are created automatically from explicit parameter values (with sensible defaults).
Data transfer — Files are uploaded to (and downloaded from) remote supercomputers via SSH/SFTP using Paramiko.
Job submission, monitoring, and collection — Jobs are submitted through the site’s scheduler (PBS / Slurm / local
bash), polled until completion, and output files are fetched back.Dependency resolution — The DAG-based
Launcheridentifies which workflow steps are ready (all dependencies satisfied) and executes them in parallel using Pythonasynciotasks.Target error-bar estimation — When a
target_error(Ha) is specified, a small pilot run is executed first; the statistical error is used to estimate the number of production steps required, together with the estimated wall time.
A single Python script can therefore express a full QMC pipeline — from
wavefunction preparation (TREXIO → hamiltonian_data.h5) through VMC
optimization, MCMC production sampling, and LRDMC extrapolation — and run it
end-to-end with automatic restarts across interruptions.
Architecture#
run_pipeline.py
│
▼
┌────────┐
│Launcher│ DAG executor (asyncio)
└──┬─────┘
│ creates asyncio.Task per ready node
│
├──► Container("vmc")
│ └─► VMC_Workflow.configure() → .run()
│
├──► Container("mcmc-prod") ← runs in parallel
│ └─► MCMC_Workflow.configure() → .run()
│
└──► Container("lrdmc-ext") ← runs in parallel
└─► LRDMC_Ext_Workflow.configure() → .run()
├─► LRDMC_Workflow (alat=0.50) ┐
├─► LRDMC_Workflow (alat=0.40) ├ parallel
└─► LRDMC_Workflow (alat=0.25) ┘
Key components#
Class / Module |
Role |
|---|---|
|
Executes a DAG of |
|
Wraps any |
|
Declare inter-workflow dependencies (files or computed values). |
|
Converts a TREXIO file to |
|
Jastrow / orbital optimization ( |
|
Production energy sampling ( |
|
Lattice-regularized diffusion Monte Carlo for a single \(a\) value. |
|
Runs multiple |
|
Enum defining the scientific phases of a workflow session (INIT → SCF → WF_BUILD → VMC → MCMC → LRDMC → COMPLETED). See Phase management. |
|
Enums for workflow-level and per-job status values. See Status enums. |
Inter-workflow data passing (FileFrom / ValueFrom)#
FileFrom and ValueFrom declare inter-workflow dependencies inside a
Container definition. At launch time, the Launcher resolves them
to actual paths or values.
FileFrom(label, filename)#
Pass a file produced by an upstream workflow. filename can be:
A static string — when the exact name is known at definition time:
FileFrom("vmc", "hamiltonian_data_opt_step_9.h5")
A
ValueFromobject — when the name is determined at runtime (e.g. VMC early convergence produces a step number that cannot be predicted):FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian"))
Here the
ValueFromis resolved first, yielding the actual basename (e.g."hamiltonian_data_opt_step_91.h5"), which is then used to locate the file in the upstream directory.
Important: file renaming. When a dynamic
FileFromresolves to a name that differs from what the downstream workflow expects (e.g.hamiltonian_data_opt_step_91.h5vs.hamiltonian_data.h5), userename_input_filesto map it to the expected name. Entries ofNonekeep the original name. A pre-launch validation check will raiseFileNotFoundErrorbefore any job is submitted if required files are missing.Container( label="mcmc", dirname="02_mcmc", input_files=[ FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")), ], rename_input_files=["hamiltonian_data.h5"], workflow=MCMC_Workflow(...), )With multiple input files, use
Noneto skip renaming for specific entries:input_files=[h5, FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian"))], rename_input_files=[None, "hamiltonian_data.h5"],
ValueFrom(label, key)#
Pass a scalar value from an upstream workflow’s output_values
dict. The available keys depend on the workflow class — see the
table below.
Available output_values keys#
Each workflow populates output_values on completion. These keys
can be referenced by downstream workflows via
ValueFrom("label", "key").
VMC_Workflow#
Key |
Type |
Description |
|---|---|---|
|
|
Basename of the last optimised Hamiltonian file (e.g. |
|
|
Basename of the restart checkpoint file. |
|
|
Estimated MCMC steps per optimisation step (automatic mode). |
|
|
Same as above, in fixed-step mode. |
|
|
Energy from the last optimisation step (Ha). |
|
|
Statistical error on |
|
|
Average S/N over the trailing window (force convergence only). |
|
|
S/N of the last optimisation step. |
|
|
Slope of energy vs. step (energy-slope check only). |
|
|
Standard deviation of the energy slope. |
MCMC_Workflow#
Key |
Type |
Description |
|---|---|---|
|
|
VMC energy (Ha). |
|
|
Statistical error on |
|
|
Basename of the restart checkpoint file. |
|
|
Atomic forces (only when |
|
|
Estimated total measurement steps (automatic mode). |
|
|
Same as above, in fixed-step mode. |
LRDMC_Workflow#
Key |
Type |
Description |
|---|---|---|
|
|
DMC energy (Ha). |
|
|
Statistical error on |
|
|
Lattice spacing used for this run. |
|
|
Basename of the restart checkpoint file. |
|
|
Atomic forces (only when |
|
|
Estimated total measurement steps. |
|
|
GFMC projections per measurement (GFMC_n mode only). |
|
|
Imaginary-time projection step (GFMC_t mode only). |
LRDMC_Ext_Workflow#
Key |
Type |
Description |
|---|---|---|
|
|
Continuum-limit (\(a^2 \to 0\)) extrapolated energy (Ha). |
|
|
Statistical error on |
|
|
Per-alat energy/error results. |
|
|
Error messages for alat runs that failed. |
|
|
Top-level error message (only on failure). |
Target error-bar estimation#
When target_error is set, each workflow (MCMC and LRDMC) operates in
a pilot + production cycle. A separate queue may be used for the pilot
via pilot_queue_label (defaults to queue_label).
MCMC / VMC#
Pilot — A short run of
pilot_stepssteps.Production — Step count estimated from the pilot error bar.
LRDMC#
LRDMC has an additional calibration stage to automatically determine
num_projection_per_measurement (GFMC projections per measurement) from
a target_survived_walkers_ratio (default 0.97):
Calibration (
_pilot_a/_pilot1–_pilot_a/_pilot3, parallel) — Three short LRDMC runs withnum_projection_per_measurement = Ne × k × (0.3/alat)²(k=2,4,6; \(N_e\) is the total electron count). A quadratic is fit to the observed survived-walkers ratio and the optimalnum_projection_per_measurementis determined.Error-bar pilot (
_pilot_b) — A run with the calibratednum_projection_per_measurement; its error bar estimates the production step count.Production (
_1,_2, …) — Start from scratch, accumulate statistics untiltarget_erroris achieved.
If num_projection_per_measurement is given explicitly, the calibration
stage is skipped and only the error-bar pilot is executed.
For LRDMC_Ext_Workflow (multi-alat extrapolation), every alat value
independently runs its own calibration, error-bar pilot, and production
in parallel. There is no inter-alat interaction until the final
extrapolation step.
AO basis optimization#
VMC_Workflow supports optimizing the atomic-orbital (AO) Gaussian
basis parameters alongside Jastrow / orbital coefficients. Four
boolean flags control which basis parameters are included in the
optimization:
Parameter |
Default |
Description |
|---|---|---|
|
|
Optimize J3 (three-body Jastrow) AO Gaussian exponents |
|
|
Optimize J3 AO contraction coefficients |
|
|
Optimize Geminal AO Gaussian exponents |
|
|
Optimize Geminal AO contraction coefficients |
opt_lambda_basis_exp and opt_lambda_basis_coeff cannot be combined
with opt_with_projected_MOs (changing Geminal AO exponents/coefficients
invalidates the overlap matrix used by the MO projection).
opt_J3_basis_exp and opt_J3_basis_coeff can be used together
with opt_with_projected_MOs.
Primitives belonging to the same shell (same atom, same \(l\), same initial radial parameters) are constrained to share identical values throughout optimization. See Variational Monte Carlo (VMC) for details on the shell-sharing constraint mechanism.
When set, the corresponding parameters are passed through to the
jqmc input TOML via resolve_with_defaults(). When left as None
(the default), the jqmc binary applies its own defaults (false).
VMC convergence checks#
After all production runs complete, the VMC workflow checks whether the optimization has converged. Two independent criteria are available; when both are active, both must pass for convergence.
Signal-to-noise (S/N) check#
Enabled when target_snr is set (not None). The workflow
averages the signal-to-noise ratio (S/N = max(|f| / |std f|)) over
the last snr_avg_window optimization steps (default 5). If there
are fewer steps than the window size, all available values are used.
The convergence criterion is:
where \(W\) is snr_avg_window.
Energy-slope check#
Enabled when energy_slope_sigma_threshold is set (not None).
A weighted linear regression is fitted to the last
energy_slope_window_size optimisation steps (default 5):
The optimisation is considered converged (plateau) when the slope \(b\) is not significantly negative:
If instead \(b < -\sigma_b \times \text{threshold}\), the energy is still decreasing and optimisation has not yet plateaued.
Combined verdict#
|
|
Behaviour |
|---|---|---|
|
|
No convergence check; always succeeds |
set |
|
S/N check only |
|
set |
Energy-slope check only |
set |
set |
Both must pass |
All numerical values (averaged S/N, slope, slope std) are recorded
in output_values for downstream inspection.
In fixed-step mode (num_mcmc_steps is set), the convergence checks
are not performed.
Early exit on convergence#
When convergence criteria are set, the VMC production loop exits
early as soon as all active checks pass, instead of always running
the full max_continuation steps. This avoids wasting compute
time once the optimization has plateaued.
On re-runs with changed criteria (e.g. lowering target_snr),
convergence is re-evaluated from the fetched results before
submitting new jobs. If the existing results already satisfy the
new criteria, the workflow completes immediately without launching
additional runs.
When no convergence criterion is set (target_snr = None and
energy_slope_sigma_threshold = None), all max_continuation
steps run unconditionally. If convergence is not achieved
after exhausting all production steps, the workflow returns
FAILED.
Step estimation formula#
The required number of production steps is estimated via
where \(N_{\text{eff}} = N_{\text{total}} - N_{\text{warmup}}\) is the
number of steps that actually contribute to statistics (warmup steps
are discarded during post-processing via -w), and
\(W = \text{walkers\_per\_MPI} \times \text{num\_MPI}\) is the
total number of walkers. The ratio
\(W_{\text{pilot}} / W_{\text{prod}}\) (the walker ratio) accounts
for the pilot queue using fewer (or more) MPI processes than the
production queue. The number of MPI processes is read from
queue_data.toml (num_cores, or mpi_per_node × nodes as
fallback).
The total production steps are then \(N_{\text{prod}} = N_{\text{eff,prod}} + N_{\text{warmup}}\).
The estimated production wall time is based on the net computation
time parsed from the pilot output (Net GFMC time for LRDMC,
Net total time for MCMC for MCMC/VMC). Only this net portion
scales with the step count; overhead (JIT compilation, file I/O,
queue wait) is treated as a constant:
-- LRDMC Step Estimation Summary (a=0.25) --------
pilot steps = 100
warmup steps = 10
pilot error = 0.00393172 Ha
target error = 0.001 Ha
nmpm = 32
pilot MPI procs = 48
prod. MPI procs = 480
walker ratio = 0.1
estimated steps = 155
pilot wall time = 5m 12s
pilot net time = 2m 42s
est. prod. time = 6m 43s
--------------------------------------------------
The first production run starts from scratch (no restart from the
pilot checkpoint, which lives in the _pilot_b/ subdirectory).
After each production run, the error bar is re-evaluated.
If it exceeds the target, additional steps are estimated and a
continuation run is launched automatically, up to max_continuation
times.
Target error not met#
If max_continuation runs complete but the statistical error still
exceeds target_error, the workflow logs a warning and returns
COMPLETED (not FAILED). The calculation itself succeeded; only
the error-bar criterion was not fully met. This applies to both
MCMC_Workflow and LRDMC_Workflow.
Continuation with tighter target error#
When re-running a pipeline with a stricter target_error (e.g.
lowering from 1e-3 to 5e-4), the workflow detects that the
cached error from previous runs exceeds the new target and
automatically re-estimates additional steps from the accumulated
data. Continuation runs are launched from where the previous
execution left off, up to max_continuation.
Pre-launch validation#
Before any job is submitted, the engine verifies that all required
files are present in the project directory. Every resolved entry in
input_files (after renaming) must exist. Workflow-internal files
(e.g. hamiltonian_file) are not checked because some workflows
(e.g. WF_Workflow) produce them rather than consume them.
If any file is missing, a FileNotFoundError is raised immediately
with a message listing the missing files:
FileNotFoundError: [mcmc-N2-0.80] Required file(s) missing in '05_mcmc/'
before workflow launch: ['hamiltonian_data.h5'].
Check that input_files and rename_input_files are configured correctly.
This catches misconfigured rename_input_files (e.g. a dynamic
FileFrom + ValueFrom that produces hamiltonian_data_opt_step_91.h5
but the workflow expects hamiltonian_data.h5) before wasting
compute resources.
Additionally, when the project directory already exists from a previous interrupted run, any input files that are missing (but available from the source) are automatically copied in. Existing files are not overwritten.
Post-completion cleanup (cleanup_patterns)#
When large checkpoint files (e.g. restart.h5, hamiltonian_opt*.h5)
are no longer needed after a successful run, the cleanup_patterns
parameter can be used to automatically delete them.
MCMC_Workflow(
server_machine_name="cluster",
target_error=0.001,
cleanup_patterns=["restart.h5", "hamiltonian_opt*.h5"],
)
Behaviour:
cleanup_patternsaccepts a list of glob patterns (e.g.["restart.h5", "hamiltonian_opt*.h5"]).Matching is recursive — patterns are applied to the workflow directory and all subdirectories (e.g.
_pilot/,_pilot_a/,_pilot_b/).Local files matching the patterns are always deleted.
Remote files are deleted only when the workflow targets a remote machine (
server_machine_nameis not"localhost").Cleanup runs after
CompletionStatus.OKis confirmed — it never touches files while the workflow might still need them for continuation.Cleanup failures are logged as warnings and never cause a completed workflow to fail.
Default is an empty list (no cleanup), preserving backward compatibility.
This is especially useful for massively-parallel MCMC runs where
restart.h5 can grow to tens of gigabytes.
All workflow classes (VMC_Workflow, MCMC_Workflow,
LRDMC_Workflow, LRDMC_Ext_Workflow) support this parameter.
For LRDMC_Ext_Workflow, the patterns are passed through to each
child LRDMC_Workflow.
Restart behavior#
Every job is recorded in workflow_state.toml with a lifecycle:
submitted → completed → fetched
On restart, the engine checks each job’s status:
fetched— Input generation and submission are both skipped.submitted/completed— Input is not regenerated; the job is resumed (polled or fetched).No record — A fresh input file is generated and the job is submitted.
This means a pipeline can be interrupted at any point (Ctrl-C, node failure, wall-time limit) and simply re-run; it will pick up exactly where it left off.
When a job leaves the scheduler queue, the engine automatically collects
job accounting data (if jobacct is configured in machine_data.yaml)
before fetching output files. The accounting command and output file
path are stored in the corresponding [[jobs]] record. See
Job accounting.
Phase management#
A workflow session progresses through a sequence of scientific phases
defined by the ScientificPhase enum (module _phase):
INIT → SCF → WF_BUILD → VMC_PILOT → VMC → MCMC_PILOT → MCMC
↓
COMPLETED ← LRDMC_FIT ← LRDMC ← LRDMC_PILOT
Not every pipeline uses every phase (e.g. a VMC-only pipeline skips
LRDMC phases). The allowed transitions are defined in
PHASE_TRANSITIONS — for example, from VMC you may advance to
MCMC_PILOT, MCMC, LRDMC_PILOT, LRDMC, or COMPLETED.
Each phase has a list of allowed actions (PHASE_ALLOWED_ACTIONS)
that further depends on the current WorkflowStatus:
When
status == RUNNING, configuration actions (configure_*) are filtered out.When
status == FAILED, only recovery actions (recover_*) androllback_phaseare available.
A set of always-allowed actions (advance_phase, rollback_phase,
close_session, register_artifact, mark_unhealthy) is appended
regardless of phase/status.
The require_action() function enforces these rules at the boundary
between an MCP tool call and a workflow method — if the action is not
permitted, a ValueError is raised immediately.
Status enums#
Workflow status and job status are represented by str-based enums
(WorkflowStatus, JobStatus) so they remain human-readable in
workflow_state.toml and can be compared directly with strings.
WorkflowStatus values:
Value |
Meaning |
|---|---|
|
Not yet started |
|
Input files being transferred |
|
Job submitted to scheduler |
|
Execution in progress |
|
Finished successfully |
|
Terminated with an error |
|
Manually cancelled |
JobStatus values (per-job, stored in [[jobs]] of
workflow_state.toml):
Value |
Meaning |
|---|---|
|
Job submitted |
|
Job finished (output not yet fetched) |
|
Output files retrieved |
|
Job failed |
Each [[jobs]] record contains:
Field |
Description |
|---|---|
|
Basename of the generated TOML input file ( |
|
Basename of the stdout capture file ( |
|
Scheduler job ID (or |
|
Machine name |
|
One of the |
|
ISO 8601 timestamp |
|
Step index (0 = pilot, 1, 2, … = production) |
|
Short hex identifier for the job |
|
ISO 8601 timestamp (set on completion) |
|
ISO 8601 timestamp (set on fetch) |
|
Scheduler stdout path (queuing systems only) |
|
Scheduler stderr path (queuing systems only) |
|
Accounting command executed (queuing systems only) |
|
Path to raw accounting output file (queuing systems only) |
Artifact registry#
The engine records file lineage in the [[artifacts]] array of
workflow_state.toml. Each entry tracks:
Field |
Description |
|---|---|
|
Basename of the artifact file |
|
Input file that produced this artifact |
|
ISO 8601 timestamp |
|
|
|
Optional list of |
Use register_artifact() to add an entry, get_artifact_lineage() to
look up a single file, and get_artifact_registry() to list all
artifacts.
Input staleness detection#
When a Container completes successfully, it records the SHA-256
content hash of each input file in the [input_fingerprints]
section of workflow_state.toml. On subsequent runs, if the container
is already completed or running, the engine compares the current
input files against the recorded fingerprints.
If any input has changed (e.g. an upstream VMC produced a new optimised wavefunction), the engine logs a warning:
WARNING: Inputs have changed but previous results are still present.
Delete 'mcmc_prod/' to re-run with the updated inputs.
The container is not automatically re-run — the user must manually delete the stale directory. This conservative approach avoids the risk of mixing old and new job data on the remote server.
Staleness is tracked per-file by SHA-256 content hashing. If an upstream workflow re-runs but produces a byte-identical output file, no warning is triggered.
Error recording#
When a workflow fails, the [error] section of workflow_state.toml
is populated via set_error():
[error]
message = "Job killed after 86400s"
exception_type = "RuntimeError"
traceback = "..."
The engine records raw data only — failure classification and recovery strategy are responsibilities of external tooling (e.g. an MCP agent).
Job accounting#
For queuing systems (PBS, Slurm, Fujitsu TCS, etc.), the engine can
collect scheduler accounting data after a job leaves the queue. This
is configured via the optional jobacct field in machine_data.yaml
(see machine_data.yaml parameters).
The engine executes {jobacct} {job_id} and writes the raw output to
a separate file job_accounting_{job_id}.txt. The accounting command
and file path are stored per-job in the [[jobs]] record:
[[jobs]]
input_file = "input_vmc-H2-0.74_1_aebf13bd.toml"
output_file = "output_vmc-H2-0.74_1_aebf13bd.out"
job_id = "12345"
server_machine = "my-cluster"
status = "fetched"
step = 1
run_id = "aebf13bd"
job_stdout = "job_vmc-H2-0.74.o"
job_stderr = "job_vmc-H2-0.74.e"
job_acct_command = "sacct -j 12345 --format=State,ExitCode,MaxRSS,Elapsed -P"
job_acct_file = "job_accounting_12345.txt"
No parsing or interpretation is performed — that responsibility belongs
to external tooling. If jobacct is not configured, the
job_acct_command and job_acct_file fields are simply absent.
Session state queries#
The Launcher provides three methods for programmatic introspection
(useful for MCP adapters and monitoring tools):
Method |
Returns |
|---|---|
|
Dict with per-workflow summaries, dependency graph, and progress counters (completed / failed / running / pending / total). |
|
The first workflow with status |
|
Flat, chronologically-sorted list of all jobs across all workflows. |
Machine catalog#
Two helper functions are available for machine discovery:
Function |
Description |
|---|---|
|
Returns a list of dicts summarising all machines defined in |
|
Tests connectivity to the named machine (SSH for remote, always reachable for local). Returns |
Configuration#
Configuration files are managed in the following directory hierarchy. The engine looks for a project-local override first, then falls back to the user-global directory:
./jqmc_setting_local/— project-local override (if it exists in CWD)~/.jqmc_setting/— user-global default
On the very first run, if neither directory exists, the template shipped
with the package is copied to ~/.jqmc_setting/ and the user is asked
to edit it.
Directory structure#
~/.jqmc_setting/
├── machine_data.yaml # Server machine definitions
├── localhost/ # Settings for localhost
│ ├── queue_data.toml
│ └── submit_mpi.sh # (name is user-defined)
├── my-cluster/ # Settings for a remote cluster (nickname)
│ ├── queue_data.toml
│ └── submit_mpi.sh
└── ...
machine_data.yaml#
A YAML file that defines each compute machine. The top-level keys are
nicknames (arbitrary labels); they do not need to match the SSH host.
Remote machines use the ssh_host field to specify the SSH connection
target.
Example#
localhost:
machine_type: local
queuing: false
workspace_root: /home/user/jqmc_work
my-cluster: # nickname (freely chosen)
ssh_host: pbs-cluster # Host alias in ~/.ssh/config
machine_type: remote
queuing: true
workspace_root: /home/user/jqmc_work
jobsubmit: /opt/pbs/bin/qsub
jobcheck: /opt/pbs/bin/qstat
jobdel: /opt/pbs/bin/qdel
jobnum_index: 0
jobacct: /opt/pbs/bin/qstat -xf
my-slurm: # Slurm example
ssh_host: slurm-cluster
machine_type: remote
queuing: true
workspace_root: /home/user/jqmc_work
jobsubmit: sbatch
jobcheck: squeue --noheader
jobdel: scancel
jobnum_index: 3
jobacct: sacct -j --format=State,ExitCode,MaxRSS,Elapsed,Timelimit -P --noheader
my-fujitsu: # Fujitsu TCS example
ssh_host: fujitsu
machine_type: remote
queuing: true
workspace_root: /home/user/jqmc_work
jobsubmit: /usr/local/bin/pjsub
jobcheck: /usr/local/bin/pjstat
jobdel: /usr/local/bin/pjdel
jobnum_index: 5
jobacct: pjstat -H -s --choose jid,st,ec,elp,pc --data
Parameters#
Key |
Type |
Required |
Description |
|---|---|---|---|
|
String |
When |
SSH connection target. Typically a |
|
|
Yes |
|
|
Boolean |
Yes |
|
|
String (path) |
Yes |
Root directory for file management. Upload/download paths are resolved relative to this directory. Must be set on both localhost and the remote server. |
|
String (command) |
When |
Command to submit a job (e.g. |
|
String (command) |
When |
Command to check job status (e.g. |
|
String (command) |
When |
Command to cancel a job (e.g. |
|
Integer |
When |
0-based column index of the job ID in the output of |
|
String (command) |
No |
Scheduler accounting command with flags. The engine executes |
|
String |
No |
IP address or hostname of the remote machine. Usually the SSH alias in |
SSH connection#
For machine_type: remote, the ssh_host field specifies the SSH
connection target. This is typically a Host alias defined in
~/.ssh/config. The engine reads ~/.ssh/config via Paramiko to
resolve HostName, User, IdentityFile, Port, and ProxyJump
(multi-hop) settings.
Your SSH key must be accessible without a passphrase prompt (use
ssh-agent or a passphrase-less key).
Note:
CanonicalizeHostnamein~/.ssh/configmay trigger a Paramiko bug. The engine automatically works around this by removing the directive in-memory before connecting.
queue_data.toml#
Defines batch queue presets for each machine. Written in TOML format with queue labels as table keys.
Example#
[default]
submit_template = 'submit_mpi.sh'
max_job_submit = 10
queue = 'batch'
num_cores = 120
omp_num_threads = 1
nodes = 1
mpi_per_node = 120
max_time = '24:00:00'
[large]
submit_template = 'submit_mpi.sh'
max_job_submit = 10
queue = 'batch'
num_cores = 480
omp_num_threads = 1
nodes = 4
mpi_per_node = 120
max_time = '24:00:00'
Parameters#
Key |
Type |
Required |
Description |
|---|---|---|---|
|
String |
Yes |
Name of the job submission script template file placed in the same directory (e.g. |
|
Integer |
Yes |
Maximum number of concurrently submitted jobs for this queue. |
custom keys |
Any |
No |
Any additional key-value pairs. These are substituted into job script templates as |
TOML format notes#
Strings must be quoted:
queue = "small".Booleans: lowercase
true/falseonly.Time values (e.g.
max_time) must be quoted to avoid TOML local-time parsing:max_time = "24:00:00".
Job script templates#
Shell scripts placed in the per-machine directory, referenced by the
submit_template key in queue_data.toml. Template variables
(written as _KEY_ with underscores) are replaced at submission time.
You can name the file anything you like (e.g. submit_gpu.sh).
Predefined variables#
Variable |
Description |
Default |
|---|---|---|
|
Path to the jqmc input TOML file |
(set by workflow) |
|
Path to the jqmc stdout+stderr capture file |
|
|
Job name |
|
|
Path for the scheduler stdout file (e.g. PBS |
|
|
Path for the scheduler stderr file (e.g. PBS |
|
_JOB_STDOUT_ and _JOB_STDERR_ allow the engine to track where the
scheduler writes its output, which is useful for failure diagnosis.
If these placeholders are not present in the template the scheduler’s
default naming convention is used (backward-compatible).
The file paths are recorded per-job in the [[jobs]] records of
workflow_state.toml (as job_stdout and job_stderr).
Note:
job_stdoutandjob_stderrare treated as optional during fetch. If the files do not exist on the remote server (e.g. the job script template does not include#SBATCH --output/#SBATCH --errordirectives), the engine logs a warning and continues instead of raising an error. All other output files remain mandatory.
Custom variables#
All keys from queue_data.toml are available in upper-case with
surrounding underscores. For example, num_cores = 48 becomes
_NUM_CORES_ in the template.
Example (submit_mpi.sh for PBS)#
#!/bin/sh
#PBS -N _JOBNAME_
#PBS -q _QUEUE_
#PBS -l nodes=_NODES_:ppn=_MPI_PER_NODE_
#PBS -l walltime=_MAX_TIME_
#PBS -o _JOB_STDOUT_
#PBS -e _JOB_STDERR_
export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_
cd ${PBS_O_WORKDIR}
mpirun -np _NUM_CORES_ jqmc ${INPUT} > ${OUTPUT} 2>&1
Example (submit_mpi.sh for Slurm)#
#!/bin/bash
#SBATCH --job-name=_JOBNAME_
#SBATCH --partition=_QUEUE_
#SBATCH --nodes=_NODES_
#SBATCH --ntasks=_NUM_CORES_
#SBATCH --time=_MAX_TIME_
#SBATCH --output=_JOB_STDOUT_
#SBATCH --error=_JOB_STDERR_
export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_
srun jqmc ${INPUT} > ${OUTPUT} 2>&1
Example (submit_serial.sh)#
#!/bin/sh
#PBS -N _JOBNAME_
#PBS -q _QUEUE_
#PBS -l nodes=_NODES_
#PBS -l walltime=_MAX_TIME_
#PBS -o _JOB_STDOUT_
#PBS -e _JOB_STDERR_
export OMP_NUM_THREADS=_OMP_NUM_THREADS_
INPUT=_INPUT_
OUTPUT=_OUTPUT_
cd ${PBS_O_WORKDIR}
jqmc ${INPUT} > ${OUTPUT} 2>&1
Pipeline example#
A minimal pipeline script that runs VMC → MCMC + LRDMC extrapolation:
from jqmc_workflow import (
Container,
FileFrom,
Launcher,
LRDMC_Ext_Workflow,
MCMC_Workflow,
ValueFrom,
VMC_Workflow,
)
server = "pbs-cluster"
h5 = "hamiltonian_data.h5"
vmc = Container(
label="vmc",
dirname="vmc",
input_files=[h5],
workflow=VMC_Workflow(
server_machine_name=server,
hamiltonian_file=h5,
queue_label="default",
pilot_queue_label="small",
jobname="vmc",
target_error=0.001,
),
)
mcmc = Container(
label="mcmc-prod",
dirname="mcmc_prod",
input_files=[
h5,
FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")),
],
rename_input_files=[None, "hamiltonian_data.h5"],
workflow=MCMC_Workflow(
server_machine_name=server,
hamiltonian_file=h5,
queue_label="default",
pilot_queue_label="small",
jobname="mcmc",
target_error=0.001,
cleanup_patterns=["restart.h5"], # delete large checkpoint on success
),
)
lrdmc = Container(
label="lrdmc-ext",
dirname="lrdmc_ext",
input_files=[
h5,
FileFrom("vmc", ValueFrom("vmc", "optimized_hamiltonian")),
],
workflow=LRDMC_Ext_Workflow(
server_machine_name=server,
alat_list=[0.10, 0.20, 0.25, 0.30],
hamiltonian_file=h5,
queue_label="default",
pilot_queue_label="small",
jobname_prefix="lrdmc",
target_survived_walkers_ratio=0.97,
E_scf=ValueFrom("mcmc-prod", "energy"),
target_error=0.001,
),
)
pipeline = Launcher(workflows=[vmc, mcmc, lrdmc])
pipeline.launch()
In this example, mcmc-prod and lrdmc-ext depend on vmc (via
FileFrom). The optimised Hamiltonian filename is resolved
dynamically through ValueFrom("vmc", "optimized_hamiltonian"),
so the pipeline works correctly even when VMC converges early
(e.g. step 91 instead of 150). Additionally, lrdmc-ext depends on
mcmc-prod (via ValueFrom for E_scf), so the DAG becomes
VMC → MCMC → LRDMC-ext. The target_survived_walkers_ratio
triggers automatic calibration of num_projection_per_measurement
independently at each lattice spacing. All alat values run their
calibration, error-bar pilot, and production phases in parallel.
Job Manager CLI (jqmc-jobmanager)#
The jqmc-jobmanager command-line tool monitors and manages running
pipelines. It recursively discovers workflow_state.toml files under
the current directory and displays a summary tree.
Commands#
Command |
Description |
|---|---|
|
Print the workflow tree. Add |
|
Print the tree and query the scheduler on the remote machine for live job status. Use |
|
Cancel a queued/running job and mark the workflow as |
Common options#
Option |
Description |
|---|---|
|
Numeric job ID shown in the tree (0-based). |
|
Server machine name (defaults to |
|
|
Usage examples#
Show the full workflow tree:
jqmc-jobmanager show
Show tree and detailed info for job 2:
jqmc-jobmanager show --id 2
Check live queue status for job 4 on pbs-cluster:
jqmc-jobmanager check --id 4 -s pbs-cluster
Cancel job 4 on pbs-cluster:
jqmc-jobmanager del --id 4 -s pbs-cluster
Example output#
==============================================================================
Workflow Job Tree
Root: /home/user/project
==============================================================================
ID Status Label Type Server Job#
--------------------------------------------------------------------------
0 completed wf WF_Workflow ? -
dir: 00_wf
1 completed vmc VMC_Workflow pbs-cluster 12345
dir: 01_vmc
energy: -17.17 +- 0.005 Ha
2 running mcmc-prod MCMC_Workflow pbs-cluster 12367
dir: 02_mcmc
3 running lrdmc-ext LRDMC_Ext_Workflow ? -
dir: 03_lrdmc_ext
For the full API reference, see API reference for the pipeline (jqmc_workflow).