Hardware Acceleration
PyTorch-BSF is built on top of PyTorch Lightning, which means it provides seamless support for hardware acceleration (GPUs, TPUs, etc.) and distributed training (multi-GPU, multi-node) out of the box.
This page explains how to leverage these features to speed up your Bézier simplex fitting tasks.
CLI / MLflow
When using the command-line interface or MLflow, you can control hardware usage via several flags. These flags are passed directly to the underlying PyTorch Lightning Trainer.
Accelerator and Devices
The --accelerator and --devices flags are the primary way to specify your hardware.
--accelerator: Choose the hardware backend. Supported values includecpu,gpu,tpu,hpu,mps(for Apple Silicon), orauto.--devices: Specify which devices to use. You can provide an integer for the number of devices (e.g.,1,2), a list of device IDs (e.g.,0,1), or-1/autoto use all available devices.
Example: Use all available GPUs .. code-block:: bash
python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –devices -1
Precision
You can reduce memory usage and increase training speed by using lower-precision floating-point arithmetic. This is particularly effective on modern GPUs (e.g., NVIDIA A100, RTX 30/40 series).
--precision: Supports16-mixed,bf16-mixed,32-true(default), or64-true.
Example: Mixed precision training .. code-block:: bash
python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –precision 16-mixed
Multi-Node Training
To scale training across multiple machines, use the --num_nodes flag.
--num_nodes: The number of machines (nodes) to use.
Example: Training on 4 nodes with 4 GPUs each .. code-block:: bash
# Run this on each node (usually handled by a cluster manager like SLURM) python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –devices 4 –num_nodes 4
Python API
The Python API provides two ways to handle hardware acceleration: by moving data to the device manually or by passing trainer configuration to fit().
Manual Device Management
If you move your training tensors to a specific device before calling fit(), PyTorch-BSF will attempt to run on that device.
import torch
import torch_bsf
# Assume ts and xs are your parameter and value tensors
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move tensors to GPU
ts = ts.to(device)
xs = xs.to(device)
# Fit the model (it will detect the device from the tensors)
bs = torch_bsf.fit(params=ts, values=xs, degree=3)
Passing Trainer Arguments
For more granular control, you can pass any PyTorch Lightning Trainer arguments directly as keyword arguments to the fit() function.
import torch_bsf
# Use multi-GPU and mixed precision via API
bs = torch_bsf.fit(
params=ts,
values=xs,
degree=3,
accelerator="gpu",
devices=2,
precision="16-mixed"
)
Detailed Use Cases
Large-Scale Pareto Front Approximation
When dealing with thousands of Pareto optimal points and high-degree Bézier simplices (e.g., degree 10+), the number of control points increases significantly. In such cases:
Use GPUs: Fitting is highly parallelizable.
Vectorized Forward: PyTorch-BSF uses a highly efficient, fully vectorized forward pass. This means that instead of looping through each control point, it uses matrix operations, which significantly speeds up computation on GPUs.
Use Mixed Precision: Set
precision="16-mixed"to fit larger models into GPU memory.Increase Batch Size: Use
--batch_sizeto optimize GPU throughput.
Distributed Training on Clusters
If your dataset is massive or you are performing an extensive search over hyperparameters, you can use multi-node training. PyTorch-BSF supports all distributed strategies provided by Lightning (DDP, FSDP, etc.).
Tip
For most cases, strategy="auto" is sufficient. If you encounter issues on specialized clusters, you might need to specify strategy="ddp".
See Also
For more advanced configuration details, please refer to the official documentation: