Hardware Acceleration

PyTorch-BSF is built on top of PyTorch Lightning, which means it provides seamless support for hardware acceleration (GPUs, TPUs, etc.) and distributed training (multi-GPU, multi-node) out of the box.

This page explains how to leverage these features to speed up your Bézier simplex fitting tasks.

CLI / MLflow

When using the command-line interface or MLflow, you can control hardware usage via several flags. These flags are passed directly to the underlying PyTorch Lightning Trainer.

Accelerator and Devices

The --accelerator and --devices flags are the primary way to specify your hardware.

--accelerator: Choose the hardware backend. Supported values include cpu, gpu, tpu, hpu, mps (for Apple Silicon), or auto.
--devices: Specify which devices to use. You can provide an integer for the number of devices (e.g., 1, 2), a list of device IDs (e.g., 0,1), or -1 / auto to use all available devices.

Example: Use all available GPUs .. code-block:: bash

python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –devices -1

Precision

You can reduce memory usage and increase training speed by using lower-precision floating-point arithmetic. This is particularly effective on modern GPUs (e.g., NVIDIA A100, RTX 30/40 series).

--precision: Supports 16-mixed, bf16-mixed, 32-true (default), or 64-true.

Example: Mixed precision training .. code-block:: bash

python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –precision 16-mixed

Multi-Node Training

To scale training across multiple machines, use the --num_nodes flag.

--num_nodes: The number of machines (nodes) to use.

Example: Training on 4 nodes with 4 GPUs each .. code-block:: bash

# Run this on each node (usually handled by a cluster manager like SLURM) python -m torch_bsf –params data.csv –values results.csv –accelerator gpu –devices 4 –num_nodes 4

Python API

The Python API provides two ways to handle hardware acceleration: by moving data to the device manually or by passing trainer configuration to fit().

Manual Device Management

If you move your training tensors to a specific device before calling fit(), PyTorch-BSF will attempt to run on that device.

import torch
import torch_bsf

# Assume ts and xs are your parameter and value tensors
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move tensors to GPU
ts = ts.to(device)
xs = xs.to(device)

# Fit the model (it will detect the device from the tensors)
bs = torch_bsf.fit(params=ts, values=xs, degree=3)

Passing Trainer Arguments

For more granular control, you can pass any PyTorch Lightning Trainer arguments directly as keyword arguments to the fit() function.

import torch_bsf

# Use multi-GPU and mixed precision via API
bs = torch_bsf.fit(
    params=ts,
    values=xs,
    degree=3,
    accelerator="gpu",
    devices=2,
    precision="16-mixed"
)

Detailed Use Cases

Large-Scale Pareto Front Approximation

When dealing with thousands of Pareto optimal points and high-degree Bézier simplices (e.g., degree 10+), the number of control points increases significantly. In such cases:

Use GPUs: Fitting is highly parallelizable.
Vectorized Forward: PyTorch-BSF uses a highly efficient, fully vectorized forward pass. This means that instead of looping through each control point, it uses matrix operations, which significantly speeds up computation on GPUs.
Use Mixed Precision: Set precision="16-mixed" to fit larger models into GPU memory.
Increase Batch Size: Use --batch_size to optimize GPU throughput.

Distributed Training on Clusters

If your dataset is massive or you are performing an extensive search over hyperparameters, you can use multi-node training. PyTorch-BSF supports all distributed strategies provided by Lightning (DDP, FSDP, etc.).

Tip

For most cases, strategy="auto" is sufficient. If you encounter issues on specialized clusters, you might need to specify strategy="ddp".