Hardware Acceleration
=====================
PyTorch-BSF is built on top of `PyTorch Lightning `_, which means it provides seamless support for hardware acceleration (GPUs, TPUs, etc.) and distributed training (multi-GPU, multi-node) out of the box.
This page explains how to leverage these features to speed up your Bézier simplex fitting tasks.
CLI / MLflow
------------
When using the command-line interface or MLflow, you can control hardware usage via several flags. These flags are passed directly to the underlying PyTorch Lightning Trainer.
Accelerator and Devices
^^^^^^^^^^^^^^^^^^^^^^^
The ``--accelerator`` and ``--devices`` flags are the primary way to specify your hardware.
* ``--accelerator``: Choose the hardware backend. Supported values include ``cpu``, ``gpu``, ``tpu``, ``hpu``, ``mps`` (for Apple Silicon), or ``auto``.
* ``--devices``: Specify which devices to use. You can provide an integer for the number of devices (e.g., ``1``, ``2``), a list of device IDs (e.g., ``0,1``), or ``-1`` / ``auto`` to use all available devices.
Example: Use all available GPUs
.. code-block:: bash
python -m torch_bsf --params data.csv --values results.csv --accelerator gpu --devices -1
Precision
^^^^^^^^^
You can reduce memory usage and increase training speed by using lower-precision floating-point arithmetic. This is particularly effective on modern GPUs (e.g., NVIDIA A100, RTX 30/40 series).
* ``--precision``: Supports ``16-mixed``, ``bf16-mixed``, ``32-true`` (default), or ``64-true``.
Example: Mixed precision training
.. code-block:: bash
python -m torch_bsf --params data.csv --values results.csv --accelerator gpu --precision 16-mixed
Multi-node Training
^^^^^^^^^^^^^^^^^^^
To scale training across multiple machines, use the ``--num_nodes`` flag.
* ``--num_nodes``: The number of machines (nodes) to use.
Example: Training on 4 nodes with 4 GPUs each
.. code-block:: bash
# Run this on each node (usually handled by a cluster manager like SLURM)
python -m torch_bsf --params data.csv --values results.csv --accelerator gpu --devices 4 --num_nodes 4
Python API
----------
The Python API provides two ways to handle hardware acceleration: by moving data to the device manually or by passing trainer configuration to ``fit()``.
Manual Device Management
^^^^^^^^^^^^^^^^^^^^^^^^
If you move your training tensors to a specific device before calling ``fit()``, PyTorch-BSF will attempt to run on that device.
.. code-block:: python
import torch
import torch_bsf
# Assume ts and xs are your parameter and value tensors
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move tensors to GPU
ts = ts.to(device)
xs = xs.to(device)
# Fit the model (it will detect the device from the tensors)
bs = torch_bsf.fit(params=ts, values=xs, degree=3)
Passing Trainer Arguments
^^^^^^^^^^^^^^^^^^^^^^^^^
For more granular control, you can pass any PyTorch Lightning `Trainer arguments `_ directly as keyword arguments to the ``fit()`` function.
.. code-block:: python
import torch_bsf
# Use multi-GPU and mixed precision via API
bs = torch_bsf.fit(
params=ts,
values=xs,
degree=3,
accelerator="gpu",
devices=2,
precision="16-mixed"
)
Detailed Use Cases
------------------
Large-scale Pareto Front Approximation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When dealing with thousands of Pareto optimal points and high-degree Bézier simplices (e.g., degree 10+), the number of control points increases significantly. In such cases:
1. **Use GPUs**: Fitting is highly parallelizable.
2. **Use Mixed Precision**: Set ``precision="16-mixed"`` to fit larger models into GPU memory.
3. **Increase Batch Size**: Use ``--batch_size`` to optimize GPU throughput.
Distributed Training on Clusters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your dataset is massive or you are performing an extensive search over hyperparameters, you can use multi-node training. PyTorch-BSF supports all distributed strategies provided by Lightning (DDP, FSDP, etc.).
.. tip::
For most cases, ``strategy="auto"`` is sufficient. If you encounter issues on specialized clusters, you might need to specify ``strategy="ddp"``.
See Also
--------
For more advanced configuration details, please refer to the official documentation:
* `PyTorch Lightning Accelerator Documentation `_
* `PyTorch Lightning Multi-GPU Training `_
* `Mixed Precision Training in Lightning `_