Engines¶

AMP
Apex
Torch

AMP ¶

AMPEngine ¶

class catalyst.engines.amp.AMPEngine(device: str = 'cuda')[source]¶

Bases: catalyst.engines.torch.DeviceEngine

Pytorch.AMP single training device engine.

Parameters: device – used device, default is “cuda”.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.AMPEngine("cuda:1")
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: AMPEngine
    device: cuda:1

stages:
    ...

DataParallelAMPEngine ¶

class catalyst.engines.amp.DataParallelAMPEngine[source]¶

Bases: catalyst.engines.amp.AMPEngine

AMP multi-gpu training device engine.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DataParallelAMPEngine()
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DataParallelAMPEngine

stages:
    ...

DistributedDataParallelAMPEngine ¶

class catalyst.engines.amp.DistributedDataParallelAMPEngine(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None)[source]¶

Bases: catalyst.engines.torch.DistributedDataParallelEngine

Distributed AMP multi-gpu training device engine.

Parameters

address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DistributedDataParallelAMPEngine(port=12345)
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DistributedDataParallelAMPEngine
    port: 12345

stages:
    ...

Apex ¶

APEXEngine ¶

class catalyst.engines.apex.APEXEngine(device: str = 'cuda', opt_level: str = 'O1', keep_batchnorm_fp32: bool = None, loss_scale: Union[float, str] = None)[source]¶

Bases: catalyst.engines.torch.DeviceEngine

Apex single training device engine.

Parameters

device – use device, default is “cuda”.
opt_level –
optimization level, should be one of "O0", "O1", "O2" or "O3".
- "O0" - no-op training
- "O1" - mixed precision (FP16) training (default)
- "O2" - “almost” mixed precision training
- "O3" - another implementation of mixed precision training
Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels
keep_batchnorm_fp32 – To enhance precision and enable CUDNN batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
loss_scale – If loss_scale is a float value, use this value as the static (fixed) loss scale If loss_scale is the string “dynamic”, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.APEXEngine(opt_level="O1", keep_batchnorm_fp32=False)
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: APEXEngine
    opt_level: O1
    keep_batchnorm_fp32: false

stages:
    ...

DataParallelApexEngine ¶

class catalyst.engines.apex.DataParallelApexEngine(opt_level: str = 'O1')[source]¶

Bases: catalyst.engines.apex.APEXEngine

Apex multi-gpu training device engine.

Parameters

opt_level –

optimization level, should be one of "O0", "O1", "O2" or "O3".

"O0" - no-op training
"O1" - mixed precision (FP16) training (default)
"O2" - “almost” mixed precision training
"O3" - another implementation of mixed precision training

Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DataParallelApexEngine(opt_level="O1")
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DataParallelApexEngine
    opt_level: O1

stages:
    ...

DistributedDataParallelApexEngine ¶

class catalyst.engines.apex.DistributedDataParallelApexEngine(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None, opt_level: str = 'O1', keep_batchnorm_fp32: bool = None, loss_scale: Union[float, str] = None, delay_all_reduce: bool = True)[source]¶

Bases: catalyst.engines.torch.DistributedDataParallelEngine

Distributed Apex MultiGPU training device engine.

Parameters

address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.
opt_level –
optimization level, should be one of "O0", "O1", "O2" or "O3".
- "O0" - no-op training
- "O1" - mixed precision (FP16) training (default)
- "O2" - “almost” mixed precision training
- "O3" - another implementation of mixed precision training
Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels
keep_batchnorm_fp32 – To enhance precision and enable CUDNN batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
loss_scale – If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string “dynamic”, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
delay_all_reduce (bool) – boolean flag for delayed all reduce, default is True.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DistributedDataParallelApexEngine(
            port=12345,
            opt_level="O1"
        )
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DistributedDataParallelApexEngine
    port: 12345
    opt_level: O1

stages:
    ...

Torch ¶

DeviceEngine ¶

class catalyst.engines.torch.DeviceEngine(device: str = None)[source]¶

Bases: catalyst.core.engine.IEngine

Single training device engine.

Parameters: device – use device, default is “cpu”.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DeviceEngine("cuda:1")
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DeviceEngine
    device: cuda:1

stages:
    ...

DataParallelEngine ¶

class catalyst.engines.torch.DataParallelEngine[source]¶

Bases: catalyst.engines.torch.DeviceEngine

MultiGPU training device engine.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DataParallelEngine()
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DataParallelEngine

stages:
    ...

DistributedDataParallelEngine ¶

class catalyst.engines.torch.DistributedDataParallelEngine(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None)[source]¶

Bases: catalyst.engines.torch.DeviceEngine

Distributed MultiGPU training device engine.

Parameters

address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.

Examples:

from catalyst import dl

class MyRunner(dl.IRunner):
    # ...
    def get_engine(self):
        return dl.DistributedDataParallelEngine(port=12345)
    # ...

args:
    logs: ...

model:
    _target_: ...
    ...

engine:
    _target_: DistributedDataParallelEngine
    port: 12345

stages:
    ...