Engines¶
AMP¶
AMPEngine¶
-
class
catalyst.engines.amp.
AMPEngine
(device: str = 'cuda')[source]¶ Bases:
catalyst.engines.torch.DeviceEngine
Pytorch.AMP single training device engine.
- Parameters
device – used device, default is “cuda”.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.AMPEngine("cuda:1") # ...
args: logs: ... model: _target_: ... ... engine: _target_: AMPEngine device: cuda:1 stages: ...
DataParallelAMPEngine¶
-
class
catalyst.engines.amp.
DataParallelAMPEngine
[source]¶ Bases:
catalyst.engines.amp.AMPEngine
AMP multi-gpu training device engine.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DataParallelAMPEngine() # ...
args: logs: ... model: _target_: ... ... engine: _target_: DataParallelAMPEngine stages: ...
DistributedDataParallelAMPEngine¶
-
class
catalyst.engines.amp.
DistributedDataParallelAMPEngine
(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None)[source]¶ Bases:
catalyst.engines.torch.DistributedDataParallelEngine
Distributed AMP multi-gpu training device engine.
- Parameters
address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DistributedDataParallelAMPEngine(port=12345) # ...
args: logs: ... model: _target_: ... ... engine: _target_: DistributedDataParallelAMPEngine port: 12345 stages: ...
Apex¶
APEXEngine¶
-
class
catalyst.engines.apex.
APEXEngine
(device: str = 'cuda', opt_level: str = 'O1', keep_batchnorm_fp32: bool = None, loss_scale: Union[float, str] = None)[source]¶ Bases:
catalyst.engines.torch.DeviceEngine
Apex single training device engine.
- Parameters
device – use device, default is “cuda”.
opt_level –
optimization level, should be one of
"O0"
,"O1"
,"O2"
or"O3"
."O0"
- no-op training"O1"
- mixed precision (FP16) training (default)"O2"
- “almost” mixed precision training"O3"
- another implementation of mixed precision training
Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels
keep_batchnorm_fp32 – To enhance precision and enable CUDNN batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
loss_scale – If loss_scale is a float value, use this value as the static (fixed) loss scale If loss_scale is the string “dynamic”, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.APEXEngine(opt_level="O1", keep_batchnorm_fp32=False) # ...
args: logs: ... model: _target_: ... ... engine: _target_: APEXEngine opt_level: O1 keep_batchnorm_fp32: false stages: ...
DataParallelApexEngine¶
-
class
catalyst.engines.apex.
DataParallelApexEngine
(opt_level: str = 'O1')[source]¶ Bases:
catalyst.engines.apex.APEXEngine
Apex multi-gpu training device engine.
- Parameters
opt_level –
optimization level, should be one of
"O0"
,"O1"
,"O2"
or"O3"
."O0"
- no-op training"O1"
- mixed precision (FP16) training (default)"O2"
- “almost” mixed precision training"O3"
- another implementation of mixed precision training
Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DataParallelApexEngine(opt_level="O1") # ...
args: logs: ... model: _target_: ... ... engine: _target_: DataParallelApexEngine opt_level: O1 stages: ...
DistributedDataParallelApexEngine¶
-
class
catalyst.engines.apex.
DistributedDataParallelApexEngine
(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None, opt_level: str = 'O1', keep_batchnorm_fp32: bool = None, loss_scale: Union[float, str] = None, delay_all_reduce: bool = True)[source]¶ Bases:
catalyst.engines.torch.DistributedDataParallelEngine
Distributed Apex MultiGPU training device engine.
- Parameters
address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.
opt_level –
optimization level, should be one of
"O0"
,"O1"
,"O2"
or"O3"
."O0"
- no-op training"O1"
- mixed precision (FP16) training (default)"O2"
- “almost” mixed precision training"O3"
- another implementation of mixed precision training
Details about levels can be found here: https://nvidia.github.io/apex/amp.html#opt-levels
keep_batchnorm_fp32 – To enhance precision and enable CUDNN batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
loss_scale – If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string “dynamic”, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
delay_all_reduce (bool) – boolean flag for delayed all reduce, default is True.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DistributedDataParallelApexEngine( port=12345, opt_level="O1" ) # ...
args: logs: ... model: _target_: ... ... engine: _target_: DistributedDataParallelApexEngine port: 12345 opt_level: O1 stages: ...
Torch¶
DeviceEngine¶
-
class
catalyst.engines.torch.
DeviceEngine
(device: str = None)[source]¶ Bases:
catalyst.core.engine.IEngine
Single training device engine.
- Parameters
device – use device, default is “cpu”.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DeviceEngine("cuda:1") # ...
args: logs: ... model: _target_: ... ... engine: _target_: DeviceEngine device: cuda:1 stages: ...
DataParallelEngine¶
-
class
catalyst.engines.torch.
DataParallelEngine
[source]¶ Bases:
catalyst.engines.torch.DeviceEngine
MultiGPU training device engine.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DataParallelEngine() # ...
args: logs: ... model: _target_: ... ... engine: _target_: DataParallelEngine stages: ...
DistributedDataParallelEngine¶
-
class
catalyst.engines.torch.
DistributedDataParallelEngine
(address: str = 'localhost', port: str = '12345', backend: str = 'nccl', world_size: int = None)[source]¶ Bases:
catalyst.engines.torch.DeviceEngine
Distributed MultiGPU training device engine.
- Parameters
address – process address to use (required for PyTorch backend), default is “localhost”.
port – process port to listen (required for PyTorch backend), default is “12345”.
backend – multiprocessing backend to use, default is “nccl”.
world_size – number of processes.
Examples:
from catalyst import dl class MyRunner(dl.IRunner): # ... def get_engine(self): return dl.DistributedDataParallelEngine(port=12345) # ...
args: logs: ... model: _target_: ... ... engine: _target_: DistributedDataParallelEngine port: 12345 stages: ...