Shortcuts

Data

Data subpackage has data preprocessers and dataloader abstractions.

Scripts

You can use scripts typing catalyst-data in your terminal. For example:

$ catalyst-data tag2label --help

Catalyst-data scripts.

Examples

1. process-images reads raw data and outputs preprocessed resized images

$ catalyst-data process-images \\
    --in-dir /path/to/raw/data/ \\
    --out-dir=./data/dataset \\
    --num-workers=6 \\
    --max-size=224 \\
    --extension=png \\
    --clear-exif \\
    --grayscale \\
    --expand-dims

2. tag2label prepares a dataset to json like {“class_id”: class_column_from_dataset}

$ catalyst-data tag2label \\
    --in-dir=./data/dataset \\
    --out-dataset=./data/dataset_raw.csv \\
    --out-labeling=./data/tag2cls.json

3. check-images checks images in your data to be non-broken and writes a flag: true if image opened without an error and false otherwise

$ catalyst-data check-images \\
    --in-csv=./data/dataset_raw.csv \\
    --img-rootpath=./data/dataset \\
    --img-col="tag" \\
    --out-csv=./data/dataset_checked.csv \\
    --n-cpu=4
  1. split-dataframe split your dataset into train/valid folds

$ catalyst-data split-dataframe \\
    --in-csv=./data/dataset_raw.csv \\
    --tag2class=./data/tag2cls.json \\
    --tag-column=tag \\
    --class-column=class \\
    --n-folds=5 \\
    --train-folds=0,1,2,3 \\
    --out-csv=./data/dataset.csv

5. image2embedding embeds images from your csv or image directory with specified neural net architecture

$ catalyst-data image2embedding \\
    --in-csv=./data/input.csv \\
    --img-col="filename" \\
    --img-size=64 \\
    --out-npy=./embeddings.npy \\
    --arch=resnet34 \\
    --pooling=GlobalMaxPool2d \\
    --batch-size=8 \\
    --num-workers=16 \\
    --verbose

Augmentors

Augmentor

class catalyst.data.augmentor.Augmentor(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]

Augmentation abstraction to use with data dictionaries.

__init__(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]

Augmentation abstraction to use with data dictionaries.

Parameters
  • dict_key – key to transform

  • augment_fn – augmentation function to use

  • input_keyaugment_fn input key

  • output_keyaugment_fn output key

  • **kwargs – default kwargs for augmentations function

AugmentorCompose

class catalyst.data.augmentor.AugmentorCompose(key2augment_fn: Dict[str, Callable])[source]

Compose augmentors.

__init__(key2augment_fn: Dict[str, Callable])[source]
Parameters

key2augment_fn (Dict[str, Callable]) – mapping from input key to augmentation function to apply

AugmentorKeys

class catalyst.data.augmentor.AugmentorKeys(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source]

Augmentation abstraction to match input and augmentations keys.

__init__(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source]
Parameters
  • dict2fn_dict (Dict[str, str]) – keys matching dict {input_key: augment_fn_key}. For example: {"image": "image", "mask": "mask"}

  • augment_fn – augmentation function

Collate Functions

FilteringCollateFn

class catalyst.data.collate_fn.FilteringCollateFn(*keys)[source]

Callable object doing job of collate_fn like default_collate, but does not cast batch items with specified key to torch.Tensor.

Only adds them to list. Supports only key-value format batches

__call__(batch)[source]
Parameters

batch – current batch

Returns

batch values filtered by keys

__init__(*keys)[source]
Parameters

keys – Keys for values that will not be converted to tensor and stacked

Dataset

PyTorch Extensions

DatasetFromSampler

class catalyst.data.dataset.torch.DatasetFromSampler(sampler: torch.utils.data.sampler.Sampler)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset to create indexes from Sampler.

Parameters

sampler – PyTorch sampler

__getitem__(index: int)[source]

Gets element of the dataset.

Parameters

index – index of the element in the dataset

Returns

Single element by index

__init__(sampler: torch.utils.data.sampler.Sampler)[source]

Initialisation for DatasetFromSampler.

__len__() → int[source]
Returns

length of the dataset

Return type

int

ListDataset

class catalyst.data.dataset.torch.ListDataset(list_data: List[Dict], open_fn: Callable, dict_transform: Optional[Callable] = None)[source]

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class with several data sources list_data.

__getitem__(index: int) → Any[source]

Gets element of the dataset.

Parameters

index – index of the element in the dataset

Returns

Single element by index

__init__(list_data: List[Dict], open_fn: Callable, dict_transform: Optional[Callable] = None)[source]
Parameters
  • list_data – list of dicts, that stores you data annotations, (for example path to images, labels, bboxes, etc.)

  • open_fn – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string.)

  • dict_transform – transforms to use on dict. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

MergeDataset

class catalyst.data.dataset.torch.MergeDataset(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Optional[Callable] = None)[source]

Bases: torch.utils.data.dataset.Dataset

Abstraction to merge several datasets into one dataset.

__getitem__(index: int) → Any[source]

Get item from all datasets.

Parameters

index – index to value from all datasets

Returns

list of value in every dataset

Return type

list

__init__(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Optional[Callable] = None)[source]
Parameters
  • datasets – params count of datasets to merge

  • dict_transform – transforms common for all datasets. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

NumpyDataset

class catalyst.data.dataset.torch.NumpyDataset(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Optional[Callable] = None)[source]

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class to use with numpy_data.

__getitem__(index: int) → Any[source]

Gets element of the dataset.

Parameters

index – index of the element in the dataset

Returns

Single element by index

__init__(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Optional[Callable] = None)[source]

General purpose dataset class to use with numpy_data.

Parameters
  • numpy_data – numpy data (for example path to embeddings, features, etc.)

  • numpy_key – key to use for output dictionary

  • dict_transform – transforms to use on dict. (for example normalize vector, etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

PathsDataset

class catalyst.data.dataset.torch.PathsDataset(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], features_key: str = 'features', target_key: str = 'targets', **list_dataset_params)[source]

Bases: catalyst.data.dataset.torch.ListDataset

Dataset that derives features and targets from samples filesystem paths.

Examples

>>> label_fn = lambda x: x.split("_")[0]
>>> dataset = PathsDataset(
>>>     filenames=Path("/path/to/images/").glob("*.jpg"),
>>>     label_fn=label_fn,
>>>     open_fn=open_fn,
>>> )
__init__(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], features_key: str = 'features', target_key: str = 'targets', **list_dataset_params)[source]
Parameters
  • filenames – list of file paths that store information about your dataset samples; it could be images, texts or any other files in general.

  • open_fn – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string)

  • label_fn – function, that can extract target value from sample path (for example, your sample could be an image file like /path/to/your/image_1.png where the target is encoded as a part of file path)

  • features_key – key to use to store sample features

  • target_key – key to use to store target label

  • list_dataset_params – base class initialization parameters.

Metric Learning Datasets

MetricLearningTrainDataset

class catalyst.data.dataset.metric_learning.MetricLearningTrainDataset[source]

Bases: torch.utils.data.dataset.Dataset, abc.ABC

Base class for datasets adapted for metric learning train stage.

abstract get_labels() → List[int][source]

Dataset for metric learning must provide label of each sample for forming positive and negative pairs during the training based on these labels.

Raises

NotImplementedError – You should implement it # noqa: DAR402

QueryGalleryDataset

class catalyst.data.dataset.metric_learning.QueryGalleryDataset[source]

Bases: torch.utils.data.dataset.Dataset, abc.ABC

QueryGallleryDataset for CMCScoreCallback

abstract __getitem__(item) → Dict[str, torch.Tensor][source]

Dataset for query/gallery split should return dict with feature, targets and is_query key. Value by key is_query should be boolean and indicate whether current object is in query or in gallery.

Raises

NotImplementedError – You should implement it # noqa: DAR402

abstract property gallery_size

Query/Gallery dataset should have property gallery size.

Returns

DAR202

Return type

gallery size # noqa

Raises

NotImplementedError – You should implement it # noqa: DAR402

abstract property query_size

Query/Gallery dataset should have property query size.

Returns

DAR202

Return type

query size # noqa

Raises

NotImplementedError – You should implement it # noqa: DAR402

In-batch Samplers

IInbatchTripletSampler

class catalyst.data.IInbatchTripletSampler[source]

An abstraction of inbatch triplet sampler.

abstract sample(features: torch.Tensor, labels: Union[List[int], torch.Tensor]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

This method includes the logic of sampling/selecting triplets.

Parameters
  • features – tensor of features

  • labels – labels of the samples in the batch, list or Tensor of shape (batch_size;)

Returns: the batch of triplets

Raises

NotImplementedError – you should implement it

InBatchTripletsSampler

class catalyst.data.InBatchTripletsSampler[source]

Base class for a triplets samplers. We expect that the child instances of this class will be used to forming triplets inside the batches. (Note. It is assumed that set of output features is a subset of samples features inside the batch.) The batches must contain at least 2 samples for each class and at least 2 different classes, such behaviour can be garantee via using catalyst.data.sampler.BalanceBatchSampler

But you are not limited to using it in any other way.

sample(features: torch.Tensor, labels: Union[List[int], torch.Tensor]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]
Parameters
  • features – has the shape of [batch_size, feature_size]

  • labels – labels of the samples in the batch

Returns

(anchor, positive, negative)

Return type

the batch of the triplets in the order below

AllTripletsSampler

class catalyst.data.AllTripletsSampler(max_output_triplets: int = 9223372036854775807)[source]

This sampler selects all the possible triplets for the given labels

__init__(max_output_triplets: int = 9223372036854775807)[source]
Parameters

max_output_triplets – with the strategy of choosing all the triplets, their number in the batch can be very large, because of it we can sample only random part of them, determined by this parameter.

HardTripletsSampler

class catalyst.data.HardTripletsSampler(norm_required: bool = False)[source]

This sampler selects hardest triplets based on distances between features: the hardest positive sample has the maximal distance to the anchor sample, the hardest negative sample has the minimal distance to the anchor sample.

Note that a typical triplet loss chart is as follows: 1. Falling: loss decreases to a value equal to the margin. 2. Long plato: the loss oscillates near the margin. 3. Falling: loss decreases to zero.

__init__(norm_required: bool = False)[source]
Parameters

norm_required – set True if features normalisation is needed

HardClusterSampler

class catalyst.data.HardClusterSampler[source]

This sampler selects hardest triplets based on distance to mean vectors: anchor is a mean vector of features of i-th class in the batch, the hardest positive sample is the most distant from anchor sample of anchor’s class, the hardest negative sample is the closest mean vector of another classes.

The batch must contain k samples for p classes in it (k > 1, p > 1).

sample(features: torch.Tensor, labels: Union[List[int], torch.Tensor]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

This method samples the hardest triplets in the batch.

Parameters
  • features – tensor of shape (batch_size; embed_dim) that contains k samples for each of p classes

  • labels – labels of the batch, list or tensor of size (batch_size,)

Returns

p triplets of (mean_vector, positive, negative_mean_vector)

Loader

BatchLimitLoaderWrapper

class catalyst.data.loader.BatchLimitLoaderWrapper(loader: torch.utils.data.dataloader.DataLoader, num_batches: Union[int, float])[source]

Loader wrapper. Limits number of batches used per each iteration.

For example, if you have some loader and want to use only first 5 bathes:

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.data.loader import BatchLimitLoaderWrapper

num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loader = BatchLimitLoaderWrapper(loader, num_batches=5)

or if you would like to use only some portion of Dataloader (we use 30% in the example below):

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.data.loader import BatchLimitLoaderWrapper

num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loader = BatchLimitLoaderWrapper(loader, num_batches=0.3)

Note

Generally speaking, this wrapper could be used with any iterator-like object. No DataLoader-specific code used.

__getattr__(key)[source]

Gets attribute by key. Firstly, looks at the origin for the appropriate key. If none founds - looks at the wrappers attributes. If could not found anything - raises NotImplementedError.

Parameters

key – attribute key

Returns

attribute value

Raises

NotImplementedError – if could not find attribute in origin or wrapper

__init__(loader: torch.utils.data.dataloader.DataLoader, num_batches: Union[int, float])[source]

Loader wrapper. Limits number of batches used per each iteration.

Parameters
  • loader – torch dataloader.

  • num_batches (Union[int, float]) – number of batches to use (int), or portion of iterator (float, should be in [0;1] range)

__iter__()[source]

Iterator.

Returns

iterator object

__len__() → int[source]

Returns length of the wrapper loader.

Returns

length of the wrapper loader

Return type

int

__next__()[source]

Next batch.

Returns

next batch

__weakref__

list of weak references to the object (if defined)

Readers

Readers are the abstraction for your dataset. They can open an elem from the dataset and transform it to data, needed by your network. For example open image by path, or read string and tokenize it.

ReaderSpec

class catalyst.data.reader.ReaderSpec(input_key: str, output_key: str)[source]

Reader abstraction for all Readers.

Applies a function to an element of your data. For example to a row from csv, or to an image, etc.

All inherited classes have to implement __call__.

__call__(element)[source]

Reads a row from your annotations dict and transfer it to data, needed by your network for example open image by path, or read string and tokenize it.

Parameters

element – elem in your dataset

Returns

Data object used for your neural network

__init__(input_key: str, output_key: str)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result, default: input_key

ScalarReader

class catalyst.data.reader.ScalarReader(input_key: str, output_key: Optional[str] = None, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]

Numeric data reader abstraction. Reads a single float, int, str or other from data

__call__(element)[source]

Reads a row from your annotations dict and transfer it to a single value

Parameters

element – elem in your dataset

Returns

Scalar value

Return type

dtype

__init__(input_key: str, output_key: Optional[str] = None, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result, default: input_key

  • dtype – datatype of scalar values to use

  • default_value – default value to use if something goes wrong

  • one_hot_classes – number of one-hot classes

  • smoothing (float, optional) – if specified applies label smoothing to one_hot classes

LambdaReader

class catalyst.data.reader.LambdaReader(input_key: str, output_key: Optional[str] = None, lambda_fn: Optional[Callable] = None, **kwargs)[source]

Reader abstraction with an lambda encoders. Can read an elem from dataset and apply encode_fn function to it.

__call__(element)[source]

Reads a row from your annotations dict and applies encode_fn function.

Parameters

element – elem in your dataset.

Returns

Value after applying lambda_fn function

__init__(input_key: str, output_key: Optional[str] = None, lambda_fn: Optional[Callable] = None, **kwargs)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result

  • lambda_fn – encode function to use to prepare your data (for example convert chars/words/tokens to indices, etc)

  • kwargs – kwargs for encode function

ReaderCompose

class catalyst.data.reader.ReaderCompose(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]

Abstraction to compose several readers into one open function.

__call__(element)[source]

Reads a row from your annotations dict and applies all readers and mixins

Parameters

element – elem in your dataset.

Returns

Value after applying all readers and mixins

__init__(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]
Parameters
  • readers – list of reader to compose

  • mixins – list of mixins to use

Samplers

BalanceBatchSampler

class catalyst.data.sampler.BalanceBatchSampler(labels: Union[List[int], numpy.ndarray], p: int, k: int)[source]

This kind of sampler can be used for both metric learning and classification task.

Sampler with the given strategy for the C unique classes dataset: - Selection P of C classes for the 1st batch - Selection K instances for each class for the 1st batch - Selection P of C - P remaining classes for 2nd batch - Selection K instances for each class for the 2nd batch - … The epoch ends when there are no classes left. So, the batch sise is P * K except the last one.

Thus, in each epoch, all the classes will be selected once, but this does not mean that all the instances will be selected during the epoch.

One of the purposes of this sampler is to be used for forming triplets and pos/neg pairs inside the batch. To guarante existance of these pairs in the batch, P and K should be > 1. (1)

Behavior in corner cases: - If a class does not contain K instances, a choice will be made with repetition. - If C % P == 1 then one of the classes should be dropped otherwise statement (1) will not be met.

This type of sampling can be found in the classical paper of Person Re-Id, where P equals 32 and K equals 4: In Defense of the Triplet Loss for Person Re-Identification.

Parameters
  • labels – list of classes labeles for each elem in the dataset

  • p – number of classes in a batch, should be > 1

  • k – number of instances of each class in a batch, should be > 1

__init__(labels: Union[List[int], numpy.ndarray], p: int, k: int)[source]

Sampler initialisation.

__iter__() → Iterator[int][source]
Returns

indeces for sampling dataset elems during an epoch

__len__() → int[source]
Returns

number of samples in an epoch

property batch_size

Returns: this value should be used in DataLoader as batch size

property batches_in_epoch

Returns: number of batches in an epoch

BalanceClassSampler

class catalyst.data.sampler.BalanceClassSampler(labels: List[int], mode: Union[str, int] = 'downsampling')[source]

Allows you to create stratified sample on unbalanced classes.

Parameters
  • labels – list of class label for each elem in the dataset

  • mode – Strategy to balance classes. Must be one of [downsampling, upsampling]

__init__(labels: List[int], mode: Union[str, int] = 'downsampling')[source]

Sampler initialisation.

__iter__() → Iterator[int][source]
Yields

indices of stratified sample

__len__() → int[source]
Returns

length of result sample

DistributedSamplerWrapper

class catalyst.data.sampler.DistributedSamplerWrapper(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]

Wrapper over Sampler for distributed training. Allows you to use any sampler in distributed mode.

It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel. In such case, each process can pass a DistributedSamplerWrapper instance as a DataLoader sampler, and load a subset of subsampled data of the original dataset that is exclusive to it.

Note

Sampler is assumed to be of constant size.

__init__(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]
Parameters
  • sampler – Sampler used for subsampling

  • num_replicas (int, optional) – Number of processes participating in distributed training

  • rank (int, optional) – Rank of the current process within num_replicas

  • shuffle (bool, optional) – If true (default), sampler will shuffle the indices

__iter__()[source]

@TODO: Docs. Contribution is welcome.

DynamicLenBatchSampler

class catalyst.data.sampler.DynamicLenBatchSampler(sampler, batch_size, drop_last)[source]

A dynamic batch length data sampler. Should be used with catalyst.utils.trim_tensors.

Adapted from Dynamic minibatch trimming to improve BERT training speed.

Parameters
  • sampler – Base sampler.

  • batch_size – Size of minibatch.

  • drop_last – If True, the sampler will drop the last batch

  • its size would be less than batch_size. (if) –

Usage example:

>>> from torch.utils import data
>>> from catalyst.data import DynamicLenBatchSampler
>>> from catalyst import utils
>>> dataset = data.TensorDataset(
>>>     input_ids, input_mask, segment_ids, labels
>>> )
>>> sampler_ = data.RandomSampler(dataset)
>>> sampler = DynamicLenBatchSampler(
>>>     sampler_, batch_size=16, drop_last=False
>>> )
>>> loader = data.DataLoader(dataset, batch_sampler=sampler)
>>> for batch in loader:
>>>     tensors = utils.trim_tensors(batch)
>>>     b_input_ids, b_input_mask, b_segment_ids, b_labels =         >>>         tuple(t.to(device) for t in tensors)
__iter__()[source]

Iteration over BatchSampler.

MiniEpochSampler

class catalyst.data.sampler.MiniEpochSampler(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]

Sampler iterates mini epochs from the dataset used by mini_epoch_len.

Parameters
  • data_len – Size of the dataset

  • mini_epoch_len – Num samples from the dataset used in one mini epoch.

  • drop_last – If True, sampler will drop the last batches if its size would be less than batches_per_epoch

  • shuffle – one of "always", "real_epoch", or None`. The sampler will shuffle indices > “per_mini_epoch” - every mini epoch (every __iter__ call) > “per_epoch” – every real epoch > None – don’t shuffle

Example

>>> MiniEpochSampler(len(dataset), mini_epoch_len=100)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100, drop_last=True)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100,
>>>     shuffle="per_epoch")
__init__(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]

Sampler initialisation.

__iter__() → Iterator[int][source]

Iterate over sampler.

Returns

python iterator

__len__() → int[source]
Returns

length of the mini-epoch

Return type

int

shuffle() → None[source]

Shuffle sampler indices.

Computer Vision Extensions

Dataset

ImageFolderDataset

class catalyst.data.cv.dataset.ImageFolderDataset(rootpath: str, target_key: str = 'targets', dir2class: Optional[Mapping[str, int]] = None, dict_transform: Optional[Callable[[Dict], Dict]] = None)[source]

Bases: catalyst.data.dataset.torch.PathsDataset

Dataset class that derives targets from samples filesystem paths. Dataset structure should be the following:

rootpat/
|-- class1/  # folder of N images
|   |-- image11
|   |-- image12
|   ...
|   `-- image1N
...
`-- classM/  # folder of K images
    |-- imageM1
    |-- imageM2
    ...
    `-- imageMK
__init__(rootpath: str, target_key: str = 'targets', dir2class: Optional[Mapping[str, int]] = None, dict_transform: Optional[Callable[[Dict], Dict]] = None) → None[source]

Constructor method for the ImageFolderDataset class.

Parameters
  • rootpath – root directory of dataset

  • target_key – key to use to store target label

  • dir2class (Mapping[str, int], optional) – mapping from folder name to class index

  • dict_transform (Callable[[Dict], Dict]], optional) – transforms to use on dict

Mixins

BlurMixin

class catalyst.data.cv.mixins.blur.BlurMixin(input_key: str = 'image', output_key: str = 'blur_factor', blur_min: int = 3, blur_max: int = 9, blur: List[str] = None)[source]

Bases: object

Calculates blur factor for augmented image.

__init__(input_key: str = 'image', output_key: str = 'blur_factor', blur_min: int = 3, blur_max: int = 9, blur: List[str] = None)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result

FlareMixin

class catalyst.data.cv.mixins.flare.FlareMixin(input_key: str = 'image', output_key: str = 'flare_factor', sunflare_params: Dict = None)[source]

Bases: object

Calculates flare factor for augmented image.

__init__(input_key: str = 'image', output_key: str = 'flare_factor', sunflare_params: Dict = None)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result

  • sunflare_params – params to init albumentations.RandomSunFlare

RotateMixin

class catalyst.data.cv.mixins.rotate.RotateMixin(input_key: str = 'image', output_key: str = 'rotation_factor', targets_key: str = None, rotate_probability: float = 1.0, hflip_probability: float = 0.5, one_hot_classes: int = None)[source]

Bases: object

Calculates rotation factor for augmented image.

__init__(input_key: str = 'image', output_key: str = 'rotation_factor', targets_key: str = None, rotate_probability: float = 1.0, hflip_probability: float = 0.5, one_hot_classes: int = None)[source]
Parameters
  • input_key – input key to use from annotation dict

  • output_key – output key to use to store the result

Readers

ImageReader

class catalyst.data.cv.reader.ImageReader(input_key: str, output_key: Optional[str] = None, rootpath: Optional[str] = None, grayscale: bool = False)[source]

Bases: catalyst.data.reader.ReaderSpec

Image reader abstraction. Reads images from a csv dataset.

__init__(input_key: str, output_key: Optional[str] = None, rootpath: Optional[str] = None, grayscale: bool = False)[source]
Parameters
  • input_key – key to use from annotation dict

  • output_key – key to use to store the result, default: input_key

  • rootpath – path to images dataset root directory (so your can use relative paths in annotations)

  • grayscale – flag if you need to work only with grayscale images

MaskReader

class catalyst.data.cv.reader.MaskReader(input_key: str, output_key: Optional[str] = None, rootpath: Optional[str] = None, clip_range: Tuple[Union[int, float], Union[int, float]] = (0, 1))[source]

Bases: catalyst.data.reader.ReaderSpec

Mask reader abstraction. Reads masks from a csv dataset.

__init__(input_key: str, output_key: Optional[str] = None, rootpath: Optional[str] = None, clip_range: Tuple[Union[int, float], Union[int, float]] = (0, 1))[source]
Parameters
  • input_key – key to use from annotation dict

  • output_key – key to use to store the result, default: input_key

  • rootpath – path to images dataset root directory (so your can use relative paths in annotations)

  • clip_range (Tuple[int, int]) – lower and upper interval edges, image values outside the interval are clipped to the interval edges

Transforms

TensorToImage

class catalyst.data.cv.transforms.albumentations.TensorToImage(denormalize: bool = False, move_channels_dim: bool = True, always_apply: bool = False, p: float = 1.0)[source]

Bases: albumentations.core.transforms_interface.ImageOnlyTransform

Casts torch.tensor to numpy.array.

__init__(denormalize: bool = False, move_channels_dim: bool = True, always_apply: bool = False, p: float = 1.0)[source]
Parameters
  • denormalize – if True, multiply image(s) by ImageNet std and add ImageNet mean

  • move_channels_dim – if True, convert [B]xCxHxW tensor to [B]xHxWxC format

  • always_apply – need to apply this transform anyway

  • p – probability for this transform

apply(img: torch.Tensor, **params) → numpy.ndarray[source]

Apply the transform to the image.

ImageToTensor

class catalyst.data.cv.transforms.albumentations.ImageToTensor(move_channels_dim: bool = True, always_apply: bool = False, p: float = 1.0)[source]

Bases: albumentations.pytorch.transforms.ToTensorV2

Casts numpy.array to torch.tensor.

__init__(move_channels_dim: bool = True, always_apply: bool = False, p: float = 1.0)[source]
Parameters
  • move_channels_dim – if False, casts numpy array to torch.tensor, but do not move channels dim

  • always_apply – need to apply this transform anyway

  • p – probability for this transform

apply(img: numpy.ndarray, **params) → torch.Tensor[source]

Apply the transform to the image.

apply_to_mask(mask: numpy.ndarray, **params) → torch.Tensor[source]

Apply the transform to the mask.

get_transform_init_args_names() → tuple[source]

Get transform init args names.

Compose

class catalyst.data.cv.transforms.torch.Compose(transforms)[source]

Bases: object

Composes several transforms together.

__init__(transforms)[source]
Parameters

transforms – list of transforms to compose.

Example

>>> Compose([ToTensor(), Normalize()])

Normalize

class catalyst.data.cv.transforms.torch.Normalize(mean, std, inplace=False)[source]

Bases: object

Normalize a tensor image with mean and standard deviation.

Given mean: (mean[1],...,mean[n]) and std: (std[1],..,std[n]) for n channels, this transform will normalize each channel of the input torch.*Tensor i.e., output[channel] = (input[channel] - mean[channel]) / std[channel]

Note

This transform acts out of place, i.e.,

it does not mutate the input tensor.

__init__(mean, std, inplace=False)[source]
Parameters
  • mean – Sequence of means for each channel.

  • std – Sequence of standard deviations for each channel.

  • inplace (bool,optional) – Bool to make this operation in-place.

ToTensor

class catalyst.data.cv.transforms.torch.ToTensor[source]

Bases: object

Convert a numpy.ndarray to tensor. Converts numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the numpy.ndarray has dtype = np.uint8 In the other cases, tensors are returned without scaling.

Natural Language Processing Extensions

Datasets

LanguageModelingDataset

class catalyst.data.nlp.dataset.language_modeling.LanguageModelingDataset(texts: Iterable[str], tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer], max_seq_length: int = None, sort: bool = True, lazy: bool = False)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset for (masked) language model task. Can sort sequnces for efficient padding.

__init__(texts: Iterable[str], tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer], max_seq_length: int = None, sort: bool = True, lazy: bool = False)[source]
Parameters
  • texts – Iterable object with text

  • tokenizer (str or tokenizer) – pre trained huggingface tokenizer or model name

  • max_seq_length – max sequence length to tokenize

  • sort – If True then sort all sequences by length for efficient padding

  • lazy – If True then tokenize and encode sequence in __getitem__ method else will tokenize in __init__ also if set to true sorting is unavialible

TextClassificationDataset

class catalyst.data.nlp.dataset.text_classification.TextClassificationDataset(texts: List[str], labels: List[str] = None, label_dict: Mapping[str, int] = None, max_seq_length: int = 512, model_name: str = 'distilbert-base-uncased')[source]

Bases: torch.utils.data.dataset.Dataset

Wrapper around Torch Dataset to perform text classification.

__init__(texts: List[str], labels: List[str] = None, label_dict: Mapping[str, int] = None, max_seq_length: int = 512, model_name: str = 'distilbert-base-uncased')[source]
Parameters
  • texts – a list with texts to classify or to train the classifier on

  • List[str] (labels) – a list with classification labels (optional)

  • label_dict – a dictionary mapping class names to class ids, to be passed to the validation data (optional)

  • max_seq_length – maximal sequence length in tokens, texts will be stripped to this length

  • model_name – transformer model name, needed to perform appropriate tokenization