Shortcuts

Data

Data subpackage has data preprocessers and dataloader abstractions.

Scripts

You can use scripts typing catalyst-data in your terminal. For example:

$ catalyst-data tag2label --help

Catalyst-data scripts.

Examples

1. process-images reads raw data and outputs preprocessed resized images

$ catalyst-data process-images \\
    --in-dir /path/to/raw/data/ \\
    --out-dir=./data/dataset \\
    --num-workers=6 \\
    --max-size=224 \\
    --extension=png \\
    --clear-exif \\
    --grayscale \\
    --expand-dims

2. tag2label prepares a dataset to json like {“class_id”: class_column_from_dataset}

$ catalyst-data tag2label \\
    --in-dir=./data/dataset \\
    --out-dataset=./data/dataset_raw.csv \\
    --out-labeling=./data/tag2cls.json

3. check-images checks images in your data to be non-broken and writes a flag: true if image opened without an error and false otherwise

$ catalyst-data check-images \\
    --in-csv=./data/dataset_raw.csv \\
    --img-rootpath=./data/dataset \\
    --img-col="tag" \\
    --out-csv=./data/dataset_checked.csv \\
    --n-cpu=4
  1. split-dataframe split your dataset into train/valid folds

$ catalyst-data split-dataframe \\
    --in-csv=./data/dataset_raw.csv \\
    --tag2class=./data/tag2cls.json \\
    --tag-column=tag \\
    --class-column=class \\
    --n-folds=5 \\
    --train-folds=0,1,2,3 \\
    --out-csv=./data/dataset.csv

5. image2embedding embeds images from your csv or image directory with specified neural net architecture

$ catalyst-data image2embedding \\
    --in-csv=./data/input.csv \\
    --img-col="filename" \\
    --img-size=64 \\
    --out-npy=./embeddings.npy \\
    --arch=resnet34 \\
    --pooling=GlobalMaxPool2d \\
    --batch-size=8 \\
    --num-workers=16 \\
    --verbose
catalyst.data.__main__.build_parser() → argparse.ArgumentParser[source]

@TODO: Docs. Contribution is welcome

catalyst.data.__main__.main()[source]

@TODO: Docs. Contribution is welcome

Augmentor

class catalyst.data.augmentor.Augmentor(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]

Augmentation abstraction to use with data dictionaries.

__init__(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]
Parameters
  • dict_key (str) – key to transform

  • augment_fn (Callable) – augmentation function to use

  • input_key (str) – augment_fn input key

  • output_key (str) – augment_fn output key

  • **kwargs – default kwargs for augmentations function

class catalyst.data.augmentor.AugmentorCompose(key2augment_fn: Dict[str, Callable])[source]

Compose augmentors.

__init__(key2augment_fn: Dict[str, Callable])[source]
Parameters

key2augment_fn (Dict[str, Callable]) – mapping from input key to augmentation function to apply

class catalyst.data.augmentor.AugmentorKeys(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source]

Augmentation abstraction to match input and augmentations keys.

__init__(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source]
Parameters
  • dict2fn_dict (Dict[str, str]) – keys matching dict {input_key: augment_fn_key}. For example: {"image": "image", "mask": "mask"}

  • augment_fn – augmentation function

Collate Functions

class catalyst.data.collate_fn.FilteringCollateFn(*keys)[source]

Callable object doing job of collate_fn like default_collate, but does not cast batch items with specified key to torch.Tensor.

Only adds them to list. Supports only key-value format batches

__call__(batch)[source]
Parameters

batch – current batch

Returns

batch values filtered by keys

__init__(*keys)[source]
Parameters

keys – Keys for values that will not be converted to tensor and stacked

Dataset

class catalyst.data.dataset.ListDataset(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class with several data sources list_data.

__getitem__(index: int) → Any[source]

Gets element of the dataset.

Parameters

index (int) – index of the element in the dataset

Returns

Single element by index

__init__(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]
Parameters
  • list_data (List[Dict]) – list of dicts, that stores you data annotations, (for example path to images, labels, bboxes, etc.)

  • open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string.)

  • dict_transform (callable) – transforms to use on dict. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

class catalyst.data.dataset.MergeDataset(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source]

Bases: torch.utils.data.dataset.Dataset

Abstraction to merge several datasets into one dataset.

__getitem__(index: int) → Any[source]

Get item from all datasets.

Parameters

index (int) – index to value from all datasets

Returns

list of value in every dataset

Return type

list

__init__(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source]
Parameters
  • datasets (List[Dataset]) – params count of datasets to merge

  • dict_transform (callable) – transforms common for all datasets. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

class catalyst.data.dataset.NumpyDataset(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class to use with numpy_data.

__getitem__(index: int) → Any[source]

Gets element of the dataset.

Parameters

index (int) – index of the element in the dataset

Returns

Single element by index

__init__(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]
Parameters
  • numpy_data (np.ndarray) – numpy data (for example path to embeddings, features, etc.)

  • numpy_key (str) – key to use for output dictionary

  • dict_transform (callable) – transforms to use on dict. (for example normalize vector, etc)

__len__() → int[source]
Returns

length of the dataset

Return type

int

class catalyst.data.dataset.PathsDataset(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]

Bases: catalyst.data.dataset.ListDataset

Dataset that derives features and targets from samples filesystem paths.

Examples

>>> label_fn = lambda x: x.split("_")[0]
>>> dataset = PathsDataset(
>>>     filenames=Path("/path/to/images/").glob("*.jpg"),
>>>     label_fn=label_fn,
>>>     open_fn=open_fn,
>>> )
__init__(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]
Parameters
  • filenames (List[str]) – list of file paths that store information about your dataset samples; it could be images, texts or any other files in general.

  • open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string)

  • label_fn (callable) – function, that can extract target value from sample path (for example, your sample could be an image file like /path/to/your/image_1.png where the target is encoded as a part of file path)

  • list_dataset_params (dict) – base class initialization parameters.

class catalyst.data.dataset.DatasetFromSampler(sampler: torch.utils.data.sampler.Sampler)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset of indexes from Sampler.

__getitem__(index: int)[source]

Gets element of the dataset.

Parameters

index (int) – index of the element in the dataset

Returns

Single element by index

__init__(sampler: torch.utils.data.sampler.Sampler)[source]
Parameters

sampler (Sampler) – @TODO: Docs. Contribution is welcome

__len__() → int[source]
Returns

length of the dataset

Return type

int

Reader

Readers are the abstraction for your dataset. They can open an elem from the dataset and transform it to data, needed by your network. For example open image by path, or read string and tokenize it.

class catalyst.data.reader.ReaderSpec(input_key: str, output_key: str)[source]

Reader abstraction for all Readers.

Applies a function to an element of your data. For example to a row from csv, or to an image, etc.

All inherited classes have to implement __call__.

__call__(element)[source]

Reads a row from your annotations dict and transfer it to data, needed by your network for example open image by path, or read string and tokenize it.

Parameters

element – elem in your dataset

Returns

Data object used for your neural network

__init__(input_key: str, output_key: str)[source]
Parameters
  • input_key (str) – input key to use from annotation dict

  • output_key (str) – output key to use to store the result

class catalyst.data.reader.ScalarReader(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]

Numeric data reader abstraction. Reads a single float, int, str or other from data

__call__(element)[source]

Reads a row from your annotations dict and transfer it to a single value

Parameters

element – elem in your dataset

Returns

Scalar value

Return type

dtype

__init__(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]
Parameters
  • input_key (str) – input key to use from annotation dict

  • output_key (str) – output key to use to store the result

  • dtype (type) – datatype of scalar values to use

  • default_value – default value to use if something goes wrong

  • one_hot_classes (int) – number of one-hot classes

  • smoothing (float, optional) – if specified applies label smoothing to one_hot classes

class catalyst.data.reader.LambdaReader(input_key: str, output_key: str, lambda_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source]

Reader abstraction with an lambda encoders. Can read an elem from dataset and apply encode_fn function to it.

__call__(element)[source]

Reads a row from your annotations dict and applies encode_fn function.

Parameters

element – elem in your dataset.

Returns

Value after applying lambda_fn function

__init__(input_key: str, output_key: str, lambda_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source]
Parameters
  • input_key (str) – input key to use from annotation dict

  • output_key (str) – output key to use to store the result

  • lambda_fn (callable) – encode function to use to prepare your data (for example convert chars/words/tokens to indices, etc)

  • kwargs – kwargs for encode function

class catalyst.data.reader.ReaderCompose(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]

Abstraction to compose several readers into one open function.

__call__(element)[source]

Reads a row from your annotations dict and applies all readers and mixins

Parameters

element – elem in your dataset.

Returns

Value after applying all readers and mixins

__init__(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]
Parameters
  • readers (List[ReaderSpec]) – list of reader to compose

  • mixins (list) – list of mixins to use

Sampler

class catalyst.data.sampler.BalanceClassSampler(labels: List[int], mode: str = 'downsampling')[source]

Abstraction over data sampler.

Allows you to create stratified sample on unbalanced classes.

__init__(labels: List[int], mode: str = 'downsampling')[source]
Parameters
  • labels (List[int]) – list of class label for each elem in the dataset

  • mode (str) – Strategy to balance classes. Must be one of [downsampling, upsampling]

__iter__() → Iterator[int][source]
Yields

indices of stratified sample

__len__() → int[source]
Returns

length of result sample

class catalyst.data.sampler.MiniEpochSampler(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]

Sampler iterates mini epochs from the dataset used by mini_epoch_len.

Example

>>> MiniEpochSampler(len(dataset), mini_epoch_len=100)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100,
>>>     drop_last=True)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100,
>>>     shuffle="per_epoch")
__init__(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]
Parameters
  • data_len (int) – Size of the dataset

  • mini_epoch_len (int) – Num samples from the dataset used in one mini epoch.

  • drop_last (bool) – If True, sampler will drop the last batches if its size would be less than batches_per_epoch

  • shuffle (str) – one of "always", "real_epoch", or None`. The sampler will shuffle indices > “per_mini_epoch” - every mini epoch (every __iter__ call) > “per_epoch” – every real epoch > None – don’t shuffle

__iter__() → Iterator[int][source]

@TODO: Docs. Contribution is welcome.

__len__() → int[source]
Returns

length of the mini-epoch

Return type

int

shuffle() → None[source]

@TODO: Docs. Contribution is welcome.

class catalyst.data.sampler.DistributedSamplerWrapper(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]

Wrapper over Sampler for distributed training. Allows you to use any sampler in distributed mode.

It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel. In such case, each process can pass a DistributedSamplerWrapper instance as a DataLoader sampler, and load a subset of subsampled data of the original dataset that is exclusive to it.

Note

Sampler is assumed to be of constant size.

__init__(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]
Parameters
  • sampler – Sampler used for subsampling

  • num_replicas (int, optional) – Number of processes participating in distributed training

  • rank (int, optional) – Rank of the current process within num_replicas

  • shuffle (bool, optional) – If true (default), sampler will shuffle the indices

__iter__()[source]

@TODO: Docs. Contribution is welcome.

class catalyst.data.sampler.DynamicLenBatchSampler(sampler, batch_size, drop_last)[source]

A dynamic batch length data sampler. Should be used with catalyst.utils.trim_tensors.

Adapted from “Dynamic minibatch trimming to improve BERT training speed” https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/94779

Parameters
  • sampler (torch.utils.data.Sampler) – Base sampler.

  • batch_size (int) – Size of minibatch.

  • drop_last (bool) – If True, the sampler will drop the last batch

  • its size would be less than batch_size. (if) –

Usage example:

>>> from torch.utils import data
>>> from catalyst.data import DynamicLenBatchSampler
>>> from catalyst import utils
>>> dataset = data.TensorDataset(
>>>     input_ids, input_mask, segment_ids, labels
>>> )
>>> sampler_ = data.RandomSampler(dataset)
>>> sampler = DynamicLenBatchSampler(
>>>     sampler_, batch_size=16, drop_last=False
>>> )
>>> loader = data.DataLoader(dataset, batch_sampler=sampler)
>>> for batch in loader:
>>>     tensors = utils.trim_tensors(batch)
>>>     b_input_ids, b_input_mask, b_segment_ids, b_labels =         >>>         tuple(t.to(device) for t in tensors)
__iter__()[source]

Iteration over BatchSampler.