Data¶

Data subpackage has data preprocessers and dataloader abstractions.

Scripts¶

You can use scripts typing catalyst-data in your terminal. For example:

$ catalyst-data tag2label --help

Catalyst-data scripts.

Examples

1. process-images reads raw data and outputs preprocessed resized images

$ catalyst-data process-images \
    --in-dir /path/to/raw/data/ \
    --out-dir=./data/dataset \
    --num-workers=6 \
    --max-size=224 \
    --extension=png \
    --clear-exif \
    --grayscale \
    --expand-dims

2. tag2label prepares a dataset to json like {“class_id”: class_column_from_dataset}

$ catalyst-data tag2label \
    --in-dir=./data/dataset \
    --out-dataset=./data/dataset_raw.csv \
    --out-labeling=./data/tag2cls.json

3. check-images checks images in your data to be non-broken and writes a flag: true if image opened without an error and false otherwise

$ catalyst-data check-images \
    --in-csv=./data/dataset_raw.csv \
    --img-datapath=./data/dataset \
    --img-col="tag" \
    --out-csv=./data/dataset_checked.csv \
    --n-cpu=4

split-dataframe split your dataset into train/valid folds

$ catalyst-data split-dataframe \
   --in-csv=./data/dataset_raw.csv \
   --tag2class=./data/tag2cls.json \
   --tag-column=tag \
   --class-column=class \
   --n-folds=5 \
   --train-folds=0,1,2,3 \
   --out-csv=./data/dataset.csv

5. image2embedding embeds images from your csv or image directory with specified neural net architecture

$ catalyst-data image2embedding \
    --in-csv=./data/input.csv \
    --img-col="filename" \
    --img-size=64 \
    --out-npy=./embeddings.npy \
    --arch=resnet34 \
    --pooling=GlobalMaxPool2d \
    --batch-size=8 \
    --num-workers=16 \
    --verbose

Augmentor¶

Legacy classes for augmentations. For modern Catalyst use albumentations.

class catalyst.data.augmentor.Augmentor(dict_key: str, augment_fn: Callable, default_kwargs: Dict = None)[source]¶

Augmentation abstraction to use with data dictionaries.

__init__(dict_key: str, augment_fn: Callable, default_kwargs: Dict = None)[source]¶

Parameters

dict_key – key to transform
augment_fn – augmentation function to use
default_kwargs – default kwargs for augmentations function

class catalyst.data.augmentor.AugmentorKeys(dict2fn_dict: Dict[str, str], augment_fn: Callable)[source]¶

Augmentation abstraction to match input and augmentations keys

__init__(dict2fn_dict: Dict[str, str], augment_fn: Callable)[source]¶

Parameters

dict2fn_dict (Dict[str, str]) – keys matching dict {input_key: augment_fn_key}. For example: {"image": "image", "mask": "mask"}
augment_fn – augmentation function

Collate Functions¶

class catalyst.data.collate_fn.FilteringCollateFn(*keys)[source]¶

Callable object doing job of collate_fn like default_collate, but does not cast batch items with specified key to torch.Tensor.

Only adds them to list. Supports only key-value format batches

__call__(batch)[source]¶

Parameters: batch – current batch
Returns: batch values filtered by keys

__init__(*keys)[source]¶

Parameters: keys – Keys for values that will not be converted to tensor and stacked

Dataset¶

class catalyst.data.dataset.ListDataset(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]¶

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class with several data sources list_data

__getitem__(index: int) → Any[source]¶

Gets element of the dataset

Parameters: index (int) – index of the element in the dataset
Returns: Single element by index

__init__(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]¶

Parameters

list_data (List[Dict]) – list of dicts, that stores you data annotations, (for example path to images, labels, bboxes, etc.)
open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string.)
dict_transform (callable) – transforms to use on dict. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]¶

Returns: length of the dataset
Return type: int

class catalyst.data.dataset.MergeDataset(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source]¶

Bases: torch.utils.data.dataset.Dataset

Abstraction to merge several datasets into one dataset.

__getitem__(index: int) → Any[source]¶

Get item from all datasets

Parameters: index (int) – index to value from all datasets
Returns: list of value in every dataset
Return type: list

__init__(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source]¶

Parameters

datasets (List[Dataset]) – params count of datasets to merge
dict_transform (callable) – transforms common for all datasets. (for example normalize image, add blur, crop/resize/etc)

__len__() → int[source]¶

Returns: length of the dataset
Return type: int

class catalyst.data.dataset.NumpyDataset(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]¶

Bases: torch.utils.data.dataset.Dataset

General purpose dataset class to use with numpy_data

__getitem__(index: int) → Any[source]¶

Gets element of the dataset

Parameters: index (int) – index of the element in the dataset
Returns: Single element by index

__init__(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]¶

Parameters

numpy_data (np.ndarray) – numpy data (for example path to embeddings, features, etc.)
numpy_key (str) – key to use for output dictionary
dict_transform (callable) – transforms to use on dict. (for example normalize vector, etc)

class catalyst.data.dataset.PathsDataset(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]¶

Bases: catalyst.data.dataset.ListDataset

Dataset that derives features and targets from samples filesystem paths.

__init__(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]¶

Args:

filenames (List[str]): list of file paths that store information
about your dataset samples; it could be images, texts or any other files in general.

open_fn (callable): function, that can open your
annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string)

label_fn (callable): function, that can extract target
value from sample path (for example, your sample could be an image file like /path/to/your/image_1.png where the target is encoded as a part of file path)

list_dataset_params (dict): base class initialization
parameters.

Examples

>>> label_fn = lambda x: x.split("_")[0]
>>> dataset = PathsDataset(
>>>     filenames=Path("/path/to/images/").glob("*.jpg"),
>>>     label_fn=label_fn,
>>>     open_fn=open_fn,
>>> )

Reader¶

Readers are the abstraction for your dataset. They can open an elem from the dataset and transform it to data, needed by your network. For example open image by path, or read string and tokenize it.

class catalyst.data.reader.ReaderSpec(input_key: str, output_key: str)[source]¶

Reader abstraction for all Readers. Applies a function to an element of your data. For example to a row from csv, or to an image, etc.

All inherited classes have to implement __call__.

__call__(row)[source]¶

Reads a row from your annotations dict and transfer it to data, needed by your network for example open image by path, or read string and tokenize it.

Parameters: row – elem in your dataset.
Returns: Data object used for your neural network

__init__(input_key: str, output_key: str)[source]¶

Parameters

input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result

class catalyst.data.reader.LambdaReader(input_key: str, output_key: str, encode_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source]¶

Reader abstraction with an lambda encoder. Can read an elem from dataset and apply encode_fn function to it

__call__(row)[source]¶

Reads a row from your annotations dict and applies encode_fn function

Parameters: row – elem in your dataset.
Returns: Value after applying encode_fn function

__init__(input_key: str, output_key: str, encode_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source]¶

Parameters

input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result
encode_fn (callable) – encode function to use to prepare your data (for example convert chars/words/tokens to indices, etc)
kwargs – kwargs for encode function

class catalyst.data.reader.ScalarReader(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]¶

Numeric data reader abstraction. Reads a single float, int, str or other from data

__call__(row)[source]¶

Reads a row from your annotations dict and transfer it to a single value

Parameters: row – elem in your dataset.
Returns: Scalar value
Return type: dtype

__init__(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]¶

Parameters

input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result
dtype (type) – datatype of scalar values to use
default_value – default value to use if something goes wrong
one_hot_classes (int) – number of one-hot classes
smoothing (float, optional) – if specified applies label smoothing to one_hot classes

class catalyst.data.reader.ImageReader(input_key: str, output_key: str, datapath: str = None, grayscale: bool = False)[source]¶

Image reader abstraction. Reads images from a csv dataset.

__call__(row)[source]¶

Reads a row from your annotations dict with filename and transfer it to an image

Parameters: row – elem in your dataset.
Returns: Image
Return type: np.ndarray

__init__(input_key: str, output_key: str, datapath: str = None, grayscale: bool = False)[source]¶

Parameters

input_key (str) – key to use from annotation dict
output_key (str) – key to use to store the result
datapath (str) – path to images dataset (so your can use relative paths in annotations)
grayscale (bool) – flag if you need to work only with grayscale images

class catalyst.data.reader.ReaderCompose(readers: List[catalyst.data.reader.ReaderSpec], mixins: [] = None)[source]¶

Abstraction to compose several readers into one open function.

__call__(row)[source]¶

Reads a row from your annotations dict and applies all readers and mixins

Parameters: row – elem in your dataset.
Returns: Value after applying all readers and mixins

__init__(readers: List[catalyst.data.reader.ReaderSpec], mixins: [] = None)[source]¶

Parameters

readers (List[ReaderSpec]) – list of reader to compose
mixins – list of mixins to use

Sampler¶

class catalyst.data.sampler.BalanceClassSampler(labels: List[int], mode: str = 'downsampling')[source]¶

Abstraction over data sampler. Allows you to create stratified sample on unbalanced classes.

__init__(labels: List[int], mode: str = 'downsampling')[source]¶

Parameters

labels (List[int]) – list of class label for each elem in the datasety
mode (str) – Strategy to balance classes. Must be one of [downsampling, upsampling]

__iter__() → Iterator[int][source]¶

Yields: indices of stratified sample

__len__() → int[source]¶

Returns: length of result sample

class catalyst.data.sampler.MiniEpochSampler(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]¶

Sampler iterates mini epochs from the dataset used by mini_epoch_len

Parameters

data_len (int) – Size of the dataset
mini_epoch_len (int) – Num samples from the dataset used in one mini epoch.
drop_last (bool) – If True, sampler will drop the last batches if its size would be less than batches_per_epoch
shuffle (str) – one of ["always", "real_epoch", None]. The sampler will shuffle indices > “per_mini_epoch” – every mini epoch (every __iter__ call) > “per_epoch” – every real epoch > None – don’t shuffle

Example

>>> MiniEpochSampler(len(dataset), mini_epoch_len=100)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100,
>>>     drop_last=True)
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100,
>>>     shuffle="per_epoch")

__iter__() → Iterator[int][source]¶

__len__() → int[source]¶

shuffle()[source]¶

Data¶

Scripts¶

Augmentor¶

Collate Functions¶

Dataset¶

Reader¶

Sampler¶

Catalyst

Navigation

Related Topics