Data¶
Data subpackage has data preprocessers and dataloader abstractions.
Scripts¶
You can use scripts typing catalyst-data in your terminal. For example:
$ catalyst-data tag2label --help
Catalyst-data scripts.
Examples
1. process-images reads raw data and outputs preprocessed resized images
$ catalyst-data process-images \\
    --in-dir /path/to/raw/data/ \\
    --out-dir=./data/dataset \\
    --num-workers=6 \\
    --max-size=224 \\
    --extension=png \\
    --clear-exif \\
    --grayscale \\
    --expand-dims
2. tag2label prepares a dataset to json like {“class_id”: class_column_from_dataset}
$ catalyst-data tag2label \\
    --in-dir=./data/dataset \\
    --out-dataset=./data/dataset_raw.csv \\
    --out-labeling=./data/tag2cls.json
3. check-images checks images in your data to be non-broken and writes a flag: true if image opened without an error and false otherwise
$ catalyst-data check-images \\
    --in-csv=./data/dataset_raw.csv \\
    --img-rootpath=./data/dataset \\
    --img-col="tag" \\
    --out-csv=./data/dataset_checked.csv \\
    --n-cpu=4
- split-dataframe split your dataset into train/valid folds 
$ catalyst-data split-dataframe \\
    --in-csv=./data/dataset_raw.csv \\
    --tag2class=./data/tag2cls.json \\
    --tag-column=tag \\
    --class-column=class \\
    --n-folds=5 \\
    --train-folds=0,1,2,3 \\
    --out-csv=./data/dataset.csv
5. image2embedding embeds images from your csv or image directory with specified neural net architecture
$ catalyst-data image2embedding \\
    --in-csv=./data/input.csv \\
    --img-col="filename" \\
    --img-size=64 \\
    --out-npy=./embeddings.npy \\
    --arch=resnet34 \\
    --pooling=GlobalMaxPool2d \\
    --batch-size=8 \\
    --num-workers=16 \\
    --verbose
Augmentor¶
- 
class catalyst.data.augmentor.Augmentor(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]¶
- Augmentation abstraction to use with data dictionaries. - 
__init__(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source]¶
- Augmentation abstraction to use with data dictionaries. - Parameters
- dict_key (str) – key to transform 
- augment_fn (Callable) – augmentation function to use 
- input_key (str) – - augment_fninput key
- output_key (str) – - augment_fnoutput key
- **kwargs – default kwargs for augmentations function 
 
 
 
- 
- 
class catalyst.data.augmentor.AugmentorCompose(key2augment_fn: Dict[str, Callable])[source]¶
- Compose augmentors. 
Collate Functions¶
Dataset¶
- 
class catalyst.data.dataset.ListDataset(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]¶
- Bases: - torch.utils.data.dataset.Dataset- General purpose dataset class with several data sources list_data. - 
__getitem__(index: int) → Any[source]¶
- Gets element of the dataset. - Parameters
- index (int) – index of the element in the dataset 
- Returns
- Single element by index 
 
 - 
__init__(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source]¶
- Parameters
- list_data (List[Dict]) – list of dicts, that stores you data annotations, (for example path to images, labels, bboxes, etc.) 
- open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string.) 
- dict_transform (callable) – transforms to use on dict. (for example normalize image, add blur, crop/resize/etc) 
 
 
 
- 
- 
class catalyst.data.dataset.MergeDataset(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source]¶
- Bases: - torch.utils.data.dataset.Dataset- Abstraction to merge several datasets into one dataset. - 
__getitem__(index: int) → Any[source]¶
- Get item from all datasets. - Parameters
- index (int) – index to value from all datasets 
- Returns
- list of value in every dataset 
- Return type
- list 
 
 
- 
- 
class catalyst.data.dataset.NumpyDataset(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]¶
- Bases: - torch.utils.data.dataset.Dataset- General purpose dataset class to use with numpy_data. - 
__getitem__(index: int) → Any[source]¶
- Gets element of the dataset. - Parameters
- index (int) – index of the element in the dataset 
- Returns
- Single element by index 
 
 - 
__init__(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source]¶
- General purpose dataset class to use with numpy_data. - Parameters
- numpy_data (np.ndarray) – numpy data (for example path to embeddings, features, etc.) 
- numpy_key (str) – key to use for output dictionary 
- dict_transform (callable) – transforms to use on dict. (for example normalize vector, etc) 
 
 
 
- 
- 
class catalyst.data.dataset.PathsDataset(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]¶
- Bases: - catalyst.data.dataset.ListDataset- Dataset that derives features and targets from samples filesystem paths. - Examples - >>> label_fn = lambda x: x.split("_")[0] >>> dataset = PathsDataset( >>> filenames=Path("/path/to/images/").glob("*.jpg"), >>> label_fn=label_fn, >>> open_fn=open_fn, >>> ) - 
__init__(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source]¶
- Parameters
- filenames (List[str]) – list of file paths that store information about your dataset samples; it could be images, texts or any other files in general. 
- open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string) 
- label_fn (callable) – function, that can extract target value from sample path (for example, your sample could be an image file like - /path/to/your/image_1.pngwhere the target is encoded as a part of file path)
- list_dataset_params (dict) – base class initialization parameters. 
 
 
 
- 
- 
class catalyst.data.dataset.DatasetFromSampler(sampler: torch.utils.data.sampler.Sampler)[source]¶
- Bases: - torch.utils.data.dataset.Dataset- Dataset of indexes from Sampler. - 
__getitem__(index: int)[source]¶
- Gets element of the dataset. - Parameters
- index (int) – index of the element in the dataset 
- Returns
- Single element by index 
 
 
- 
In-batch Samplers¶
- 
class catalyst.data.sampler_inbatch.AllTripletsSampler(max_output_triplets: int = 9223372036854775807)[source]¶
- This sampler selects all the possible triplets for the given labels 
- 
class catalyst.data.sampler_inbatch.HardTripletsSampler(need_norm: bool = False)[source]¶
- This sampler selects hardest triplets based on distances between features: the hardest positive sample has the maximal distance to the anchor sample, the hardest negative sample has the minimal distance to the anchor sample. 
- 
class catalyst.data.sampler_inbatch.InBatchTripletsSampler[source]¶
- Base class for a triplets samplers. We expect that the child instances of this class will be used to forming triplets inside the batches. The batches must contain at least 2 samples for each class and at least 2 different classes, such behaviour can be garantee via using catalyst.data.sampler.BalanceBatchSampler - But you are not limited to using it in any other way. - 
sample(features: torch.Tensor, labels: List[int]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
- Parameters
- features – has the shape of [batch_size, feature_size] 
- labels – labels of the samples in the batch 
 
- Returns
- (anchor, positive, negative) 
- Return type
- the batch of the triplets in the order below 
 
 
- 
Loader¶
- 
class catalyst.data.loader.BatchLimitLoaderWrapper(loader: torch.utils.data.dataloader.DataLoader, num_batches: Union[int, float])[source]¶
- Bases: - object- Loader wrapper. Limits number of batches used per each iteration. - For example, if you have some loader and want to use only first 5 bathes: - import torch from torch.utils.data import DataLoader, TensorDataset from catalyst.data.loader import BatchLimitLoaderWrapper num_samples, num_features = int(1e4), int(1e1) X, y = torch.rand(num_samples, num_features), torch.rand(num_samples) dataset = TensorDataset(X, y) loader = DataLoader(dataset, batch_size=32, num_workers=1) loader = BatchLimitLoaderWrapper(loader, num_batches=5) - or if you would like to use only some portion of Dataloader (we use 30% in the example below): - import torch from torch.utils.data import DataLoader, TensorDataset from catalyst.data.loader import BatchLimitLoaderWrapper num_samples, num_features = int(1e4), int(1e1) X, y = torch.rand(num_samples, num_features), torch.rand(num_samples) dataset = TensorDataset(X, y) loader = DataLoader(dataset, batch_size=32, num_workers=1) loader = BatchLimitLoaderWrapper(loader, num_batches=0.3) - Note - Generally speaking, this wrapper could be used with any iterator-like object. No - DataLoader-specific code used.- 
__getattr__(key)[source]¶
- Gets attribute by - key. Firstly, looks at the- originfor the appropriate- key. If none founds - looks at the wrappers attributes. If could not found anything - raises- NotImplementedError.- Parameters
- key – attribute key 
- Returns
- attribute value 
- Raises
- NotImplementedError – if could not find attribute in - originor- wrapper
 
 - 
__init__(loader: torch.utils.data.dataloader.DataLoader, num_batches: Union[int, float])[source]¶
- Loader wrapper. Limits number of batches used per each iteration. - Parameters
- loader (DataLoader) – torch dataloader. 
- num_batches (Union[int, float]) – number of batches to use (int), or portion of iterator (float, should be in [0;1] range) 
 
 
 - 
__len__() → int[source]¶
- Returns length of the wrapper loader. - Returns
- length of the wrapper loader 
- Return type
- int 
 
 - 
__weakref__¶
- list of weak references to the object (if defined) 
 
- 
Reader¶
Readers are the abstraction for your dataset. They can open an elem from the dataset and transform it to data, needed by your network. For example open image by path, or read string and tokenize it.
- 
class catalyst.data.reader.ReaderSpec(input_key: str, output_key: str)[source]¶
- Reader abstraction for all Readers. - Applies a function to an element of your data. For example to a row from csv, or to an image, etc. - All inherited classes have to implement __call__. 
- 
class catalyst.data.reader.ScalarReader(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]¶
- Numeric data reader abstraction. Reads a single float, int, str or other from data - 
__call__(element)[source]¶
- Reads a row from your annotations dict and transfer it to a single value - Parameters
- element – elem in your dataset 
- Returns
- Scalar value 
- Return type
- dtype 
 
 - 
__init__(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source]¶
- Parameters
- input_key (str) – input key to use from annotation dict 
- output_key (str) – output key to use to store the result 
- dtype (type) – datatype of scalar values to use 
- default_value – default value to use if something goes wrong 
- one_hot_classes (int) – number of one-hot classes 
- smoothing (float, optional) – if specified applies label smoothing to one_hot classes 
 
 
 
- 
- 
class catalyst.data.reader.LambdaReader(input_key: str, output_key: str, lambda_fn: Callable = None, **kwargs)[source]¶
- Reader abstraction with an lambda encoders. Can read an elem from dataset and apply encode_fn function to it. - 
__call__(element)[source]¶
- Reads a row from your annotations dict and applies encode_fn function. - Parameters
- element – elem in your dataset. 
- Returns
- Value after applying lambda_fn function 
 
 - 
__init__(input_key: str, output_key: str, lambda_fn: Callable = None, **kwargs)[source]¶
- Parameters
- input_key (str) – input key to use from annotation dict 
- output_key (str) – output key to use to store the result 
- lambda_fn (callable) – encode function to use to prepare your data (for example convert chars/words/tokens to indices, etc) 
- kwargs – kwargs for encode function 
 
 
 
- 
- 
class catalyst.data.reader.ReaderCompose(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]¶
- Abstraction to compose several readers into one open function. - 
__call__(element)[source]¶
- Reads a row from your annotations dict and applies all readers and mixins - Parameters
- element – elem in your dataset. 
- Returns
- Value after applying all readers and mixins 
 
 - 
__init__(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source]¶
- Parameters
- readers (List[ReaderSpec]) – list of reader to compose 
- mixins (list) – list of mixins to use 
 
 
 
- 
Sampler¶
- 
class catalyst.data.sampler.BalanceClassSampler(labels: List[int], mode: Union[str, int] = 'downsampling')[source]¶
- Abstraction over data sampler. - Allows you to create stratified sample on unbalanced classes. 
- 
class catalyst.data.sampler.BalanceBatchSampler(labels: List[int], p: int, k: int)[source]¶
- This kind of sampler can be used for both metric learning and classification task. - Sampler with the given strategy for the C unique classes dataset: - Selection P of C classes for the 1st batch - Selection K instances for each class for the 1st batch - Selection P of C - P remaining classes for 2nd batch - Selection K instances for each class for the 2nd batch - … The epoch ends when there are no classes left. So, the batch sise is P * K except the last one. - Thus, in each epoch, all the classes will be selected once, but this does not mean that all the instances will be selected during the epoch. - One of the purposes of this sampler is to be used for forming triplets and pos/neg pairs inside the batch. To guarante existance of these pairs in the batch, P and K should be > 1. (1) - Behavior in corner cases: - If a class does not contain K instances, a choice will be made with repetition. - If C % P == 1 then one of the classes should be dropped otherwise statement (1) will not be met. - This type of sampling can be found in the classical paper of Person Re-Id, where P equals 32 and K equals 4: In Defense of the Triplet Loss for Person Re-Identification. - 
__init__(labels: List[int], p: int, k: int)[source]¶
- Parameters
- labels – list of classes labeles for each elem in the dataset 
- p – number of classes in a batch, should be > 1 
- k – number of instances of each class in a batch, should be > 1 
 
 
 - 
property batch_size¶
- Returns: this value should be used in DataLoader as batch size 
 - 
property batches_in_epoch¶
- Returns: number of batches in an epoch 
 
- 
- 
class catalyst.data.sampler.MiniEpochSampler(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]¶
- Sampler iterates mini epochs from the dataset used by - mini_epoch_len.- Example - >>> MiniEpochSampler(len(dataset), mini_epoch_len=100) >>> MiniEpochSampler(len(dataset), mini_epoch_len=100, drop_last=True) >>> MiniEpochSampler(len(dataset), mini_epoch_len=100, >>> shuffle="per_epoch") - 
__init__(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source]¶
- Parameters
- data_len (int) – Size of the dataset 
- mini_epoch_len (int) – Num samples from the dataset used in one mini epoch. 
- drop_last (bool) – If - True, sampler will drop the last batches if its size would be less than- batches_per_epoch
- shuffle (str) – one of - "always",- "real_epoch", or None`. The sampler will shuffle indices > “per_mini_epoch” - every mini epoch (every- __iter__call) > “per_epoch” – every real epoch > None – don’t shuffle
 
 
 
- 
- 
class catalyst.data.sampler.DistributedSamplerWrapper(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]¶
- Wrapper over Sampler for distributed training. Allows you to use any sampler in distributed mode. - It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel. In such case, each process can pass a DistributedSamplerWrapper instance as a DataLoader sampler, and load a subset of subsampled data of the original dataset that is exclusive to it. - Note - Sampler is assumed to be of constant size. - 
__init__(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source]¶
- Parameters
- sampler – Sampler used for subsampling 
- num_replicas (int, optional) – Number of processes participating in distributed training 
- rank (int, optional) – Rank of the current process within - num_replicas
- shuffle (bool, optional) – If true (default), sampler will shuffle the indices 
 
 
 
- 
- 
class catalyst.data.sampler.DynamicLenBatchSampler(sampler, batch_size, drop_last)[source]¶
- A dynamic batch length data sampler. Should be used with catalyst.utils.trim_tensors. - Adapted from Dynamic minibatch trimming to improve BERT training speed. - Parameters
- sampler (torch.utils.data.Sampler) – Base sampler. 
- batch_size (int) – Size of minibatch. 
- drop_last (bool) – If - True, the sampler will drop the last batch
- its size would be less than batch_size. (if) – 
 
 - Usage example: - >>> from torch.utils import data >>> from catalyst.data import DynamicLenBatchSampler >>> from catalyst import utils - >>> dataset = data.TensorDataset( >>> input_ids, input_mask, segment_ids, labels >>> ) - >>> sampler_ = data.RandomSampler(dataset) >>> sampler = DynamicLenBatchSampler( >>> sampler_, batch_size=16, drop_last=False >>> ) >>> loader = data.DataLoader(dataset, batch_sampler=sampler) - >>> for batch in loader: >>> tensors = utils.trim_tensors(batch) >>> b_input_ids, b_input_mask, b_segment_ids, b_labels = >>> tuple(t.to(device) for t in tensors)