Data
Data subpackage has data preprocessers and dataloader abstractions.
Scripts
You can use scripts typing catalyst-data in your terminal. For example:
$ catalyst-data tag2label --help
Catalyst-data scripts.
Examples
1. process-images reads raw data and outputs preprocessed resized images
$ catalyst-data process-images \\
--in-dir /path/to/raw/data/ \\
--out-dir=./data/dataset \\
--num-workers=6 \\
--max-size=224 \\
--extension=png \\
--clear-exif \\
--grayscale \\
--expand-dims
2. tag2label prepares a dataset to json like {“class_id”: class_column_from_dataset}
$ catalyst-data tag2label \\
--in-dir=./data/dataset \\
--out-dataset=./data/dataset_raw.csv \\
--out-labeling=./data/tag2cls.json
3. check-images checks images in your data to be non-broken and writes a flag: true if image opened without an error and false otherwise
$ catalyst-data check-images \\
--in-csv=./data/dataset_raw.csv \\
--img-rootpath=./data/dataset \\
--img-col="tag" \\
--out-csv=./data/dataset_checked.csv \\
--n-cpu=4
split-dataframe split your dataset into train/valid folds
$ catalyst-data split-dataframe \\
--in-csv=./data/dataset_raw.csv \\
--tag2class=./data/tag2cls.json \\
--tag-column=tag \\
--class-column=class \\
--n-folds=5 \\
--train-folds=0,1,2,3 \\
--out-csv=./data/dataset.csv
5. image2embedding embeds images from your csv or image directory with specified neural net architecture
$ catalyst-data image2embedding \\
--in-csv=./data/input.csv \\
--img-col="filename" \\
--img-size=64 \\
--out-npy=./embeddings.npy \\
--arch=resnet34 \\
--pooling=GlobalMaxPool2d \\
--batch-size=8 \\
--num-workers=16 \\
--verbose
-
catalyst.data.__main__.
build_parser
() → argparse.ArgumentParser[source] @TODO: Docs. Contribution is welcome
-
catalyst.data.__main__.
main
()[source] @TODO: Docs. Contribution is welcome
Augmentor
-
class
catalyst.data.augmentor.
Augmentor
(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source] Augmentation abstraction to use with data dictionaries.
-
__init__
(dict_key: str, augment_fn: Callable, input_key: str = None, output_key: str = None, **kwargs)[source] - Parameters
dict_key (str) – key to transform
augment_fn (Callable) – augmentation function to use
input_key (str) –
augment_fn
input keyoutput_key (str) –
augment_fn
output key**kwargs – default kwargs for augmentations function
-
-
class
catalyst.data.augmentor.
AugmentorCompose
(key2augment_fn: Dict[str, Callable])[source] Compose augmentors.
-
__init__
(key2augment_fn: Dict[str, Callable])[source] - Parameters
key2augment_fn (Dict[str, Callable]) – mapping from input key to augmentation function to apply
-
-
class
catalyst.data.augmentor.
AugmentorKeys
(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source] Augmentation abstraction to match input and augmentations keys.
-
__init__
(dict2fn_dict: Union[Dict[str, str], List[str]], augment_fn: Callable)[source] - Parameters
dict2fn_dict (Dict[str, str]) – keys matching dict
{input_key: augment_fn_key}
. For example:{"image": "image", "mask": "mask"}
augment_fn – augmentation function
-
Collate Functions
-
class
catalyst.data.collate_fn.
FilteringCollateFn
(*keys)[source] Callable object doing job of
collate_fn
likedefault_collate
, but does not cast batch items with specified key totorch.Tensor
.Only adds them to list. Supports only key-value format batches
-
__call__
(batch)[source] - Parameters
batch – current batch
- Returns
batch values filtered by keys
-
__init__
(*keys)[source] - Parameters
keys – Keys for values that will not be converted to tensor and stacked
-
Dataset
-
class
catalyst.data.dataset.
ListDataset
(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source] Bases:
torch.utils.data.dataset.Dataset
General purpose dataset class with several data sources list_data.
-
__getitem__
(index: int) → Any[source] Gets element of the dataset.
- Parameters
index (int) – index of the element in the dataset
- Returns
Single element by index
-
__init__
(list_data: List[Dict], open_fn: Callable, dict_transform: Callable = None)[source] - Parameters
list_data (List[Dict]) – list of dicts, that stores you data annotations, (for example path to images, labels, bboxes, etc.)
open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string.)
dict_transform (callable) – transforms to use on dict. (for example normalize image, add blur, crop/resize/etc)
-
__len__
() → int[source] - Returns
length of the dataset
- Return type
int
-
-
class
catalyst.data.dataset.
MergeDataset
(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source] Bases:
torch.utils.data.dataset.Dataset
Abstraction to merge several datasets into one dataset.
-
__getitem__
(index: int) → Any[source] Get item from all datasets.
- Parameters
index (int) – index to value from all datasets
- Returns
list of value in every dataset
- Return type
list
-
__init__
(*datasets: torch.utils.data.dataset.Dataset, dict_transform: Callable = None)[source] - Parameters
datasets (List[Dataset]) – params count of datasets to merge
dict_transform (callable) – transforms common for all datasets. (for example normalize image, add blur, crop/resize/etc)
-
__len__
() → int[source] - Returns
length of the dataset
- Return type
int
-
-
class
catalyst.data.dataset.
NumpyDataset
(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source] Bases:
torch.utils.data.dataset.Dataset
General purpose dataset class to use with numpy_data.
-
__getitem__
(index: int) → Any[source] Gets element of the dataset.
- Parameters
index (int) – index of the element in the dataset
- Returns
Single element by index
-
__init__
(numpy_data: numpy.ndarray, numpy_key: str = 'features', dict_transform: Callable = None)[source] - Parameters
numpy_data (np.ndarray) – numpy data (for example path to embeddings, features, etc.)
numpy_key (str) – key to use for output dictionary
dict_transform (callable) – transforms to use on dict. (for example normalize vector, etc)
-
__len__
() → int[source] - Returns
length of the dataset
- Return type
int
-
-
class
catalyst.data.dataset.
PathsDataset
(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source] Bases:
catalyst.data.dataset.ListDataset
Dataset that derives features and targets from samples filesystem paths.
Examples
>>> label_fn = lambda x: x.split("_")[0] >>> dataset = PathsDataset( >>> filenames=Path("/path/to/images/").glob("*.jpg"), >>> label_fn=label_fn, >>> open_fn=open_fn, >>> )
-
__init__
(filenames: List[Union[str, pathlib.Path]], open_fn: Callable[[dict], dict], label_fn: Callable[[Union[str, pathlib.Path]], Any], **list_dataset_params)[source] - Parameters
filenames (List[str]) – list of file paths that store information about your dataset samples; it could be images, texts or any other files in general.
open_fn (callable) – function, that can open your annotations dict and transfer it to data, needed by your network (for example open image by path, or tokenize read string)
label_fn (callable) – function, that can extract target value from sample path (for example, your sample could be an image file like
/path/to/your/image_1.png
where the target is encoded as a part of file path)list_dataset_params (dict) – base class initialization parameters.
-
-
class
catalyst.data.dataset.
DatasetFromSampler
(sampler: torch.utils.data.sampler.Sampler)[source] Bases:
torch.utils.data.dataset.Dataset
Dataset of indexes from Sampler.
-
__getitem__
(index: int)[source] Gets element of the dataset.
- Parameters
index (int) – index of the element in the dataset
- Returns
Single element by index
-
__init__
(sampler: torch.utils.data.sampler.Sampler)[source] - Parameters
sampler (Sampler) – @TODO: Docs. Contribution is welcome
-
__len__
() → int[source] - Returns
length of the dataset
- Return type
int
-
Reader
Readers are the abstraction for your dataset. They can open an elem from the dataset and transform it to data, needed by your network. For example open image by path, or read string and tokenize it.
-
class
catalyst.data.reader.
ReaderSpec
(input_key: str, output_key: str)[source] Reader abstraction for all Readers.
Applies a function to an element of your data. For example to a row from csv, or to an image, etc.
All inherited classes have to implement __call__.
-
__call__
(element)[source] Reads a row from your annotations dict and transfer it to data, needed by your network for example open image by path, or read string and tokenize it.
- Parameters
element – elem in your dataset
- Returns
Data object used for your neural network
-
__init__
(input_key: str, output_key: str)[source] - Parameters
input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result
-
-
class
catalyst.data.reader.
ImageReader
(input_key: str, output_key: str, rootpath: str = None, grayscale: bool = False)[source] Image reader abstraction. Reads images from a
csv
dataset.-
__call__
(element)[source] Reads a row from your annotations dict with filename and transfer it to an image
- Parameters
element – elem in your dataset
- Returns
Image
- Return type
np.ndarray
-
__init__
(input_key: str, output_key: str, rootpath: str = None, grayscale: bool = False)[source] - Parameters
input_key (str) – key to use from annotation dict
output_key (str) – key to use to store the result
rootpath (str) – path to images dataset root directory (so your can use relative paths in annotations)
grayscale (bool) – flag if you need to work only with grayscale images
-
-
class
catalyst.data.reader.
ScalarReader
(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source] Numeric data reader abstraction. Reads a single float, int, str or other from data
-
__call__
(element)[source] Reads a row from your annotations dict and transfer it to a single value
- Parameters
element – elem in your dataset
- Returns
Scalar value
- Return type
dtype
-
__init__
(input_key: str, output_key: str, dtype: Type = <class 'numpy.float32'>, default_value: float = None, one_hot_classes: int = None, smoothing: float = None)[source] - Parameters
input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result
dtype (type) – datatype of scalar values to use
default_value – default value to use if something goes wrong
one_hot_classes (int) – number of one-hot classes
smoothing (float, optional) – if specified applies label smoothing to one_hot classes
-
-
class
catalyst.data.reader.
LambdaReader
(input_key: str, output_key: str, lambda_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source] Reader abstraction with an lambda encoders. Can read an elem from dataset and apply encode_fn function to it.
-
__call__
(element)[source] Reads a row from your annotations dict and applies encode_fn function.
- Parameters
element – elem in your dataset.
- Returns
Value after applying lambda_fn function
-
__init__
(input_key: str, output_key: str, lambda_fn: Callable = <function LambdaReader.<lambda>>, **kwargs)[source] - Parameters
input_key (str) – input key to use from annotation dict
output_key (str) – output key to use to store the result
lambda_fn (callable) – encode function to use to prepare your data (for example convert chars/words/tokens to indices, etc)
kwargs – kwargs for encode function
-
-
class
catalyst.data.reader.
ReaderCompose
(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source] Abstraction to compose several readers into one open function.
-
__call__
(element)[source] Reads a row from your annotations dict and applies all readers and mixins
- Parameters
element – elem in your dataset.
- Returns
Value after applying all readers and mixins
-
__init__
(readers: List[catalyst.data.reader.ReaderSpec], mixins: list = None)[source] - Parameters
readers (List[ReaderSpec]) – list of reader to compose
mixins (list) – list of mixins to use
-
Sampler
-
class
catalyst.data.sampler.
BalanceClassSampler
(labels: List[int], mode: str = 'downsampling')[source] Abstraction over data sampler.
Allows you to create stratified sample on unbalanced classes.
-
__init__
(labels: List[int], mode: str = 'downsampling')[source] - Parameters
labels (List[int]) – list of class label for each elem in the datasety
mode (str) – Strategy to balance classes. Must be one of [downsampling, upsampling]
-
__iter__
() → Iterator[int][source] - Yields
indices of stratified sample
-
__len__
() → int[source] - Returns
length of result sample
-
-
class
catalyst.data.sampler.
MiniEpochSampler
(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source] Sampler iterates mini epochs from the dataset used by
mini_epoch_len
.Example
>>> MiniEpochSampler(len(dataset), mini_epoch_len=100) >>> MiniEpochSampler(len(dataset), mini_epoch_len=100, >>> drop_last=True) >>> MiniEpochSampler(len(dataset), mini_epoch_len=100, >>> shuffle="per_epoch")
-
__init__
(data_len: int, mini_epoch_len: int, drop_last: bool = False, shuffle: str = None)[source] - Parameters
data_len (int) – Size of the dataset
mini_epoch_len (int) – Num samples from the dataset used in one mini epoch.
drop_last (bool) – If
True
, sampler will drop the last batches if its size would be less thanbatches_per_epoch
shuffle (str) – one of
"always"
,"real_epoch"
, or None`. The sampler will shuffle indices > “per_mini_epoch” - every mini epoch (every__iter__
call) > “per_epoch” – every real epoch > None – don’t shuffle
-
__iter__
() → Iterator[int][source] @TODO: Docs. Contribution is welcome.
-
__len__
() → int[source] - Returns
length of the mini-epoch
- Return type
int
-
shuffle
() → None[source] @TODO: Docs. Contribution is welcome.
-
-
class
catalyst.data.sampler.
DistributedSamplerWrapper
(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source] Wrapper over Sampler for distributed training. Allows you to use any sampler in distributed mode.
It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
. In such case, each process can pass a DistributedSamplerWrapper instance as a DataLoader sampler, and load a subset of subsampled data of the original dataset that is exclusive to it.Note
Sampler is assumed to be of constant size.
-
__init__
(sampler, num_replicas: Optional[int] = None, rank: Optional[int] = None, shuffle: bool = True)[source] - Parameters
sampler – Sampler used for subsampling
num_replicas (int, optional) – Number of processes participating in distributed training
rank (int, optional) – Rank of the current process within
num_replicas
shuffle (bool, optional) – If true (default), sampler will shuffle the indices
-
__iter__
()[source] @TODO: Docs. Contribution is welcome.
-