sad.generator package

Submodules

sad.generator.base module

class GeneratorBase(config: Dict, model: sad.model.base.ModelBase, task: TrainingTask)[source]

Bases: abc.ABC

A generator base class that all concrete generator classes should inherit from. A generator class should also be an iterable, by implementing __iter__ method.

The way a generator works is that file(s) containing training/validation samples will be first added to the generator by calling self.add(file). Then by calling self.prepare(), the generator is informed that all files have been added, and it is the time to get ready to iterate through the files and produce samples. At this point, one can use the generator in following manner:

for features, targets in my_generator:
    # fit my model

To only iterate through training samples, one can do:

for features, targets in my_generator.get_trn():
    # fit my model

Same applies to validation.

add(filename: str)[source]

A method to add a local file to generator. The local file contains data from which mini-batches of training/validation samples will be read.

Parameters

filename (str) – A file path pointing the file.

property batch_size: int

Batch size when generating samples in minibatch.

property config: Dict

Configuration information that is used to initialize the generator instance.

property data: Dict[str, List[str]]

A dictionary with keys being user ids and values being a list of item ids that user has interacted with. Lists of complete users and items will be inferred from it. Will be set after self.prepare() is called.

abstract get_trn() Iterator[Any][source]

Interface to generator samples for model training.

Returns

An iterable that training samples will be iterated through in mini-batches.

Return type

Iterator[Any]

abstract get_val_or_not() Iterator[Any][source]

Interface to generator samples for validating model.

Returns

An iterable that validation samples will be iterated through in mini-batches.

Return type

Iterator[Any]

property i_batch: int

The number of random items that will be chosen when working in "random" mode. Read directly from "i_batch" field in self.spec. When not configured, it will be set to 20% items.

property input_dir: str

Read directly from self.task.input_dir.

property input_files: List[str]

A list of files from where samples will be read.

property item_id_to_idx: Dict[str, int]

A dictionary with keys being item id and values being the index. It is the inverse mapping of self.item_idx_to_id.

property item_idx_to_id: Dict[int, str]

A dictionary with keys being item indices from zero to n_items-1, and values being their ids. Will be set after self.prepare() is called.

property mode

The mode of how generator works. Currently supports two configurations: "random|iteration".

  1. "random": When working in this mode, a number of self.u_batch random users will be selected (with replacement) from entire user set in an iteration. For items, a number of self.i_batch positive items that each user has interacted with will be randomly (with replacement) generated. Same number of negative items that user hasn’t interacted with will be randomly generated as well, producing triplets of samples in the format of (user, item i (interacted), item j (non-interacted)).

  2. "iteration": When working in this mode, all users will be iterated through in a randomized order. Same to items. For each positive user-item interaction, a number of self.n_negatives non-interacted items will be randomly selected.

Type

str

property model: sad.model.base.ModelBase

A trainable model instance, which will be trained using samples produced by current generator instance.

property n_negatives: int

The number of negative samples will be drawn for each positive user-item interaction. Read directly from "n_negatives" field in self.spec. Valid when the generator is performing in "iteration" mode. Default to five.

property output_dir: str

Read directly from self.task.output_dir.

abstract prepare()[source]

A method to inform generator to setup things in order to be prepared for generating samples. Concrete subclasses are responsible to implement this method.

save(working_dir: str)[source]

Save generator’s configuration to a folder.

Parameters

working_dir (str) – A local path where the configuration of the generator will be saved.

property spec: Dict

A reference to "spec" field in self.config. If such field does not exist or its value is None, an empty dictionary will be created.

property task: sad.tasks.training.TrainingTask

An instance of training task associated with the generator. It is the task in which current generator is initialized.

property tensor: numpy.ndarray

A three way array with shape of n x m x m where n is the number of users, and m is the number of items. A value of 1 at location (u, i, j) suggests u-th user prefers i-th item over j-th item. -1 suggests the opposite. A value of 0 means no information available to determine the preference of the two items. Value will be optionally set after self.prepare() is called, depending on the value of self.tensor_flag, for the purpose of saving memory.

property tensor_flag: bool

A boolean flag to indicate if three way data tensor self.tensor will be constructed. False will stop creating the tensor to save memory consumption.

property u_batch: int

The number of random users that will be chosen when working in "random" mode. Read directly from "u_batch" field in self.spec. When not configured, it will be set to 20% users.

property uidx_to_iidxs_tuple: Dict[int, Tuple[Set[int], Set[int]]]

A dictionary mapping from user idx to a tuple in which the first element is a set of item idxs the user has interacted with, and the second one is a set of non-interacted item idxs. Will be set after self.prepare() is called.

property user_id_to_idx: Dict[str, int]

A dictionary with keys being user id and values being the index. It is the inverse mapping of self.user_idx_to_id.

property user_idx_to_id: Dict[int, str]

A dictionary with keys being user indices from zero to n_users-1, and values being their ids. Will be set after self.prepare() is called.

property user_idx_to_preference: Dict[int, Dict[Tuple[str, str], int]]

A dictionary contains a mapping between user idx and item pairs that the user prefer one over the other. The item pairs are stored in a dictionary as well, with key being a tuple of two item ids, and value being 1.

class GeneratorFactory[source]

Bases: object

A factory class that is responsible to create generator instances.

logger = <Logger generator.GeneratorFactory (INFO)>

Class attribute for logging.

Type

logging.Logger

classmethod produce(config: Dict, model: sad.model.base.ModelBase, task: TrainingTask) sad.generator.base.GeneratorBase[source]
classmethod register(wrapped_class: sad.generator.base.GeneratorBase) sad.generator.base.GeneratorBase[source]

sad.generator.implicit_fb module

class ImplicitFeedbackGenerator(config: dict, model: sad.model.sad.SADModel, task: TrainingTask)[source]

Bases: sad.generator.base.GeneratorBase

A concrete generator class that handles user-item implicit feedbacks. After an instance of this class is created, self.add(filepath) will need to be called to add a local file to this generator. The format of the local file is a compressed tarball, containing a raw.json file, and an optionally raw_with_rating.json file.

The raw.json file is a dictionary mapping a user (in user_id) to a list of items (in item_id) that the user has interacted with.

The optional raw_with_rating.json file is a nested dictionary. It is a mapping between a user (in user_id) and items that the user has rated. The value of the dictionary is another dict with mapping between items (in item_id) and their rating scores.

property cornac_dataset: cornac.data.dataset.Dataset

A Cornac Dataset object containing user/item pairs and ratings associated with them. Will be used for fitting models from cornac package.

property data_df: pandas.core.frame.DataFrame

A Pandas Dataframe containing user/item pairs and ratings associated with them. For ImplicitFeedbackGenerator the ratings are set to 1.0|0.0. User and item IDs are under userID and itemID respectively.

get_obs_uij(u_idx: int, i_idx: int, j_idx: int) int[source]

Get the (u, i, j)-th observation from personalized three-way tensor self.tensor. When self.tensor is pre-calculated, its value will be returned. Otherwise, self.uidx_to_iidxs_tuple will be used to infer the observation at runtime.

Parameters
  • u_idx (int) – The user idx.

  • i_idx (int) – Index of first item in comparison.

  • j_idx (int) – Index of second item in comparison.

Returns

A value from (-1, 1, 0) indicating the personalized preference of the two items. 1 indicates i_idx-th item is preferable than j_idx-th; -1 suggests otherwise; 0 indicate such information is not available.

Return type

int

get_trn() Iterator[Any][source]

Interface to generator samples for model training.

Returns

An iterable that training samples will be iterated through in mini-batches.

Return type

Iterator[Any]

get_val_or_not() Iterator[Any][source]

Interface to generator samples for validating model.

Returns

An iterable that validation samples will be iterated through in mini-batches.

Return type

Iterator[Any]

property msft_ncf_dataset: recommenders.models.ncf.dataset.Dataset

A NCF (Neural Collaborative Filtering) Dataset object implemented in recommenders package from MSFT. It contains user/item pairs and ratings associated with them. Will be used for fitting a NCF model using recommenders package.

prepare()[source]

Instance method that will be called to inform a generator instance that all raw data have been added. For this class, the format of raw data is a compressed tarball, containing a raw.json file, and optionally, a raw_with_rating.json file, a delete_raw.json, and delete_raw_with_rating.json. The second two files contain hold-out user-item interactions (and their ratings). Upon being called, following steps will be performed.

  1. Unzip raw data tarball. Read the raw.json and raw_with_rating.json file. When multiple such tarballs exist, their json files will be merged into one. When hold-out user-item interactions exist (delete_raw.json, and delete_raw_with_rating.json), those interactions will be read too. Interaction data will be read to self.data_trn, self.data_val and self.data_all fields. Data with ratings will be in self.ratings_trn, self.ratings_val, and self.ratings_all.

  2. Create a self.user_idx_to_id and self.user_id_to_idx mapping. The same will be created for items.

  3. Create (optionally) self.tensor with size n x m x m containing personalized pairwise comparison between items. Its value takes -1, 1 and 0, meaning first item is less preferable, more preferable and preference not available respectively. This tensor is only created when self.tensor_flag is set to True. Large values of n and m may result memory overflow.

  4. Create self.uidx_to_iidxs_tuple, a mapping between user idx to a tuple of two sets, with first one being interacted items and second one being non-interacted items, in item_idx.

  5. Create self.user_idx_to_preference, a mapping between user idx to another dictionary, with keys being a tuple of two items (in item_id) and values being 1. The order of the two items in keys indicate their preference.

property surprise_dataset: surprise.dataset.Dataset

A Dataset object implemented in surprise package. It contains user/item pairs and ratings associated with them. Will be used for fitting a SVD model using surprise package.

sad.generator.simulation module

class SimulationGenerator(config: dict, model: sad.model.sad.SADModel, task: TrainingTask)[source]

Bases: sad.generator.base.GeneratorBase

A concrete generator class that handles simulated data from the generative model of SAD. After an instance of this class is created, self.add(filepath) will need to be called to add a local file to this generator. The format of the local file is a compressed tarball, containing a raw.npz file, inside which true model parameters XI0, T0 H0 and X0 (derived from the first three matrices) are contained. An observation tensor Obs0 is in the raw file as well, containing a fully observed personalized pairwise comparision taking values of -1 or 1.

One can set self.missing_ratio to control the percentage of missing data in the observation. Details see below.

property H0: numpy.ndarray

The true left item matrix (k x m) containing item left vectors as columns.

property Obs0: numpy.ndarray

Three way tensor containing observations. An alias to self.tensor.

property T0: numpy.ndarray

The true right item matrix (k x m) containing item right vectors as columns.

property X0: numpy.ndarray

The three way tensor (n x m x m) containing true preference scores.

property XI0: numpy.ndarray

The true user matrix (k x n) containing user vectors as columns.

get_obs_uij(u_idx: int, i_idx: int, j_idx: int) int[source]

Get the (u, i, j)-th observation from observation tensor self.Obs0.

Parameters
  • u_idx (int) – The user idx.

  • i_idx (int) – Index of first item in comparison.

  • j_idx (int) – Index of second item in comparison.

Returns

A value from (-1, 1, 0) indicating the personalized preference of the two items. 1 indicates i_idx-th item is preferable than j_idx-th; -1 suggests otherwise; 0 indicate such information is not available.

Return type

int

get_trn() Iterator[Any][source]

Interface to generator samples for model training.

Returns

An iterable that training samples will be iterated through in mini-batches.

Return type

Iterator[Any]

get_val_or_not() Iterator[Any][source]

Interface to generator samples for validating model.

Returns

An iterable that validation samples will be iterated through in mini-batches.

Return type

Iterator[Any]

property ll0: float

The log likelihood of non-missing observations under true parameter values. Its value will be set after running self.prepare().

property missing_ratio: float

Proportion of missing entries in self.Obs0. Default to 0 meaning no observation is missing. Will read directly from "missing_ratio" field in self.spec. Missing entries in self.Obs0 will be set to 0 when self.prepare() is invoked.

prepare()[source]

Instance method that will be called to inform a generator that all raw data have been added. For this class, the format of raw data is a compressed tarball, containing a raw.npz file. Upon being called, following steps will be performed. For this class only one raw data file is allowed to be added to the generator.

  1. Unzip raw data tarball. Read true parameter values from raw.npz file, set corresponding attributes of current generator.

  2. Create a self.user_idx_to_id and self.user_id_to_idx mapping. The same will be created for items.

  3. Randomly set certain proportion of observations to 0, suggesting data are missing. In the meanwhile, calculate log likelihood of observed entries under true parameter values.

  4. Create self.user_idx_to_preference, a mapping between user idx to another dictionary, with keys being a tuple of two items (in item_id) and values being 1. The order of the two items in keys indicate their preference.

property rnd_seed: int

Random seed. Used for reproducibility purposes. Will read directly from "rnd_seed" field from self.spec.

Module contents