sad.generator package
Submodules
sad.generator.base module
- class GeneratorBase(config: Dict, model: sad.model.base.ModelBase, task: TrainingTask)[source]
Bases:
abc.ABC
A generator base class that all concrete generator classes should inherit from. A generator class should also be an iterable, by implementing
__iter__
method.The way a generator works is that file(s) containing training/validation samples will be first added to the generator by calling
self.add(file)
. Then by callingself.prepare()
, the generator is informed that all files have been added, and it is the time to get ready to iterate through the files and produce samples. At this point, one can use the generator in following manner:for features, targets in my_generator: # fit my model
To only iterate through training samples, one can do:
for features, targets in my_generator.get_trn(): # fit my model
Same applies to validation.
- add(filename: str)[source]
A method to add a local file to generator. The local file contains data from which mini-batches of training/validation samples will be read.
- Parameters
filename (
str
) – A file path pointing the file.
- property batch_size: int
Batch size when generating samples in minibatch.
- property config: Dict
Configuration information that is used to initialize the generator instance.
- property data: Dict[str, List[str]]
A dictionary with keys being user ids and values being a list of item ids that user has interacted with. Lists of complete users and items will be inferred from it. Will be set after
self.prepare()
is called.
- abstract get_trn() Iterator[Any] [source]
Interface to generator samples for model training.
- Returns
An iterable that training samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- abstract get_val_or_not() Iterator[Any] [source]
Interface to generator samples for validating model.
- Returns
An iterable that validation samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- property i_batch: int
The number of random items that will be chosen when working in
"random"
mode. Read directly from"i_batch"
field inself.spec
. When not configured, it will be set to 20% items.
- property input_dir: str
Read directly from
self.task.input_dir
.
- property input_files: List[str]
A list of files from where samples will be read.
- property item_id_to_idx: Dict[str, int]
A dictionary with keys being item id and values being the index. It is the inverse mapping of
self.item_idx_to_id
.
- property item_idx_to_id: Dict[int, str]
A dictionary with keys being item indices from zero to
n_items-1
, and values being their ids. Will be set afterself.prepare()
is called.
- property mode
The mode of how generator works. Currently supports two configurations:
"random|iteration"
."random"
: When working in this mode, a number ofself.u_batch
random users will be selected (with replacement) from entire user set in an iteration. For items, a number ofself.i_batch
positive items that each user has interacted with will be randomly (with replacement) generated. Same number of negative items that user hasn’t interacted with will be randomly generated as well, producing triplets of samples in the format of (user, item i (interacted), item j (non-interacted))."iteration"
: When working in this mode, all users will be iterated through in a randomized order. Same to items. For each positive user-item interaction, a number ofself.n_negatives
non-interacted items will be randomly selected.
- Type
str
- property model: sad.model.base.ModelBase
A trainable model instance, which will be trained using samples produced by current generator instance.
- property n_negatives: int
The number of negative samples will be drawn for each positive user-item interaction. Read directly from
"n_negatives"
field inself.spec
. Valid when the generator is performing in"iteration"
mode. Default to five.
- property output_dir: str
Read directly from
self.task.output_dir
.
- abstract prepare()[source]
A method to inform generator to setup things in order to be prepared for generating samples. Concrete subclasses are responsible to implement this method.
- save(working_dir: str)[source]
Save generator’s configuration to a folder.
- Parameters
working_dir (
str
) – A local path where the configuration of the generator will be saved.
- property spec: Dict
A reference to
"spec"
field inself.config
. If such field does not exist or its value isNone
, an empty dictionary will be created.
- property task: sad.tasks.training.TrainingTask
An instance of training task associated with the generator. It is the task in which current generator is initialized.
- property tensor: numpy.ndarray
A three way array with shape of
n x m x m
wheren
is the number of users, andm
is the number of items. A value of1
at location(u, i, j)
suggestsu
-th user prefersi
-th item overj
-th item.-1
suggests the opposite. A value of0
means no information available to determine the preference of the two items. Value will be optionally set afterself.prepare()
is called, depending on the value ofself.tensor_flag
, for the purpose of saving memory.
- property tensor_flag: bool
A boolean flag to indicate if three way data tensor
self.tensor
will be constructed.False
will stop creating the tensor to save memory consumption.
- property u_batch: int
The number of random users that will be chosen when working in
"random"
mode. Read directly from"u_batch"
field inself.spec
. When not configured, it will be set to 20% users.
- property uidx_to_iidxs_tuple: Dict[int, Tuple[Set[int], Set[int]]]
A dictionary mapping from user idx to a tuple in which the first element is a set of item idxs the user has interacted with, and the second one is a set of non-interacted item idxs. Will be set after
self.prepare()
is called.
- property user_id_to_idx: Dict[str, int]
A dictionary with keys being user id and values being the index. It is the inverse mapping of
self.user_idx_to_id
.
- property user_idx_to_id: Dict[int, str]
A dictionary with keys being user indices from zero to
n_users-1
, and values being their ids. Will be set afterself.prepare()
is called.
- property user_idx_to_preference: Dict[int, Dict[Tuple[str, str], int]]
A dictionary contains a mapping between user idx and item pairs that the user prefer one over the other. The item pairs are stored in a dictionary as well, with key being a tuple of two item ids, and value being
1
.
- class GeneratorFactory[source]
Bases:
object
A factory class that is responsible to create generator instances.
- logger = <Logger generator.GeneratorFactory (INFO)>
Class attribute for logging.
- Type
logging.Logger
- classmethod produce(config: Dict, model: sad.model.base.ModelBase, task: TrainingTask) sad.generator.base.GeneratorBase [source]
- classmethod register(wrapped_class: sad.generator.base.GeneratorBase) sad.generator.base.GeneratorBase [source]
sad.generator.implicit_fb module
- class ImplicitFeedbackGenerator(config: dict, model: sad.model.sad.SADModel, task: TrainingTask)[source]
Bases:
sad.generator.base.GeneratorBase
A concrete generator class that handles user-item implicit feedbacks. After an instance of this class is created,
self.add(filepath)
will need to be called to add a local file to this generator. The format of the local file is a compressed tarball, containing araw.json
file, and an optionallyraw_with_rating.json
file.The
raw.json
file is a dictionary mapping a user (inuser_id
) to a list of items (initem_id
) that the user has interacted with.The optional
raw_with_rating.json
file is a nested dictionary. It is a mapping between a user (inuser_id
) and items that the user has rated. The value of the dictionary is another dict with mapping between items (initem_id
) and their rating scores.- property cornac_dataset: cornac.data.dataset.Dataset
A Cornac Dataset object containing user/item pairs and ratings associated with them. Will be used for fitting models from
cornac
package.
- property data_df: pandas.core.frame.DataFrame
A Pandas Dataframe containing user/item pairs and ratings associated with them. For
ImplicitFeedbackGenerator
the ratings are set to1.0|0.0
. User and item IDs are underuserID
anditemID
respectively.
- get_obs_uij(u_idx: int, i_idx: int, j_idx: int) int [source]
Get the
(u, i, j)
-th observation from personalized three-way tensorself.tensor
. Whenself.tensor
is pre-calculated, its value will be returned. Otherwise,self.uidx_to_iidxs_tuple
will be used to infer the observation at runtime.- Parameters
u_idx (
int
) – The user idx.i_idx (
int
) – Index of first item in comparison.j_idx (
int
) – Index of second item in comparison.
- Returns
A value from
(-1, 1, 0)
indicating the personalized preference of the two items.1
indicatesi_idx
-th item is preferable thanj_idx
-th;-1
suggests otherwise;0
indicate such information is not available.- Return type
int
- get_trn() Iterator[Any] [source]
Interface to generator samples for model training.
- Returns
An iterable that training samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- get_val_or_not() Iterator[Any] [source]
Interface to generator samples for validating model.
- Returns
An iterable that validation samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- property msft_ncf_dataset: recommenders.models.ncf.dataset.Dataset
A NCF (Neural Collaborative Filtering) Dataset object implemented in
recommenders
package from MSFT. It contains user/item pairs and ratings associated with them. Will be used for fitting a NCF model usingrecommenders
package.
- prepare()[source]
Instance method that will be called to inform a generator instance that all raw data have been added. For this class, the format of raw data is a compressed tarball, containing a
raw.json
file, and optionally, araw_with_rating.json
file, adelete_raw.json
, and delete_raw_with_rating.json. The second two files contain hold-out user-item interactions (and their ratings). Upon being called, following steps will be performed.Unzip raw data tarball. Read the
raw.json
andraw_with_rating.json
file. When multiple such tarballs exist, their json files will be merged into one. When hold-out user-item interactions exist (delete_raw.json
, anddelete_raw_with_rating.json
), those interactions will be read too. Interaction data will be read toself.data_trn
,self.data_val
andself.data_all
fields. Data with ratings will be inself.ratings_trn
,self.ratings_val
, andself.ratings_all
.Create a
self.user_idx_to_id
andself.user_id_to_idx
mapping. The same will be created for items.Create (optionally)
self.tensor
with sizen x m x m
containing personalized pairwise comparison between items. Its value takes-1
,1
and0
, meaning first item is less preferable, more preferable and preference not available respectively. This tensor is only created whenself.tensor_flag
is set toTrue
. Large values ofn
andm
may result memory overflow.Create
self.uidx_to_iidxs_tuple
, a mapping between user idx to a tuple of two sets, with first one being interacted items and second one being non-interacted items, initem_idx
.Create
self.user_idx_to_preference
, a mapping between user idx to another dictionary, with keys being a tuple of two items (initem_id
) and values being1
. The order of the two items in keys indicate their preference.
- property surprise_dataset: surprise.dataset.Dataset
A Dataset object implemented in
surprise
package. It contains user/item pairs and ratings associated with them. Will be used for fitting a SVD model usingsurprise
package.
sad.generator.simulation module
- class SimulationGenerator(config: dict, model: sad.model.sad.SADModel, task: TrainingTask)[source]
Bases:
sad.generator.base.GeneratorBase
A concrete generator class that handles simulated data from the generative model of
SAD
. After an instance of this class is created,self.add(filepath)
will need to be called to add a local file to this generator. The format of the local file is a compressed tarball, containing araw.npz
file, inside which true model parametersXI0
,T0
H0
andX0
(derived from the first three matrices) are contained. An observation tensorObs0
is in the raw file as well, containing a fully observed personalized pairwise comparision taking values of-1
or1
.One can set
self.missing_ratio
to control the percentage of missing data in the observation. Details see below.- property H0: numpy.ndarray
The true left item matrix (
k x m
) containing item left vectors as columns.
- property Obs0: numpy.ndarray
Three way tensor containing observations. An alias to
self.tensor
.
- property T0: numpy.ndarray
The true right item matrix (
k x m
) containing item right vectors as columns.
- property X0: numpy.ndarray
The three way tensor (
n x m x m
) containing true preference scores.
- property XI0: numpy.ndarray
The true user matrix (
k x n
) containing user vectors as columns.
- get_obs_uij(u_idx: int, i_idx: int, j_idx: int) int [source]
Get the
(u, i, j)
-th observation from observation tensorself.Obs0
.- Parameters
u_idx (
int
) – The user idx.i_idx (
int
) – Index of first item in comparison.j_idx (
int
) – Index of second item in comparison.
- Returns
A value from
(-1, 1, 0)
indicating the personalized preference of the two items.1
indicatesi_idx
-th item is preferable thanj_idx
-th;-1
suggests otherwise;0
indicate such information is not available.- Return type
int
- get_trn() Iterator[Any] [source]
Interface to generator samples for model training.
- Returns
An iterable that training samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- get_val_or_not() Iterator[Any] [source]
Interface to generator samples for validating model.
- Returns
An iterable that validation samples will be iterated through in mini-batches.
- Return type
Iterator[Any]
- property ll0: float
The log likelihood of non-missing observations under true parameter values. Its value will be set after running
self.prepare()
.
- property missing_ratio: float
Proportion of missing entries in
self.Obs0
. Default to0
meaning no observation is missing. Will read directly from"missing_ratio"
field inself.spec
. Missing entries inself.Obs0
will be set to0
whenself.prepare()
is invoked.
- prepare()[source]
Instance method that will be called to inform a generator that all raw data have been added. For this class, the format of raw data is a compressed tarball, containing a
raw.npz
file. Upon being called, following steps will be performed. For this class only one raw data file is allowed to be added to the generator.Unzip raw data tarball. Read true parameter values from
raw.npz
file, set corresponding attributes of current generator.Create a
self.user_idx_to_id
andself.user_id_to_idx
mapping. The same will be created for items.Randomly set certain proportion of observations to
0
, suggesting data are missing. In the meanwhile, calculate log likelihood of observed entries under true parameter values.Create
self.user_idx_to_preference
, a mapping between user idx to another dictionary, with keys being a tuple of two items (initem_id
) and values being1
. The order of the two items in keys indicate their preference.
- property rnd_seed: int
Random seed. Used for reproducibility purposes. Will read directly from
"rnd_seed"
field fromself.spec
.