Gradient boosted decision trees#

Algorithms#

class pfl.tree.federated_gbdt.GBDTClippingBound(base_value, layer_multiplier=1.0, tree_multiplier=1.0, use_tree_offset=False)#

Adapt clipping bound based on current layer and current index of tree being trained in a GBDT.

The sensitivity of the vector of gradients gets smaller as training progresses due to smaller gradients as predictions improve. Adapting the clipping bound during training improves the SNR.

Clipping bound is adapted as follows: clipping_bound = base_value * layer_multiplier ** current_depth * tree_multiplier ** (current_tree - num_trees_in_model_at_start_of_training) # pylint: disable=line-too-long where base_value is the base clipping bound value, layer_multiplier and tree_multiplier are parameters controlling the impact of layer depth and number of trees on the clipping bound, current_depth is the depth of the current tree being trained, and current_tree is the id of the current tree being trained (starting at 0).

Warning

You must include a GBDTClippingBound() object in the list of callbacks input to backend to ensure the clipping bound is actually updated on each iteration that the algorithm is run.

Parameters:
  • base_value (float) – Default value for clipping bound

  • layer_multiplier (float) – Factor used to modify the base value of clipping bound depending on the layer of a tree being trained. Should be in range (0, 1].

  • tree_multiplier (float) – Factor used to modify the base value for clipping bound depending on the layer of a tree being trained. Should be in range (0, 1].

  • use_tree_offset (bool) – Whether or not to include trees in GBDT initialization model when computing clipping bound. A random initialization of a GBDT has zero trees. By default do include all trees as gradients depend on number of trees in GBDT, not on number of trees currently being trained.

value()#

The current state (inner value) of the hyperparameter.

Return type:

float

on_train_begin(*, model)#

Called before the first central iteration.

Return type:

Metrics

after_central_iteration(aggregate_metrics, model, *, central_iteration)#

Finalize any computations after each central iteration.

Parameters:
  • aggregate_metrics (Metrics) – A Metrics object with aggregated metrics accumulated from local training on users and central updates of the model.

  • model (GBDTModel) – A reference to the Model that is trained.

  • central_iteration (int) – The current central iteration number.

Return type:

Tuple[bool, Metrics]

Returns:

A tuple. The first value returned is a boolean, signaling that training should be interrupted if True. Can be useful for implementing features with early stopping or convergence criteria. The second value returned is new metrics. Do not include any of the aggregate_metrics!

class pfl.tree.federated_gbdt.GBDTAlgorithmHyperParams(cohort_size, val_cohort_size, num_trees, cohort_size_per_layer_modifier_fn='power', l2_regularization=1, leaf_nodes_reduction_factor=1, compute_second_order_gradients=True, report_gradients_both_sides_split=True, report_node_sum_gradients=False, report_per_feature_result_difference=False, report_per_node_result_difference=False)#

Parameters for federated GBDT algorithms.

Parameters:
  • cohort_size (int) – Size of cohort used for each training iteration. Note that initial value of cohort_size may be modified by FederatedGBDT algorithm during training of a GBDT.

  • val_cohort_size (int) – Size of val cohort used for distributed evaluation during training of a GBDT. If set to 0, then there will not be any evaluation.

  • num_trees (int) – Number of trees to train.

  • cohort_size_per_layer_modifier_fn (str) – Define function to use to modify the base value of the training cohort size depending on the layer of a tree being trained. Options include {‘none’, ‘linear’, ‘power’}, which, respectively, correspond to: not adapting the cohort size; linearly increasing the cohort size with the layer being trained; exponentially increasing the cohort size with the layer being trained. ‘power’ is the default setting, as this achieves the correct SNR during training. See _compute_cohort_size for more details.

  • compute_second_order_gradients (bool) – Boolean determining whether or not second order gradients should be computed and reported on device.

  • report_gradients_both_sides_split (bool) – Boolean determining whether or not to report gradients on both sides of each question/split asked.

  • report_node_sum_gradients (bool) – Boolean determining whether to report sum of gradients in node, or else right gradient for one question asked of node, if report_gradients_both_sides_split = False.

  • report_per_feature_result_difference (bool) – Report difference in gradients of questions asked for each feature in each node.

  • report_per_node_result_difference (bool) – Report difference in gradients for all questions asked for each node.

  • leaf_nodes_reduction_factor (int) – Defines the factor by which to reduce the training cohort size from the base value when training the maximum depth of the tree, which comprises only leaf nodes. The default value is 1, i.e. no reduction takes place. However, this can be set to total_num_questions/2^(max_depth - 1), where total_num_questions is the sum of the number of questions specified for each feature used for training. leaf_nodes_reduction_factor must be an integer >= 1.

class pfl.tree.federated_gbdt.FederatedGBDT(features)#
simulate_one_user(model, user_dataset, central_context)#

Simulate the client-side part of the computation.

This should train a single user on its dataset (simulated as user_dataset) and return the model statistics. The statistics should be in such a form that they can be aggregated over users. For a cohort of multiple users, this method will be called multiple times.

Parameters:
  • model (TypeVar(GBDTModelType, bound= GBDTModel)) – The current state of the model.

  • user_dataset (TabularDataset) – The data simulated for a single user.

  • central_context (CentralContext[TypeVar(GBDTAlgorithmHyperParamsType, bound= GBDTAlgorithmHyperParams), TypeVar(GBDTModelHyperParamsType, bound= GBDTModelHyperParams)]) – Settings to use for this round. Contains the model params used for training/evaluating a user’s model.

Return type:

Tuple[Optional[MappedVectorStatistics], Metrics]

Returns:

A tuple (statistics, metrics), with model statistics and metrics generated from the training of a single user. Both statistics and metrics will be aggregated and the aggregate will be passed to process_aggregated_statistics.

static postprocess_training_statistics(statistics, l2_regularization=1)#

Postprocess statistics for one node.

Parameters:
  • statistics (MappedVectorStatistics) – Must have keys: ‘questions’; ‘first_order_grads_left’; ‘first_order_grads_right’; ‘second_order_grads_left’; ‘second_order_grads_right’.

  • l2_regularization (float) – L2 regularization used to reduce the complexity of the GBDT model trained, and avoid the model overfitting to the data.

Return type:

Tuple[int, float, float, float, float, float]

Returns:

Tuple of 6 scalar statistics representing a node in a tree in a GBDT: (feature, threshold, gain, value, left_child_value, right_child_value)

process_aggregated_statistics(central_context, aggregate_metrics, model, statistics)#

The aggregated statistics are the aggregated first and, optionally, second order gradients for each of the questions asked of nodes to be split. For each question, a gain and a value are computed from the gradients for that question. The question with the highest gain for each node to be split is selected as the split question for that node, if the node is a branch node, and will be included in the tree in the GBDT for that node. The value will be assigned as a leaf weight to the node if the node is a leaf node.

This function also calls for the model to apply the update, by adding branch or leaf nodes with the optimal splits or values as appropriate.

Return type:

Tuple[TypeVar(GBDTModelType, bound= GBDTModel), Metrics]

Models#

class pfl.tree.gbdt_model.NodeRecord(parent, decision_path, is_left, value, is_leaf)#

Node of a GBDT.

Parameters:
  • parent (Optional[Node]) – Parent node. None for root.

  • decision_path (List[List]) – List of triples [feature, threshold, is_left] defining the tests along the path to this node and the results of these tests. is_left is True if and only if the specific feature is less than or equal to the threshold, so the decision path goes left.

  • is_left (bool) – Whether this node hangs to the left of its parent. The value in the root node is True by default but it does not matter.

  • value (Optional[float]) – Value of this node as determined by gradients aggregated for this node. For the root, it’s None.

  • is_leaf (bool) – True if the node is a leaf, False otherwise.

class pfl.tree.gbdt_model.GBDTModelHyperParams#
class pfl.tree.gbdt_model.GBDTClassificationModelHyperParams(evaluation_threshold)#
class pfl.tree.gbdt_model.GBDTModel(num_features, max_depth=3, learning_rate=0.9, minimum_split_loss=0, base_prediction=0, alpha=None)#

Gradient Boosted Decision Tree (GBDT) model base class.

Parameters:
  • num_features (int) – Number of features of data used to train GBDT.

  • max_depth (int) – Maximum depth of any tree trained in a GBDT.

  • learning_rate (float) – Learning rate used during boosting. Used to control by how much each tree should try to reduce error in predictions to zero.

  • minimum_split_loss (float) – Minimum reduction in loss achieved by splitting a node into two child nodes required before a node is split in a GBDT.

  • base_prediction (float) – Default prediction of a GBDT on datapoints before any trees are added to the GBDT.

  • alpha (Optional[float]) – Optionally used when computing value of a node. If None, the value of a node is computed from statistics aggregated for that node. If real- valued and greater than 0, the value of a node is the weighted sum of statistics aggregated for the current node and for the parent node: a*parent_value + (1-a)*value, where a = alpha.

property current_depth: int#

Current depth of tree being trained, starting at 0, meaning no nodes have yet been added to the tree.

property current_tree: int#

Current tree being trained for GBDT. Starts at 0.

abstract compute_first_order_gradient(target, prediction)#

Compute first order gradient.

abstract compute_second_order_gradient(target, prediction)#

Compute second order gradient.

save(dir_path)#

Save a GBDT model to disk.

Parameters:

dir_path (str) – Path to which to save GBDT model will be saved as “gbdt.json”

Return type:

None

load(path)#

Load a GBDT model from disk.

The model can be loaded from a saved XGBoost model or from a json serialisation of a GBDT model, which uses the XGBoost serialisation format: https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=dump_model#xgboost.Booster.dump_model

An XGBoost model must has a “.model” extension, e.g. “xgboost.model”. A json serialisation of the model must have a “.json” extension, e.g. “gbdt.json”.

Note that the following parameters are not saved to disk along with a GBDT model: l2 regularization; number of features; minimum split loss; maximum number of trees; maximum depth of any tree.

Further training of the GBDT model loaded using this function will result in new trees being trained and added to the GBDT model. Further training will not adapt the existing trees in the GBDT loaded using this function.

Parameters:

path (str) – Path to a XGBoost model or a json serialisation of a GBDT model. Extension of file must be either .model or .json.

Return type:

None

apply_model_update(statistics)#

Update GBDT model using best splits determined from first and second order gradients computed on training data.

The statistics define the node to be added to a tree in a GBDT.

All nodes on a single level of a tree in a GBDT are updated in one iteration of training, corresponding to one call of this function.

The parent nodes of new nodes to be added to the tree in a GBDT are defined in self._nodes_to_split. The best splits for these nodes are computed by the FederatedGBDT algorithm, determining whether another branch node or leaf node should be added as a child node to the parent node. Any child branch nodes are added to self._nodes_to_split so that they can be added to the tree in the next iteration of training.

A node can be set as a leaf node if the maximum depth of a tree has been reached or if the minimum split loss was not attained by the best split for that node.

Parameters:

statistics (MappedVectorStatistics) – MappedVectorStatistics holding (feature, threshold, gain, value, left_child_value, right_child_value) statistics for each node to be added to tree. These statistics have been computed from gradients aggregated from user devices.

Return type:

Tuple[TypeVar(GBDTModelType, bound= GBDTModel), Metrics]

predict(X)#

Make predictions for each data sample in X using GBDT.

evaluate(dataset, name_formatting_fn=<function GBDTModel.<lambda>>, eval_params=None)#

Evaluate model performance using a dataset.

The prediction of a datapoint used for evaluation is computed using:

  1. all the trees in a GBDT, when the most recently added tree of the GBDT is fully trained.

  2. All trees in the GBDT except the most recently added tree, when this most recently added tree is not fully trained.

Parameters:
  • dataset (TypeVar(AbstractDatasetType, bound= AbstractDataset)) – The dataset on which evaluation should take place. Assumes that dataset.raw_data is a list of datapoints, where each datapoint is represented by a tuple (x, y), where x is a vector of input features, and y is a scalar target value.

  • name_formatting_fn – A function that produces a TrainMetricName object from a simple string for metrics computed on a dataset.

Return type:

Metrics

Returns:

A Metrics object with performance metrics evaluated on the dataset.

class pfl.tree.gbdt_model.GBDTModelClassifier(num_features, max_depth=3, learning_rate=0.9, minimum_split_loss=0, base_prediction=0, alpha=None)#

Federated GBDT model for binary classification.

See GBDTModel for description of parameters.

compute_first_order_gradient(target, prediction)#

Compute first order gradient of log loss (cross-entropy loss for binary classification).

compute_second_order_gradient(target, prediction)#

Second order gradient of log loss for binary classification.

predict_classes(X)#

Round predictions to closest of binary values, {0,1}.

class pfl.tree.gbdt_model.GBDTModelRegressor(num_features, max_depth=3, learning_rate=0.9, minimum_split_loss=0, base_prediction=0, alpha=None)#

Federated GBDT model for regression.

See GBDTModel for description of parameters.

compute_first_order_gradient(target, prediction)#

Compute first order gradient for squared error loss function used for regression.

compute_second_order_gradient(target, prediction)#

Compute second order gradient for squared error loss function for regression.

Utils#

class pfl.tree.tree_utils.Feature(feature_index, feature_range, feature_type, num_questions, question_choice_method=None)#

Represents a feature of the data used to train a GBDT. The algorithm used to train a GBDT will select feature splits which optimally partition data to reduce the loss function used for training.

Parameters:
  • feature_index (int) – Index of the feature in the data’s feature vector. Must be non-negative

  • feature_range (Tuple[Union[int, float], Union[int, float]]) – Range of feature, [min_val, max_val].

  • feature_type (Union[Type[float], Type[int], Type[bool]]) – Type of feature: {bool, float, int} are supported.

  • num_questions (int) – Number of questions to ask for this feature during training algorithm. Must be >= 0.

  • question_choice_method (Optional[str]) – Method used to generate questions for this feature, to find optimal feature splits to partition the data. Either equidistant or random for int or float features. Boolean features have default splitting method.

generate_feature_questions(node, num_questions=None)#

Generate questions for feature, to be used in training a tree.

Parameters:
  • node (NodeRecord) – The node in a tree for which questions should be generated for this feature.

  • num_questions (Optional[int]) – Optional parameter for number of questions to generate. If not None, it will override self._num_questions. Can be 0.

Return type:

List[Union[int, float]]

pfl.tree.tree_utils.choose_questions(nodes, features)#

Select questions to be used to train a GBDT in each iteration of training.

Note: leaf nodes only require a single question, as leaf nodes require the value of the node to be determined, which can be computed from the gradients from any split on the data, not the optimal split.

Parameters:
  • nodes (List[NodeRecord]) – Nodes for which questions should be generated so optimal parameters for these nodes can be computed, so these nodes can be added to trees in GBDT.

  • features (List[Feature]) – List of Feature objects used to train a GBDT. Questions will be asked of each Feature to identify the feature-split which achieves the best partition of the dataset to reduce the loss function used for training.

Return type:

Tuple[List[Dict[str, Any]], int]

Returns:

A tuple including a list comprising questions for each node in nodes, to be used for training, and the total number of questions asked. For each node, a dictionary is returned, which includes the path to the node, with the key decisionPath, and a dictionary mapping feature indices to threshold values: these are the questions.