Tree#

Tree data structs#

class pfl.internal.tree.gbdt.GBDT(base_value=0)#

Gradient Boosted Decision Tree base class.

Parameters:

base_value (float) – Default value for all datapoints before any trees are trained and added to a GBDT. When trees have been added to a GBDT, the prediction for a datapoint is base_value + sum of leaf values assigned to a datapoint from each tree in a GBDT.

add_tree(root)#

Add the root node of a fully or partially trained tree to the GBDT.

This will be the root node of a new tree in the GBDT ensemble of trees. Ensure the previous tree in the GBDT is fully trained before adding a new tree.

Parameters:

root (Node) – Root node of a fully or partially trained tree to be added to GBDT.

get_max_min_predictions()#

Return the maximum and minimum predictions possible from a GBDT.

Maximum prediction = sum of highest value leaf from each tree in GBDT plus base value. Minimum prediction = sum of lowest value leaf from each tree in GBDT plus base value.

Only consider trees in GBDT which are fully trained.

Return type:

Tuple[float, float]

Returns:

Maximum possible prediction from GBDT, Minimum possible prediction from GBDT.

predict(X)#

Make prediction for each datapoint in X using all fully trained trees in GBDT.

The prediction of a datapoint from a GBDT is the sum of the values assigned to a datapoint from each tree in the GBDT plus the base value.

If a GBDT is empty, the prediction of a datapoint is the base value.

Only fully trained trees in a GBDT are considered during prediction.

Parameters:

X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

Return type:

ndarray

Returns:

Prediction from GBDT for each datapoint in X.

to_serialized_xgboost()#

Emulate output of xgboost dump_model() method.

Returns a dictionary describing the GBDT, which looks the same as the model description from the xgboost dump_model() method, see https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.dump_model

Only fully trained trees in GBDT will be converted to XGBoost format.

Return type:

List[Dict]

Returns:

Dictionary describing GBDT, in the format of XGBoost’s dump_model() function.

abstract evaluate(X, y)#

Compute performance metrics for GBDT on datapoints in X, given targets in y.

Only fully trained trees in GBDT will be considered during evaluation.

Parameters:
  • X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

  • y (ndarray) – Targets corresponding to datapoints in X: array-like, of shape (N,) or (N, 1), where N is the number of datapoints.

Return type:

Dict[str, float]

Returns:

Dictionary of performance metrics, evaluating GBDT on datapoints in X, given targets y.

class pfl.internal.tree.gbdt.GBDTClassifier(base_value=0)#

Gradient Boosted Decision Tree class for binary classification.

Parameters:

base_value (float) – Default value for all datapoints before any trees are trained and added to a GBDT. When trees have been added to a GBDT, the prediction for a datapoint is base_value + sum of leaf values assigned to a datapoint from each tree in a GBDT.

predict(X)#

Make prediction for each datapoint in X using all fully trained trees in GBDT.

The prediction of a datapoint from a classification GBDT is the sigmoid of the sum of the values assigned to that datapoint from each tree in the GBDT plus the base value.

If a GBDT is empty, the prediction of a datapoint is the sigmoid of the base value of the GBDT.

Only fully trained trees in a GBDT are considered for prediction.

Parameters:

X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

Return type:

ndarray

Returns:

Prediction from GBDT for each datapoint in X.

predict_classes(X)#

Make binary class prediction for each datapoint in X using all fully trained trees in GBDT.

The class prediction is the prediction rounded to the nearest class of 0 and 1. Predictions in range (-inf, 0.5] map to class 0. Predictions in range (0.5, +inf) map to class 1. Note that the prediction used to calculate the class is the sigmoid of the sum of the values assigned to the datapoint from each tree in the GBDT plus the base value.

Only fully trained trees in GBDT will be considered for prediction.

Parameters:

X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

Return type:

ndarray

Returns:

Prediction from GBDT for each datapoint in X.

evaluate(X, y)#

Compute accuracy of GBDT binary classifier on data X, given targets y.

Evaluation will only be performed on fully trained trees in GBDT.

Parameters:
  • X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

  • y (ndarray) – Targets corresponding to datapoints in X: array-like, of shape (N,) or (N, 1), where N is the number of datapoints.

Return type:

Dict[str, float]

Returns:

Dictionary of metrics, including accuracy of GBDT on data X, with targets y.

get_max_min_predictions()#

Return the maximum and minimum predictions possible from a GBDT.

Maximum prediction = expit of sum of highest value leaf from each tree in GBDT plus base value. Minimum prediction = expit of sum of lowest value leaf from each tree in GBDT plus base value.

Only consider trees in GBDT which are fully trained.

Return type:

Tuple[float, float]

Returns:

Maximum possible prediction from GBDT, Minimum possible prediction from GBDT.

class pfl.internal.tree.gbdt.GBDTRegressor(base_value=0)#

Gradient Boosted Decision Tree class for regression.

Parameters:
  • num_features – Number of features of datapoints used to train GBDT.

  • base_value (float) – Default value for all datapoints before any trees are trained and added to a GBDT. When trees have been added to a GBDT, the prediction for a datapoint is base_value + sum of leaf values assigned to a datapoint from each tree in a GBDT.

evaluate(X, y)#

Compute mean-absolute error (MAE), and mean-squared error (MSE) of regression GBDT, given data X and targets y.

Evaluation will only be performed on fully trained trees in GBDT.

Parameters:
  • X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints, and d is the number of features per datapoint.

  • y (ndarray) – Targets corresponding to datapoints in X: array-like, of shape (N,) or (N, 1), where N is the number of datapoints.

Return type:

Dict[str, float]

Returns:

Dictionary of metrics, including MAE and MSE of GBDT on data X, with targets y.

class pfl.internal.tree.node.Node(feature=None, threshold=None, value=None, left_child=None, right_child=None)#

Represents a branch node or a leaf node in a binary decision tree.

A branch node has a left child node and right child node, and is defined by an inequality: feature <= threshold. A datapoint follows the path to the left child node if the inequality is satisfied for this datapoint. Else, a datapoint follows the path to the right child node.

A leaf node has no left or right child node. It is defined by a value.

is_leaf()#

Check if node is a leaf node.

A leaf node has a value and does not have a left or a right child node.

If a node does not have a left or a right child node, but does not have a value, it is just an incomplete branch node, with no child nodes attached yet.

Return type:

bool

Returns:

True if node is a leaf node, else False.

num_nodes()#

Return number of nodes in tree, starting with current node.

The tree does not have to be fully trained to call this function.

Either value must be set, or feature and threshold must be set for a node to be counted.

Return type:

int

max_depth()#

Return maximum depth of tree: the number of nodes along the longest path in the tree.

The tree does not have to be fully trained to call this function.

Either value must be set, or feature and threshold must be set for a node to be counted.

Return type:

int

Returns:

Maximum depth of tree from current node.

add_leaf_node(is_left, value)#

Add a leaf child node to this node.

Parameters:
  • is_left (bool) – If True, leaf node is left child of parent node. Else, leaf node is right child of parent node.

  • value (float) – Value of leaf node, used for predictions.

add_branch_node(is_left, feature, threshold)#

Add a branch child node to this node.

Parameters:
  • is_left (bool) – If True, branch node is left child of parent node. Else, branch node is right child of parent node.

  • feature (int) – Feature used for inequality condition at branch node.

  • threshold (float) – Threshold used for inequality condition at branch node.

Return type:

Node

Returns:

Branch child node just added.

training_complete()#

Returns whether tree, including this node and all child nodes, is completely trained or not.

Tree is completely trained if it is a full binary tree: every node other than the leaf nodes have 2 child nodes.

Return type:

bool

get_leaf_values()#

Return list of values of all leaves in tree.

Tree can be fully or partially trained when this method is called.

Return type:

List[float]

predict(X)#

Make prediction for each datapoint in X using tree.

Each datapoint is passed through the tree, following the decision path arising from the inequality condition at each branch node, until each datapoint arrives at a leaf node, whose value is the prediction for this datapoint.

If the tree is empty, return the base prediction for all datapoints.

The tree must be fully trained to return a prediction.

Parameters:

X (ndarray) – (N x d) array of datapoints, where N is the number of datapoints and d is the number of features per datapoint.

Return type:

ndarray

Returns:

Prediction from tree for each datapoint in X.

to_serialized_xgboost()#

Emulate xgboost’s dump_model() function.

Return type:

Dict

Returns:

Nested dictionary describing the nodes in a tree.

static from_serialized_xgboost(tree)#

Extra tree from serialized xgboost format. Use depth-first search to process all nodes in tree.

Parameters:

tree (Dict) – Dictionary representing tree, in serialized XGBoost format.

Return type:

Node

Returns:

Return root node of tree.

pfl.internal.tree.node.process_xgboost_node(node)#

Process a node in serialized xgboost format.

:param node

Serialized XGBoost dictionary format for node.

Return type:

Tuple[Optional[int], Optional[float], Optional[float]]

Returns:

Tuple of (feature, threshold, value) describing node. Note that if node is a branch node, Value will be None. Else if node is a leaf node, Feature and Threshold will be None.

GBDT adaptive hyperparameters#

class pfl.internal.tree.gbdt_adaptive_hyperparameters.GBDTAdaptiveHyperparameter(base_value, *args, **kwargs)#

Base class for hyperparameters which are adapted during PFL of a GBDT using the GBDT model.

Parameters:

base_value – Default value for hyperparameter.

abstract current_value(model)#

Current value of hyperparameter, which is a function of base value, and the GBDT model.

class pfl.internal.tree.gbdt_adaptive_hyperparameters.TrainCohortSize(base_value, per_layer_modifier_function='power', leaf_nodes_reduction_factor=1)#

Cohort size used for training each level of a tree in a GBDT.

Default method for adapting training cohort size per layer of trees in GBDT is the “power” method, which ensures the signal-to-noise ratio (SNR) is not reduced when training deeper layers of the trees in a GBDT.

Note that fewer results are required to compute the values of leaf nodes, as fewer training statistics are aggregated to compute this value. Consequently, the cohort size for training leaf nodes is reduced compared to the base value.

Parameters:
  • base_value (int) – Initial value for training cohort size. Gradients from this number of users are aggregated to train root node, at top level of a tree in a GBDT. This base value is selected to balance the tradeoff between final performance of the trained model, the DP guarantees used during training, and the time required to wait to aggregate this number of results.

  • per_layer_modifier_function (str) – Define function to use to modify the base value of the training cohort size depending on the layer of a tree being trained. Options include {‘none’, ‘linear’, ‘power’}, which, respectively, correspond to: not adapting the cohort size; linearly increasing the cohort size with the layer being trained; exponentially increase the cohort size with the layer being trained. ‘power’ is the default setting, as this helps to ensure that SNR does not reduce with the tree depth of the tree being trained.

  • leaf_nodes_reduction_factor (int) – Defines the factor by which to reduce the training cohort size from the base value when training the maximum depth of the tree, which comprises only leaf nodes. The default value is 1, i.e. no reduction takes place. However, this can be set to total_num_questions/2^(max_depth - 1), where total_num_questions is the sum of the number of questions specified for each feature used for training.

current_value(model)#

Current value of hyperparameter, which is a function of base value, and the GBDT model.

Return type:

int

class pfl.internal.tree.gbdt_adaptive_hyperparameters.ValidationCohortSize(base_value)#

Validation is only performed once per tree being trained.

As trees are trained layer-wise, validation is only performed while training the top layer of the tree. This is because the model’s predictions will not change until an entire new tree is trained.

Future enhancement: rdar://104295602 (GBDT evaluation should occur at lowest level of tree, when largest training cohort size is used)

Parameters:

base_value (int) – Number of users from which to gather metrics during validation iterations.

current_value(model)#

Current value of hyperparameter, which is a function of base value, and the GBDT model.

Return type:

int

class pfl.internal.tree.gbdt_adaptive_hyperparameters.ClippingBound(base_value, layer_multiplier=1.0, tree_multiplier=1.0)#

Adapt clipping bound based on current layer and current index of tree being trained in a GBDT.

The sensitivity of the vector of gradients gets smaller as training progresses due to smaller gradients as predictions improve. Adapting the clipping bound during training improves the SNR.

Parameters:
  • base_value (float) – Default value for clipping bound

  • layer_multiplier (float) – Factor used to modify the base value for clipping bound depending on the layer of a tree being trained. Should be in range (0, 1].

  • tree_multiplier (float) – Factor used to modify the base value for clipping bound depending on the layer of a tree being trained. Should be in range (0, 1].

current_value(model)#

Current value of hyperparameter, which is a function of base value, and the GBDT model.

Return type:

float

class pfl.internal.tree.gbdt_adaptive_hyperparameters.ComputeSecondOrderGradients(base_value)#

Decide whether or not to compute and aggregate second order gradients during training of a GBDT.

Second order gradients improve the algorithm’s ability to identify the optimal split for a branch node, or value for a leaf node in a tree. However, including second order gradients in the vector of gradients being aggregated during training increases the sensitivity of this vector, which means that the clipping for DP reduces the sensitivity by a greater factor which can result in lower SNR during training.

Note that second order gradients must be aggregated for leaf nodes, in order that leaf values can be computed.

Parameters:

base_value (bool) – Default setting of whether to aggregate second order gradients during training of a tree.

current_value(model)#

Current value of hyperparameter, which is a function of base value, and the GBDT model.

Return type:

bool

Tree split questions#

class pfl.internal.tree.questions.QuestionGenerator#

Abstract base class for generating questions to be used in training a GBDT with a federated algorithm.

A question is defined as a “threshold” value, which is used to separate data points into two subsets: (1) where datapoint[feature] <= threshold; and (2) where datapoint[feature] > threshold.

The questions must be returned as an increasing sorted list.

abstract generate(min_val, max_val, num_questions)#

Generate questions to be asked for a feature in the federated algorithm to train a GBDT.

Questions generated should be in the range [min_val, max_val]. The number of questions generated should be equal to num_questions.

Return type:

List[float]

class pfl.internal.tree.questions.FloatEquidistantQuestionGenerator#

Generate questions for a feature of type float. The questions should be spaced equidistantly between min_val and max_val.

generate(min_val, max_val, num_questions)#

Generate questions to be asked for a feature in the federated algorithm to train a GBDT.

Questions generated should be in the range [min_val, max_val]. The number of questions generated should be equal to num_questions.

Return type:

List[float]

class pfl.internal.tree.questions.IntEquidistantQuestionGenerator#

Generate questions for a feature of type int. The questions should be spaced equidistantly between min_val and max_val. For features of type int, there should only be one question for the same integral part.

generate(min_val, max_val, num_questions)#

Generate questions to be asked for a feature in the federated algorithm to train a GBDT.

Questions generated should be in the range [min_val, max_val]. The number of questions generated should be equal to num_questions.

Return type:

List[float]

class pfl.internal.tree.questions.FloatRandomQuestionGenerator(offset_fraction=0.02)#

Generate questions for a feature of type float. The questions are randomly chosen in the range [min_val+offset, max_val-offset].

generate(min_val, max_val, num_questions)#

Generate questions to be asked for a feature in the federated algorithm to train a GBDT.

Questions generated should be in the range [min_val, max_val]. The number of questions generated should be equal to num_questions.

Return type:

List[float]

class pfl.internal.tree.questions.IntRandomQuestionGenerator#

Generate questions for a feature of type int. The questions are randomly selected from the range[min_val, max_val], and there should only be one question for the same integral part.

generate(min_val, max_val, num_questions)#

Generate questions to be asked for a feature in the federated algorithm to train a GBDT.

Questions generated should be in the range [min_val, max_val]. The number of questions generated should be equal to num_questions.

Return type:

List[float]

class pfl.internal.tree.questions.BoolQuestionGenerator(false_val, true_val)#

Generate a single question for a Boolean variable. Only one question will be generated, and this will lie at the midpoint of the range [false_val, true_val] for the Boolean variable.

generate(min_val=0, max_val=1, num_questions=1)#

If a split already exists on a Boolean feature, don’t generate a new question, since it will not be able to further separate data. An existing split will be identified when min_val > self._false_val, and/or max_val > self._true_val.

Return type:

List[float]