MERF¶

Mixed Effects Random Forest model.

class merf.merf.MERF(fixed_effects_model=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=- 1, oob_score=False, random_state=None, verbose=0, warm_start=False), gll_early_stop_threshold=None, max_iterations=20)¶

This is the core class to instantiate, train, and predict using a mixed effects random forest model. It roughly adheres to the sklearn estimator API. Note that the user must pass in an already instantiated fixed_effects_model that adheres to the sklearn regression estimator API, i.e. must have a fit() and predict() method defined.

It assumes a data model of the form:

\[y = f(X) + b_i Z + e\]

y is the target variable. The current code only supports regression for now, e.g. continuously varying scalar value
X is the fixed effect features. Assume p dimensional
f(.) is the nonlinear fixed effects mode, e.g. random forest
Z is the random effect features. Assume q dimensional.
e is iid noise ~N(0, sigma_e²)
i is the cluster index. Assume k clusters in the training.
bi is the random effect coefficients. They are different per cluster i but are assumed to be drawn from the same distribution ~N(0, Sigma_b) where Sigma_b is learned from the data.

Parameters

fixed_effects_model (sklearn.base.RegressorMixin) – instantiated model class
gll_early_stop_threshold (float) – early stopping threshold on GLL improvement
max_iterations (int) – maximum number of EM iterations to train

fit(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series, y: numpy.ndarray, X_val: numpy.ndarray = None, Z_val: numpy.ndarray = None, clusters_val: pandas.core.series.Series = None, y_val: numpy.ndarray = None)¶

Fit MERF using Expectation-Maximization algorithm.

Parameters

X (np.ndarray) – fixed effect covariates
Z (np.ndarray) – random effect covariates
clusters (pd.Series) – cluster assignments for samples
y (np.ndarray) – response/target variable

Returns

fitted model

Return type

MERF

get_bhat_history_df()¶

This function does a complicated reshape and re-indexing operation to get the list of dataframes for the b_hat_history into a multi-indexed dataframe. This dataframe is easier to work with in plotting utilities and other downstream analyses than the list of dataframes b_hat_history.

Parameters: b_hat_history (list) – list of dataframes of bhat at every iteration
Returns: multi-index dataframe with outer index as iteration, inner index as cluster
Return type: pd.DataFrame

predict(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series)¶

Predict using trained MERF. For known clusters the trained random effect correction is applied. For unknown clusters the pure fixed effect (RF) estimate is used.

Parameters

X (np.ndarray) – fixed effect covariates
Z (np.ndarray) – random effect covariates
clusters (pd.Series) – cluster assignments for samples

Returns

the predictions y_hat

Return type

np.ndarray

Utilities¶

Synthetic mixed-effects data generator.

class merf.utils.MERFDataGenerator(m, sigma_b, sigma_e)¶

Synthetic data generator class. It simulates samples y from K clusters according to the following equation.

\[ \begin{align}\begin{aligned}y_{ij} = m \cdot g(x_{ij}) + b_i + \epsilon_{ij}\\g(x_{ij}) = 2 x_{1ij} + x_{2ij}^2 + 4(x_{3ij} > 0) + 2 \log |x_{1ij}|x_{3ij}\\b_i \sim N(0, \sigma_b^2)\\\epsilon_{ij} \sim N(0, \sigma_\epsilon^2)\\i = 1, ..., K\\j = 1, ..., n_i\end{aligned}\end{align} \]

Parameters

m (float) – scale parameter on fixed effect
sigma_b (float) – hyper parameter of random intercept
sigma_e (float) – noise std

static create_X_with_ohe_clusters(X, clusters, training_cluster_ids)¶

Helper function to join one hot encoded cluster ids with the feature matrix X.

Parameters

X (np.ndarray) – fixed effects feature matrix
clusters (np.ndarray) – array of cluster labels for data in X
training_cluster_ids – array of clusters in training data

Returns

X augmented with one hot encoded clusters

Return type

pd.DataFrame

static create_cluster_sizes_array(sizes, num_clusters_per_size)¶

Helper function to create an array of cluster sizes.

Parameters

sizes (np.ndarray) – array of sizes
num_clusters_per_size (np.ndarray) – array of the number of clusters to make of each size

Returns

array of cluster sizes for all clusters

Return type

np.ndarray

generate_samples(n_samples_per_cluster)¶

Generate test data for the MERF algorithm.

Parameters

n_samples_per_cluster – array of number representing number of samples to choose from that cluster

Returns

y (response)
X (fixed effect features)
Z (cluster assignment)
ptev (proportion of total effect variability)
prev (proportion of random effect variability)

Return type

tuple

generate_split_samples(n_training_per_cluster, n_test_known_per_cluster, n_test_new_per_cluster)¶

Generate samples split into training and two test sets.

Parameters

n_training_per_cluster –
n_test_known_per_cluster –
n_test_new_per_cluster –

Returns

training_data
known_cluster_test_data
new_cluster_test_data
training_cluster_ids
ptev
prev

Return type

tuple

static ohe_clusters(clusters, training_cluster_ids)¶

Helper function to one hot encode cluster ids based on training cluster ids. Note that for the “new” clusters this should encode to a matrix of all zeros.

Parameters

clusters (np.ndarray) – array of cluster labels for data
training_cluster_ids – array of clusters in training data

Returns

one hot encoded clusters

Return type

pd.DataFrame

Mixed Effects Random Forest plotting utlities.

merf.viz.plot_merf_training_stats(model, num_clusters_to_plot=5)¶

Plot training statistics for MERF model. This generates a plot that is rendered to the screen that has five components:

Generalized log-likelihood across iterations
trace and determinant of Sigma_b across iterations
sigma_e across iterations
bi for num_clusters_to_plot across iterations
a histogram of the final learned bi

Parameters

model (MERF) – trained MERF model
num_clusters_to_plot (int) – number of example bi’s to plot across iterations

Returns

figure. Also draws to display.

Return type

(matplotlib.pyplot.fig)

MERF¶

Utilities¶

Indices and tables¶

Table of Contents

This Page