MERF

Mixed Effects Random Forest model.

class merf.merf.MERF(fixed_effects_model=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=- 1, oob_score=False, random_state=None, verbose=0, warm_start=False), gll_early_stop_threshold=None, max_iterations=20)

This is the core class to instantiate, train, and predict using a mixed effects random forest model. It roughly adheres to the sklearn estimator API. Note that the user must pass in an already instantiated fixed_effects_model that adheres to the sklearn regression estimator API, i.e. must have a fit() and predict() method defined.

It assumes a data model of the form:

\[y = f(X) + b_i Z + e\]
  • y is the target variable. The current code only supports regression for now, e.g. continuously varying scalar value

  • X is the fixed effect features. Assume p dimensional

  • f(.) is the nonlinear fixed effects mode, e.g. random forest

  • Z is the random effect features. Assume q dimensional.

  • e is iid noise ~N(0, sigma_e²)

  • i is the cluster index. Assume k clusters in the training.

  • bi is the random effect coefficients. They are different per cluster i but are assumed to be drawn from the same distribution ~N(0, Sigma_b) where Sigma_b is learned from the data.

Parameters
  • fixed_effects_model (sklearn.base.RegressorMixin) – instantiated model class

  • gll_early_stop_threshold (float) – early stopping threshold on GLL improvement

  • max_iterations (int) – maximum number of EM iterations to train

fit(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series, y: numpy.ndarray, X_val: numpy.ndarray = None, Z_val: numpy.ndarray = None, clusters_val: pandas.core.series.Series = None, y_val: numpy.ndarray = None)

Fit MERF using Expectation-Maximization algorithm.

Parameters
  • X (np.ndarray) – fixed effect covariates

  • Z (np.ndarray) – random effect covariates

  • clusters (pd.Series) – cluster assignments for samples

  • y (np.ndarray) – response/target variable

Returns

fitted model

Return type

MERF

get_bhat_history_df()

This function does a complicated reshape and re-indexing operation to get the list of dataframes for the b_hat_history into a multi-indexed dataframe. This dataframe is easier to work with in plotting utilities and other downstream analyses than the list of dataframes b_hat_history.

Parameters

b_hat_history (list) – list of dataframes of bhat at every iteration

Returns

multi-index dataframe with outer index as iteration, inner index as cluster

Return type

pd.DataFrame

predict(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series)

Predict using trained MERF. For known clusters the trained random effect correction is applied. For unknown clusters the pure fixed effect (RF) estimate is used.

Parameters
  • X (np.ndarray) – fixed effect covariates

  • Z (np.ndarray) – random effect covariates

  • clusters (pd.Series) – cluster assignments for samples

Returns

the predictions y_hat

Return type

np.ndarray

Utilities

Synthetic mixed-effects data generator.

class merf.utils.MERFDataGenerator(m, sigma_b, sigma_e)

Synthetic data generator class. It simulates samples y from K clusters according to the following equation.

\[ \begin{align}\begin{aligned}y_{ij} = m \cdot g(x_{ij}) + b_i + \epsilon_{ij}\\g(x_{ij}) = 2 x_{1ij} + x_{2ij}^2 + 4(x_{3ij} > 0) + 2 \log |x_{1ij}|x_{3ij}\\b_i \sim N(0, \sigma_b^2)\\\epsilon_{ij} \sim N(0, \sigma_\epsilon^2)\\i = 1, ..., K\\j = 1, ..., n_i\end{aligned}\end{align} \]
Parameters
  • m (float) – scale parameter on fixed effect

  • sigma_b (float) – hyper parameter of random intercept

  • sigma_e (float) – noise std

static create_X_with_ohe_clusters(X, clusters, training_cluster_ids)

Helper function to join one hot encoded cluster ids with the feature matrix X.

Parameters
  • X (np.ndarray) – fixed effects feature matrix

  • clusters (np.ndarray) – array of cluster labels for data in X

  • training_cluster_ids – array of clusters in training data

Returns

X augmented with one hot encoded clusters

Return type

pd.DataFrame

static create_cluster_sizes_array(sizes, num_clusters_per_size)

Helper function to create an array of cluster sizes.

Parameters
  • sizes (np.ndarray) – array of sizes

  • num_clusters_per_size (np.ndarray) – array of the number of clusters to make of each size

Returns

array of cluster sizes for all clusters

Return type

np.ndarray

generate_samples(n_samples_per_cluster)

Generate test data for the MERF algorithm.

Parameters

n_samples_per_cluster – array of number representing number of samples to choose from that cluster

Returns

  • y (response)

  • X (fixed effect features)

  • Z (cluster assignment)

  • ptev (proportion of total effect variability)

  • prev (proportion of random effect variability)

Return type

tuple

generate_split_samples(n_training_per_cluster, n_test_known_per_cluster, n_test_new_per_cluster)

Generate samples split into training and two test sets.

Parameters
  • n_training_per_cluster

  • n_test_known_per_cluster

  • n_test_new_per_cluster

Returns

  • training_data

  • known_cluster_test_data

  • new_cluster_test_data

  • training_cluster_ids

  • ptev

  • prev

Return type

tuple

static ohe_clusters(clusters, training_cluster_ids)

Helper function to one hot encode cluster ids based on training cluster ids. Note that for the “new” clusters this should encode to a matrix of all zeros.

Parameters
  • clusters (np.ndarray) – array of cluster labels for data

  • training_cluster_ids – array of clusters in training data

Returns

one hot encoded clusters

Return type

pd.DataFrame

Mixed Effects Random Forest plotting utlities.

merf.viz.plot_merf_training_stats(model, num_clusters_to_plot=5)

Plot training statistics for MERF model. This generates a plot that is rendered to the screen that has five components:

  • Generalized log-likelihood across iterations

  • trace and determinant of Sigma_b across iterations

  • sigma_e across iterations

  • bi for num_clusters_to_plot across iterations

  • a histogram of the final learned bi

Parameters
  • model (MERF) – trained MERF model

  • num_clusters_to_plot (int) – number of example bi’s to plot across iterations

Returns

figure. Also draws to display.

Return type

(matplotlib.pyplot.fig)

Indices and tables