MERF¶
Mixed Effects Random Forest model.
-
class
merf.merf.
MERF
(fixed_effects_model=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=- 1, oob_score=False, random_state=None, verbose=0, warm_start=False), gll_early_stop_threshold=None, max_iterations=20)¶ This is the core class to instantiate, train, and predict using a mixed effects random forest model. It roughly adheres to the sklearn estimator API. Note that the user must pass in an already instantiated fixed_effects_model that adheres to the sklearn regression estimator API, i.e. must have a fit() and predict() method defined.
It assumes a data model of the form:
\[y = f(X) + b_i Z + e\]y is the target variable. The current code only supports regression for now, e.g. continuously varying scalar value
X is the fixed effect features. Assume p dimensional
f(.) is the nonlinear fixed effects mode, e.g. random forest
Z is the random effect features. Assume q dimensional.
e is iid noise ~N(0, sigma_e²)
i is the cluster index. Assume k clusters in the training.
bi is the random effect coefficients. They are different per cluster i but are assumed to be drawn from the same distribution ~N(0, Sigma_b) where Sigma_b is learned from the data.
- Parameters
fixed_effects_model (sklearn.base.RegressorMixin) – instantiated model class
gll_early_stop_threshold (float) – early stopping threshold on GLL improvement
max_iterations (int) – maximum number of EM iterations to train
-
fit
(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series, y: numpy.ndarray, X_val: numpy.ndarray = None, Z_val: numpy.ndarray = None, clusters_val: pandas.core.series.Series = None, y_val: numpy.ndarray = None)¶ Fit MERF using Expectation-Maximization algorithm.
- Parameters
X (np.ndarray) – fixed effect covariates
Z (np.ndarray) – random effect covariates
clusters (pd.Series) – cluster assignments for samples
y (np.ndarray) – response/target variable
- Returns
fitted model
- Return type
-
get_bhat_history_df
()¶ This function does a complicated reshape and re-indexing operation to get the list of dataframes for the b_hat_history into a multi-indexed dataframe. This dataframe is easier to work with in plotting utilities and other downstream analyses than the list of dataframes b_hat_history.
- Parameters
b_hat_history (list) – list of dataframes of bhat at every iteration
- Returns
multi-index dataframe with outer index as iteration, inner index as cluster
- Return type
pd.DataFrame
-
predict
(X: numpy.ndarray, Z: numpy.ndarray, clusters: pandas.core.series.Series)¶ Predict using trained MERF. For known clusters the trained random effect correction is applied. For unknown clusters the pure fixed effect (RF) estimate is used.
- Parameters
X (np.ndarray) – fixed effect covariates
Z (np.ndarray) – random effect covariates
clusters (pd.Series) – cluster assignments for samples
- Returns
the predictions y_hat
- Return type
np.ndarray
Utilities¶
Synthetic mixed-effects data generator.
-
class
merf.utils.
MERFDataGenerator
(m, sigma_b, sigma_e)¶ Synthetic data generator class. It simulates samples y from K clusters according to the following equation.
\[ \begin{align}\begin{aligned}y_{ij} = m \cdot g(x_{ij}) + b_i + \epsilon_{ij}\\g(x_{ij}) = 2 x_{1ij} + x_{2ij}^2 + 4(x_{3ij} > 0) + 2 \log |x_{1ij}|x_{3ij}\\b_i \sim N(0, \sigma_b^2)\\\epsilon_{ij} \sim N(0, \sigma_\epsilon^2)\\i = 1, ..., K\\j = 1, ..., n_i\end{aligned}\end{align} \]- Parameters
m (float) – scale parameter on fixed effect
sigma_b (float) – hyper parameter of random intercept
sigma_e (float) – noise std
-
static
create_X_with_ohe_clusters
(X, clusters, training_cluster_ids)¶ Helper function to join one hot encoded cluster ids with the feature matrix X.
- Parameters
X (np.ndarray) – fixed effects feature matrix
clusters (np.ndarray) – array of cluster labels for data in X
training_cluster_ids – array of clusters in training data
- Returns
X augmented with one hot encoded clusters
- Return type
pd.DataFrame
-
static
create_cluster_sizes_array
(sizes, num_clusters_per_size)¶ Helper function to create an array of cluster sizes.
- Parameters
sizes (np.ndarray) – array of sizes
num_clusters_per_size (np.ndarray) – array of the number of clusters to make of each size
- Returns
array of cluster sizes for all clusters
- Return type
np.ndarray
-
generate_samples
(n_samples_per_cluster)¶ Generate test data for the MERF algorithm.
- Parameters
n_samples_per_cluster – array of number representing number of samples to choose from that cluster
- Returns
y (response)
X (fixed effect features)
Z (cluster assignment)
ptev (proportion of total effect variability)
prev (proportion of random effect variability)
- Return type
tuple
-
generate_split_samples
(n_training_per_cluster, n_test_known_per_cluster, n_test_new_per_cluster)¶ Generate samples split into training and two test sets.
- Parameters
n_training_per_cluster –
n_test_known_per_cluster –
n_test_new_per_cluster –
- Returns
training_data
known_cluster_test_data
new_cluster_test_data
training_cluster_ids
ptev
prev
- Return type
tuple
-
static
ohe_clusters
(clusters, training_cluster_ids)¶ Helper function to one hot encode cluster ids based on training cluster ids. Note that for the “new” clusters this should encode to a matrix of all zeros.
- Parameters
clusters (np.ndarray) – array of cluster labels for data
training_cluster_ids – array of clusters in training data
- Returns
one hot encoded clusters
- Return type
pd.DataFrame
Mixed Effects Random Forest plotting utlities.
-
merf.viz.
plot_merf_training_stats
(model, num_clusters_to_plot=5)¶ Plot training statistics for MERF model. This generates a plot that is rendered to the screen that has five components:
Generalized log-likelihood across iterations
trace and determinant of Sigma_b across iterations
sigma_e across iterations
bi for num_clusters_to_plot across iterations
a histogram of the final learned bi
- Parameters
model (MERF) – trained MERF model
num_clusters_to_plot (int) – number of example bi’s to plot across iterations
- Returns
figure. Also draws to display.
- Return type
(matplotlib.pyplot.fig)