Lazy loading ensembles¶
When each ensemble contains huge amount of models or the ensemble amount is large (e.g., > 50), it might not be viable to load all ensembles into memory. In that case, we can ask the model to save all ensembles into disk, and only load them when the ensembles are being used. This process may dramatically reduce the memory usage but also reduce the speed of prediction, since the IO (read & write between disk and memory) will be a speed limit here.
We run experiments to see how this speed-memory trade-off works:
import os
import sys
import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import h3pandas
import copy
pd.set_option('display.max_columns', None)
# warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2
Download data¶
Training/test data¶
Please download the sample data from:
Suppose now it's downloaded and saved as './Sample_data_Mallard.csv'
Alternatively, you can try other species like
- Alder Flycatcher: https://figshare.com/articles/dataset/Sample_data_Alder_Flycatcher_csv/24080751
- Short-eared Owl: https://figshare.com/articles/dataset/Sample_data_Short-eared_Owl_csv/24080742
- Eurasian Tree Sparrow: https://figshare.com/articles/dataset/Sample_data_Eurasian_Tree_Sparrow_csv/24080748
Caveat: These bird observation data are about 200MB each file.
data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)
data = data.sample(frac=0.7)
Prediction set¶
Prediction set are used to feed into a trained AdaSTEM model and make prediction: at some location, at some day of year, given the environmental variables, how many Mallard individual do I expected to observe?
The prediction set will be loaded after the model is trained.
Download the prediction set from: https://figshare.com/articles/dataset/Predset_2020_csv/24124980
Caveat: The file is about 700MB.
Get X and y¶
X = data.drop('count', axis=1)
y = data['count'].values
X.head()
longitude | latitude | DOY | duration_minutes | Traveling | Stationary | Area | effort_distance_km | number_observers | obsvr_species_count | time_observation_started_minute_of_day | elevation_mean | slope_mean | eastness_mean | northness_mean | bio1 | bio2 | bio3 | bio4 | bio5 | bio6 | bio7 | bio8 | bio9 | bio10 | bio11 | bio12 | bio13 | bio14 | bio15 | bio16 | bio17 | bio18 | bio19 | closed_shrublands | cropland_or_natural_vegetation_mosaics | croplands | deciduous_broadleaf_forests | deciduous_needleleaf_forests | evergreen_broadleaf_forests | evergreen_needleleaf_forests | grasslands | mixed_forests | non_vegetated_lands | open_shrublands | permanent_wetlands | savannas | urban_and_built_up_lands | water_bodies | woody_savannas | entropy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
397682 | -121.243566 | 44.722404 | 73 | 20.0 | 0 | 1 | 0 | -1.00 | 3.0 | 675.0 | 946 | 528.666700 | 6.930458 | -0.004461 | -0.003129 | 12.254499 | 11.222793 | 36.556190 | 722.291583 | 30.786013 | 0.085896 | 30.700117 | 5.900058 | 22.592101 | 22.036626 | 4.241771 | 0.054554 | 0.012744 | 0.000037 | 0.000017 | 0.032447 | 0.001068 | 0.001068 | 0.022911 | 0.0 | 0.000000 | 0.888889 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.083333 | 0.0 | 0.027778 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.411314 |
52761 | -84.830546 | 10.262687 | 202 | 15.0 | 0 | 1 | 0 | -1.00 | 1.0 | 279.0 | 823 | 926.777800 | 11.270809 | -0.000824 | 0.026607 | 22.108191 | 6.112472 | 69.367717 | 87.154508 | 27.772561 | 18.960866 | 8.811695 | 20.835122 | 23.067510 | 23.229957 | 20.984801 | 0.247798 | 0.053908 | 0.000343 | 0.000307 | 0.135164 | 0.003454 | 0.005471 | 0.099995 | 0.0 | 0.000000 | 0.000000 | 0.111111 | 0.0 | 0.083333 | 0.0 | 0.027778 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.388889 | 0.000000 | 0.0 | 0.388889 | 1.285335 |
102324 | -84.410191 | 42.625370 | 138 | 41.0 | 0 | 1 | 0 | -1.00 | 1.0 | 379.0 | 1073 | 271.000000 | 0.411284 | 0.192033 | -0.162664 | 10.530639 | 7.988692 | 22.794835 | 890.107596 | 29.042612 | -6.003451 | 35.046063 | -0.510486 | 24.446625 | 22.611013 | -0.829569 | 0.115785 | 0.016661 | 0.005342 | 0.000015 | 0.047077 | 0.017823 | 0.021165 | 0.029807 | 0.0 | 0.444444 | 0.555556 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.686962 |
198809 | -123.112206 | 49.241944 | 172 | 9.0 | 1 | 0 | 0 | 0.32 | 2.0 | 222.0 | 976 | 61.833332 | 1.856077 | 0.031009 | 0.402949 | 10.378653 | 6.568099 | 31.841727 | 561.138541 | 22.256131 | 1.628799 | 20.627332 | 3.524946 | 17.947798 | 17.871489 | 3.964883 | 0.315263 | 0.070945 | 0.006232 | 0.000348 | 0.164123 | 0.023907 | 0.033864 | 0.148501 | 0.0 | 0.055556 | 0.583333 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.055556 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.083333 | 0.222222 | 0.0 | 0.000000 | 1.176882 |
318693 | -96.888985 | 42.779600 | 259 | 45.0 | 0 | 1 | 0 | -1.00 | 2.0 | 670.0 | 981 | 376.666660 | 0.206779 | 0.003448 | 0.044174 | 11.011082 | 9.531909 | 24.486426 | 1033.230340 | 30.444804 | -8.482515 | 38.927319 | 14.754586 | -1.147955 | 25.053628 | -2.344446 | 0.073579 | 0.015177 | 0.001466 | 0.000021 | 0.038704 | 0.005140 | 0.025476 | 0.006744 | 0.0 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | -0.000000 |
The features include:
spatial coordinates:
longitude
andlatitude
(used for indexing, not actual training)
Temporal coordinate:
- day of year (
DOY
): used for both indexing and training
- day of year (
Sampling parameters: These are parameters quantifying how the observation was made
duration_minutes
: How long the observation was conducted- Observation protocol:
Traveling
,Stationary
, orArea
effort_distance_km
: how far have one travelednumber_observers
: How many observers are there in the groupobsvr_species_count
: How many bird species have the birder observed in the pasttime_observation_started_minute_of_day
: When did the birder start birding
Topological features:
- Features of elevation:
elevation_mean
- Features of slope magnitude and direction:
slope_mean
,eastness_mean
,northness_mean
- Features of elevation:
Bioclimate features:
- Summaries of yearly temperature and precipitation: from
bio1
tobio19
- Summaries of yearly temperature and precipitation: from
Land cover features:
- Summaries of land cover, percentage of cover. For example,
closed_shrublands
,urban_and_built_up_lands
. entropy
: Entropy of land cover
- Summaries of land cover, percentage of cover. For example,
As you can see, the environmental variables are almost static. However, dynamic features (e.g., daily temperature) is fully supported as input. See Tips for data types for details.
First thing first: Spatiotemporal train test split¶
from stemflow.model_selection import ST_train_test_split
X_train, X_test, y_train, y_test = ST_train_test_split(X, y,
Spatio1 = 'longitude',
Spatio2 = 'latitude',
Temporal1 = 'DOY',
Spatio_blocks_count = 50, Temporal_blocks_count=50,
random_state=42, test_size=0.3)
Train AdaSTEM hurdle model¶
from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor # remember to install xgboost if you use it as base model
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle
import time
from memory_profiler import memory_usage
def make_model_lazyloading(ensemble_fold):
model_lazyloading = AdaSTEMRegressor(
base_model=Hurdle(
classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
), # hurdel model for zero-inflated problem (e.g., count)
save_gridding_plot = True,
ensemble_fold=ensemble_fold, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=ensemble_fold-2, # Only points covered by > 7 ensembles will be predicted
grid_len_upper_threshold=25, # force splitting if the grid length exceeds 25
grid_len_lower_threshold=5, # stop splitting if the grid length fall short 5
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=366,
temporal_step=25, # The window takes steps of 20 DOY (see AdaSTEM demo for details)
temporal_bin_interval=50, # Each window will contain data of 50 DOY
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='longitude', # The next three params define the name of
Spatio2='latitude', # spatial coordinates shown in the dataframe
Temporal1='DOY',
use_temporal_to_train=True, # In each stixel, whether 'DOY' should be a predictor
n_jobs=1, # Not using parallel computing
random_state=42, # The random state makes the gridding process reproducible
lazy_loading=True, # Using lazy loading for large ensemble amount (e.g., >20 ensembles).
verbosity=1 # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
)
return model_lazyloading
def make_model_not_lazyloading(ensemble_fold):
model_not_lazyloading = AdaSTEMRegressor(
base_model=Hurdle(
classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
), # hurdel model for zero-inflated problem (e.g., count)
save_gridding_plot = True,
ensemble_fold=ensemble_fold, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=ensemble_fold-2, # Only points covered by > 7 ensembles will be predicted
grid_len_upper_threshold=25, # force splitting if the grid length exceeds 25
grid_len_lower_threshold=5, # stop splitting if the grid length fall short 5
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=366,
temporal_step=25, # The window takes steps of 20 DOY (see AdaSTEM demo for details)
temporal_bin_interval=50, # Each window will contain data of 50 DOY
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='longitude', # The next three params define the name of
Spatio2='latitude', # spatial coordinates shown in the dataframe
Temporal1='DOY',
use_temporal_to_train=True, # In each stixel, whether 'DOY' should be a predictor
n_jobs=1, # Not using parallel computing
random_state=42, # The random state makes the gridding process reproducible
lazy_loading=False, # Using lazy loading for large ensemble amount (e.g., >20 ensembles).
verbosity=1 # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
)
return model_not_lazyloading
def train_model(model, X_train, y_train):
start_time = time.time()
def run_model(model=model, X_train=X_train, y_train=y_train):
model.fit(X_train, y_train)
return
memory_use = memory_usage((run_model, (model, X_train, y_train, )))
memory_use = [i/1024 for i in memory_use] # in GB
peak_use = np.max(memory_use)
average_use = np.mean(memory_use)
end_time = time.time()
training_time_use = end_time - start_time
print('Training finish!')
return training_time_use, peak_use, average_use
def test_model(model, X_test, y_test):
start_time = time.time()
def run_model(model=model, X_test=X_test):
model.predict(X_test)
return
memory_use = memory_usage((run_model, (model, X_test, )))
memory_use = [i/1024 for i in memory_use] # in GB
peak_use = np.max(memory_use)
average_use = np.mean(memory_use)
end_time = time.time()
prediction_time_use = end_time - start_time
print('Prediction finish!')
return prediction_time_use, peak_use, average_use
Run test¶
log_list= []
ensemble_fold_list = [3, 5, 10, 20, 30, 40]
for ensemble_fold in tqdm(ensemble_fold_list):
model_not_lazyloading = make_model_not_lazyloading(ensemble_fold)
train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_not_lazyloading, X_train, y_train)
test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_not_lazyloading, X_test, y_test)
log_list.append({
'ensemble_fold':ensemble_fold,
'lazy_loading':False,
'train_time_use':train_time_use,
'peak_train_memory_use':peak_train_memory_use,
'average_train_memory_use':average_train_memory_use,
'test_time_use':test_time_use,
'peak_test_memory_use':peak_test_memory_use,
'average_test_memory_use':average_test_memory_use
})
del model_not_lazyloading
model_lazyloading = make_model_lazyloading(ensemble_fold)
train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_lazyloading, X_train, y_train)
test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_lazyloading, X_test, y_test)
log_list.append({
'ensemble_fold':ensemble_fold,
'lazy_loading':True,
'train_time_use':train_time_use,
'peak_train_memory_use':peak_train_memory_use,
'average_train_memory_use':average_train_memory_use,
'test_time_use':test_time_use,
'peak_test_memory_use':peak_test_memory_use,
'average_test_memory_use':average_test_memory_use
})
del model_lazyloading
0%| | 0/6 [00:00<?, ?it/s]
Generating Ensemble: 100%|██████████| 3/3 [00:03<00:00, 1.13s/it] Training: 100%|██████████| 3/3 [02:00<00:00, 40.32s/it]
Training finish!
Predicting: 100%|██████████| 3/3 [00:03<00:00, 1.12s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 3/3 [00:03<00:00, 1.13s/it] Training: 100%|██████████| 3/3 [02:03<00:00, 41.10s/it]
Training finish!
Predicting: 100%|██████████| 3/3 [00:06<00:00, 2.18s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 5/5 [00:05<00:00, 1.13s/it] Training: 100%|██████████| 5/5 [03:28<00:00, 41.76s/it]
Training finish!
Predicting: 100%|██████████| 5/5 [00:05<00:00, 1.13s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 5/5 [00:05<00:00, 1.14s/it] Training: 100%|██████████| 5/5 [03:30<00:00, 42.10s/it]
Training finish!
Predicting: 100%|██████████| 5/5 [00:10<00:00, 2.12s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 10/10 [00:11<00:00, 1.15s/it] Training: 100%|██████████| 10/10 [06:53<00:00, 41.31s/it]
Training finish!
Predicting: 100%|██████████| 10/10 [00:11<00:00, 1.15s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 10/10 [00:11<00:00, 1.20s/it] Training: 100%|██████████| 10/10 [06:55<00:00, 41.52s/it]
Training finish!
Predicting: 100%|██████████| 10/10 [00:21<00:00, 2.16s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 20/20 [00:23<00:00, 1.19s/it] Training: 100%|██████████| 20/20 [13:45<00:00, 41.25s/it]
Training finish!
Predicting: 100%|██████████| 20/20 [00:22<00:00, 1.12s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 20/20 [00:22<00:00, 1.15s/it] Training: 100%|██████████| 20/20 [13:52<00:00, 41.63s/it]
Training finish!
Predicting: 100%|██████████| 20/20 [00:43<00:00, 2.19s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 30/30 [00:36<00:00, 1.21s/it] Training: 100%|██████████| 30/30 [20:39<00:00, 41.31s/it]
Training finish!
Predicting: 100%|██████████| 30/30 [00:34<00:00, 1.14s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 30/30 [00:37<00:00, 1.25s/it] Training: 100%|██████████| 30/30 [20:51<00:00, 41.73s/it]
Training finish!
Predicting: 100%|██████████| 30/30 [01:03<00:00, 2.13s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 40/40 [00:50<00:00, 1.26s/it] Training: 100%|██████████| 40/40 [27:14<00:00, 40.87s/it]
Training finish!
Predicting: 100%|██████████| 40/40 [00:44<00:00, 1.12s/it]
Prediction finish!
Generating Ensemble: 100%|██████████| 40/40 [00:48<00:00, 1.20s/it] Training: 100%|██████████| 40/40 [27:34<00:00, 41.36s/it]
Training finish!
Predicting: 100%|██████████| 40/40 [01:25<00:00, 2.15s/it]
Prediction finish!
log_df = pd.DataFrame(log_list)
log_df.to_csv('tmp_log_df.csv', index=False)
log_df
ensemble_fold | lazy_loading | train_time_use | peak_train_memory_use | average_train_memory_use | test_time_use | peak_test_memory_use | average_test_memory_use | |
---|---|---|---|---|---|---|---|---|
0 | 3 | False | 124.643243 | 2.142380 | 1.776680 | 3.752282 | 2.219360 | 2.211794 |
1 | 3 | True | 127.121982 | 2.891403 | 2.499677 | 6.961078 | 3.016815 | 2.899677 |
2 | 5 | False | 214.892371 | 3.385010 | 3.014771 | 6.143069 | 3.386505 | 3.385702 |
3 | 5 | True | 216.611525 | 3.400070 | 3.321900 | 11.085089 | 3.239746 | 2.909443 |
4 | 10 | False | 425.086186 | 3.564209 | 3.024224 | 11.998664 | 2.775177 | 2.343333 |
5 | 10 | True | 427.636022 | 3.242584 | 2.792008 | 22.192055 | 3.300964 | 3.262257 |
6 | 20 | False | 849.239290 | 3.253647 | 2.356456 | 22.949365 | 4.707184 | 4.254614 |
7 | 20 | True | 856.095079 | 3.199005 | 2.433513 | 44.455940 | 2.685013 | 2.607614 |
8 | 30 | False | 1276.125569 | 3.290817 | 2.319885 | 34.866566 | 6.339859 | 4.900478 |
9 | 30 | True | 1289.964938 | 3.367615 | 2.650008 | 64.755426 | 2.727402 | 2.603010 |
10 | 40 | False | 1685.935830 | 3.607254 | 2.748954 | 45.656563 | 8.115158 | 5.828427 |
11 | 40 | True | 1703.282001 | 4.484177 | 3.213610 | 86.798334 | 2.951721 | 2.789791 |
Plotting experiment results¶
fig,ax = plt.subplots(2,3,figsize=(15,9))
for var_id, var_ in enumerate(['train_time_use','peak_train_memory_use','average_train_memory_use',
'test_time_use','peak_test_memory_use','average_test_memory_use']):
ax[var_id//3, var_id%3].plot(
log_df[log_df['lazy_loading']==False]['ensemble_fold'],
log_df[log_df['lazy_loading']==False][var_],
label='non-lazy'
)
ax[var_id//3, var_id%3].scatter(
log_df[log_df['lazy_loading']==False]['ensemble_fold'],
log_df[log_df['lazy_loading']==False][var_],
)
ax[var_id//3, var_id%3].plot(
log_df[log_df['lazy_loading']==True]['ensemble_fold'],
log_df[log_df['lazy_loading']==True][var_],
label='lazy'
)
ax[var_id//3, var_id%3].scatter(
log_df[log_df['lazy_loading']==True]['ensemble_fold'],
log_df[log_df['lazy_loading']==True][var_],
)
ax[var_id//3, var_id%3].legend()
ax[var_id//3, var_id%3].set_title(var_)
if 'time' in var_:
ax[var_id//3, var_id%3].set_ylabel('Seconds')
elif 'memory' in var_:
ax[var_id//3, var_id%3].set_ylabel('GB')
ax[var_id//3, var_id%3].set_xlabel('Ensemble fold')
plt.subplots_adjust(wspace=0.2, hspace=0.3)
Speed vs. memory usage¶
From the results we can clearly see the trade-off. Using lazy-loading:
- Does not have large impact on training speed.
- Has large impact on testing (prediction) speed. The time for prediction is more than doubled in our case.
- Lazy-loading will maintain memory-use stable and unchanged as ensemble fold increases (maintaining ~ 3GB in our case), while non-lazy-loading will have linear memory consumption growth.
Save model¶
If you are using lazy_loading=True
, try:
model.save(tar_gz_file='./my_model.tar.gz', remove_temporary_file=True)
# After removing temporary files, you can no longer access to the ensembles saved on disk!
# Instead, they are in the .tar.gz file now.
# load model
model = AdaSTEM.load(tar_gz_file='./my_model.tar.gz', new_lazy_loading_path='./new_lazyloading_ensemble_folder', remove_original_file=False)
# if not specifying target_lazyloading_path, a random folder will be made under your current working directory
model.lazy_loading_dir # notice that your lazy loading folder now is changed!
'new_lazyloading_ensemble_folder'
Make sure you use the same version of stemflow
for write and load models
Concluding mark¶
Please open an issue if you have any question
Cheers!
from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn"))
Last updated: 2024-10-27T16:06:11.216365-05:00 Python implementation: CPython Python version : 3.11.10 IPython version : 8.22.1 Compiler : Clang 17.0.6 OS : Darwin Release : 23.1.0 Machine : arm64 Processor : arm CPU cores : 14 Architecture: 64bit stemflow : 1.1.2 numpy : 1.26.4 scipy : 1.14.1 pandas : 2.2.3 xgboost : 2.0.3 tqdm : 4.66.5 matplotlib : 3.9.2 h3pandas : 0.2.6 geopandas : 0.14.3 scikit-learn: 1.5.2