Lazy loading ensembles¶

When each ensemble contains huge amount of models or the ensemble amount is large (e.g., > 50), it might not be viable to load all ensembles into memory. In that case, we can ask the model to save all ensembles into disk, and only load them when the ensembles are being used. This process may dramatically reduce the memory usage but also reduce the speed of prediction, since the IO (read & write between disk and memory) will be a speed limit here.

We run experiments to see how this speed-memory trade-off works:

In [1]:

Copied!





import os
import sys 

import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import h3pandas
import copy 

pd.set_option('display.max_columns', None)
# warnings.filterwarnings('ignore')
import os
import sys 

import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import h3pandas
import copy 

pd.set_option('display.max_columns', None)
# warnings.filterwarnings('ignore')

In [2]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

Download data¶

Training/test data¶

Please download the sample data from:

Mallard: https://figshare.com/articles/dataset/Sample_data_Mallard_csv/24080745

Suppose now it's downloaded and saved as './Sample_data_Mallard.csv'

Alternatively, you can try other species like

Alder Flycatcher: https://figshare.com/articles/dataset/Sample_data_Alder_Flycatcher_csv/24080751
Short-eared Owl: https://figshare.com/articles/dataset/Sample_data_Short-eared_Owl_csv/24080742
Eurasian Tree Sparrow: https://figshare.com/articles/dataset/Sample_data_Eurasian_Tree_Sparrow_csv/24080748

Caveat: These bird observation data are about 200MB each file.

In [3]:

Copied!

data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)
data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)

In [4]:

Copied!

data = data.sample(frac=0.7)
data = data.sample(frac=0.7)

Prediction set¶

Prediction set are used to feed into a trained AdaSTEM model and make prediction: at some location, at some day of year, given the environmental variables, how many Mallard individual do I expected to observe?

The prediction set will be loaded after the model is trained.

Download the prediction set from: https://figshare.com/articles/dataset/Predset_2020_csv/24124980

Caveat: The file is about 700MB.

Get X and y¶

In [5]:

Copied!

X = data.drop('count', axis=1)
y = data['count'].values
X = data.drop('count', axis=1)
y = data['count'].values

In [6]:

Copied!

X.head()
X.head()

Out[6]:

	longitude	latitude	DOY	duration_minutes	Traveling	Stationary	effort_distance_km	number_observers	obsvr_species_count	time_observation_started_minute_of_day	elevation_mean	slope_mean	eastness_mean	northness_mean	bio1	bio2	bio3	bio4	bio5	bio6	bio7	bio8	bio9	bio10	bio11	bio12	bio13	bio14	bio15	bio16	bio17	bio18	bio19	cropland_or_natural_vegetation_mosaics	croplands	deciduous_broadleaf_forests	evergreen_broadleaf_forests	grasslands	non_vegetated_lands	savannas	urban_and_built_up_lands	woody_savannas	entropy
397682	-121.243566	44.722404	73	20.0	0	1	-1.00	3.0	675.0	946	528.666700	6.930458	-0.004461	-0.003129	12.254499	11.222793	36.556190	722.291583	30.786013	0.085896	30.700117	5.900058	22.592101	22.036626	4.241771	0.054554	0.012744	0.000037	0.000017	0.032447	0.001068	0.001068	0.022911	0.000000	0.888889	0.000000	0.000000	0.083333	0.027778	0.000000	0.000000	0.000000	0.411314
52761	-84.830546	10.262687	202	15.0	0	1	-1.00	1.0	279.0	823	926.777800	11.270809	-0.000824	0.026607	22.108191	6.112472	69.367717	87.154508	27.772561	18.960866	8.811695	20.835122	23.067510	23.229957	20.984801	0.247798	0.053908	0.000343	0.000307	0.135164	0.003454	0.005471	0.099995	0.000000	0.000000	0.111111	0.083333	0.027778	0.000000	0.388889	0.000000	0.388889	1.285335
102324	-84.410191	42.625370	138	41.0	0	1	-1.00	1.0	379.0	1073	271.000000	0.411284	0.192033	-0.162664	10.530639	7.988692	22.794835	890.107596	29.042612	-6.003451	35.046063	-0.510486	24.446625	22.611013	-0.829569	0.115785	0.016661	0.005342	0.000015	0.047077	0.017823	0.021165	0.029807	0.444444	0.555556	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.686962
198809	-123.112206	49.241944	172	9.0	1	0	0.32	2.0	222.0	976	61.833332	1.856077	0.031009	0.402949	10.378653	6.568099	31.841727	561.138541	22.256131	1.628799	20.627332	3.524946	17.947798	17.871489	3.964883	0.315263	0.070945	0.006232	0.000348	0.164123	0.023907	0.033864	0.148501	0.055556	0.583333	0.000000	0.000000	0.055556	0.000000	0.083333	0.222222	0.000000	1.176882
318693	-96.888985	42.779600	259	45.0	0	1	-1.00	2.0	670.0	981	376.666660	0.206779	0.003448	0.044174	11.011082	9.531909	24.486426	1033.230340	30.444804	-8.482515	38.927319	14.754586	-1.147955	25.053628	-2.344446	0.073579	0.015177	0.001466	0.000021	0.038704	0.005140	0.025476	0.006744	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.000000

The features include:

spatial coordinates:
- longitude and latitude (used for indexing, not actual training)
Temporal coordinate:
- day of year (DOY): used for both indexing and training
Sampling parameters: These are parameters quantifying how the observation was made
- duration_minutes: How long the observation was conducted
- Observation protocol: Traveling, Stationary, or Area
- effort_distance_km: how far have one traveled
- number_observers: How many observers are there in the group
- obsvr_species_count: How many bird species have the birder observed in the past
- time_observation_started_minute_of_day: When did the birder start birding
Topological features:
- Features of elevation: elevation_mean
- Features of slope magnitude and direction: slope_mean, eastness_mean, northness_mean
Bioclimate features:
- Summaries of yearly temperature and precipitation: from bio1 to bio19
Land cover features:
- Summaries of land cover, percentage of cover. For example, closed_shrublands, urban_and_built_up_lands.
- entropy: Entropy of land cover

As you can see, the environmental variables are almost static. However, dynamic features (e.g., daily temperature) is fully supported as input. See Tips for data types for details.

First thing first: Spatiotemporal train test split¶

In [7]:

Copied!





from stemflow.model_selection import ST_train_test_split
X_train, X_test, y_train, y_test = ST_train_test_split(X, y, 
                                                        Spatio1 = 'longitude',
                                                        Spatio2 = 'latitude',
                                                        Temporal1 = 'DOY',
                                                        Spatio_blocks_count = 50, Temporal_blocks_count=50,
                                                        random_state=42, test_size=0.3)
from stemflow.model_selection import ST_train_test_split
X_train, X_test, y_train, y_test = ST_train_test_split(X, y, 
                                                        Spatio1 = 'longitude',
                                                        Spatio2 = 'latitude',
                                                        Temporal1 = 'DOY',
                                                        Spatio_blocks_count = 50, Temporal_blocks_count=50,
                                                        random_state=42, test_size=0.3)

Train AdaSTEM hurdle model¶

In [8]:

Copied!

from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor # remember to install xgboost if you use it as base model
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle
from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor # remember to install xgboost if you use it as base model
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle

In [ ]:

Copied!





import time
from memory_profiler import memory_usage

def make_model_lazyloading(ensemble_fold):
    model_lazyloading = AdaSTEMRegressor(
        base_model=Hurdle(
            classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
            regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
        ),                                      # hurdel model for zero-inflated problem (e.g., count)
        task='hurdle',
        save_gridding_plot = True,
        ensemble_fold=ensemble_fold,                       # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
        min_ensemble_required=ensemble_fold-2,                # Only points covered by > 7 ensembles will be predicted
        grid_len_upper_threshold=25,            # force splitting if the grid length exceeds 25
        grid_len_lower_threshold=5,             # stop splitting if the grid length fall short 5         
        temporal_start=1,                       # The next 4 params define the temporal sliding window
        temporal_end=366,                            
        temporal_step=25,                       # The window takes steps of 20 DOY (see AdaSTEM demo for details)
        temporal_bin_interval=50,               # Each window will contain data of 50 DOY
        points_lower_threshold=50,              # Only stixels with more than 50 samples are trained
        Spatio1='longitude',                    # The next three params define the name of 
        Spatio2='latitude',                     # spatial coordinates shown in the dataframe
        Temporal1='DOY',
        use_temporal_to_train=True,             # In each stixel, whether 'DOY' should be a predictor
        n_jobs=1,                               # Not using parallel computing
        random_state=42,                        # The random state makes the gridding process reproducible
        lazy_loading=True,                       # Using lazy loading for large ensemble amount (e.g., >20 ensembles). 
        verbosity=1                              # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
    )
    
    return model_lazyloading


def make_model_not_lazyloading(ensemble_fold):
    model_not_lazyloading = AdaSTEMRegressor(
        base_model=Hurdle(
            classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
            regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
        ),                                      # hurdel model for zero-inflated problem (e.g., count)
        task='hurdle',
        save_gridding_plot = True,
        ensemble_fold=ensemble_fold,                       # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
        min_ensemble_required=ensemble_fold-2,                # Only points covered by > 7 ensembles will be predicted
        grid_len_upper_threshold=25,            # force splitting if the grid length exceeds 25
        grid_len_lower_threshold=5,             # stop splitting if the grid length fall short 5         
        temporal_start=1,                       # The next 4 params define the temporal sliding window
        temporal_end=366,                            
        temporal_step=25,                       # The window takes steps of 20 DOY (see AdaSTEM demo for details)
        temporal_bin_interval=50,               # Each window will contain data of 50 DOY
        points_lower_threshold=50,              # Only stixels with more than 50 samples are trained
        Spatio1='longitude',                    # The next three params define the name of 
        Spatio2='latitude',                     # spatial coordinates shown in the dataframe
        Temporal1='DOY',
        use_temporal_to_train=True,             # In each stixel, whether 'DOY' should be a predictor
        n_jobs=1,                               # Not using parallel computing
        random_state=42,                        # The random state makes the gridding process reproducible
        lazy_loading=False,                       # Using lazy loading for large ensemble amount (e.g., >20 ensembles). 
        verbosity=1                               # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
    )
    return model_not_lazyloading

def train_model(model, X_train, y_train):
    
    start_time = time.time()
    
    
    def run_model(model=model, X_train=X_train, y_train=y_train):
        model.fit(X_train, y_train)
        return

    memory_use = memory_usage((run_model, (model, X_train, y_train, )))
    memory_use = [i/1024 for i in memory_use] # in GB
    peak_use = np.max(memory_use)
    average_use = np.mean(memory_use)
    
    end_time = time.time()
    
    training_time_use = end_time - start_time
    print('Training finish!')
    
    return training_time_use, peak_use, average_use
    
def test_model(model, X_test, y_test):
    
    start_time = time.time()
    
    def run_model(model=model, X_test=X_test):
        model.predict(X_test)
        return

    memory_use = memory_usage((run_model, (model, X_test, )))
    memory_use = [i/1024 for i in memory_use] # in GB
    peak_use = np.max(memory_use)
    average_use = np.mean(memory_use)
    
    end_time = time.time()
    
    prediction_time_use = end_time - start_time
    print('Prediction finish!')
    
    return prediction_time_use, peak_use, average_use
import time
from memory_profiler import memory_usage

def make_model_lazyloading(ensemble_fold):
    model_lazyloading = AdaSTEMRegressor(
        base_model=Hurdle(
            classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
            regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
        ),                                      # hurdel model for zero-inflated problem (e.g., count)
        task='hurdle',
        save_gridding_plot = True,
        ensemble_fold=ensemble_fold,                       # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
        min_ensemble_required=ensemble_fold-2,                # Only points covered by > 7 ensembles will be predicted
        grid_len_upper_threshold=25,            # force splitting if the grid length exceeds 25
        grid_len_lower_threshold=5,             # stop splitting if the grid length fall short 5         
        temporal_start=1,                       # The next 4 params define the temporal sliding window
        temporal_end=366,                            
        temporal_step=25,                       # The window takes steps of 20 DOY (see AdaSTEM demo for details)
        temporal_bin_interval=50,               # Each window will contain data of 50 DOY
        points_lower_threshold=50,              # Only stixels with more than 50 samples are trained
        Spatio1='longitude',                    # The next three params define the name of 
        Spatio2='latitude',                     # spatial coordinates shown in the dataframe
        Temporal1='DOY',
        use_temporal_to_train=True,             # In each stixel, whether 'DOY' should be a predictor
        n_jobs=1,                               # Not using parallel computing
        random_state=42,                        # The random state makes the gridding process reproducible
        lazy_loading=True,                       # Using lazy loading for large ensemble amount (e.g., >20 ensembles). 
        verbosity=1                              # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
    )
    
    return model_lazyloading


def make_model_not_lazyloading(ensemble_fold):
    model_not_lazyloading = AdaSTEMRegressor(
        base_model=Hurdle(
            classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
            regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
        ),                                      # hurdel model for zero-inflated problem (e.g., count)
        task='hurdle',
        save_gridding_plot = True,
        ensemble_fold=ensemble_fold,                       # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
        min_ensemble_required=ensemble_fold-2,                # Only points covered by > 7 ensembles will be predicted
        grid_len_upper_threshold=25,            # force splitting if the grid length exceeds 25
        grid_len_lower_threshold=5,             # stop splitting if the grid length fall short 5         
        temporal_start=1,                       # The next 4 params define the temporal sliding window
        temporal_end=366,                            
        temporal_step=25,                       # The window takes steps of 20 DOY (see AdaSTEM demo for details)
        temporal_bin_interval=50,               # Each window will contain data of 50 DOY
        points_lower_threshold=50,              # Only stixels with more than 50 samples are trained
        Spatio1='longitude',                    # The next three params define the name of 
        Spatio2='latitude',                     # spatial coordinates shown in the dataframe
        Temporal1='DOY',
        use_temporal_to_train=True,             # In each stixel, whether 'DOY' should be a predictor
        n_jobs=1,                               # Not using parallel computing
        random_state=42,                        # The random state makes the gridding process reproducible
        lazy_loading=False,                       # Using lazy loading for large ensemble amount (e.g., >20 ensembles). 
        verbosity=1                               # -- Each trained ensemble will be saved into disk and will only be loaded if needed (e.g. for prediction).
    )
    return model_not_lazyloading

def train_model(model, X_train, y_train):
    
    start_time = time.time()
    
    
    def run_model(model=model, X_train=X_train, y_train=y_train):
        model.fit(X_train, y_train)
        return

    memory_use = memory_usage((run_model, (model, X_train, y_train, )))
    memory_use = [i/1024 for i in memory_use] # in GB
    peak_use = np.max(memory_use)
    average_use = np.mean(memory_use)
    
    end_time = time.time()
    
    training_time_use = end_time - start_time
    print('Training finish!')
    
    return training_time_use, peak_use, average_use
    
def test_model(model, X_test, y_test):
    
    start_time = time.time()
    
    def run_model(model=model, X_test=X_test):
        model.predict(X_test)
        return

    memory_use = memory_usage((run_model, (model, X_test, )))
    memory_use = [i/1024 for i in memory_use] # in GB
    peak_use = np.max(memory_use)
    average_use = np.mean(memory_use)
    
    end_time = time.time()
    
    prediction_time_use = end_time - start_time
    print('Prediction finish!')
    
    return prediction_time_use, peak_use, average_use

Run test¶

In [10]:

Copied!





log_list= []

ensemble_fold_list = [3, 5, 10, 20, 30, 40]

for ensemble_fold in tqdm(ensemble_fold_list):
    
    model_not_lazyloading = make_model_not_lazyloading(ensemble_fold)
    train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_not_lazyloading, X_train, y_train)
    test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_not_lazyloading, X_test, y_test)
    
    log_list.append({
        'ensemble_fold':ensemble_fold,
        'lazy_loading':False,
        'train_time_use':train_time_use,
        'peak_train_memory_use':peak_train_memory_use,
        'average_train_memory_use':average_train_memory_use,
        'test_time_use':test_time_use,
        'peak_test_memory_use':peak_test_memory_use,
        'average_test_memory_use':average_test_memory_use
    })
    
    del model_not_lazyloading
    
    model_lazyloading = make_model_lazyloading(ensemble_fold)
    train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_lazyloading, X_train, y_train)
    test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_lazyloading, X_test, y_test)
    
    log_list.append({
        'ensemble_fold':ensemble_fold,
        'lazy_loading':True,
        'train_time_use':train_time_use,
        'peak_train_memory_use':peak_train_memory_use,
        'average_train_memory_use':average_train_memory_use,
        'test_time_use':test_time_use,
        'peak_test_memory_use':peak_test_memory_use,
        'average_test_memory_use':average_test_memory_use
    })
    
    del model_lazyloading
    
    
log_list= []

ensemble_fold_list = [3, 5, 10, 20, 30, 40]

for ensemble_fold in tqdm(ensemble_fold_list):
    
    model_not_lazyloading = make_model_not_lazyloading(ensemble_fold)
    train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_not_lazyloading, X_train, y_train)
    test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_not_lazyloading, X_test, y_test)
    
    log_list.append({
        'ensemble_fold':ensemble_fold,
        'lazy_loading':False,
        'train_time_use':train_time_use,
        'peak_train_memory_use':peak_train_memory_use,
        'average_train_memory_use':average_train_memory_use,
        'test_time_use':test_time_use,
        'peak_test_memory_use':peak_test_memory_use,
        'average_test_memory_use':average_test_memory_use
    })
    
    del model_not_lazyloading
    
    model_lazyloading = make_model_lazyloading(ensemble_fold)
    train_time_use, peak_train_memory_use, average_train_memory_use = train_model(model_lazyloading, X_train, y_train)
    test_time_use, peak_test_memory_use, average_test_memory_use = test_model(model_lazyloading, X_test, y_test)
    
    log_list.append({
        'ensemble_fold':ensemble_fold,
        'lazy_loading':True,
        'train_time_use':train_time_use,
        'peak_train_memory_use':peak_train_memory_use,
        'average_train_memory_use':average_train_memory_use,
        'test_time_use':test_time_use,
        'peak_test_memory_use':peak_test_memory_use,
        'average_test_memory_use':average_test_memory_use
    })
    
    del model_lazyloading
    
    

  0%|          | 0/6 [00:00<?, ?it/s]

Generating Ensemble: 100%|██████████| 3/3 [00:03<00:00,  1.13s/it]
Training: 100%|██████████| 3/3 [02:00<00:00, 40.32s/it]

Training finish!

Predicting: 100%|██████████| 3/3 [00:03<00:00,  1.12s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 3/3 [00:03<00:00,  1.13s/it]
Training: 100%|██████████| 3/3 [02:03<00:00, 41.10s/it]

Training finish!

Predicting: 100%|██████████| 3/3 [00:06<00:00,  2.18s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 5/5 [00:05<00:00,  1.13s/it]
Training: 100%|██████████| 5/5 [03:28<00:00, 41.76s/it]

Training finish!

Predicting: 100%|██████████| 5/5 [00:05<00:00,  1.13s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 5/5 [00:05<00:00,  1.14s/it]
Training: 100%|██████████| 5/5 [03:30<00:00, 42.10s/it]

Training finish!

Predicting: 100%|██████████| 5/5 [00:10<00:00,  2.12s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 10/10 [00:11<00:00,  1.15s/it]
Training: 100%|██████████| 10/10 [06:53<00:00, 41.31s/it]

Training finish!

Predicting: 100%|██████████| 10/10 [00:11<00:00,  1.15s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 10/10 [00:11<00:00,  1.20s/it]
Training: 100%|██████████| 10/10 [06:55<00:00, 41.52s/it]

Training finish!

Predicting: 100%|██████████| 10/10 [00:21<00:00,  2.16s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 20/20 [00:23<00:00,  1.19s/it]
Training: 100%|██████████| 20/20 [13:45<00:00, 41.25s/it]

Training finish!

Predicting: 100%|██████████| 20/20 [00:22<00:00,  1.12s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 20/20 [00:22<00:00,  1.15s/it]
Training: 100%|██████████| 20/20 [13:52<00:00, 41.63s/it]

Training finish!

Predicting: 100%|██████████| 20/20 [00:43<00:00,  2.19s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 30/30 [00:36<00:00,  1.21s/it]
Training: 100%|██████████| 30/30 [20:39<00:00, 41.31s/it]

Training finish!

Predicting: 100%|██████████| 30/30 [00:34<00:00,  1.14s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 30/30 [00:37<00:00,  1.25s/it]
Training: 100%|██████████| 30/30 [20:51<00:00, 41.73s/it]

Training finish!

Predicting: 100%|██████████| 30/30 [01:03<00:00,  2.13s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 40/40 [00:50<00:00,  1.26s/it]
Training: 100%|██████████| 40/40 [27:14<00:00, 40.87s/it]

Training finish!

Predicting: 100%|██████████| 40/40 [00:44<00:00,  1.12s/it]

Prediction finish!

Generating Ensemble: 100%|██████████| 40/40 [00:48<00:00,  1.20s/it]
Training: 100%|██████████| 40/40 [27:34<00:00, 41.36s/it]

Training finish!

Predicting: 100%|██████████| 40/40 [01:25<00:00,  2.15s/it]

Prediction finish!

In [12]:

Copied!

log_df = pd.DataFrame(log_list)
log_df.to_csv('tmp_log_df.csv', index=False)
log_df
log_df = pd.DataFrame(log_list)
log_df.to_csv('tmp_log_df.csv', index=False)
log_df

Out[12]:

	ensemble_fold	lazy_loading	train_time_use	peak_train_memory_use	average_train_memory_use	test_time_use	peak_test_memory_use	average_test_memory_use
0	3	False	124.643243	2.142380	1.776680	3.752282	2.219360	2.211794
1	3	True	127.121982	2.891403	2.499677	6.961078	3.016815	2.899677
2	5	False	214.892371	3.385010	3.014771	6.143069	3.386505	3.385702
3	5	True	216.611525	3.400070	3.321900	11.085089	3.239746	2.909443
4	10	False	425.086186	3.564209	3.024224	11.998664	2.775177	2.343333
5	10	True	427.636022	3.242584	2.792008	22.192055	3.300964	3.262257
6	20	False	849.239290	3.253647	2.356456	22.949365	4.707184	4.254614
7	20	True	856.095079	3.199005	2.433513	44.455940	2.685013	2.607614
8	30	False	1276.125569	3.290817	2.319885	34.866566	6.339859	4.900478
9	30	True	1289.964938	3.367615	2.650008	64.755426	2.727402	2.603010
10	40	False	1685.935830	3.607254	2.748954	45.656563	8.115158	5.828427
11	40	True	1703.282001	4.484177	3.213610	86.798334	2.951721	2.789791

Plotting experiment results¶

In [36]:

Copied!





fig,ax = plt.subplots(2,3,figsize=(15,9))
for var_id, var_ in enumerate(['train_time_use','peak_train_memory_use','average_train_memory_use',
             'test_time_use','peak_test_memory_use','average_test_memory_use']):
    ax[var_id//3, var_id%3].plot(
        log_df[log_df['lazy_loading']==False]['ensemble_fold'],
        log_df[log_df['lazy_loading']==False][var_],
        label='non-lazy'
    )
    ax[var_id//3, var_id%3].scatter(
        log_df[log_df['lazy_loading']==False]['ensemble_fold'],
        log_df[log_df['lazy_loading']==False][var_],
    )

    ax[var_id//3, var_id%3].plot(
        log_df[log_df['lazy_loading']==True]['ensemble_fold'],
        log_df[log_df['lazy_loading']==True][var_],
        label='lazy'
    )
    ax[var_id//3, var_id%3].scatter(
        log_df[log_df['lazy_loading']==True]['ensemble_fold'],
        log_df[log_df['lazy_loading']==True][var_],
    )

    ax[var_id//3, var_id%3].legend()
    ax[var_id//3, var_id%3].set_title(var_)
    if 'time' in var_:
        ax[var_id//3, var_id%3].set_ylabel('Seconds')
    elif 'memory' in var_:
        ax[var_id//3, var_id%3].set_ylabel('GB')
    ax[var_id//3, var_id%3].set_xlabel('Ensemble fold')


plt.subplots_adjust(wspace=0.2, hspace=0.3)
fig,ax = plt.subplots(2,3,figsize=(15,9))
for var_id, var_ in enumerate(['train_time_use','peak_train_memory_use','average_train_memory_use',
             'test_time_use','peak_test_memory_use','average_test_memory_use']):
    ax[var_id//3, var_id%3].plot(
        log_df[log_df['lazy_loading']==False]['ensemble_fold'],
        log_df[log_df['lazy_loading']==False][var_],
        label='non-lazy'
    )
    ax[var_id//3, var_id%3].scatter(
        log_df[log_df['lazy_loading']==False]['ensemble_fold'],
        log_df[log_df['lazy_loading']==False][var_],
    )

    ax[var_id//3, var_id%3].plot(
        log_df[log_df['lazy_loading']==True]['ensemble_fold'],
        log_df[log_df['lazy_loading']==True][var_],
        label='lazy'
    )
    ax[var_id//3, var_id%3].scatter(
        log_df[log_df['lazy_loading']==True]['ensemble_fold'],
        log_df[log_df['lazy_loading']==True][var_],
    )

    ax[var_id//3, var_id%3].legend()
    ax[var_id//3, var_id%3].set_title(var_)
    if 'time' in var_:
        ax[var_id//3, var_id%3].set_ylabel('Seconds')
    elif 'memory' in var_:
        ax[var_id//3, var_id%3].set_ylabel('GB')
    ax[var_id//3, var_id%3].set_xlabel('Ensemble fold')


plt.subplots_adjust(wspace=0.2, hspace=0.3)

No description has been provided for this image

Speed vs. memory usage¶

From the results we can clearly see the trade-off. Using lazy-loading:

Does not have large impact on training speed.
Has large impact on testing (prediction) speed. The time for prediction is more than doubled in our case.
Lazy-loading will maintain memory-use stable and unchanged as ensemble fold increases (maintaining ~ 3GB in our case), while non-lazy-loading will have linear memory consumption growth.

Save model¶

If you are using lazy_loading=True, try:

In [ ]:

Copied!

model.save(tar_gz_file='./my_model.tar.gz', remove_temporary_file=True) 
# After removing temporary files, you can no longer access to the ensembles saved on disk!
# Instead, they are in the .tar.gz file now.
model.save(tar_gz_file='./my_model.tar.gz', remove_temporary_file=True) 
# After removing temporary files, you can no longer access to the ensembles saved on disk!
# Instead, they are in the .tar.gz file now.

In [ ]:

Copied!

# load model
model = AdaSTEM.load(tar_gz_file='./my_model.tar.gz', new_lazy_loading_path='./new_lazyloading_ensemble_folder', remove_original_file=False)
# if not specifying target_lazyloading_path, a random folder will be made under your current working directory
# load model
model = AdaSTEM.load(tar_gz_file='./my_model.tar.gz', new_lazy_loading_path='./new_lazyloading_ensemble_folder', remove_original_file=False)
# if not specifying target_lazyloading_path, a random folder will be made under your current working directory

In [ ]:

Copied!

model.lazy_loading_dir # notice that your lazy loading folder now is changed!
model.lazy_loading_dir # notice that your lazy loading folder now is changed!

'new_lazyloading_ensemble_folder'

Make sure you use the same version of stemflow for write and load models

Concluding mark¶

Please open an issue if you have any question

Cheers!

In [38]:

Copied!

from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn"))
from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn"))

Last updated: 2024-10-27T16:06:11.216365-05:00

Python implementation: CPython
Python version       : 3.11.10
IPython version      : 8.22.1

Compiler    : Clang 17.0.6 
OS          : Darwin
Release     : 23.1.0
Machine     : arm64
Processor   : arm
CPU cores   : 14
Architecture: 64bit

stemflow    : 1.1.2
numpy       : 1.26.4
scipy       : 1.14.1
pandas      : 2.2.3
xgboost     : 2.0.3
tqdm        : 4.66.5
matplotlib  : 3.9.2
h3pandas    : 0.2.6
geopandas   : 0.14.3
scikit-learn: 1.5.2