Optimizing stixel size¶

Yangkang Chen
Jan 12, 2024

This notebook is to explore why and how we should tune the gridding hyperparameters of the Quadtree algorithm.

Why should we even care about the grid size parameter?¶

As mentioned in the brief introduction to stemflow, the main target of AdaSTEM modeling framework is to solve the long-distance/long-range prediction problem.

This is realized by gridding data into smaller spatial and temporal grids (that is what temporal sliding window and spatial Quadtree algorithm are doing), so that each prediction will only rely on the trained adjacency information in space and time.

It is similar to the concept of spatial regression or spatial/temporal lag, only that in our cases the lag is a sharp truncation (but after averaging across ensembles, the edge will be soften).

Using two extreme examples we can tell the necessity of set good gridding parameters:

Extremely large: Imagine if you set extremely large sliding window interval and grid size for Quadtree algo – there will be only one stixel. That is, it's equivalent to bulk regression without partitioning. Apparently this is not what we want.
Extremely small: Imagine the grids are extremely small that each grid only contains one sample – That will be an 100% overfitting.

Therefore, we need to assess which parameter setting is best for our modeling goal: to

1. improve the model performance
1. avoid overfitting
1. reach a balance between accuracy and efficienty

In [1]:

Copied!





import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

import matplotlib
import warnings
import pickle
import os
import time

import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

import matplotlib
import warnings
import pickle
import os
import time

In [2]:

Copied!





import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import os
import time

from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor # remember to install xgboost if you use it as base model
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle

# warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import os
import time

from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor # remember to install xgboost if you use it as base model
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle

# warnings.filterwarnings('ignore')

In [3]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

Download data¶

Please download the sample data from:

Mallard: https://figshare.com/articles/dataset/Sample_data_Mallard_csv/24080745

Suppose now it's downloaded and saved as './Sample_data_Mallard.csv'

Alternatively, you can try other species like

Alder Flycatcher: https://figshare.com/articles/dataset/Sample_data_Alder_Flycatcher_csv/24080751
Short-eared Owl: https://figshare.com/articles/dataset/Sample_data_Short-eared_Owl_csv/24080742
Eurasian Tree Sparrow: https://figshare.com/articles/dataset/Sample_data_Eurasian_Tree_Sparrow_csv/24080748

Caveat: These bird observation data are about 200MB each file.

Load data¶

In [4]:

Copied!

data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)
data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)

In [5]:

Copied!

data.head()
data.head()

Out[5]:

	longitude	latitude	count	DOY	duration_minutes	Traveling	Stationary	effort_distance_km	number_observers	...	grasslands	mixed_forests	permanent_wetlands	savannas	urban_and_built_up_lands	water_bodies	woody_savannas	entropy
0	-83.472224	8.859308	0.0	22	300.0	1	0	4.828	5.0	...	0.000000	0.000000	0.777778	0.000000	0.000000	0.083333	0.000000	0.676720
1	-2.687724	43.373323	9.0	290	90.0	1	0	0.570	2.0	...	0.000000	0.083333	0.000000	0.194444	0.027778	0.000000	0.361111	1.359063
2	-89.884770	35.087255	0.0	141	10.0	0	1	-1.000	2.0	...	0.305556	0.000000	0.000000	0.527778	0.000000	0.000000	0.111111	1.104278
3	-99.216873	31.218510	0.0	104	9.0	1	0	0.805	2.0	...	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.000000
4	-124.426730	43.065847	2.0	96	30.0	1	0	0.161	2.0	...	0.000000	0.472222	0.000000	0.000000	0.000000	0.000000	0.000000	1.020754

5 rows × 52 columns

Train AdaSTEMRegressor with hurdle model as base model¶

stemflow have 2 important gridding parameters: The maximum grid length, and the minimum grid length.

Now we set the gradient that we want to explore: 5, 10, 25, 50, and 100 degree units.

In [6]:

Copied!





import itertools

# Create the parameter grid ensuring the upper threshold is greater than the lower threshold
params = [
    {'grid_len_lower_threshold': [lower], 'grid_len_upper_threshold': [upper]}
    for lower, upper in itertools.product([5, 10, 25, 50, 100], [5, 10, 25, 50, 100])
    if upper > lower
]

print('Gridsearch params: ',params)
import itertools

# Create the parameter grid ensuring the upper threshold is greater than the lower threshold
params = [
    {'grid_len_lower_threshold': [lower], 'grid_len_upper_threshold': [upper]}
    for lower, upper in itertools.product([5, 10, 25, 50, 100], [5, 10, 25, 50, 100])
    if upper > lower
]

print('Gridsearch params: ',params)

Gridsearch params:  [{'grid_len_lower_threshold': [5], 'grid_len_upper_threshold': [10]}, {'grid_len_lower_threshold': [5], 'grid_len_upper_threshold': [25]}, {'grid_len_lower_threshold': [5], 'grid_len_upper_threshold': [50]}, {'grid_len_lower_threshold': [5], 'grid_len_upper_threshold': [100]}, {'grid_len_lower_threshold': [10], 'grid_len_upper_threshold': [25]}, {'grid_len_lower_threshold': [10], 'grid_len_upper_threshold': [50]}, {'grid_len_lower_threshold': [10], 'grid_len_upper_threshold': [100]}, {'grid_len_lower_threshold': [25], 'grid_len_upper_threshold': [50]}, {'grid_len_lower_threshold': [25], 'grid_len_upper_threshold': [100]}, {'grid_len_lower_threshold': [50], 'grid_len_upper_threshold': [100]}]

We made a grid of parameters "grid_len_upper_threshold" and "grid_len_upper_threshold". Now we train each pair of these combinations and find the best combination.

Parameter Gridsearch¶

To simplify, we use a subset of the data and take only 20% random samples:

In [7]:

Copied!





# data_sub = data.sample(frac=0.2, replace=False)
# X = data_sub.drop('count', axis=1)
# y = data_sub['count'].values

X = data.drop('count', axis=1)
y = data['count'].values
# data_sub = data.sample(frac=0.2, replace=False)
# X = data_sub.drop('count', axis=1)
# y = data_sub['count'].values

X = data.drop('count', axis=1)
y = data['count'].values

First thing first: Spatiotemporal train-test-split

The following generator will generate a tuple of (training_data_indexes, test_data_indexes) each time it is called, until we run out of samples.

In [8]:

Copied!





from stemflow.model_selection import ST_KFold
ST_KFold_generator = ST_KFold(
    Spatio1 = "longitude",
    Spatio2 = "latitude",
    Temporal1 = "DOY",
    Spatio_blocks_count = 50,
    Temporal_blocks_count = 50,
    random_state = 42,
    n_splits = 3
).split(X)
from stemflow.model_selection import ST_KFold
ST_KFold_generator = ST_KFold(
    Spatio1 = "longitude",
    Spatio2 = "latitude",
    Temporal1 = "DOY",
    Spatio_blocks_count = 50,
    Temporal_blocks_count = 50,
    random_state = 42,
    n_splits = 3
).split(X)

ST_Kfold function returns a one-off generator, we will call it in the later loop. Here we define the parameters for fold (n_splits = 3).

We need to make a custom scorer for "poisson deviance explained" metric, which will be passed to the GridsearchCV:

In [9]:

Copied!





# Make scorer
from stemflow.model.AdaSTEM import AdaSTEM
from scipy.stats import spearmanr
from sklearn.metrics import (
    make_scorer,
    d2_tweedie_score,
    mean_squared_error,
    r2_score,
    roc_auc_score,
    cohen_kappa_score,
    f1_score,
    average_precision_score
)

def spearmanr_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return spearmanr(np.array(a.x), np.array(a.y))[0]

def MSE_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return mean_squared_error(np.array(a.x), np.array(a.y))
    
def PD_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    a = a[(a['x']>=0) & (a['y']>0)]
    return d2_tweedie_score(np.array(a.x), np.array(a.y), power=1)
     
def r2_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return r2_score(np.array(a.x), np.array(a.y))
     
def roc_auc_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return roc_auc_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def cohen_kappa_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return cohen_kappa_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def f1_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return f1_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def average_precision_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return average_precision_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     

my_scoring = {'spearmanr':make_scorer(spearmanr_score, greater_is_better=True),
                'MSE':make_scorer(MSE_score, greater_is_better=False),
                'PD':make_scorer(PD_score, greater_is_better=True),
                'R2':make_scorer(r2_score_, greater_is_better=True),
                'AUC':make_scorer(roc_auc_score_, greater_is_better=True),
                'cohen_kappa':make_scorer(cohen_kappa_score_, greater_is_better=True),
                'f1':make_scorer(f1_score_, greater_is_better=True),
                'average_precision':make_scorer(average_precision_score_, greater_is_better=True),
                }
# Make scorer
from stemflow.model.AdaSTEM import AdaSTEM
from scipy.stats import spearmanr
from sklearn.metrics import (
    make_scorer,
    d2_tweedie_score,
    mean_squared_error,
    r2_score,
    roc_auc_score,
    cohen_kappa_score,
    f1_score,
    average_precision_score
)

def spearmanr_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return spearmanr(np.array(a.x), np.array(a.y))[0]

def MSE_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return mean_squared_error(np.array(a.x), np.array(a.y))
    
def PD_score(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    a = a[(a['x']>=0) & (a['y']>0)]
    return d2_tweedie_score(np.array(a.x), np.array(a.y), power=1)
     
def r2_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return r2_score(np.array(a.x), np.array(a.y))
     
def roc_auc_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return roc_auc_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def cohen_kappa_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return cohen_kappa_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def f1_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return f1_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     
def average_precision_score_(x,y):
    a = pd.DataFrame({
        'x':x,
        'y':y
    }).dropna()
    return average_precision_score(np.where(np.array(a.x)>0, 1, 0), np.where(np.array(a.y)>0, 1, 0))
     

my_scoring = {'spearmanr':make_scorer(spearmanr_score, greater_is_better=True),
                'MSE':make_scorer(MSE_score, greater_is_better=False),
                'PD':make_scorer(PD_score, greater_is_better=True),
                'R2':make_scorer(r2_score_, greater_is_better=True),
                'AUC':make_scorer(roc_auc_score_, greater_is_better=True),
                'cohen_kappa':make_scorer(cohen_kappa_score_, greater_is_better=True),
                'f1':make_scorer(f1_score_, greater_is_better=True),
                'average_precision':make_scorer(average_precision_score_, greater_is_better=True),
                }

Finaly we can perform the gridsearch:

In [ ]:

Copied!





# Define model
model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),
    task='hurdle',
    save_gridding_plot = True,
    ensemble_fold=10, 
    min_ensemble_required=7,
    # grid_len_upper_threshold=max_grid, # This is a gridsearch target
    # grid_len_lower_threshold=min_grid, # This is a gridsearch target
    points_lower_threshold=50,
    Spatio1='longitude',
    Spatio2 = 'latitude', 
    Temporal1 = 'DOY',
    use_temporal_to_train=True,
    n_jobs=4                  
)

# Perform gridsearch
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(model, params, scoring=my_scoring, cv=ST_KFold_generator, refit=False)
gridsearch.fit(X, np.log(y + 1), verbosity=0) # the y is fitted in the log scale
# Define model
model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),
    task='hurdle',
    save_gridding_plot = True,
    ensemble_fold=10, 
    min_ensemble_required=7,
    # grid_len_upper_threshold=max_grid, # This is a gridsearch target
    # grid_len_lower_threshold=min_grid, # This is a gridsearch target
    points_lower_threshold=50,
    Spatio1='longitude',
    Spatio2 = 'latitude', 
    Temporal1 = 'DOY',
    use_temporal_to_train=True,
    n_jobs=4                  
)

# Perform gridsearch
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(model, params, scoring=my_scoring, cv=ST_KFold_generator, refit=False)
gridsearch.fit(X, np.log(y + 1), verbosity=0) # the y is fitted in the log scale

Out[ ]:

GridSearchCV(cv=<generator object ST_KFold.split at 0x17c07a2c0>,
             estimator=AdaSTEMRegressor(base_model=Hurdle(classifier=XGBClassifier(base_score=None,
                                                                                   booster=None,
                                                                                   callbacks=None,
                                                                                   colsample_bylevel=None,
                                                                                   colsample_bynode=None,
                                                                                   colsample_bytree=None,
                                                                                   device=None,
                                                                                   early_stopping_rounds=None,
                                                                                   enable_categorical=False,
                                                                                   eval_metric=None,
                                                                                   feature_types=None,
                                                                                   gamma=No...
                      'PD': make_scorer(PD_score, response_method='predict'),
                      'R2': make_scorer(r2_score_, response_method='predict'),
                      'average_precision': make_scorer(average_precision_score_, response_method='predict'),
                      'cohen_kappa': make_scorer(cohen_kappa_score_, response_method='predict'),
                      'f1': make_scorer(f1_score_, response_method='predict'),
                      'spearmanr': make_scorer(spearmanr_score, response_method='predict')})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV?Documentation for GridSearchCViFitted

GridSearchCV(cv=<generator object ST_KFold.split at 0x17c07a2c0>,
             estimator=AdaSTEMRegressor(base_model=Hurdle(classifier=XGBClassifier(base_score=None,
                                                                                   booster=None,
                                                                                   callbacks=None,
                                                                                   colsample_bylevel=None,
                                                                                   colsample_bynode=None,
                                                                                   colsample_bytree=None,
                                                                                   device=None,
                                                                                   early_stopping_rounds=None,
                                                                                   enable_categorical=False,
                                                                                   eval_metric=None,
                                                                                   feature_types=None,
                                                                                   gamma=No...
                      'PD': make_scorer(PD_score, response_method='predict'),
                      'R2': make_scorer(r2_score_, response_method='predict'),
                      'average_precision': make_scorer(average_precision_score_, response_method='predict'),
                      'cohen_kappa': make_scorer(cohen_kappa_score_, response_method='predict'),
                      'f1': make_scorer(f1_score_, response_method='predict'),
                      'spearmanr': make_scorer(spearmanr_score, response_method='predict')})

estimator: AdaSTEMRegressor

AdaSTEMRegressor(base_model=Hurdle(classifier=XGBClassifier(base_score=None,
                                                            booster=None,
                                                            callbacks=None,
                                                            colsample_bylevel=None,
                                                            colsample_bynode=None,
                                                            colsample_bytree=None,
                                                            device=None,
                                                            early_stopping_rounds=None,
                                                            enable_categorical=False,
                                                            eval_metric=None,
                                                            feature_types=None,
                                                            gamma=None,
                                                            grow_policy=None,
                                                            importance_type=None,
                                                            interaction_constraints=None,
                                                            l...
                                                          interaction_constraints=None,
                                                          learning_rate=None,
                                                          max_bin=None,
                                                          max_cat_threshold=None,
                                                          max_cat_to_onehot=None,
                                                          max_delta_step=None,
                                                          max_depth=None,
                                                          max_leaves=None,
                                                          min_child_weight=None,
                                                          missing=nan,
                                                          monotone_constraints=None,
                                                          multi_strategy=None,
                                                          n_estimators=None,
                                                          n_jobs=1,
                                                          num_parallel_tree=None,
                                                          random_state=42, ...)),
                 njobs=4, save_gridding_plot=True,
                 stixel_training_size_threshold=50)

base_model: Hurdle

Hurdle(classifier=XGBClassifier(base_score=None, booster=None, callbacks=None,
                                colsample_bylevel=None, colsample_bynode=None,
                                colsample_bytree=None, device=None,
                                early_stopping_rounds=None,
                                enable_categorical=False, eval_metric=None,
                                feature_types=None, gamma=None,
                                grow_policy=None, importance_type=None,
                                interaction_constraints=None,
                                learning_rate=None, max_bin=No...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=1,
                              num_parallel_tree=None, random_state=42, ...))

classifier: XGBClassifier

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=1,
              num_parallel_tree=None, random_state=42, ...)

XGBClassifier

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=1,
              num_parallel_tree=None, random_state=42, ...)

regressor: XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=1,
             num_parallel_tree=None, random_state=42, ...)

XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=1,
             num_parallel_tree=None, random_state=42, ...)

Summarize and see the results:

In [11]:

Copied!

pd.set_option('display.max_columns', None)
cv_results = pd.DataFrame(gridsearch.cv_results_)
cv_results
pd.set_option('display.max_columns', None)
cv_results = pd.DataFrame(gridsearch.cv_results_)
cv_results

Out[11]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_grid_len_lower_threshold	param_grid_len_upper_threshold	params	split0_test_spearmanr	split1_test_spearmanr	split2_test_spearmanr	mean_test_spearmanr	std_test_spearmanr	rank_test_spearmanr	split0_test_MSE	split1_test_MSE	split2_test_MSE	mean_test_MSE	std_test_MSE	rank_test_MSE	split0_test_PD	split1_test_PD	split2_test_PD	mean_test_PD	std_test_PD	rank_test_PD	split0_test_R2	split1_test_R2	split2_test_R2	mean_test_R2	std_test_R2	rank_test_R2	split0_test_AUC	split1_test_AUC	split2_test_AUC	mean_test_AUC	std_test_AUC	rank_test_AUC	split0_test_cohen_kappa	split1_test_cohen_kappa	split2_test_cohen_kappa	mean_test_cohen_kappa	std_test_cohen_kappa	rank_test_cohen_kappa	split0_test_f1	split1_test_f1	split2_test_f1	mean_test_f1	std_test_f1	rank_test_f1	split0_test_average_precision	split1_test_average_precision	split2_test_average_precision	mean_test_average_precision	std_test_average_precision	rank_test_average_precision
0	218.272017	1.420731	20.046579	0.102875	5	10	{'grid_len_lower_threshold': 5, 'grid_len_uppe...	0.480224	0.479757	0.476468	0.478816	0.001671	6	-0.637476	-0.646614	-0.640127	-0.641406	0.003838	7	0.044668	0.027008	0.024234	0.031970	0.009050	2	0.239227	0.227532	0.231092	0.232617	0.004895	7	0.767697	0.765100	0.763683	0.765493	0.001662	8	0.382611	0.386255	0.379867	0.382911	0.002616	1	0.519841	0.525674	0.521139	0.522218	0.002501	2	0.340727	0.346915	0.342835	0.343492	0.002569	2
1	217.042172	1.759909	20.449399	0.767189	5	25	{'grid_len_lower_threshold': 5, 'grid_len_uppe...	0.480897	0.480847	0.473253	0.478332	0.003592	10	-0.636580	-0.645693	-0.640233	-0.640835	0.003744	5	0.040950	0.022381	0.028896	0.030742	0.007692	5	0.240138	0.228775	0.230528	0.233147	0.004995	5	0.768442	0.765952	0.761300	0.765231	0.002960	10	0.383198	0.385025	0.377348	0.381857	0.003274	8	0.520386	0.525242	0.518940	0.521523	0.002696	10	0.341257	0.346639	0.340771	0.342889	0.002659	10
2	214.022419	1.903922	19.413781	0.260689	5	50	{'grid_len_lower_threshold': 5, 'grid_len_uppe...	0.480159	0.483764	0.472423	0.478782	0.004731	7	-0.638353	-0.645326	-0.641852	-0.641843	0.002847	10	0.038271	0.019981	0.029494	0.029249	0.007469	8	0.238124	0.229372	0.228444	0.231980	0.004361	10	0.768335	0.768429	0.760858	0.765874	0.003547	4	0.381689	0.387667	0.377277	0.382211	0.004258	5	0.519461	0.527503	0.518775	0.521913	0.003962	4	0.340491	0.348781	0.340599	0.343290	0.003883	4
3	216.172207	4.619032	19.466128	0.300229	5	100	{'grid_len_lower_threshold': 5, 'grid_len_uppe...	0.480596	0.483419	0.473618	0.479211	0.004119	4	-0.638459	-0.645217	-0.641344	-0.641673	0.002769	9	0.040570	0.016574	0.023548	0.026897	0.010078	10	0.237735	0.229369	0.229269	0.232124	0.003967	9	0.768546	0.767857	0.762124	0.766176	0.002879	1	0.384552	0.386498	0.377480	0.382843	0.003875	2	0.521157	0.526651	0.519326	0.522378	0.003113	1	0.341879	0.348006	0.341190	0.343692	0.003064	1
4	215.107877	4.181419	20.273383	0.557603	10	25	{'grid_len_lower_threshold': 10, 'grid_len_upp...	0.482321	0.483158	0.473463	0.479647	0.004386	1	-0.636647	-0.644747	-0.640509	-0.640634	0.003308	3	0.042446	0.023040	0.029125	0.031537	0.008104	4	0.239677	0.230368	0.230200	0.233415	0.004428	3	0.769428	0.767579	0.761230	0.766079	0.003511	3	0.382140	0.386979	0.377967	0.382362	0.003682	3	0.519993	0.526874	0.519292	0.522053	0.003421	3	0.341045	0.348174	0.341064	0.343428	0.003356	3
5	213.352489	2.193241	19.362665	0.505505	10	50	{'grid_len_lower_threshold': 10, 'grid_len_upp...	0.481020	0.482094	0.473762	0.478959	0.003701	5	-0.636243	-0.643681	-0.640688	-0.640204	0.003056	1	0.040334	0.028420	0.026523	0.031759	0.006113	3	0.240317	0.231341	0.230629	0.234096	0.004409	1	0.768764	0.766218	0.761671	0.765551	0.002934	6	0.383390	0.383986	0.377840	0.381738	0.002768	10	0.520507	0.524712	0.519428	0.521549	0.002279	9	0.341365	0.346217	0.341245	0.342942	0.002316	9
6	212.409846	1.791777	19.197133	0.329267	10	100	{'grid_len_lower_threshold': 10, 'grid_len_upp...	0.482231	0.483458	0.473078	0.479589	0.004631	2	-0.634884	-0.646097	-0.640164	-0.640382	0.004580	2	0.047673	0.011060	0.022171	0.026968	0.015327	9	0.242213	0.228560	0.230660	0.233811	0.006003	2	0.769214	0.767891	0.761302	0.766136	0.003460	2	0.381944	0.386624	0.376668	0.381746	0.004067	9	0.519889	0.526760	0.518590	0.521747	0.003585	7	0.340960	0.348110	0.340496	0.343189	0.003485	6
7	213.293670	4.804161	19.170226	0.311369	25	50	{'grid_len_lower_threshold': 25, 'grid_len_upp...	0.482532	0.480622	0.474632	0.479262	0.003365	3	-0.637352	-0.644668	-0.640170	-0.640730	0.003013	4	0.042056	0.033681	0.025022	0.033587	0.006954	1	0.239183	0.230004	0.230808	0.233332	0.004150	4	0.769941	0.765217	0.761939	0.765699	0.003285	5	0.383998	0.383063	0.379359	0.382140	0.002003	7	0.521312	0.523952	0.520331	0.521865	0.001529	5	0.342212	0.345509	0.342002	0.343241	0.001606	5
8	214.590513	2.080030	19.493644	0.539657	25	100	{'grid_len_lower_threshold': 25, 'grid_len_upp...	0.481007	0.482448	0.472626	0.478693	0.004331	8	-0.637054	-0.644291	-0.643013	-0.641453	0.003154	8	0.040531	0.028586	0.022551	0.030556	0.007472	6	0.239600	0.230487	0.227569	0.232552	0.005124	8	0.768641	0.766892	0.761085	0.765540	0.003230	7	0.381809	0.385155	0.379586	0.382183	0.002289	6	0.519655	0.525611	0.520109	0.521792	0.002707	6	0.340700	0.347045	0.341710	0.343152	0.002784	7
9	213.452177	1.717779	20.315590	1.204029	50	100	{'grid_len_lower_threshold': 50, 'grid_len_upp...	0.481174	0.483147	0.471024	0.478448	0.005311	9	-0.637902	-0.644339	-0.641084	-0.641109	0.002628	6	0.034179	0.023032	0.033411	0.030207	0.005083	7	0.238569	0.230036	0.229567	0.232724	0.004138	6	0.769075	0.767635	0.759114	0.765275	0.004396	9	0.383413	0.387336	0.375940	0.382229	0.004727	4	0.520644	0.526995	0.517515	0.521718	0.003944	8	0.341529	0.348245	0.339413	0.343062	0.003765	8

In [12]:

Copied!

cv_results.to_csv(f'./grid_size_CV.csv', index=False)
cv_results.to_csv(f'./grid_size_CV.csv', index=False)

In [13]:

Copied!





cv_results = cv_results[['mean_fit_time','mean_score_time',
                         'param_grid_len_lower_threshold', 'param_grid_len_upper_threshold',
                         'rank_test_spearmanr','rank_test_MSE','rank_test_PD','rank_test_R2',
                         'rank_test_AUC','rank_test_cohen_kappa','rank_test_f1','rank_test_average_precision'
                         ]]
cv_results.columns = [i.replace('rank_test_','') for i in cv_results.columns]
cv_results = cv_results[['mean_fit_time','mean_score_time',
                         'param_grid_len_lower_threshold', 'param_grid_len_upper_threshold',
                         'rank_test_spearmanr','rank_test_MSE','rank_test_PD','rank_test_R2',
                         'rank_test_AUC','rank_test_cohen_kappa','rank_test_f1','rank_test_average_precision'
                         ]]
cv_results.columns = [i.replace('rank_test_','') for i in cv_results.columns]

Plot results¶

Now we plot these metrics and see which parameter set could be optimum.

Ranking the performance for tasks and efficiency¶

we can average the metrics by target categories:

Classification
Regression
Training efficiency
Testing (prediction) efficiency

In [14]:

Copied!





cv_results['min_max'] = cv_results.apply(
    lambda x:str(int(x['param_grid_len_lower_threshold']))+'_'+str(int(x['param_grid_len_upper_threshold'])), 
    axis=1
    )
cv_results['min_max'] = cv_results.apply(
    lambda x:str(int(x['param_grid_len_lower_threshold']))+'_'+str(int(x['param_grid_len_upper_threshold'])), 
    axis=1
    )

Ranking for classification tasks¶

In [15]:

Copied!

classification_metrics = ['AUC', 'cohen_kappa', 'f1', 'average_precision']
cv_results['mean_ranking_cls'] = cv_results[classification_metrics].mean(axis=1)
cv_results.sort_values(by='mean_ranking_cls')
classification_metrics = ['AUC', 'cohen_kappa', 'f1', 'average_precision']
cv_results['mean_ranking_cls'] = cv_results[classification_metrics].mean(axis=1)
cv_results.sort_values(by='mean_ranking_cls')

Out[15]:

	mean_fit_time	mean_score_time	param_grid_len_lower_threshold	param_grid_len_upper_threshold	spearmanr	MSE	PD	R2	AUC	cohen_kappa	f1	average_precision	min_max	mean_ranking_cls
3	216.172207	19.466128	5	100	4	9	10	9	1	2	1	1	5_100	1.25
4	215.107877	20.273383	10	25	1	3	4	3	3	3	3	3	10_25	3.00
0	218.272017	20.046579	5	10	6	7	2	7	8	1	2	2	5_10	3.25
2	214.022419	19.413781	5	50	7	10	8	10	4	5	4	4	5_50	4.25
7	213.293670	19.170226	25	50	3	4	1	4	5	7	5	5	25_50	5.50
6	212.409846	19.197133	10	100	2	2	9	2	2	9	7	6	10_100	6.00
8	214.590513	19.493644	25	100	8	8	6	8	7	6	6	7	25_100	6.50
9	213.452177	20.315590	50	100	9	6	7	6	9	4	8	8	50_100	7.25
5	213.352489	19.362665	10	50	5	1	3	1	6	10	9	9	10_50	8.50
1	217.042172	20.449399	5	25	10	5	5	5	10	8	10	10	5_25	9.50

Ranking for regression tasks¶

In [16]:

Copied!

regression_metrics = ['spearmanr','MSE','PD','R2']
cv_results['mean_ranking_reg'] = cv_results[regression_metrics].mean(axis=1)
cv_results.sort_values(by='mean_ranking_reg')
                
regression_metrics = ['spearmanr','MSE','PD','R2']
cv_results['mean_ranking_reg'] = cv_results[regression_metrics].mean(axis=1)
cv_results.sort_values(by='mean_ranking_reg')

Out[16]:

	mean_fit_time	mean_score_time	param_grid_len_lower_threshold	param_grid_len_upper_threshold	spearmanr	MSE	PD	R2	AUC	cohen_kappa	f1	average_precision	min_max	mean_ranking_cls	mean_ranking_reg
5	213.352489	19.362665	10	50	5	1	3	1	6	10	9	9	10_50	8.50	2.50
4	215.107877	20.273383	10	25	1	3	4	3	3	3	3	3	10_25	3.00	2.75
7	213.293670	19.170226	25	50	3	4	1	4	5	7	5	5	25_50	5.50	3.00
6	212.409846	19.197133	10	100	2	2	9	2	2	9	7	6	10_100	6.00	3.75
0	218.272017	20.046579	5	10	6	7	2	7	8	1	2	2	5_10	3.25	5.50
1	217.042172	20.449399	5	25	10	5	5	5	10	8	10	10	5_25	9.50	6.25
9	213.452177	20.315590	50	100	9	6	7	6	9	4	8	8	50_100	7.25	7.00
8	214.590513	19.493644	25	100	8	8	6	8	7	6	6	7	25_100	6.50	7.50
3	216.172207	19.466128	5	100	4	9	10	9	1	2	1	1	5_100	1.25	8.00
2	214.022419	19.413781	5	50	7	10	8	10	4	5	4	4	5_50	4.25	8.75

Ranking for time consumption¶

In [17]:

Copied!

cv_results['mean_ranking_fit_time'] = cv_results['mean_fit_time'].rank()
cv_results.sort_values(by='mean_ranking_fit_time')
cv_results['mean_ranking_fit_time'] = cv_results['mean_fit_time'].rank()
cv_results.sort_values(by='mean_ranking_fit_time')

Out[17]:

	mean_fit_time	mean_score_time	param_grid_len_lower_threshold	param_grid_len_upper_threshold	spearmanr	MSE	PD	R2	AUC	cohen_kappa	f1	average_precision	min_max	mean_ranking_cls	mean_ranking_reg	mean_ranking_fit_time
6	212.409846	19.197133	10	100	2	2	9	2	2	9	7	6	10_100	6.00	3.75	1.0
7	213.293670	19.170226	25	50	3	4	1	4	5	7	5	5	25_50	5.50	3.00	2.0
5	213.352489	19.362665	10	50	5	1	3	1	6	10	9	9	10_50	8.50	2.50	3.0
9	213.452177	20.315590	50	100	9	6	7	6	9	4	8	8	50_100	7.25	7.00	4.0
2	214.022419	19.413781	5	50	7	10	8	10	4	5	4	4	5_50	4.25	8.75	5.0
8	214.590513	19.493644	25	100	8	8	6	8	7	6	6	7	25_100	6.50	7.50	6.0
4	215.107877	20.273383	10	25	1	3	4	3	3	3	3	3	10_25	3.00	2.75	7.0
3	216.172207	19.466128	5	100	4	9	10	9	1	2	1	1	5_100	1.25	8.00	8.0
1	217.042172	20.449399	5	25	10	5	5	5	10	8	10	10	5_25	9.50	6.25	9.0
0	218.272017	20.046579	5	10	6	7	2	7	8	1	2	2	5_10	3.25	5.50	10.0

In [18]:

Copied!

cv_results['mean_ranking_score_time'] = cv_results['mean_score_time'].rank()
cv_results.sort_values(by='mean_ranking_score_time')
cv_results['mean_ranking_score_time'] = cv_results['mean_score_time'].rank()
cv_results.sort_values(by='mean_ranking_score_time')

Out[18]:

	mean_fit_time	mean_score_time	param_grid_len_lower_threshold	param_grid_len_upper_threshold	spearmanr	MSE	PD	R2	AUC	cohen_kappa	f1	average_precision	min_max	mean_ranking_cls	mean_ranking_reg	mean_ranking_fit_time	mean_ranking_score_time
7	213.293670	19.170226	25	50	3	4	1	4	5	7	5	5	25_50	5.50	3.00	2.0	1.0
6	212.409846	19.197133	10	100	2	2	9	2	2	9	7	6	10_100	6.00	3.75	1.0	2.0
5	213.352489	19.362665	10	50	5	1	3	1	6	10	9	9	10_50	8.50	2.50	3.0	3.0
2	214.022419	19.413781	5	50	7	10	8	10	4	5	4	4	5_50	4.25	8.75	5.0	4.0
3	216.172207	19.466128	5	100	4	9	10	9	1	2	1	1	5_100	1.25	8.00	8.0	5.0
8	214.590513	19.493644	25	100	8	8	6	8	7	6	6	7	25_100	6.50	7.50	6.0	6.0
0	218.272017	20.046579	5	10	6	7	2	7	8	1	2	2	5_10	3.25	5.50	10.0	7.0
4	215.107877	20.273383	10	25	1	3	4	3	3	3	3	3	10_25	3.00	2.75	7.0	8.0
9	213.452177	20.315590	50	100	9	6	7	6	9	4	8	8	50_100	7.25	7.00	4.0	9.0
1	217.042172	20.449399	5	25	10	5	5	5	10	8	10	10	5_25	9.50	6.25	9.0	10.0

Finally, we plot these rankings using bump chart:¶

In [19]:

Copied!





fig, ax = plt.subplots(figsize=(8, 10))
to_plot = pd.concat([
    cv_results.set_index('min_max')[['mean_ranking_cls']],
    cv_results.set_index('min_max')[['mean_ranking_reg']],
    cv_results.set_index('min_max')[['mean_ranking_fit_time']],
    cv_results.set_index('min_max')[['mean_ranking_score_time']]
    ], axis=1).sort_values(by='mean_ranking_cls')
for index, line in to_plot.iterrows():
    ax.plot([1,2,3,4], [line['mean_ranking_cls'], line['mean_ranking_reg'], line['mean_ranking_fit_time'], line['mean_ranking_score_time']], 'o-')

ax.set_yticks(to_plot['mean_ranking_cls'], list(to_plot.index))
yax2 = ax.secondary_yaxis("right")
yax2.set_yticks(to_plot['mean_ranking_score_time'], list(to_plot.index))
ax.set_xticks([1,2,3,4], ['Classification\nperformance', 'Regression\nperformance', 'Training\nefficiency', 'Testing\nefficiency'])
ax.set_ylabel('Ranking')
ax.invert_yaxis()
plt.show()
fig, ax = plt.subplots(figsize=(8, 10))
to_plot = pd.concat([
    cv_results.set_index('min_max')[['mean_ranking_cls']],
    cv_results.set_index('min_max')[['mean_ranking_reg']],
    cv_results.set_index('min_max')[['mean_ranking_fit_time']],
    cv_results.set_index('min_max')[['mean_ranking_score_time']]
    ], axis=1).sort_values(by='mean_ranking_cls')
for index, line in to_plot.iterrows():
    ax.plot([1,2,3,4], [line['mean_ranking_cls'], line['mean_ranking_reg'], line['mean_ranking_fit_time'], line['mean_ranking_score_time']], 'o-')

ax.set_yticks(to_plot['mean_ranking_cls'], list(to_plot.index))
yax2 = ax.secondary_yaxis("right")
yax2.set_yticks(to_plot['mean_ranking_score_time'], list(to_plot.index))
ax.set_xticks([1,2,3,4], ['Classification\nperformance', 'Regression\nperformance', 'Training\nefficiency', 'Testing\nefficiency'])
ax.set_ylabel('Ranking')
ax.invert_yaxis()
plt.show()

No description has been provided for this image

Conclusion¶

By reading the chart, 5_100(minimum grid length as 5 unit and maximum set as 100) shows the best performance in classification, and 10_50 shows the best in regression, averaging the metrics across different measurements.

While the ranking seems to have variation, the absolute differences by digits is not significant in our case (e.g., R2 variation is only at 0.01 level), meaning that all these parameter sets may be acceptable. At the end of the day, it may depend more on your domain knowledge than leaving this question to gridsearch.

The above situation and ranking may vary largely on a case-by-case bases, but generally I think median level gridding will be the best balance between overfitting and underfitting.

From another perspective, you may want to use larger grid size for efficiency, because AdaSTEM has speed bottleneck at the prediction/testing step. Increasing the grid size will also give you higher flexibility of longer-distance prediction, because larger spatiotemporal regions are covered. Adjust the gird based on your need.

Finally, refit the model with the best parameters:¶

In [31]:

Copied!





# Here we chose the best parameters for regression, although the difference is really not much.
best_params = {
    'grid_len_lower_threshold':cv_results.sort_values(by='mean_ranking_reg')['param_grid_len_lower_threshold'].iloc[0],
    'grid_len_upper_threshold':cv_results.sort_values(by='mean_ranking_reg')['param_grid_len_upper_threshold'].iloc[0]
}
print('Best parameters: ',best_params)

model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),
    save_gridding_plot = True,
    ensemble_fold=10, 
    min_ensemble_required=7,
    # grid_len_upper_threshold=max_grid, # This is a gridsearch target
    # grid_len_lower_threshold=min_grid, # This is a gridsearch target
    points_lower_threshold=50,
    Spatio1='longitude',
    Spatio2 = 'latitude', 
    Temporal1 = 'DOY',
    use_temporal_to_train=True,
    n_jobs=4                  
)

from stemflow.model_selection import ST_train_test_split

X_train, X_test, y_train, y_test = ST_train_test_split(
    X, y, Spatio_blocks_count=50, Temporal_blocks_count=50, random_state=42, test_size=0.3
)
model = model.set_params(**best_params)
model = model.fit(X_train, y_train)
# Here we chose the best parameters for regression, although the difference is really not much.
best_params = {
    'grid_len_lower_threshold':cv_results.sort_values(by='mean_ranking_reg')['param_grid_len_lower_threshold'].iloc[0],
    'grid_len_upper_threshold':cv_results.sort_values(by='mean_ranking_reg')['param_grid_len_upper_threshold'].iloc[0]
}
print('Best parameters: ',best_params)

model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),
    save_gridding_plot = True,
    ensemble_fold=10, 
    min_ensemble_required=7,
    # grid_len_upper_threshold=max_grid, # This is a gridsearch target
    # grid_len_lower_threshold=min_grid, # This is a gridsearch target
    points_lower_threshold=50,
    Spatio1='longitude',
    Spatio2 = 'latitude', 
    Temporal1 = 'DOY',
    use_temporal_to_train=True,
    n_jobs=4                  
)

from stemflow.model_selection import ST_train_test_split

X_train, X_test, y_train, y_test = ST_train_test_split(
    X, y, Spatio_blocks_count=50, Temporal_blocks_count=50, random_state=42, test_size=0.3
)
model = model.set_params(**best_params)
model = model.fit(X_train, y_train)

Best parameters:  {'grid_len_lower_threshold': 10, 'grid_len_upper_threshold': 50}

In [35]:

Copied!

pred = model.predict(X_test)
pred = np.where(pred<0, 0, pred)
model.eval_STEM_res('hurdle', y_test.flatten()[pred>=0], pred[pred>=0]) # filter out NA
pred = model.predict(X_test)
pred = np.where(pred<0, 0, pred)
model.eval_STEM_res('hurdle', y_test.flatten()[pred>=0], pred[pred>=0]) # filter out NA

Out[35]:

{'AUC': 0.7734218247408502,
 'kappa': 0.3919169104759538,
 'f1': 0.5259200702678963,
 'precision': 0.3953359391506444,
 'recall': 0.785320812129479,
 'average_precision': 0.34602773988349644,
 'Spearman_r': 0.47852384111905466,
 'Pearson_r': 0.21818167240532746,
 'R2': 0.008694469621954526,
 'MAE': 4.090158236189759,
 'MSE': 1256.4220053678623,
 'poisson_deviance_explained': 0.12527780086832163}

Concluding mark¶

In [36]:

Copied!

from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,sklearn"))
from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,sklearn"))

Last updated: 2024-09-21T15:46:56.990649-05:00

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.1

Compiler    : Clang 16.0.6 
OS          : Darwin
Release     : 23.1.0
Machine     : arm64
Processor   : arm
CPU cores   : 14
Architecture: 64bit

stemflow  : 1.1.1
numpy     : 1.26.4
scipy     : 1.14.1
pandas    : 2.2.3
xgboost   : 2.0.3
tqdm      : 4.66.5
matplotlib: 3.9.2
h3pandas  : 0.2.6
geopandas : 0.14.3
sklearn   : 1.5.2