Using Maxent as base models¶
Yangkang Chen
Jan 12, 2024
This notebook is to show how Maxent model in package elapid
could be used as the base model
of AdaSTEM framework.
The task we explore here will be the similar to AdaSTEM demo: Predict the occurrence of Mallard (a bird species) based on environmental variables. That is – species distribution model (SDM) (what is species distribution model and what is Maxent?).
The data were requested from eBird, a citizen science project for bird observation, and with some variable annotation.
import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import os
import json
from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle
import elapid as ela # Please install elapid if you haven't
%matplotlib inline
%load_ext autoreload
%autoreload 2
Please download the sample data from:
Suppose now it's downloaded and saved as './Sample_data_Mallard.csv'
Alternatively, you can try other species like
- Alder Flycatcher: https://figshare.com/articles/dataset/Sample_data_Alder_Flycatcher_csv/24080751
- Short-eared Owl: https://figshare.com/articles/dataset/Sample_data_Short-eared_Owl_csv/24080742
- Eurasian Tree Sparrow: https://figshare.com/articles/dataset/Sample_data_Eurasian_Tree_Sparrow_csv/24080748
Caveat: These bird observation data are about 200MB each file.
Load data¶
data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)
data.head()
longitude | latitude | count | DOY | duration_minutes | Traveling | Stationary | Area | effort_distance_km | number_observers | obsvr_species_count | time_observation_started_minute_of_day | elevation_mean | slope_mean | eastness_mean | northness_mean | bio1 | bio2 | bio3 | bio4 | bio5 | bio6 | bio7 | bio8 | bio9 | bio10 | bio11 | bio12 | bio13 | bio14 | bio15 | bio16 | bio17 | bio18 | bio19 | closed_shrublands | cropland_or_natural_vegetation_mosaics | croplands | deciduous_broadleaf_forests | deciduous_needleleaf_forests | evergreen_broadleaf_forests | evergreen_needleleaf_forests | grasslands | mixed_forests | non_vegetated_lands | open_shrublands | permanent_wetlands | savannas | urban_and_built_up_lands | water_bodies | woody_savannas | entropy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -83.472224 | 8.859308 | 0.0 | 22 | 300.0 | 1 | 0 | 0 | 4.828 | 5.0 | 34.0 | 476 | 7.555556 | 0.758156 | 0.036083 | -0.021484 | 24.883502 | 5.174890 | 59.628088 | 93.482247 | 30.529131 | 21.850519 | 8.678612 | 24.302626 | 26.536822 | 26.213334 | 23.864924 | 0.720487 | 0.127594 | 0.003156 | 0.001451 | 0.332425 | 0.026401 | 0.044218 | 0.260672 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.138889 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.777778 | 0.000000 | 0.000000 | 0.083333 | 0.000000 | 0.676720 |
1 | -2.687724 | 43.373323 | 9.0 | 290 | 90.0 | 1 | 0 | 0 | 0.570 | 2.0 | 151.0 | 1075 | 30.833336 | 3.376527 | 0.050544 | -0.099299 | 14.107917 | 5.224109 | 31.174167 | 376.543853 | 23.219421 | 6.461607 | 16.757814 | 9.048385 | 19.092725 | 19.236082 | 9.287841 | 0.171423 | 0.035598 | 0.004512 | 0.000081 | 0.084657 | 0.018400 | 0.030210 | 0.065007 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.333333 | 0.000000 | 0.000000 | 0.083333 | 0.0 | 0.0 | 0.000000 | 0.194444 | 0.027778 | 0.000000 | 0.361111 | 1.359063 |
2 | -89.884770 | 35.087255 | 0.0 | 141 | 10.0 | 0 | 1 | 0 | -1.000 | 2.0 | 678.0 | 575 | 91.777780 | 0.558100 | -0.187924 | -0.269078 | 17.396487 | 8.673912 | 28.688889 | 718.996078 | 32.948335 | 2.713938 | 30.234397 | 14.741099 | 13.759220 | 26.795849 | 7.747272 | 0.187089 | 0.031802 | 0.005878 | 0.000044 | 0.073328 | 0.026618 | 0.039616 | 0.059673 | 0.0 | 0.055556 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.305556 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.527778 | 0.000000 | 0.000000 | 0.111111 | 1.104278 |
3 | -99.216873 | 31.218510 | 0.0 | 104 | 9.0 | 1 | 0 | 0 | 0.805 | 2.0 | 976.0 | 657 | 553.166700 | 0.856235 | -0.347514 | -0.342971 | 20.740836 | 10.665164 | 35.409121 | 666.796919 | 35.909941 | 5.790119 | 30.119822 | 18.444353 | 30.734456 | 29.546417 | 11.701038 | 0.084375 | 0.025289 | 0.000791 | 0.000052 | 0.052866 | 0.004096 | 0.006064 | 0.015965 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.000000 |
4 | -124.426730 | 43.065847 | 2.0 | 96 | 30.0 | 1 | 0 | 0 | 0.161 | 2.0 | 654.0 | 600 | 6.500000 | 0.491816 | -0.347794 | -0.007017 | 11.822340 | 6.766870 | 35.672897 | 396.157833 | 22.608788 | 3.639569 | 18.969219 | 8.184412 | 16.290802 | 17.258721 | 7.319234 | 0.144122 | 0.044062 | 0.000211 | 0.000147 | 0.089238 | 0.004435 | 0.004822 | 0.040621 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.361111 | 0.166667 | 0.000000 | 0.472222 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.020754 |
Get X and y¶
X = data.drop('count', axis=1)
y = data['count'].values
First thing first: Spatio-temporal train test split¶
from stemflow.model_selection import ST_CV
CV = 5
CV_generator = ST_CV(X, y,
Spatio_blocks_count = 50, Temporal_blocks_count=50,
random_state=42, CV=CV)
We use a spatial temporal train-test-split here. Each dimension of space and time are split into 50 bins, and 30% blocks are randomly selected as test data, with the rest as training data. For more discussion of these parameters please refer to AdaSTEM demo
Train an AdaSTEM hurdle model¶
We create a model instance and pass ela.MaxentModel
model class as base model:
## create model instance
model = AdaSTEMClassifier(base_model=ela.MaxentModel(transform='cloglog', beta_multiplier=2.0),
save_gridding_plot = True,
ensemble_fold=10,
min_ensemble_required=7,
grid_len_upper_threshold=50,
grid_len_lower_threshold=5,
temporal_step=50,
temporal_bin_interval=50,
points_lower_threshold=100, n_jobs=1)
For more discussion on parameters please see AdaSTEM demo and Optimizing stixel size.
Next, we can train the model and evaluate it using test set on each CV partition:
metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
## fit model
model.fit(X_train, np.where(y_train>0,1,0), verbosity=0)
## predict
pred_adastem = model.predict(X_test, verbosity=0)
## save prediction results
pred_df = pd.DataFrame({
'y_true':y_test.flatten(),
'y_pred_adastem':np.where(pred_adastem.flatten()<0, 0, pred_adastem.flatten()),
}).dropna()
## calculate metrics
metric_dict = AdaSTEM.eval_STEM_res('hurdle', np.array(pred_df.y_true).flatten(),
np.where(np.array(pred_df.y_pred_adastem).flatten()<0, 0, np.array(pred_df.y_pred_adastem).flatten())
)
metric_dict_list.append(metric_dict)
100%|██████████| 5/5 [3:14:47<00:00, 2337.55s/it]
Take a look at the metrics:
adastem_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
adastem_metrics
AUC | kappa | f1 | precision | recall | average_precision | |
---|---|---|---|---|---|---|
0 | 0.736316 | 0.401915 | 0.526669 | 0.445225 | 0.644579 | 0.349655 |
1 | 0.720966 | 0.377922 | 0.509609 | 0.432525 | 0.620129 | 0.336472 |
2 | 0.727614 | 0.387215 | 0.520642 | 0.439436 | 0.638664 | 0.347057 |
3 | 0.724253 | 0.382455 | 0.512133 | 0.433795 | 0.625000 | 0.337823 |
4 | 0.727759 | 0.390803 | 0.521628 | 0.443649 | 0.632864 | 0.347826 |
Save the dataframe:
adastem_metrics.to_csv('./Maxent_AdasTEM_metrics.csv', index=False)
Model the occurrence using basic Maxent model¶
To compare with AdaSTEM model, we also train a basic Maxent model:
## create model instance
model_me = ela.MaxentModel(transform='cloglog', beta_multiplier=2.0)
CV_generator = ST_CV(X, y,
Spatio_blocks_count = 50, Temporal_blocks_count=50,
random_state=42, CV=CV)
metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
## fit model
model_me.fit(X_train.drop(['longitude','latitude'], axis=1), np.where(y_train>0,1,0))
## predict on test set
pred_me = model_me.predict(X_test.drop(['longitude','latitude'], axis=1))
## save prediction results
pred_df = pd.DataFrame({
'y_true':y_test.flatten(),
'y_pred':np.where(pred_me.flatten()>0.5, 1, 0)
}).dropna()
## calculate metrics
metrics_me = AdaSTEM.eval_STEM_res('hurdle', pred_df.y_true, pred_df.y_pred)
metric_dict_list.append(metrics_me)
0%| | 0/5 [00:00<?, ?it/s]
100%|██████████| 5/5 [3:48:12<00:00, 2738.41s/it]
maxent_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
maxent_metrics
AUC | kappa | f1 | precision | recall | average_precision | |
---|---|---|---|---|---|---|
0 | 0.721753 | 0.325763 | 0.471622 | 0.359857 | 0.684088 | 0.298320 |
1 | 0.706076 | 0.306849 | 0.458735 | 0.352637 | 0.656152 | 0.289303 |
2 | 0.714194 | 0.318858 | 0.471624 | 0.362087 | 0.676178 | 0.300594 |
3 | 0.708874 | 0.320062 | 0.465677 | 0.364793 | 0.643690 | 0.294589 |
4 | 0.711749 | 0.316275 | 0.468793 | 0.360737 | 0.669266 | 0.298200 |
maxent_metrics.to_csv('./Maxent_simple_metrics.csv', index=False)
Comparing AdaSTEM-Maxent & Maxent¶
Now we can visualize the comparison:
import seaborn as sns
adastem_metrics['model'] = 'AdaSTEM-Maxent'
maxent_metrics['model'] = 'Maxent'
new_dat = pd.concat([adastem_metrics, maxent_metrics], axis=0)
dat_for_plot = []
for index,row in new_dat.iterrows():
for metric_name in ['AUC', 'kappa', 'f1', 'precision', 'recall', 'average_precision']:
dat_for_plot.append({
'model':row['model'],
'metric_name':metric_name,
'value':row[metric_name]
})
dat_for_plot = pd.DataFrame(dat_for_plot)
sns.barplot(data=dat_for_plot, errorbar=('ci', 95),
x="metric_name", y="value", hue="model")
plt.title('Model performance')
plt.legend(bbox_to_anchor=(1,1))
plt.show()
Conclusion¶
AdaSTEM-Maxent model outperform the naive Maxent model in all metrics except recall. Interestingly, this indicate that AdaSTEM wrapper makes the Maxent model more conserve, but increase the overall performance in terms of AUC, kappa, and average precision. The improvement ranges from little to median level.
Besides, the two modeling framework consumes similar resources: both about 3 hours. Consequently, it could be a good choice to wrap the Maxent with AdaSTEM if needed.
from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn,watermark"))
Last updated: 2023-09-15T10:05:30.738065+08:00 Python implementation: CPython Python version : 3.11.5 IPython version : 8.15.0 Compiler : Clang 15.0.7 OS : Darwin Release : 21.6.0 Machine : arm64 Processor : arm CPU cores : 8 Architecture: 64bit stemflow : 0.0.20 numpy : 1.25.2 scipy : 1.11.2 pandas : 2.1.0 xgboost : 2.0.0 tqdm : 4.66.1 matplotlib : 3.7.3 h3pandas : 0.2.4 geopandas : 0.13.2 scikit-learn: 1.3.0 watermark : 2.4.3