Using Maxent as base models¶

Yangkang Chen
Jan 12, 2024

This notebook is to show how Maxent model in package elapid could be used as the base model of AdaSTEM framework.

The task we explore here will be the similar to AdaSTEM demo: Predict the occurrence of Mallard (a bird species) based on environmental variables. That is – species distribution model (SDM) (what is species distribution model and what is Maxent?).

The data were requested from eBird, a citizen science project for bird observation, and with some variable annotation.

In [2]:

Copied!





import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import os
import json

from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle
import elapid as ela # Please install elapid if you haven't
import pandas as pd
import numpy as np
import random
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import matplotlib
import warnings
import pickle
import os
import json

from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from xgboost import XGBClassifier, XGBRegressor
from stemflow.model.Hurdle import Hurdle_for_AdaSTEM, Hurdle
import elapid as ela # Please install elapid if you haven't

In [3]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

Please download the sample data from:

Mallard: https://figshare.com/articles/dataset/Sample_data_Mallard_csv/24080745

Suppose now it's downloaded and saved as './Sample_data_Mallard.csv'

Alternatively, you can try other species like

Alder Flycatcher: https://figshare.com/articles/dataset/Sample_data_Alder_Flycatcher_csv/24080751
Short-eared Owl: https://figshare.com/articles/dataset/Sample_data_Short-eared_Owl_csv/24080742
Eurasian Tree Sparrow: https://figshare.com/articles/dataset/Sample_data_Eurasian_Tree_Sparrow_csv/24080748

Caveat: These bird observation data are about 200MB each file.

Load data¶

In [4]:

Copied!

data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)
data = pd.read_csv(f'./Sample_data_Mallard.csv')
data = data.drop('sampling_event_identifier', axis=1)

In [5]:

Copied!

data.head()
data.head()

Out[5]:

	longitude	latitude	count	DOY	duration_minutes	Traveling	Stationary	effort_distance_km	number_observers	obsvr_species_count	time_observation_started_minute_of_day	elevation_mean	slope_mean	eastness_mean	northness_mean	bio1	bio2	bio3	bio4	bio5	bio6	bio7	bio8	bio9	bio10	bio11	bio12	bio13	bio14	bio15	bio16	bio17	bio18	bio19	cropland_or_natural_vegetation_mosaics	evergreen_broadleaf_forests	evergreen_needleleaf_forests	grasslands	mixed_forests	permanent_wetlands	savannas	urban_and_built_up_lands	water_bodies	woody_savannas	entropy
0	-83.472224	8.859308	0.0	22	300.0	1	0	4.828	5.0	34.0	476	7.555556	0.758156	0.036083	-0.021484	24.883502	5.174890	59.628088	93.482247	30.529131	21.850519	8.678612	24.302626	26.536822	26.213334	23.864924	0.720487	0.127594	0.003156	0.001451	0.332425	0.026401	0.044218	0.260672	0.000000	0.138889	0.000000	0.000000	0.000000	0.777778	0.000000	0.000000	0.083333	0.000000	0.676720
1	-2.687724	43.373323	9.0	290	90.0	1	0	0.570	2.0	151.0	1075	30.833336	3.376527	0.050544	-0.099299	14.107917	5.224109	31.174167	376.543853	23.219421	6.461607	16.757814	9.048385	19.092725	19.236082	9.287841	0.171423	0.035598	0.004512	0.000081	0.084657	0.018400	0.030210	0.065007	0.000000	0.333333	0.000000	0.000000	0.083333	0.000000	0.194444	0.027778	0.000000	0.361111	1.359063
2	-89.884770	35.087255	0.0	141	10.0	0	1	-1.000	2.0	678.0	575	91.777780	0.558100	-0.187924	-0.269078	17.396487	8.673912	28.688889	718.996078	32.948335	2.713938	30.234397	14.741099	13.759220	26.795849	7.747272	0.187089	0.031802	0.005878	0.000044	0.073328	0.026618	0.039616	0.059673	0.055556	0.000000	0.000000	0.305556	0.000000	0.000000	0.527778	0.000000	0.000000	0.111111	1.104278
3	-99.216873	31.218510	0.0	104	9.0	1	0	0.805	2.0	976.0	657	553.166700	0.856235	-0.347514	-0.342971	20.740836	10.665164	35.409121	666.796919	35.909941	5.790119	30.119822	18.444353	30.734456	29.546417	11.701038	0.084375	0.025289	0.000791	0.000052	0.052866	0.004096	0.006064	0.015965	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.000000
4	-124.426730	43.065847	2.0	96	30.0	1	0	0.161	2.0	654.0	600	6.500000	0.491816	-0.347794	-0.007017	11.822340	6.766870	35.672897	396.157833	22.608788	3.639569	18.969219	8.184412	16.290802	17.258721	7.319234	0.144122	0.044062	0.000211	0.000147	0.089238	0.004435	0.004822	0.040621	0.000000	0.361111	0.166667	0.000000	0.472222	0.000000	0.000000	0.000000	0.000000	0.000000	1.020754

Get X and y¶

In [6]:

Copied!

X = data.drop('count', axis=1)
y = data['count'].values
X = data.drop('count', axis=1)
y = data['count'].values

First thing first: Spatio-temporal train test split¶

In [35]:

Copied!





from stemflow.model_selection import ST_CV
CV = 5
CV_generator = ST_CV(X, y, 
                    Spatio_blocks_count = 50, Temporal_blocks_count=50,
                    random_state=42, CV=CV)
from stemflow.model_selection import ST_CV
CV = 5
CV_generator = ST_CV(X, y, 
                    Spatio_blocks_count = 50, Temporal_blocks_count=50,
                    random_state=42, CV=CV)

We use a spatial temporal train-test-split here. Each dimension of space and time are split into 50 bins, and 30% blocks are randomly selected as test data, with the rest as training data. For more discussion of these parameters please refer to AdaSTEM demo

Train an AdaSTEM hurdle model¶

We create a model instance and pass ela.MaxentModel model class as base model:

In [47]:

Copied!





## create model instance
model = AdaSTEMClassifier(base_model=ela.MaxentModel(transform='cloglog', beta_multiplier=2.0),
                                save_gridding_plot = True,
                                ensemble_fold=10, 
                                min_ensemble_required=7,
                                grid_len_upper_threshold=50,
                                grid_len_lower_threshold=5,
                                temporal_step=50,
                                temporal_bin_interval=50,
                                points_lower_threshold=100, n_jobs=1)
## create model instance
model = AdaSTEMClassifier(base_model=ela.MaxentModel(transform='cloglog', beta_multiplier=2.0),
                                save_gridding_plot = True,
                                ensemble_fold=10, 
                                min_ensemble_required=7,
                                grid_len_upper_threshold=50,
                                grid_len_lower_threshold=5,
                                temporal_step=50,
                                temporal_bin_interval=50,
                                points_lower_threshold=100, n_jobs=1)

For more discussion on parameters please see AdaSTEM demo and Optimizing stixel size.

Next, we can train the model and evaluate it using test set on each CV partition:

In [48]:

Copied!





metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
    ## fit model
    model.fit(X_train, np.where(y_train>0,1,0), verbosity=0)
    ## predict
    pred_adastem = model.predict(X_test, verbosity=0)
    ## save prediction results
    pred_df = pd.DataFrame({
                'y_true':y_test.flatten(),
                'y_pred_adastem':np.where(pred_adastem.flatten()<0, 0, pred_adastem.flatten()),
            }).dropna()

    ## calculate metrics
    metric_dict = AdaSTEM.eval_STEM_res('hurdle', np.array(pred_df.y_true).flatten(), 
                                                np.where(np.array(pred_df.y_pred_adastem).flatten()<0, 0, np.array(pred_df.y_pred_adastem).flatten())
                                                )
    metric_dict_list.append(metric_dict)
    
metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
    ## fit model
    model.fit(X_train, np.where(y_train>0,1,0), verbosity=0)
    ## predict
    pred_adastem = model.predict(X_test, verbosity=0)
    ## save prediction results
    pred_df = pd.DataFrame({
                'y_true':y_test.flatten(),
                'y_pred_adastem':np.where(pred_adastem.flatten()<0, 0, pred_adastem.flatten()),
            }).dropna()

    ## calculate metrics
    metric_dict = AdaSTEM.eval_STEM_res('hurdle', np.array(pred_df.y_true).flatten(), 
                                                np.where(np.array(pred_df.y_pred_adastem).flatten()<0, 0, np.array(pred_df.y_pred_adastem).flatten())
                                                )
    metric_dict_list.append(metric_dict)
    

100%|██████████| 5/5 [3:14:47<00:00, 2337.55s/it]

Take a look at the metrics:

In [49]:

Copied!

adastem_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
adastem_metrics
adastem_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
adastem_metrics

Out[49]:

	AUC	kappa	f1	precision	recall	average_precision
0	0.736316	0.401915	0.526669	0.445225	0.644579	0.349655
1	0.720966	0.377922	0.509609	0.432525	0.620129	0.336472
2	0.727614	0.387215	0.520642	0.439436	0.638664	0.347057
3	0.724253	0.382455	0.512133	0.433795	0.625000	0.337823
4	0.727759	0.390803	0.521628	0.443649	0.632864	0.347826

Save the dataframe:

In [50]:

Copied!

adastem_metrics.to_csv('./Maxent_AdasTEM_metrics.csv', index=False)
adastem_metrics.to_csv('./Maxent_AdasTEM_metrics.csv', index=False)

Model the occurrence using basic Maxent model¶

To compare with AdaSTEM model, we also train a basic Maxent model:

In [41]:

Copied!

## create model instance
model_me = ela.MaxentModel(transform='cloglog', beta_multiplier=2.0)
## create model instance
model_me = ela.MaxentModel(transform='cloglog', beta_multiplier=2.0)

In [43]:

Copied!





CV_generator = ST_CV(X, y, 
                    Spatio_blocks_count = 50, Temporal_blocks_count=50,
                    random_state=42, CV=CV)

metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
    ## fit model
    model_me.fit(X_train.drop(['longitude','latitude'], axis=1), np.where(y_train>0,1,0))
    ## predict on test set
    pred_me = model_me.predict(X_test.drop(['longitude','latitude'], axis=1))
    ## save prediction results
    pred_df = pd.DataFrame({
        'y_true':y_test.flatten(),
        'y_pred':np.where(pred_me.flatten()>0.5, 1, 0)
    }).dropna()
    ## calculate metrics
    metrics_me = AdaSTEM.eval_STEM_res('hurdle', pred_df.y_true, pred_df.y_pred)
    metric_dict_list.append(metrics_me)
    
CV_generator = ST_CV(X, y, 
                    Spatio_blocks_count = 50, Temporal_blocks_count=50,
                    random_state=42, CV=CV)

metric_dict_list = []
for X_train, X_test, y_train, y_test in tqdm(CV_generator, total=CV):
    ## fit model
    model_me.fit(X_train.drop(['longitude','latitude'], axis=1), np.where(y_train>0,1,0))
    ## predict on test set
    pred_me = model_me.predict(X_test.drop(['longitude','latitude'], axis=1))
    ## save prediction results
    pred_df = pd.DataFrame({
        'y_true':y_test.flatten(),
        'y_pred':np.where(pred_me.flatten()>0.5, 1, 0)
    }).dropna()
    ## calculate metrics
    metrics_me = AdaSTEM.eval_STEM_res('hurdle', pred_df.y_true, pred_df.y_pred)
    metric_dict_list.append(metrics_me)
    

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [3:48:12<00:00, 2738.41s/it]

In [44]:

Copied!

maxent_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
maxent_metrics
maxent_metrics = pd.DataFrame(metric_dict_list)[['AUC','kappa','f1','precision','recall','average_precision']]
maxent_metrics

Out[44]:

	AUC	kappa	f1	precision	recall	average_precision
0	0.721753	0.325763	0.471622	0.359857	0.684088	0.298320
1	0.706076	0.306849	0.458735	0.352637	0.656152	0.289303
2	0.714194	0.318858	0.471624	0.362087	0.676178	0.300594
3	0.708874	0.320062	0.465677	0.364793	0.643690	0.294589
4	0.711749	0.316275	0.468793	0.360737	0.669266	0.298200

In [45]:

Copied!

maxent_metrics.to_csv('./Maxent_simple_metrics.csv', index=False)
maxent_metrics.to_csv('./Maxent_simple_metrics.csv', index=False)

Comparing AdaSTEM-Maxent & Maxent¶

Now we can visualize the comparison:

In [71]:

Copied!





import seaborn as sns
adastem_metrics['model'] = 'AdaSTEM-Maxent'
maxent_metrics['model'] = 'Maxent'
new_dat = pd.concat([adastem_metrics, maxent_metrics], axis=0)
dat_for_plot = []
for index,row in new_dat.iterrows():
    for metric_name in ['AUC', 'kappa', 'f1', 'precision', 'recall', 'average_precision']:
        dat_for_plot.append({
            'model':row['model'],
            'metric_name':metric_name,
            'value':row[metric_name]
        })
dat_for_plot = pd.DataFrame(dat_for_plot)

sns.barplot(data=dat_for_plot, errorbar=('ci', 95), 
            x="metric_name", y="value", hue="model")
plt.title('Model performance')
plt.legend(bbox_to_anchor=(1,1))
plt.show()

import seaborn as sns
adastem_metrics['model'] = 'AdaSTEM-Maxent'
maxent_metrics['model'] = 'Maxent'
new_dat = pd.concat([adastem_metrics, maxent_metrics], axis=0)
dat_for_plot = []
for index,row in new_dat.iterrows():
    for metric_name in ['AUC', 'kappa', 'f1', 'precision', 'recall', 'average_precision']:
        dat_for_plot.append({
            'model':row['model'],
            'metric_name':metric_name,
            'value':row[metric_name]
        })
dat_for_plot = pd.DataFrame(dat_for_plot)

sns.barplot(data=dat_for_plot, errorbar=('ci', 95), 
            x="metric_name", y="value", hue="model")
plt.title('Model performance')
plt.legend(bbox_to_anchor=(1,1))
plt.show()

No description has been provided for this image

Conclusion¶

AdaSTEM-Maxent model outperform the naive Maxent model in all metrics except recall. Interestingly, this indicate that AdaSTEM wrapper makes the Maxent model more conserve, but increase the overall performance in terms of AUC, kappa, and average precision. The improvement ranges from little to median level.

Besides, the two modeling framework consumes similar resources: both about 3 hours. Consequently, it could be a good choice to wrap the Maxent with AdaSTEM if needed.

In [72]:

Copied!

from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn,watermark"))
from watermark import watermark
print(watermark())
print(watermark(packages="stemflow,numpy,scipy,pandas,xgboost,tqdm,matplotlib,h3pandas,geopandas,scikit-learn,watermark"))

Last updated: 2023-09-15T10:05:30.738065+08:00

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.15.0

Compiler    : Clang 15.0.7 
OS          : Darwin
Release     : 21.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 8
Architecture: 64bit

stemflow    : 0.0.20
numpy       : 1.25.2
scipy       : 1.11.2
pandas      : 2.1.0
xgboost     : 2.0.0
tqdm        : 4.66.1
matplotlib  : 3.7.3
h3pandas    : 0.2.4
geopandas   : 0.13.2
scikit-learn: 1.3.0
watermark   : 2.4.3

In [ ]: