Machine Learning On Marble#
This tutorial seeks to outline the steps typically taken in building a machine learning model. You will be introduced to terms and common libraries used for model building.
Before diving into any machine learning endeavor, it’s crucial to grasp a clear understanding of the target variables you aim to predict or map. In this context, our goal is to uncover which atmospheric variables hold the most influence in predicting the total monthly surface precipitation in Singapore. Given that precipitation is a continuous value, the nature of our task aligns with a regression model.
To embark on this journey, let’s kick things off by importing the essential libraries needed to conduct our analysis.
#scientific computing
import numpy as np
import pandas as pd
import xarray as xr
import pooch
from matplotlib import pyplot as plt
%matplotlib inline
#mapping
import cartopy.crs as ccrs
import cartopy as cp
import cartopy.feature as cfeature
#EDA
from ydata_profiling import ProfileReport
#filter warnings
import warnings
warnings.filterwarnings('ignore')
#time
from datetime import datetime
#models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import HistGradientBoostingRegressor, AdaBoostRegressor
#metrics and model tuning
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
#random_state / random_seed
seed = 0
Among the familiar tools like numpy
and xarray
, you’ll discover a league of powerful newcomers: xgboost
, sklearn
, lightgbm
, and catboost
. These libraries hold the keys to the world of gradient boosting tree models. Imagine these models as a band of specialists, each refining its craft to mend the errors made by its predecessors. It’s like building an army of sharp minds where each new recruit shores up the weaknesses of the previous ones. Together, they create an ensemble model that’s far mightier than any individual learner. After all, when it comes to creating impactful models, strength truly lies in numbers!
Loading Data#
%%time
data_dict={'tas':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/tas/gn/v20190514/tas_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'psl':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/psl/gn/v20190514/psl_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'pr':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/pr/gn/v20190514/pr_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'huss':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/huss/gn/v20190514/huss_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'hurs':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/hurs/gn/v20190514/hurs_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'clt':'https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/fileServer/datasets/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/clt/gn/v20190514/clt_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc'
}
data_paths = {}
for variable, url in data_dict.items():
path = pooch.retrieve(url, known_hash=None)
data_paths[variable] = path
data_paths
CPU times: user 552 µs, sys: 1.13 ms, total: 1.69 ms
Wall time: 1.26 ms
{'tas': '/home/jovyan/.cache/pooch/2e6aa3d311a810a2e568c56910e8c7f3-tas_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'psl': '/home/jovyan/.cache/pooch/55a2f0c6b1eec6b87f61bea0c8f42bac-psl_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'pr': '/home/jovyan/.cache/pooch/1910086169d33e6536538196b07a41b1-pr_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'huss': '/home/jovyan/.cache/pooch/39ad9111eaf9d3a09ec78f48697f679d-huss_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'hurs': '/home/jovyan/.cache/pooch/8d3d8e88e8bc674f79ece1924eca3777-hurs_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc',
'clt': '/home/jovyan/.cache/pooch/4e9fc634b3f7956b2499e90edbe04832-clt_Amon_CESM2_historical_r11i1p1f1_gn_185001-201412.nc'}
Here, we retrieve our data from a THREDDS server and organize it into a dictionary. This dictionary pairs variable names with their respective data file locations. Leveraging this setup, we employ xarray.open_mfdataset
to efficiently load and merge these files into a single dataset, simplifying data manipulation.
%%time
#opening psl
data = xr.open_mfdataset(data_paths.values())
data
CPU times: user 28.7 s, sys: 3.35 s, total: 32 s
Wall time: 28.5 s
<xarray.Dataset> Dimensions: (time: 1980, lat: 192, lon: 288, nbnd: 2) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8 * time (time) object 1850-01-15 12:00:00 ... 2014-12-15 12:00:00 Dimensions without coordinates: nbnd Data variables: clt (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> time_bnds (time, nbnd) object dask.array<chunksize=(1, 2), meta=np.ndarray> lat_bnds (lat, nbnd) float64 dask.array<chunksize=(192, 2), meta=np.ndarray> lon_bnds (lon, nbnd) float64 dask.array<chunksize=(288, 2), meta=np.ndarray> hurs (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> huss (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> pr (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> psl (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> tas (time, lat, lon) float32 dask.array<chunksize=(1, 192, 288), meta=np.ndarray> Attributes: (12/45) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 674885.0 branch_time_in_parent: 219000.0 case_id: 972 ... ... sub_experiment_id: none table_id: Amon variable_id: clt variant_info: CMIP6 20th century experiments (1850-2014) with C... variant_label: r11i1p1f1 history: This file was generated on Fri Nov 3 21:15:57 20...
You might have observed a distinctive aspect of this xarray dataset. The xarray.open_mfdataset
seamlessly integrates with the dask
library, specifically designed for handling big data computations. This function conveniently partitions your data into manageable chunks, preventing the need to load the entire set into memory simultaneously. This approach significantly improves resource management.
Let’s visualise our dataset to get a better idea of our variables.
%%time
#make a list of variables to plot
vars = list(data_paths.keys())
plt.figure(figsize=(15,12))
# Flatten the axes array to iterate over each subplot
#axes = axes.flatten()
# Iterate over the variables
for i, var in enumerate(vars):
ax = plt.subplot(3,2,i+1, projection=ccrs.PlateCarree())
# ax = plt.axes(projection=ccrs.PlateCarree())
# Plot states & borders
ax.add_feature(cp.feature.STATES, edgecolor='black', linewidth=0.5)
ax.add_feature(cp.feature.BORDERS, edgecolor='black', linewidth=1)
ax.coastlines()
data[var].mean(dim='time').plot(ax=ax, transform=ccrs.PlateCarree())
ax.set_title(f"Mean of {var}")
ax.gridlines(crs=ccrs.PlateCarree(), draw_labels=True, linewidth=1, color='gray', alpha=0.5, linestyle='-')
plt.tight_layout()
plt.show()
CPU times: user 45 s, sys: 3.35 s, total: 48.4 s
Wall time: 30.6 s
This analysis provides valuable insights into each variable’s distribution:
tas
: Indicates a consistent decrease in temperature moving towards the poles.psl
: Illustrates the locations of high-pressure zones across the surface.pr
: Reveals persistent high rainfall around the equator and on the eastern and northern peripheries of high-pressure zones.huss
: Highest in the tropics, indicating regions where the air can retain more water vapor.hurs
: Demonstrates lower average values over continents along high-pressure belts.clt
: Shows a similar distribution tohurs
, with fewer clouds in areas with lower humidity and/or higher pressure.
This understanding helps in interpreting the climatic conditions and dynamics across different geographical regions.
However, we are not interested in that, what we want to know is which of these variables have the greatest power in predicting precipitation over Singapore over time.
%%time
#singapore coordinates
lat = 1.3521
lon = 1.3521
singapore = data[vars].sel(lat=lat, lon=lon, method='nearest')
singapore
CPU times: user 21.8 ms, sys: 367 µs, total: 22.2 ms
Wall time: 21.2 ms
<xarray.Dataset> Dimensions: (time: 1980) Coordinates: lat float64 1.414 lon float64 1.25 * time (time) object 1850-01-15 12:00:00 ... 2014-12-15 12:00:00 Data variables: tas (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> psl (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> pr (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> huss (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> hurs (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> clt (time) float32 dask.array<chunksize=(1,), meta=np.ndarray> Attributes: (12/45) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 674885.0 branch_time_in_parent: 219000.0 case_id: 972 ... ... sub_experiment_id: none table_id: Amon variable_id: clt variant_info: CMIP6 20th century experiments (1850-2014) with C... variant_label: r11i1p1f1 history: This file was generated on Fri Nov 3 21:15:57 20...
We’ve focused on selecting the nearest data point to our specified location for the variables of interest. You might wonder why not take a global approach? The reason is straightforward: tree models operate within a 2D space, making the preparation of a higher-dimensional dataset beyond the scope of this tutorial. More comprehensive techniques will be covered in subsequent tutorials to handle higher dimensions effectively.
Let’s take a look at the metadata for each variable.
Manipulating Data#
%%time
#take a look at the units of variables
for variable in singapore.variables:
#get units and variable description
units = singapore[variable].attrs.get('units')
description = singapore[variable].attrs.get('description')
#write data to metadata txt file
with open('singapore_metadata.txt', 'a+') as file:
file.write(f"{variable} is {description} measured in {units}. \n\n")
#print information
print(f"{variable} is {description} measured in {units}.")
print()
tas is near-surface (usually, 2 meter) air temperature measured in K.
psl is not, in general, the same as surface pressure measured in Pa.
pr is at surface; includes both liquid and solid phases from all types of clouds (both large-scale and convective) measured in kg m-2 s-1.
huss is Near-surface (usually, 2 meter) specific humidity. measured in 1.
hurs is This is the relative humidity with respect to liquid water for T> 0 C, and with respect to ice for T<0 C. measured in %.
clt is for the whole atmospheric column, as seen from the surface or the top of the atmosphere. Include both large-scale and convective cloud. measured in %.
lat is None measured in degrees_north.
lon is None measured in degrees_east.
time is None measured in None.
CPU times: user 0 ns, sys: 1.01 ms, total: 1.01 ms
Wall time: 805 µs
As we convert the data from an xarray dataset to a dataframe, some information loss is anticipated. To preserve vital metadata about our variables, we’ve maintained this information in a text file for future reference if needed.
Now let’s convert our dataset to a ‘pandas` dataframe for easier manipulation.
%%time
sing_df = singapore.to_dataframe()
sing_df = sing_df.drop(['lat','lon'], axis=1)
sing_df
CPU times: user 32.7 s, sys: 1.87 s, total: 34.6 s
Wall time: 31.5 s
tas | psl | pr | huss | hurs | clt | |
---|---|---|---|---|---|---|
time | ||||||
1850-01-15 12:00:00 | 300.332672 | 100981.664062 | 0.000031 | 0.018161 | 80.826218 | 67.595139 |
1850-02-14 00:00:00 | 301.102051 | 100931.570312 | 0.000020 | 0.018143 | 77.156830 | 69.740829 |
1850-03-15 12:00:00 | 301.172913 | 100871.906250 | 0.000053 | 0.019471 | 82.353477 | 75.408829 |
1850-04-15 00:00:00 | 301.175446 | 101016.468750 | 0.000035 | 0.019206 | 81.336685 | 69.285049 |
1850-05-15 12:00:00 | 300.771332 | 100964.578125 | 0.000042 | 0.019539 | 84.693420 | 69.559113 |
... | ... | ... | ... | ... | ... | ... |
2014-08-15 12:00:00 | 299.182434 | 101339.984375 | 0.000002 | 0.016927 | 80.921547 | 63.258583 |
2014-09-15 00:00:00 | 299.601135 | 101311.554688 | 0.000002 | 0.017491 | 81.500290 | 65.061989 |
2014-10-15 12:00:00 | 300.147461 | 101159.484375 | 0.000004 | 0.017855 | 80.409378 | 47.490864 |
2014-11-15 00:00:00 | 300.769775 | 101101.570312 | 0.000074 | 0.019193 | 83.324203 | 61.768677 |
2014-12-15 12:00:00 | 300.938385 | 101096.953125 | 0.000033 | 0.019048 | 81.893417 | 70.255272 |
1980 rows × 6 columns
#reset the index of the dataframe
sing_df = sing_df.reset_index()
sing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980 entries, 0 to 1979
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 1980 non-null object
1 tas 1980 non-null float32
2 psl 1980 non-null float32
3 pr 1980 non-null float32
4 huss 1980 non-null float32
5 hurs 1980 non-null float32
6 clt 1980 non-null float32
dtypes: float32(6), object(1)
memory usage: 62.0+ KB
Now that the dataset has been transformed into a dataframe, we’re poised to start exploring and refining the data as needed. One notable observation is that the time information is currently stored as an object rather than in a datetime format. Let’s see how we can change this.
%%time
sing_df.time = sing_df.time.astype(str)
sing_df.time = pd.to_datetime(sing_df.time)
sing_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980 entries, 0 to 1979
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 1980 non-null datetime64[ns]
1 tas 1980 non-null float32
2 psl 1980 non-null float32
3 pr 1980 non-null float32
4 huss 1980 non-null float32
5 hurs 1980 non-null float32
6 clt 1980 non-null float32
dtypes: datetime64[ns](1), float32(6)
memory usage: 62.0 KB
CPU times: user 38.8 ms, sys: 0 ns, total: 38.8 ms
Wall time: 38.5 ms
Addressing data irregularities is of utmost importance before delving into advanced model building. It’s crucial to meticulously handle data types, address missing values, and ensure accurate data representations. In machine learning, always remember: ‘garbage in, garbage out!’
Exploratory Data Analysis#
Let’s start by visualising how our variables change with time.
sing_time = sing_df.set_index('time')
sing_time.head()
tas | psl | pr | huss | hurs | clt | |
---|---|---|---|---|---|---|
time | ||||||
1850-01-15 12:00:00 | 300.332672 | 100981.664062 | 0.000031 | 0.018161 | 80.826218 | 67.595139 |
1850-02-14 00:00:00 | 301.102051 | 100931.570312 | 0.000020 | 0.018143 | 77.156830 | 69.740829 |
1850-03-15 12:00:00 | 301.172913 | 100871.906250 | 0.000053 | 0.019471 | 82.353477 | 75.408829 |
1850-04-15 00:00:00 | 301.175446 | 101016.468750 | 0.000035 | 0.019206 | 81.336685 | 69.285049 |
1850-05-15 12:00:00 | 300.771332 | 100964.578125 | 0.000042 | 0.019539 | 84.693420 | 69.559113 |
%%time
###
cols = list(sing_time.columns)
plt.figure(figsize=(15,12))
for i, var in enumerate(cols):
ax = plt.subplot(3,2,i+1)
sing_time[var].groupby(sing_time.index.month).mean().plot(ax=ax)
ax.set_title(f"Monthly mean {var}")
#ax.set_xlim(1,365)
plt.tight_layout()
plt.show()
CPU times: user 890 ms, sys: 396 ms, total: 1.29 s
Wall time: 795 ms
Initially, our dataset encapsulated monthly averages spanning 164 years. By calculating the mean values for each month across this timeline, we aimed to uncover any discernible patterns.
The plots reveal an intriguing trend: notably, precipitation spikes align with months exhibiting higher temperatures, humidity levels, and increased cloud cover. While this correlation might seem evident to seasoned climate researchers, it highlights potential underlying interactions between these variables when viewed through the lens of machine learning.
Moreover, the month itself emerges as a pivotal factor in predicting these variables. Therefore, let’s derive a new column dedicated to housing this data.
#making a new column
sing_time['month'] = sing_time.index.month
sing_time.head()
tas | psl | pr | huss | hurs | clt | month | |
---|---|---|---|---|---|---|---|
time | |||||||
1850-01-15 12:00:00 | 300.332672 | 100981.664062 | 0.000031 | 0.018161 | 80.826218 | 67.595139 | 1 |
1850-02-14 00:00:00 | 301.102051 | 100931.570312 | 0.000020 | 0.018143 | 77.156830 | 69.740829 | 2 |
1850-03-15 12:00:00 | 301.172913 | 100871.906250 | 0.000053 | 0.019471 | 82.353477 | 75.408829 | 3 |
1850-04-15 00:00:00 | 301.175446 | 101016.468750 | 0.000035 | 0.019206 | 81.336685 | 69.285049 | 4 |
1850-05-15 12:00:00 | 300.771332 | 100964.578125 | 0.000042 | 0.019539 | 84.693420 | 69.559113 | 5 |
Each row in our dataset corresponds to the specific month when the data was recorded. To accelerate our exploratory data analysis (EDA), we’ll utilize ydata_profiling
. This tool seamlessly integrates with pandas
, generating comprehensive reports that delve into the statistical properties of each column along with potential interactions. While this approach significantly accelerates the model-building process, it’s essential to note that for exceedingly large datasets, this method can be computationally intensive and time-consuming.
For a deeper dive into ydata_profiling
, I encourage exploring its comprehensive documentation.
sing_report = ProfileReport(sing_time)
sing_report.to_notebook_iframe()