Sap flow prediction Ananlysis

Introduction

This notebook walks through an analysis of sap flows from maple trees in the US northeast and in Quebec (Canada). Specifically, the methodology proposed by Houle et al. (2015) is applied to sap flow data made published by Stinson et al. (2019). The analysis involves predicting whether or not there will be sap collected during a given week using a logistic regression model based on the derived weather features Growing Degree Days (GDD) and freeze-thaw cycles (frthw). A comparison is made between the precision achieved by Houle et al. (2015) on their original dataset and the current data set using the same modelling parameters.

Data Preparation

The following code chunk loads the tables which have been created from the data set provided by Stinson et al. (2019) on ScienceBase. The scripts used to create these tables can be found in the src.

Locations

The following plot shows the location of the study site for the sap data collection along with the NOAA weather station which has been associated with each site for the purposes of this analysis

Analysis Table Creation

The following code chunk creates a dataframe containing the features required for the prediction model used by Houle et al. (2015). These features are:

More information on these features is provided in the following section. Additional featuers are also given in the dataframe full_sap including the measurement dates and locations.

Analysis

Logistic Regression Model

Houle et al. (2015) created a logistic regression model to predict the presence or absence of maple syrup production in a given week. The following linear function was developed based on their measured data:

$$ P = -5.09 + 0.722F - 0.014F^2 - 0.07G$$

Where:

$P$ = Predictor of whether there will or will not be sap flowing in a given week (variable is labelled 'Production' in Houle et al., 2015)

$F$ = Cumulative number of freeze/thaw events since the beginning of the year (January 1st). A freeze/thaw event is counted if the temperature rises above a given threshold ($T_{thresh}$) and drops below it again. A threshold of 3°C has been used as in Houle et al., 2015. The temperature measurements used to derive this feature were generally taken every 15 minutes.

$G$ = Cumulative number of growing degree days since the beginning of the year (January 1st) using a 5°C base temperature ($T_{base}$). Each day, the maximum air temperature ($T_{max}$is extracted and, if it is above the $T_{base}$, a value of $T_{base} - T_{max}$ is added to the running total of growing degree days ($G$).

Passing $P$ into a sigmoid function and applying a threshold of 0.51, we end up with a prediction of whether there will or will not be sap flow in a given week.

$$\hat{Y} = \begin{cases} 1 & \text{if} \ \frac{1}{1+e^{-P}} \geq 0.51 \\ 0 & \text{if} \ \frac{1}{1+e^{-P}} < 0.51 \end{cases} $$

Note that in the subsequent tables, $S$ is used to denote the output of the sigmoid function $\frac{1}{1+e^{-P}}$.

The code chunks below manually compute the predictions of for $P$ for the data provided by Stinson et al. (2019) and stores the full result in a dataframe called LR_table. A summary of the prediction results for each tap and year are stored in LR_summary including whether a prediction was a true positive, true negative, false positive, or false negative and the precision of the predictions with respect to weeks when sap flow did occur (precision_1) and for weeks when sap flow did not occur (precision_0).

Logistic Regression Predictions

The following plot illustrates the predictions of the logistic regression model for a single tap ('QC1A' from the Boris/Quebuec site) for a single year (2015). The plot displays the output of the sigmoid function ($S$) for the values of the linear regression ($P$) The threshold $S$ value to predict sap in a given week was 0.51 as specified by Houle et al (2015) and is plotted as a dashed black line.

Precision of Predictions

The following plot shows the precision of the predictions generated with the model proposed by Houle et al. (2015) for each tap and each year. Houle et al (2015) stated that, 'The global model accurately predicted 83% of the production weeks and 95% of the non-production weeks.' Assuming that these values represent the precision with respect to production weeks and with respect to non-production weeks, respectively, these values have been plotted as dashed red lines for reference.

Site Specific Logistic Regression

Using the same model set up as Houle et al. (2015), separate logistic regression models were fit for each Site to determine the improvement in the model predictions when the model was fit to local data. The resulting model precisions are plotted against the precision of predictions using the model coefficients suggested by Houle et al. (2015). Note that the parameters which were fit in these site specific models were the parameters of the regression equation ($\beta_0, \beta_1, \beta_2,$ and $\beta_3$):

$$ P = \beta_0 + \beta_1F + \beta_2F^2 + \beta_3G$$

where Houle et al. (2015) used the following values: $$ \beta_0 = -5.09 \\ \beta_1 = 0.722 \\ \beta_2 = -0.014 \\ \beta_3 = 0.07$$

Note that the features of the model were not adjusted as part of the site-specific refitting. In particular, the freeze-thaw threshold temperatures and the growing-degree base temperature were not changed from the values used by Houle et al. (2015).

Results

Overall Precision

Houle et al. (2015) stated that their model, 'accurately predicted 83% of the production weeks and 95% of the non-production weeks.' It is assumed that these values represent the precision with respect to production weeks and with respect to non-production weeks, respectively. The table below compares the global precision of the model fit by Houle et al. (2015) on their original data, the model fit by Houle et al. (2015) on the data from Stinson et al. (2019), and a series of models fit to each site location using the data from Stinson et al. (2019). Note that only the training data used to fit the site specific models is excluded from the precision calculations on the site specific models.

Model Comparsion

From the table above, we see that the overall precision attained by Houle et al. (2015) on their original data set was not replicated when applying their method to the data sets from Stinson et al. (2019). In particular, the precision of predicting weeks without sap production was substantially lower (0.68 vs 0.83) when the model was applied to the data from Stinson et al. (2019). Refitting the regression coefficients for the model to the data sets from each site resulted in some improvement in model performance. In this case, the site specific models achieved a higher precision of prediction for the weeks without sap production than did the Houle et al. (2015) on the original data set. The precision of prediction of the weeks with sap production, however, remained lower in the site specific model than in the original work of Houle et al. (2015).