Towards a Probabilistic Fusion Approach for Robust Battery Prognostics

Batteries are a key enabling technology for the decarbonization of transport and energy sectors. The safe and reliable operation of batteries is crucial for battery-powered systems. In this direction, the development of accurate and robust battery state-of-health prognostics models can unlock the potential of autonomous systems for complex, remote and reliable operations. The combination of Neural Networks, Bayesian modelling concepts and ensemble learning strategies, form a valuable prognostics framework to combine uncertainty in a robust and accurate manner. Accordingly, this paper introduces a Bayesian ensemble learning approach to predict the capacity depletion of lithium-ion batteries. The approach accurately predicts the capacity fade and quantifies the uncertainty associated with battery design and degradation processes. The proposed Bayesian ensemble methodology employs a stacking technique, integrating multiple Bayesian neural networks (BNNs) as base learners, which have been trained on data diversity. The proposed method has been validated using a battery aging dataset collected by the NASA Ames Prognostics Center of Excellence. Obtained results demonstrate the improved accuracy and robustness of the proposed probabilistic fusion approach with respect to (i) a single BNN model and (ii) a classical stacking strategy based on different BNNs.


INTRODUCTION
Batteries are key components in the transition towards a sustainable carbon-free economy.In this transition, the development of remaining useful life (RUL) prediction of batteries is a crucial activity.The accuracy and reliability of the RUL Jokin Alcibar et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
prediction models is essential to build trust in the predictions (Liu et al., 2023).In this context, robust and reliable battery prognostics models support the development of accurate monitoring strategies and cost-effective solutions.
The estimation of the state-of-health (SOH) of batteries is a key activity for the design of RUL prognostics models.SOHbased prognostics models focus on capturing the run-to-failure ageing dynamics and battery health state estimation (Toughzaoui et al., 2022).It is frequently used to determine age-related degradation that reduces energy capacity and rises safety risks, including overheating and explosions (Wang et al., 2022).Therefore, accurate SOH monitoring and forecasting are key activities to design and operate safe, reliable and effective battery-powered systems (H.Zhao et al., 2023).
SOH estimation is an ongoing area of research (Yang, Chen, Chen, & Huang, 2023).SOH refers to the ratio of the current maximum capacity relative to its original specified capacity (X.Zhao, Wang, Li, & Miao, 2024).SOH can be quantified through different factors, including resistance and maximum power.However, discharge capacity is the most common definition (Vanem, Salucci, Bakdi, & Alnes, 2021), and this is adopted in this research.
Recent data-driven approaches have focused on modeling the capacity degradation of lithium-ion (Li-ion) batteries.(Lee, Kwon, & Lee, 2023) used convolutional neural network (CNN) to estimate the future SOH value of Li-ion batteries, transforming the capacity degradation data into two-dimensional images.Estimates of the SOH and RUL are commonly found together in the literature.For example, (Toughzaoui et al., 2022) developed a CNN-LSTM architecture, and (Wei & Wu, 2023) presented a graph CNN complemented by dual attention mechanisms for the estimation of SOH and RUL of batteries.However, due to the variability inherent in battery manufacturing process, it is essential to quantify this uncertainty to ensure robust and reliable prognostics predictions (Abdar et al., 2021;Nemani et al., 2023).
There are different sources of uncertainty present in the design, operation and maintenance of batteries (Hadigol, Maute, & Doostan, 2015).(Y.Zhang, Zhang, Liu, Feng, & Xu, 2024) introduced a SOH assessment method that estimates uncertainty through the quantile distribution of deep features, which are inferred from a Residual Neural Network (ResNet) architecture.This approach generates SOH values accompanied by confidence intervals.However, the proposed ResNet architecture lacks probabilistic layers, overlooking the uncertainty inherent in the model parameters.(Che et al., 2024) developed a prognostic framework to assess battery aging, using a CNN-LSTM Bayesian neural network.However, this approach limits the uncertainty to the final dense layers, which are the only components modeled probabilistically.
With the aim of capturing uncertainty associated with complex processes, recent studies in the broader machine learning (ML) community have focused on ensembles of probabilistic models.(Fan, Olson, & Evans, 2017) introduced a Bayesian posterior predictive framework for weighting ensemble climate models.(Cobb et al., 2019) present a new ML retrieval method based on an ensemble of Bayesian Neural Networks (BNNs).In this scenario, the overall output from the ensemble is treated as a Gaussian mixture model.However, models are equally weighted with no adaptation to the observed data.(S.Zhang, Liu, & Su, 2022) present a Bayesian Mixture Neural Network (BMNN) for Li-ion battery RUL prediction.The BMNN framework incorporates a Bayesian Convolutional Neural Network as feature extractor and a Bayesian Long Short-Term Memory to learn degradation patterns over time.However, the absence of a weighted model combination limits the analysis of individual model contributions.
Alternatively, (Bai & Chandra, 2023) described a Bayesian ensemble learning framework that uses gradient boosting by combining multiple Neural Networks trained by Markov Chain Monte Carlo (MCMC) sampling.Finally, (Dai, Pollock, & Roberts, 2023) demonstrate the robustness of Bayesian fusion by embedding the Monte Carlo fusion framework within a sequential Monte Carlo algorithm.
In this context, inspired by the use of probabilistic ensemble models to capture model uncertainty, the main contribution of this research is the development of a novel probabilistic model fusion approach for battery SOH predictions.Bayesian convolutional neural networks (BCNNs) are used as base models for SOH prediction, and the fusion approach integrates individual BCNN probabilistic predictions.The fusion strategy balances between precision and reliability of individual predictions, adopting an optimal tradeoff between accuracy and uncertainty of predictions through the proposed stacking approach.
The proposed approach has been compared with (i) individual BCNN models and (ii) fusion strategies focused on stacking of BCNN models using point prediction information.Obtained results confirm that the proposed framework infers accurate, well-calibrated, and reliable probabilistic predictions, which improve predictive performance and contribute to estimate uncertainty in a robust and reliable manner in complex data-driven tasks.The proposed approach has been tested and validated with the publicly available NASA's battery dataset (Saha & Goebel, 2007).
The remainder of this article is organized as follows.Section 2 outlines our probabilistic fusion approach for robust battery prognostics.Section 3 describes a case study to demonstrate the application of our methodology.Section 4 presents and analyzes the results obtained from the case study.Section 5 discusses the implications of these findings.The article concludes with Section 6, summarizing our main conclusions and suggesting avenues for future work.

PROBABILISTIC FUSION APPROACH FOR ROBUST BAT-TERY PROGNOSTICS
The proposed probabilistic fusion framework integrates BC-NNs with probabilistic ensemble strategies.The main objective of the integration is to generate accurate predictions with robust uncertainty quantification, thanks to the uncertainty quantification of Bayesian modelling (Blundell, Cornebise, Kavukcuoglu, & Wierstra, 2015) and the robustness and accuracy of ensemble strategies (S.Zhang et al., 2022).
The approach is divided into offline and online stages.Starting from a set of battery datasets, in the offline process, data pre-processing and model training steps are completed.In the online process, trained models are stacked in an ensemble model according to computed weight and stacking criteria.The outcome of the approach is a one-step-ahead probabilistic capacity estimate.Figure 1 shows the high-level block diagram of the proposed approach.The high-level concepts in Figure 1 are implemented through the detailed model architecture shown in Figure 2.

Online
The base models are BCNN models, which are trained (offline) through a leave-one-out cross validation (LOOCV) process.The probabilistic results of individual BCNN models are aggregated through a stacking process that includes accuracy and uncertainty metrics.In the testing (online) phase, each BCNN model weights are computed using learned mod-  els (log-score weights) and the stacking model is designed to combine them and generate a distribution from a mixture model.The following subsections explain in detail the main parts of the approach.

Offline Phase
During the offline phase, starting from a battery dataset with different run-to-failure trajectories on the same type of batteries, different base models are designed through a training strategy which seeks diversity in the training set to develop complementary predictive models.

Ensemble Base Models: BCNNs
BCNN models are a Bayesian extension of the classical CNN models to include uncertainty associated with parameter estimation.This requires modification of the classical backpropagation algorithm through Bayesian techniques that involves incorporating uncertainty into the model by treating weights as random variables, and applying variational inference to approximate posterior distributions.This results in a more robust model that predicts the complete probability density function (PDF).
Consequently, BCNN models have been selected to improve the robustness and accuracy of model prediction.To this end, BCNNs make use of probabilistic distributions to model parameters and the uncertainty related to their training process, and prior distributions to incorporate previous knowledge, generate uncertainty estimations and mitigate over-fitting (Blundell et al., 2015).In contrast, the classical learning models, e.g.non-Bayesian CNN models, focus on maximum likelihood estimation (MLE) and they overlook prior and poste-rior distributions.This leads to increasing error and decreasing model robustness in high uncertainty contexts, e.g.outof-distribution data or manufacturing drifts.
The proposed approach utilizes data pre-processing techniques to standardize the length of discharge cycles through padding.This technique involves repeating the last discharge value until the desired cycle length is reached, ensuring consistent input dimensions for all models.Additionally, normalization is carried out scaling the discharge values between 0 and 1.
The architecture of the BCNN models is shown in Figure 3 defined as follows: • Input data: the input data for the BCNN is structured in a tensor format.The rows represent data samples of discharge cycles, and columns that correspond to features, such as the voltage and temperature over time.Notably, the input does not include the current discharge as it remains constant in this scenario.• Convolutional 1D Reparametrization: this layer creates a convolution kernel that is applied to the input data.
During the forward pass, kernel and bias parameters are drawn from a Gaussian distribution.It uses the reparameterization estimator to approximate distributions through Monte Carlo trials, integrating over the kernel and bias.• Global Average Pooling 1D: this layer performs average pooling specifically for temporal data.It reduces the spatial dimensions of the input data to a single value per channel by calculating the average over the temporal dimension.• Flatten: this layer reshapes input data into a one dimensional array, enabling compatibility between Bayesian • Distribution Lambda: this layer is responsible for producing the final results given the inputs and the learned weights from the previous layers.The output layer consists of two neurons representing the mean, ŷ and variance, σ2 , in order to quantify the expected value and its associated uncertainty.To ensure a positive variance, the neuron is activated using an exponential function.
BCNN combines feature extraction capabilities of classical CNN models with the uncertainty quantification of Bayesian theory.The proposed architecture is built using the Bayesian layers of TensorFlow Probability in Python (Dillon et al., 2017).

Training for Diversity
Model diversity is a key concept for effective ensemble models (Nam, Yoon, Lee, & Lee, 2021).Accordingly, in this case, the training set for each battery model is modified to learn different battery aging properties.Historical capacity fading data are used to build aging models for each battery in the dataset.
Namely, using the LOOCV strategy, if K run-to-failure trajectories are available, K diverse BCNN models are built changing the training set in each iteration (cf. Figure 2).That is, the model is trained on all batteries except one, which is held as a test set.This process is repeated so that each battery serves as a test set exactly once.Thus, all available data are used for training, maximizing the diversity of training scenarios.
Training the BCNN models through LOOCV strategy, enhances the ability of individual models to generalize across different battery types and manufacturing conditions.
This stage completes the offline training process, which results in a set of BCNN models: which are used in the subsequent online inference process to build ensemble models.

Online: Stacking of Predictive Distribution
During the online phase, the proposed stacking of predictive distribution strategy is designed and tested.The proposed approach takes as input individual base models [cf.Eq. ( 1)] and monitored data up to the prediction instant t, which is used to forecast the probability density function (PDF) of the capacity at t + 1, ŷP DF (t + 1).The objective of the stacking process is to integrate the predictive distributions of different base models and propagate all the information end-to-end.
For comparison and benchmarking purposes, an alternative stacking approach is also implemented named stacking of point prediction (cf.Subsection 3.3).

Log-Score Weights
The optimal way to combine a set of Bayesian posterior predictive distributions is by using the logarithmic score (Yao, Vehtari, Simpson, & Gelman, 2018).This method maximizes the average log-likelihood of the observed data, which is a proper scoring rule used to evaluate the accuracy of probabilistic forecasts.It measures the accuracy of a forecast and penalizes overconfidence and underconfidence in the predicted probability.The logarithmic score is defined as follows: where N denotes the total number of data points and K denotes the total number of base models.The leave-one-out predictive distribution for each model, i.e. p(y i | y −i , M k ), is used to compute the model's prediction for the data point i.To avoid overfitting, a regularization term λ reg is added to the likelihood function, penalizing large weights.

Stacking
Stacking is a method to average point estimates from several models (LeBlanc & Tibshirani, 1996).In its simplest form, it can be seen as a weighted average method.Through the weighted average, it facilitates the construction of ensembles that incorporate predictions from multiple models.In the proposed framework, the goal of weighted average ensemble is to leverage the predictive capabilities of K pre-trained BCNN models [cf.Eq. ( 1)].It seeks to mitigate forecasting errors by assigning weights to the linear combination of these models, thereby enhancing the accuracy of predictions.
In the Bayesian framework, stacking extends beyond the limitations of averaging point predictions by combining multiple Bayesian posterior predictive distributions.This approach develops a stacking model that leverages the strengths of various predictive models, enhancing overall predictive accuracy.
The stacking of the predictive distribution enables the fusion of uncertainties from various models into a unified predictive framework.This approach improves the accuracy of forecasts and offers a comprehensive evaluation of the uncertainty associated with these forecasts, providing advantages across diverse decision-making scenarios.The fundamental equation governing this process is defined as follows: where p(ỹ|y) represents the aggregate probability estimation based on the ensemble model, ω k denotes the weight assigned to the k-th component within the ensemble, and p(ỹ|y, M k ) refers to the probabilistic forecast generated by each base model, denoted as BCNN k , given the observed data y.
This probabilistic prediction indicates the likelihood of observing the predicted outcome ỹ, dependent on the specific base model employed.

Forecasting
Online forecasting is computed for one-step-ahead predictions.In order to forecast battery capacity at instant t + 1, previous data until the instant t is used, plus an uncertainty factor expressed as noise: where {V (t), T (t)} denote the values of voltage and temperature at instant t, and ϵ denotes the Gaussian noise term, N (0, σ) with σ = 0.1, that introduces variability in the progression of X over time.
The one-step-ahead capacity distribution prediction is thus defined as follows: where f (.), denotes the designed ensemble model, ŷP DF (t + 1) is the distribution of the capacity estimate at t + 1.
It is possible to perform SOH predictions for longer prediction horizons through a recursive forecasting strategy.However, due to the accumulation of individual forecasting errors, this approach may lead to decrease long-term forecasting performance.Long-term SOH forecasting activities are left open for future work.
This approach allows the model to learn continuously and adapt to changing conditions.Online forecasting is particularly beneficial in environments that require immediate decision making based on the latest available data.

Dataset description
The effectiveness of the proposed method has been tested using a battery dataset from the NASA Ames Prognostics Center of Excellence (Saha & Goebel, 2007).
A subset of available battery data has been selected, focusing on batteries #5, #6, #7 and #18.Each battery is operated under various conditions including charging, discharging, and impedance analysis.Throughout the charge and discharge cycles, temperature, current, and voltage were meticulously recorded.During charging, a constant current mode at 1.5 A was maintained until the voltage reached 4.2 V, followed by a switch to constant voltage mode until the current dropped to 20 mA.Discharge cycles involved a constant load mode at 2 A until the voltage levels reached 2.7 V, 2.5 V, 2.2 V and 2.5 V for batteries #5, #6, #7 and #18, respectively.The experiment ended once the battery capacity decreased by 30%.These batteries had a maximum capacity of 2Ah with an end-of-life capacity set at 1.4Ah.
Figures 4(a), 4(b) and 4(c) show the evolution of voltage, current (constant), and temperature measurements with the increment of discharge cycles for the battery #5. Figure 5 shows variations in capacity degradation rates for identical batteries.This is an indicator of uncertainty inherent in the manufacturing process, which affects SOH estimates.
Figure 5. Capacity degradation data of Li-ion batteries.

BCNN structure and hyperparameters
The design of the base BCNN model structure is developed through experimentation.The BCNN architecture for SOH forecasting is detailed in Table 1, where 'None' is indicative of the batch size.The input for the model comprises 371 data points per discharge cycle, with each point aggregating 3 features: voltage, temperature, and time.
The proposed structure encompasses a total of 1300 trainable parameters, designed to extract features from battery discharge cycle data for forecasting purposes.Figure 3 details the convolutional layer hyperparameters, which includes 16 kernels, each with a dimension of 3, adopting a Laplace distribution for the prior and employing a ReLU activation function.In addition, the model incorporates Bayesian dense layers with 16 units, Adam optimizer, a learning rate of 0.01, and Evidence Lower Bound (ELBO) as its loss function (S.Zhang et al., 2022).Distribution Lambda (None,1),(None,1) 0 Total params: 1300 (5.08 KB)

Benchmarking
In order to compare the designed stacking approach with alternative stacking strategies, another stacking approach has been designed using point prediction information instead of the full distribution.

Stacking of Point Prediction
An effective method for determining the weight of each model in the stacking process is by minimizing the leave-one-out mean squared error with a L 2 regularization term, λ reg .The purpose of this term is to penalize large weights, thus preventing overfitting and balancing individual model contributions.
The weights are obtained through the following optimization problem: where f (−i) K (x i ) represents the predicted value of the k-th model, when the i-th observation is left out of the training set.The regularization parameter, λ reg , controls the strength of the regularization applied.To ensure a feasible solution, the weights are restricted to w k ≥ 0 and Accordingly, the stacking of point prediction approach is defined as follows: where ŷ represents the prediction of the ensemble for the test battery capacity, ŵk denotes the weight assigned to the k-th battery base model, and f k (x|θ k ) is the prediction made by the corresponding base model (BCNN k ).

Evaluation criteria
The accuracy of the regression is measured by Mean Squared Error, while Negative Log Likelihood assesses model perfor-mance by quantifying prediction probabilities.Finally, The correctness of probability predictions is assessed through the CRPS.
Mean Square Error (MSE) is a metric for measuring the quality of an estimator.It is a measure of the average squared differences between the estimated values and what is estimated.MSE is calculated by taking the average of the square of the differences between the predicted values and the actual values (Hodson, 2022).
where, n represents the number of observations, Y i denotes the actual value for the ith observation, and Ŷi signifies the predicted value for the ith observation.
Coefficient of Determination (R 2 ) is a metric used to assess the goodness of fit of the model.It provides a measure of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model (Barrett, 1974).
where, n is the number of observations, Y i is the actual value, Ŷi the predicted value for the i-th observation and Ȳ the mean of Y .R 2 of 1 implies perfect model predictions, while 0 means no explained variability.
Continuous Ranked Probability Score (CRPS) can be formally expressed as a quadratic measure of discrepancy between the predicted Cumulative Distribution Function (CDF), F (•), and the observed empirical CDF for a given scalar observation y (Zamo & Naveau, 2018): where I(x ≥ y i ) is the indicator function, which models the empirical CDF.
To obtain a single score value from Eq. ( 10), a weighted average is calculated for each individual observation of the test set (Gneiting, Raftery, Westveld, & Goldman, 2005): where N denotes the total number of predictions.
Negative Log Likelihood (NLL) metric assesses probabilistic models by using the likelihood concept, which indicates how likely the observed data is given model parameters (Bosman & Thierens, 2000).Likelihood (L) is the product of each observation's probability density function (PDF), expressed mathematically as where θ denotes model parameters and X includes N data points.NLL is preferred for optimization since minimizing NLL is equivalent to maximizing the log-likelihood, facilitating the discovery of model parameters that best explain the observed data, represented by Calibration refers to the statistical consistency between the predictive distributions and the actual observations.It represents a joint property of forecasts and empirical data (Jung, Jo, Choo, & Lee, 2022).Namely, it is stated that the model is calibrated if (Kuleshov, Fenner, & Ermon, 2018): In this expression, T refers to the total number of data points, while the indicator function I{y t ≤ F −1 t (p)} takes a value of 1 when the condition y t ≤ F −1 t (p) is true, and 0 otherwise.Given this condition, y t express the observed outcome at time t, and F −1 t (p) is the inverse of the CDF for the forecast, evaluated at probability p.Therefore, the condition represents the threshold below which a random sample from the distribution would occur with a probability p.
Sharpness means that the confidence intervals should be optimized for minimal width around a singular value.That is, the goal is to reduce the variance, denoted as var(F n ), of the random variable characterized by the cumulative distribution function F n (Kuleshov et al., 2018;Tran et al., 2020):  To evaluate the proposed approach, firstly, different ensemble strategies are compared to evaluate their strengths and identify the most suitable approach.Subsequently, a sensitivity analysis is developed with respect to the contribution of individual base-models to the overall ensemble.

Probabilistic Ensemble Strategies
This section focuses on the comparison between (i) the baseline model, i.e.BCNN model trained with all available data, (ii) ensemble of point prediction and (iii) proposed ensemble method (cf. Figure 2) to further evaluate the improvement of ensemble strategies over baseline model.
Table 2 presents a comparative analysis in terms of accuracy and probabilistic metrics.This comparison highlights that, for different test scenarios, the ensemble methodologies enhance the performance of the baseline model.
A notable observation from the results in Table 2 is the variance between the proposed ensemble approach (cf. Figure 2) and the benchmarking ensemble model (cf.Subsection 3.3) in specific scenarios.For batteries #5 and #6, the proposed approach exhibited superior outcomes, particularly in probabilistic metrics (NLL and CRPS).This suggests that within a Bayesian framework, prioritizing likelihood maximization, leads to accurately modelling uncertainty, and therefore, it is more advantageous than focusing on MSE minimization (as in Subsection 3.3).
The model optimization criterion has a direct impact on the performance of the tested methods and on the effectiveness of the ensemble approach.However, for batteries #7 and #18, no significant differences were observed between the tested ensemble approaches, which indicates that the results are associated to the prior models.That is, it is possible that the same prior model minimizes the MSE and maximizes the likelihood at the same time.quantification.This is indicated by the positioning of the ground truth (dashed lines) at the limit of the lower boundary in Figure 6(c), which means that the uncertainty does not accurately cover the observed values.That is, the uncertainty bounds are not well-calibrated, compromising the model's ability to accurately represent the underlying variability in the data and in the model compared to ensemble strategies.
Figure 6(a) shows an improvement in the prediction accuracy.However, it simultaneously introduces a higher level of uncertainty compared to the proposed ensemble method in Figure 6(b).This is reflected in the NLL and CRPS metrics, where the stacking of the predictive distribution demonstrates superior performance (cf.Table 2).Such probabilistic metrics indicate that the model parameters make the observed data more probable, indicating a good fit to the observed data.
The evaluation of the shape of the PDF is a crucial aspect of uncertainty quantification.Accordingly, the calibration and the sharpness assessment of PDFs is performed through a python toolbox for predictive uncertainty quantification (Chung, Char, Guo, Schneider, & Neiswanger, 2021).Figure 7 shows the calibration and sharpness of the analysed ensemble methods designed for probabilistic forecasting for the battery #5.
The calibration plot for the point-prediction ensemble model [cf. Figure 7(a)] reveals a miscalibration area of 0.26, indicating a gap between predicted probabilities and actual outcomes, generally overestimating event probabilities.On the contrary, the proposed ensemble model [cf. Figure 7(b)] shows better calibration with a miscalibration area of 0.12, aligning closer to the ideal, especially in midrange probabilities.
In terms of sharpness, the predictions of the point-prediction based ensemble model have a mean sharpness value of 0.06 and are right-skewed, reflecting higher uncertainty.However, the proposed ensemble model has a mean sharpness value of 0.05, with a slightly left-skewed distribution, indicating more predictions with lower uncertainty and greater confidence.

Sensitivity of the Ensemble Strategy with Base-Models
To evaluate the contribution of each individual BCNN model to the ensemble approach, a sensitivity assessment has been performed.Namely, the performance of the different leaveone-out iterations has been evaluated, sequentially training with different battery datasets and testing with the leave-out battery dataset.This has been compared with the proposed ensemble approach results to identify individual contributions from different models.Table 3 displays the obtained results.The ensemble model also shows a notable improvement in the NLL metric, suggesting a more reliable uncertainty estimation.Additionally, by achieving the lowest CRPS, it emphasizes its proficiency in probabilistic forecasting and precise uncertainty quantification.Overall, the ensemble method outperforms individual models, highlighting its effectiveness in contexts that require high accuracy and reliability.
Figure 8 presents the forecasts generated by individual models for battery #5 (cf.Table 3).It be seen that the ensemble effectively combines the characteristics of models 2 and 3, thereby improving the overall performance of the final forecast of the ensemble.

DISCUSSION
The proposed research work demonstrates that the stacking of predictive distributions based on a Bayesian framework improves the accuracy and robustness of predictions compared with stacking of point predictions.Furthermore, it has been observed that the use of an ensemble of BCNN models im-proves the modeling of uncertainty when compared to relying on a single BCNN model (baseline).However, before drawing definitive conclusions about the application of the proposed solution in real-world applications, further work is necessary testing the robustness, scalability, and sensitivity with respect to noise.

Robustness
Credible intervals reflect the uncertainty associated with the data and the model (cf. Figure 6).The robustness of the proposed approach is therefore directly dependent on model and data uncertainty.The reduction of credible intervals align with the objective of increasing robustness.To this end, increasing the number of observations would reduce the uncertainty attributed to the model, which results in more precise credible intervals.Additionally, employing priors like maximum entropy priors or weakly informative priors may further tighten credible intervals, thereby improving the reliability of the model predictions.

Scalability
To analyze larger fleets of batteries, instead of using leaveone-out methodologies, it may be more appropriate to de-velop generalized training methodologies.In this direction, one approach would be to cluster batteries that exhibit similar operation and degradation conditions.This strategy would enable capturing data diversity, which is a key property for ensemble strategies.Alternatively, a hierarchical modelling strategy may be adopted.This method involves a global model for overall battery behavior, supplemented by smaller models for specific groups, enabling precise adaptations without the need for separate models per battery.This strategy ensures scalability and flexibility in handling various battery operation and degradation conditions efficiently.

Noise Sensitivity
The proposed approach assumes a Gaussian noise to model the variability of the modeled process and measurements [cf.
Eq. ( 4)].To analyze the impact of Gaussian noise levels on prediction results, a sensitivity analysis has been performed.Figure 9 shows the obtained results.

Application Limits
Some of the adopted practices may limit the applicability of the proposed framework in real-world applications.The experimental setup, conducted in a controlled environment with specified load conditions, may not entirely replicate the diverse sources of uncertainty present in real-world applications.Such controlled conditions could potentially skew the understanding of uncertainty due to environmental and operational variabilities.Consequently, the predictive performance observed in this study may differ under less predictable conditions.In this direction, for controlled operation environments, the complexity of the proposed approach may be reduced.However, the proposed methodology complexity is designed to capture a wide range of uncertainties found in real operating systems.

CONCLUSION AND FUTURE WORK
Batteries are key components in power and energy systems and ensuring a robust and reliable remaining useful life (RUL) prediction of batteries is crucial to develop accurate monitoring strategies, and build cost-effective solutions.
In this context, battery RUL prediction models generally focus on individual prediction models.They may be able to capture uncertainty associated with the battery ageing process, but the uncertainty modelling and capturing ability is also limited to the individual model.This research presents a probabilistic ensemble prognostics approach which combines Bayesian Convolutional Neural Network (BCNN) models in a probabilistic stacking strategy.The proposed framework leverages the probabilistic predictive information of individual BCNN models, which are integrated through a probabilistic stacking approach that calibrates between accuracy and robustness of probabilistic predictions.
The proposed approach has been tested on NASA's battery dataset.Obtained results show that the proposed probabilistic stacking approach improves accuracy and uncertainty of predictions with respect to other ensemble strategies and individual BCNN models.
This research study contributes towards understanding and predicting the capacity fade in Li-ion batteries.Namely, it highlights the role of probabilistic approaches and ensemble methods in modelling the uncertainties inherent in battery manufacturing and operation.
Looking forward, there are different opportunities to expand the scope and applicability of this work.On the one hand, the use of a larger battery dataset, which includes diverse environmental and operational conditions, would allow for a more comprehensive understanding of capacity fade across various scenarios.On the other hand, it may be possible to perform a more exhaustive comparative analysis of different fusion strategies, including Bayesian Model Averaging, Pseudo Bayesian Model Averaging, or Mixture Models.This comparative will provide further insights into the optimal approaches for integrating predictive models in the context of battery life prediction, enhancing both the accuracy and reliability of capacity fade forecasts.

ACKNOWLEDGENEMTS
This publication is part of the research projects KK-2023-00041, IT1451-22 and IT1676-22 funded by the Basque Gov-ernment.J. I. Aizpurua is funded by Juan de la Cierva Incorporacion Fellowship, Spanish State Research Agency (grant No. IJC2019-039183-I).

OfflineForecastingFigure 1 .
Figure 1.High-level block diagram of the proposed approach.

Figure 2 .
Figure 2. Block diagram of the proposed approach.

Figure 3 .
Figure 3. Schematic of the Bayesian convolution neural network.
Figure 4. Feature variations due to an increasing number of discharge cycles in battery #5.

Figure 6
Figure 6(a) shows the comparison between the ensemble model generated by stacking point predictions (cf.Subsection 3.3), Figure 6(b) shows the ensemble model generated through stacking of predictive distributions (cf. Figure 2), and Figure 6(c) shows the individual BCNN trained with the entire dataset, e.g. for the battery #5, train with batteries #6, #7, and #18, Figure 8. Capacity fade forecasting for battery #5 employing an ensemble of BCNN models.
Figure 8 presents the forecasts generated by individual models for battery #5 (cf.Table 3).Figures 8(b)-8(d), show individual models and Figure 8(a) shows the combined forecast of the ensemble model.

Figure 9 .
Figure 9. Impact of Gaussian Noise on Predictive Modeling of Battery Capacity Degradation.Obtained results indicate that, when testing data diverges from training data, the epistemic uncertainty increases.The increase in Gaussian noise causes a greater deviation, and therefore, there is a significant rise in epistemic uncertainty.Analysing the model's behaviour in the presence of different types of uncertainty is crucial to evaluate the robustness of the model and determine if additional training stages are needed to enhance its reliability.Consequently, this research adopts a noise level of 0.1 as a trade-off decision between prediction accuracy and uncertainty.

Table 2 .
Comparison of different ensemble strategies for different batteries used as test.

Table 3 .
Performance evaluation of BCNN models and the ensemble approach.