Sound-Dr: Reliable Sound Dataset and Baseline Artificial Intelligence System for Respiratory Illnesses

As the burden of respiratory diseases continues to fall on society worldwide, this paper proposes a high-quality and reliable dataset of human sounds for studying respiratory illnesses, including pneumonia and COVID-19. It consists of coughing, mouth breathing, and nose breathing sounds together with metadata on related clinical characteristics. We also develop a proof-of-concept system for establishing baselines and benchmarking against multiple datasets, such as Coswara and COUGHVID. Our comprehensive experiments show that the Sound-Dr dataset has richer features, better performance, and is more robust to dataset shifts in various machine learning tasks. It is promising for a wide range of real-time applications on mobile devices. The proposed dataset and system will serve as practical tools to support healthcare professionals in diagnosing respiratory disorders. The dataset and code are publicly available here: https://github.com/ReML-AI/Sound-Dr/.


INTRODUCTION
Abnormalities can be discovered in the respiratory sounds of individuals with fever, asthma, tuberculosis, pneumonia, and COVID-19 compared to the sound of those without these conditions.A solid body of literature has shown the effectiveness of respiratory sounds in disease detection with the use of artificial intelligence (AI) (Song, 2015;Sakkatos et al., 2019;Yang et al., 2022).Furthermore, AI systems and data can be periodically updated, thereby improving accuracy and reliability.In real-world situations, sound-based medical screening tools can be widely deployed in multiple locations, such as airports, factories and supermarkets.
Truong V. Hoang et al.This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 United States License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
To date, there are several respiratory sound datasets to detect diseases, such as the Internal Conference on Biomedical Health Informatics (ICBHI) data (Rocha et al., 2019) in which each audio recording identifies the patients in terms of being healthy or exhibiting one of the following respiratory diseases or conditions including COPD, Bronchiectasis, Asthma, Upper and Lower respiratory tract infection, Pneumonia, and Bronchiolitis.To detect COVID-19, there are also two well-known datasets from New York (NYU Breathing Sounds for COVID-19, 2020) and Cambridge (Brown et al., 2020) universities.These respiratory datasets, however, are likely prone to reliability issues, as shown in our later experiments with the use of a dataset shift detection method (Rabanser, Günnemann, & Lipton, 2019).
In order to build a high-quality and reliable dataset, we developed a system to collect respiratory sound data in an efficient manner, such as recording each sound multiple times to reduce the impact of unwanted noises and capture the average sample's duration longer for better reliability.As a result, our dataset, named Sound-Dr, is collected under many different diseases, such as fever, asthma, and COVID-19, to enable researchers to solve various machine-learning problems related to respiratory diseases, including disease classification and anomaly detection.
The Sound-Dr dataset contains three types of respiration sounds, including nose breathing, mouth breathing, and coughing, with extensive lengths to foster more possibilities in machine learning algorithms and model deployments.Besides the audio recordings, metadata and health-related characteristics (e.g., smoking, insomnia) are included with high quality and data richness, which can be useful for multiple machine learning tasks on medical diseases related to respiratory systems.
Compared to existing datasets, the Sound-Dr dataset has multiple advantages with the following contributions: 4th Asia Pacific Conference of the Prognostics and Health Management, Tokyo, Japan, September 11 -14, 2023 R02-16 • Besides the audio recordings, the dataset also provides metadata and health-related characteristics for various tasks in machine learning, including but not limited to disease classification, anomaly detection, and symptom recognition for respiratory illnesses.It is suitable for large-scale adoption and deployment via smart devices for homes or businesses.
• This paper establishes an open baseline framework to facilitate benchmarking the Sound-Dr dataset and other datasets in terms of performance and robustness, as well as the efficiency of the data collection.

Studies of Human Sounds for Medical Screening
In medicine, human sounds have been well-studied as viable inputs for identifying vocal fold pathology, which involves either subjective or objective assessments.In subjective approaches, a skilled medical professional hears the sound signal and determines whether it is diseased or normal based on their prior training and experience.Nevertheless, depending on the level of experience, this type of evaluation may differ from doctor to doctor (Kreiman, Gerratt, & Precoda, 1990).As a result, both medical and engineering professionals are paying more and more attention to objective approaches to voice pathology.
Many medical conditions can be accurately identified using computer-aided voice pathology classification tools and deep learning techniques.For example, a recent study (Deb et al., 2022)

Machine learning datasets for respiratory diseases
The use of machine learning has become increasingly promising for the detection and monitoring of respiratory illnesses.A recent work (Pham et al., 2022) presented an exploration of various deep-learning models for detecting respiratory anomalies from auditory recordings.Authors used the ICBHI 2017 (Rocha et al., 2019) There are also recent datasets for respiratory diseases, such as COUGHVID (Orlandic, Teijeiro, & Atienza, 2021) and Coswara (N.Sharma et al., 2020).COUGHVID is a global cough signal recordings dataset for COVID-19 detection with some clinical information and metadata.And Coswara is another dataset composed of voice samples from healthy individuals, including breathing sounds (fast and slow), cough sounds (deep and shallow), phonation of sustained vowels, and counting numbers.These datasets are large-scale and regularly updated; nevertheless, they are susceptible to reliability issues due to their intrinsic properties and distributional characteristics.We use these two datasets for performance benchmarking and evaluate the dataset shift problem.

Sound-Dr Dataset Collection
Sound-Dr dataset is a project1 of AI Center of FPT Software Company Limited (FPT, 1999) , 1981).The application collects users' demographic information, medical history, and records of voices and respiratory sounds.The Sound-Dr dataset was solely collected by FPT for community purposes during the peak season of the COVID-19 pandemic in Vietnam from August 2021 to October 2021.We treat ethical issues as important, and users were prompted to read about our terms and conditions and give us their consent solely for development and community purposes.Therefore, the dataset has the agreement and consent of all users.
In the data collection, we developed web-based and mobilebased applications for users to easily interact and record three different sounds: (1) mouth breathing, (2) nose breathing, and (3) coughing.With the involvement of medical experts in the field, for each audio type, users are requested to record at least three times with a minimum duration of 5 seconds in each turn.The sample rate of 48,000 Hz is set to be the default, and no noise reduction method is used in web-based or mobile-based applications to collect the true nature of the data.Additionally, some metadata of users is also collected via a survey form which includes personal information (e.g., age and gender), related respiratory illness symptoms, smoking status, and COVID-19 diagnosis, as shown in Table 1.

Descriptive Statistics of Sound-Dr Dataset
We obtained a dataset of 3,930 sound recordings; the distribution of coughing, mouth breathing, and nose breathing is presented in Figure 1.There are 1,310 in total subjects with gender distribution shown in Figure 2 and age distribution presented in Figure 4.It can be seen that more males (e.g.60%) than females (e.g.40%) participated in our program.In terms of age groups, subjects between 20 and 40 years old are dominant.Regarding the smoking status, Figure 3 indicates that 90% of the subjects are non-smokers.From 346 COVID-19 positive subjects, there are 293 objects with symptoms and 193 objects without symptoms.
Statistics of the duration, shown in Figure 5, reveal several interesting characteristics in quantifying above mentioned datasets.Some lengths of audio in Coswara and COUGHVID samples are less than 1s, which might lead to being unqualified for the training model and errors in the reading input data process.Nevertheless, the Sound-Dr dataset has longer durations to ensure that training data is split into even parts and does not require padding when the sound length is unsatisfactory.Moreover, regarding statistics of the sampling rate, the Coswara and COUGHVID datasets have audio sampling rates of 44100Hz and 22050Hz, respectively.The Sound-Dr dataset has a higher sampling rate; therefore, our training data can be easily converted to different sampling rates for benchmarking.Our data collection system was built skillfully to ensure the best quality for machine learning tasks.

Task Definition
Given the Sound-Dr dataset, we propose three main tasks: (I) Detect negative or positive COVID-19 subjects, (II) Detect subjects with and without related respiratory symptoms, and (III) Detect normal subjects and anomaly subjects (i.e., anomaly subjects are positive COVID-19 or present related respiratory symptoms).For each task, the audio input of coughing, mouth breathing mouth, and nose breathing are evaluated independently.
Based on the metadata as shown in Table 1, the total of 1,310 subjects are separated into: COVID-19 negative and COVID-19 positive subjects, subjects with and without symptoms, and normal and anomaly subjects for task I, II, and III respectively as shown in Figure 6.
To evaluate the Sound-Dr dataset for each defined task, we apply a 5-fold cross-validation method where the final result is an average of all folds.Random seeds are used to ensure that results are reproducible and the data is divided the same way into all methods for benchmarking.Every experiment's result is an average of the 5 different seeds.The evaluation metrics in use are Accuracy (Acc), F1 score (Sasaki, 2007), and AUC (Bradley, 1997).All the model training, benchmarking, and evaluation tasks have been executed in a system with Ubuntu 18.04, 12 GB RAM, and NVIDIA GTX 1080 GPU.

BASELINE SYSTEM
Given the Sound-Dr dataset, we develop a deep-learningbased framework to explore, which is referred to as the baseline.Generally, the baseline framework can be separated into two main steps: Feature extraction and Classification.

Feature Extraction
The raw audio from one channel (mono) is firstly re-sampled with a sample rate of 16000 Hz using Librosa toolkit (McFee   et al., 2015).Then, re-sampled audio recordings are fed into a pre-trained model to extract embedding features.In this paper, the pre-trained model is from both TRILL (Shor et al., 2020) and FRILL (Peplinski, Shor, Joglekar, Garrison, & Patel, 2021), which is recommended for downstream tasks on non-semantic speech signals.Using TRILL to extract features from Cough sounds for detecting COVID-19 has also been proven effective (Hoang, Pham, Ngo, & Nguyen, 2022).
While the pre-trained TRILL model is based on ResNet architecture presenting a large footprint, the pre-trained FRILL model is built on MobileNet architecture, leveraging knowledge distillation from the pre-trained TRILL model.As a result, the pre-trained FRILL model is suitable for real-time application on edge devices (i.e., FRILL pre-trained model is 32 times faster on a Pixel 1 smartphone and equals to 40% of TRILL size, but still competitive to TRILL model with an average decrease of only 2% in terms of accuracy).
The outputs of both pre-trained models are a time series of one embedding.This means we obtain one embedding (i.e., 2048-dimensional vector) from every second when feeding the audio recordings with different lengths from the Sound-Dr dataset into the models.Hence, we obtain multiple embeddings representing one audio recording.Consequently, we conduct two statistical features of mean and standard deviation across the time axis.We then concatenate these features to create the final embedding (4096-dimensional vector).

Classification
We conduct experiments on the Support Vector Machine, Random Forest, Multilayer Perceptron, ExtraTrees Classifier, LightGBM, and XGB Classifier.Anomaly detection mostly focuses on unsupervised or semi-supervised settings, we use Isolation Forest (Liu, Ting, & Zhou, 2008), and XGBOD (Zhao & Hryniewicki, 2018) for actually seeing the usage of this dataset for anomaly detection for recognizing the outliers.However, we only achieved a good score on XGB Classifier for both Coswara and COUGHVID.Therefore, to build a baseline system and classify extracted embedding features into certain groups defined in Section 3.3, we use XGB Classifier (Friedman, 2000).To fine-tune hyper-parameters of this classifier, shown in Table 2, we make use of the Optuna framework (Akiba, Sano, Yanase, Ohta, & Koyama, 2019) with the Grid Search algorithm.All these classification models are implemented by using XGBoost library (Chen & Guestrin, 2016) for XGB Classifier, Python Outlier Detection library (Zhao, Nasrullah, & Li, 2019) for XGBOD, and Scikit-Learn toolkit (Pedregosa et al., 2011) for the others.

Experimental Results and Discussion
We experimented with the task of COVID-19 Detection based on the three collected sound types: Cough, Breathing mouth, and Breathing nose, as illustrated in Table 3.The performance using Breathe Mouth and Breathe Nose is lower compared to the Cough sound data.The best performance using Cough sound scores 88.44 AUC, 73.13 F1, 86.06 Accuracy.Although TRILL outperforms FRILL on Accuracy by about 0.2% (86.06-86.26Acc), on F1 and AUC metrics, FRILL performs better for 2% .There- fore, we use FRILL for our baseline model as it is satisfactory for the real environment that needs fast, accurate detection, especially on mobile devices.
In addition, we also experiment with the Abnormal Detection in respiratory sound by adjusting the label which we combine the COVID-19 Positive and Symptomatic status into Abnormal labels.Using XGB Classifier with hyper-parameters shown in Table 2, we achieve promising results of 81.16 AUC, 68.12 F1, and 77.18 Accuracy.On XGBOD, results of 82.95 AUC, 70.02 F1, and 79.77 Accuracy show that the unsupervised settings can be used on this dataset for anomaly detection.The performance comparison is described in Table 4.This shows that our dataset has potential for more reliable outcomes on multiple tasks, such as Outlier Detection and Anomaly Detection in Respiration Sound.We hope that models based on the Sound-Dr dataset could be built to support the doctor's diagnosis of disease faster and more accurately in the future.

Dataset Shift Detection
Besides evaluating the effectiveness of the models applied on 3 datasets, we parallelly consider the dataset shift problem that contributes to measuring the dataset's robustness.It happens due to the different distributional characteristics of data between train and test set (Quionero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2009).Many machine learning algorithms are based on the assumption that the training and test data are drawn from the same distribution; thus, dataset shift might lead to the model's tremendous performance degradation.We qualify the robustness of the dataset between Sound-Dr, Coswara, and COUGHVID by detecting accuracy shifts indicating the degree of distribution shifting in the dataset.
We conducted several experiments based on a pipeline for detecting dataset shift by a two-sample-testing-based approach, using pre-trained classifiers for dimensionality reduction (Rabanser et al., 2019).Specifically, the train and test set is reduced in dimension and subsequently analyzed via statistical hypothesis testing.We investigate the equivalence of the source distribution (from which training data is sampled) and target distribution (from which real-world data is sampled).The shifting of datasets is evaluated with various amounts of samples including 50,100,500,1000,10000 accordingly.Table 5 shows that the Sound-Dr dataset exhibits less shifting in train-test distribution.Over samples, our result reached better values of about 15% and 26% with respect to Coswara and COUGHVID; thereby leading to lesser risk of drifting and more reliability for real-world deployment.

Task Performance
The paper aims to establish performance benchmarks for multiple machine-learning tasks.We exploit extracted features through SVMs with linear kernels for classification tasks.Specifically, we use several extraction methods including FRILL, OpenSmile (Eyben, Wöllmer, & Schuller, 2010), OpenXBOW (Schmitt & Schuller, 2017) and Deep Spectrum (Amiriparian et al., 2017) to extract feature representations from preprocessed raw audio data.Acquired representations were scaled to zero mean and unit standard deviation following the parameters from the respective training set.These normalized features were applied to the SVM model employed by the Scikit-Learn toolkit (Pedregosa et al., 2011) with its class LINEARSVC with the optimized complexity parameter C. We conduct experiments in these settings and unify them to a result in Table 6.
We utilise the same feature extraction process and classifier (SVM) for COVID-19 and abnormal detection tasks on datasets.The experiment results on the Sound-Dr dataset are better than the two other datasets in Table 6.The task per-formance improvements are statistically significant for both COVID-19 Detection and Abnormal Detection on both the Coswara and COUGHVID datasets respectively.
It demonstrates that the Sound-Dr dataset might provide potential features for detecting anomalies in respiratory sounds such as cough and breath.In addition, better results of the Sound-Dr dataset indicate that our dataset was processed well to obtain high-quality samples during data collection.

CONCLUSION
High-quality respiratory sound data, which can be used to detect patient symptoms, is in demand; thus, the Sound-Dr dataset is essential for researchers to build health applications.We also build a system to evaluate multiple datasets and create the first baseline system for future research and benchmarking.Based on our comprehensive experimental results, the Sound-Dr dataset is better than multiple existing datasets in terms of both unsupervised and supervised methods.Therefore, the Sound-Dr dataset is effectively collected with extensive lengths to minimize various noises.Furthermore, our dataset's unique properties and metadata of healthrelated characteristics are more reliable against dataset shifts.
We build a model using FRILL embedding and XGBoost classifier for potential real-life context that necessitates rapid and accurate detection.It also helps the researchers easy to explore to improve the performance compared with the baselines.With the baseline system and dataset available, researchers have the advantage of rapid development of solutions in high demand.With the Sound-Dr dataset, we hope that researchers accelerate the building of Artificial Intelligence models to support doctors diagnose diseases faster and more accurately.The Sound-Dr dataset is collected from various mobile devices, with rigorous data collection methods, promising to apply widely in real-world situations.
With the increasing impact of respiratory illnesses, the Sound-Dr dataset is proposed in collaboration with medical experts to study respiratory anomalies, including pneumonia and COVID-19.As the baseline, this dataset can be useful to qualify respiratory disease screening/abnormal detection/symptom classification.In real-world scenarios, the dataset has been used in multiple medical apps for rapid screening due to its quality and robustness such as Respiratory diseases, COVID-19, and Respiratory anomalies.
In our pipeline, more data are needed in the field to enhance neural networks optimally.Although we provide an additional dataset on respiration to increase the distribution of data, more data are needed across many countries with a larger number of subjects.By collecting data from subjects from South East Asia, our research aims to provide the groundwork for future advancements in information processing and machine learning.
Figure 1.Histograms of three types of audio recording duration.

Figure 5 .
Figure 5.The distribution of the duration of datasets.
Figure 6.The number of subjects for each task defined.

Table 1 .
Metadata fields of the Sound-Dr dataset.

Table 3 .
The experimented results on Sound-Dr dataset over five runs.Results in bold font mark the best results given the same (fair) task.

Table 4 .
The benchmark results of unsupervised methods on other datasets over five runs for Abnormal Detection task (Symptom + COVID-19).Results in bold font mark the best results given the same (fair) dataset.

Table 5 .
Detecting Dataset Shift Using Failing Loudly

Table 6 .
The benchmark results of supervised methods on other datasets over five runs.Results in bold font mark the best results given the same (fair) feature.