A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> SCIENCES: COMPREHENSIVE WORKS (Total: 374 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Journal Cover
Number of Followers: 4  

  This is an Open Access Journal Open Access journal
ISSN (Online) 2306-5729
Published by MDPI Homepage  [258 journals]
  • Data, Vol. 8, Pages 135: Enhancing Small Tabular Clinical Trial Dataset
           through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP

    • Authors: Winston Wang, Tun-Wen Pai
      First page: 135
      Abstract: This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.
      Citation: Data
      PubDate: 2023-08-23
      DOI: 10.3390/data8090135
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 136: Knowledge Graph Dataset for Semantic Enrichment
           of Picture Description in NAPS Database

    • Authors: Marko Horvat, Gordan Gledec, Tomislav Jagušt, Zoran Kalafatić
      First page: 136
      Abstract: This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations for 1356 NAPS pictures in 5 categories, linked to WordNet synsets and Suggested Upper Merged Ontology (SUMO) concepts presented in a tabular format. Both knowledge databases provide an extensive and supervised taxonomy glossary suitable for describing picture semantics. The annotation glossary consists of 935 WordNet and 513 SUMO entities. A description of the dataset and the specific processes used to collect, process, review, and publish the dataset as open data are also provided. This dataset is unique in that it captures complex objects, scenes, actions, and the overall context of emotional stimuli with knowledge taxonomies at a high level of quality. It provides a valuable resource for a variety of projects investigating emotion, attention, and related phenomena. In addition, researchers can use this dataset to explore the relationship between emotions and high-level semantics or to develop data-retrieval tools to generate personalized stimuli sequences. The dataset is freely available in common formats (Excel and CSV).
      Citation: Data
      PubDate: 2023-08-24
      DOI: 10.3390/data8090136
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 137: A Framework for Evaluating Renewable Energy for
           Decision-Making Integrating a Hybrid FAHP-TOPSIS Approach: A Case Study in
           Valle del Cauca, Colombia

    • Authors: Mateo Barrera-Zapata, Fabian Zuñiga-Cortes, Eduardo Caicedo-Bravo
      First page: 137
      Abstract: At present, the energy landscape of many countries faces transformational challenges driven by sustainable development objectives, supported by the implementation of clean technologies, such as renewable energy sources, to meet the flexibility and diversification needs of the traditional energy mix. However, integrating these technologies requires a thorough study of the context in which they are developed. Furthermore, it is necessary to carry out an analysis from a sustainable approach that quantifies the impact of proposals on multiple objectives established by stakeholders. This article presents a framework for analysis that integrates a method for evaluating the technical feasibility of resources for photovoltaic solar, wind, small hydroelectric power, and biomass generation. These resources are used to construct a set of alternatives and are evaluated using a hybrid FAHP-TOPSIS approach. FAHP-TOPSIS is used as a comparison technique among a collection of technical, economic, and environmental criteria, ranking the alternatives considering their level of trade-off between criteria. The results of a case study in Valle del Cauca (Colombia) offer a wide range of alternatives and indicate a combination of 50% biomass, and 50% solar as the best, assisting in decision-making for the correct use of available resources and maximizing the benefits for stakeholders.
      Citation: Data
      PubDate: 2023-08-30
      DOI: 10.3390/data8090137
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 138: Using Landsat-5 for Accurate Historical LULC
           Classification: A Comparison of Machine Learning Models

    • Authors: Denis Krivoguz, Sergei G. Chernyi, Elena Zinchenko, Artem Silkin, Anton Zinchenko
      First page: 138
      Abstract: This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from 1990 as input data. Four machine learning models (deep neural network, Random Forest, support vector machine (SVM), and AdaBoost) are employed, and their hyperparameters are tuned using random search and grid search. Model performance is evaluated through cross-validation and confusion matrices. The deep neural network achieves the highest accuracy (96.2%) and performs well in classifying water, urban lands, open soils, and high vegetation. However, it faces challenges in classifying grasslands, bare lands, and agricultural areas. The Random Forest model achieves an accuracy of 90.5% but struggles with differentiating high vegetation from agricultural lands. The SVM model achieves an accuracy of 86.1%, while the AdaBoost model performs the lowest with an accuracy of 58.4%. The novel contributions of this study include the comparison and evaluation of multiple machine learning models for land use classification in the Kerch Peninsula. The deep neural network and Random Forest models outperform SVM and AdaBoost in terms of accuracy. However, the use of limited data sources such as cadastral data and scientific articles may introduce limitations and potential errors. Future research should consider incorporating field studies and additional data sources for improved accuracy. This study provides valuable insights for land use classification, facilitating the assessment and management of natural resources in the Kerch Peninsula. The findings contribute to informed decision-making processes and lay the groundwork for further research in the field.
      Citation: Data
      PubDate: 2023-08-30
      DOI: 10.3390/data8090138
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 139: Dataset of Multi-Aspect Integrated Migration

    • Authors: Diletta Goglia, Laura Pollacci, Alina Sîrbu
      First page: 139
      Abstract: Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated using traditional data, which are however distributed across different sources and difficult to integrate. In this context we present the Multi-aspect Integrated Migration Indicators (MIMI) dataset, a new dataset of migration indicators (flows and stocks) and possible migration drivers (cultural, economic, demographic and geographic indicators). This was obtained through acquisition, transformation and integration of disparate traditional datasets together with social network data from Facebook (Social Connectedness Index). This article describes the process of gathering, embedding and merging traditional and novel variables, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.
      Citation: Data
      PubDate: 2023-08-31
      DOI: 10.3390/data8090139
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 140: Employing Source Code Quality Analytics for
           Enriching Code Snippets Data

    • Authors: Thomas Karanikiotis, Themistoklis Diamantopoulos, Andreas Symeonidis
      First page: 140
      Abstract: The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability.
      Citation: Data
      PubDate: 2023-08-31
      DOI: 10.3390/data8090140
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 141: Thailand Raw Water Quality Dataset Analysis and

    • Authors: Jaturapith Krohkaew, Pongpon Nilaphruek, Niti Witthayawiroj, Sakchai Uapipatanakul, Yamin Thwe, Padma Nyoman Crisnapati
      First page: 141
      Abstract: Sustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need for accurate monitoring and assessment of water quality parameters. This research describes a reconstructed daily water quality dataset that complements rare historical observations for six station points along the Chao Phraya River in Thailand. Internet of Things technology and a Eureka water probe sensor is used to collect and reconstruct the water quality dataset for the period from June 2022–February 2023, with Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, Acidity/Basicity, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth as the recorded parameters from six different stations. The presented dataset comprises a total of 211,322 data points, which are separated into six CSV files. The dataset is then evaluated using the Long Short-Term Memory (LSTM) algorithm with a Mean Squared Error (MSE) of 0.0012256, and Root Mean Squared Error (RMSE) of 0.0350080. The proposed dataset provides valuable insights for researchers studying river ecosystems, supporting informed decision-making and sustainable water management practices.
      Citation: Data
      PubDate: 2023-09-04
      DOI: 10.3390/data8090141
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 142: Update of Dietary Supplement Label Database
           Addressing on Coding in Italy

    • Authors: Giorgia Perelli, Roberta Bernini, Massimo Lucarini, Alessandra Durazzo
      First page: 142
      Abstract: Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of dietary supplements has increased. A food supplements-based database has, as a principal feature, an intrinsic dynamism related to the continuous changes in formulations, which consequently leads to the need for constant monitoring of the market and for regular updates of the database. This study presents an update to the Dietary Supplement Label Database in Italy focused on dietary supplements coding. The updated dataset here, presented for the first time, consists of the codes of 216 dietary supplements currently on the market in Italy that have functional foods as their characterizing ingredients, throughout the two commonly most used description and classification systems: LanguaLTM and FoodEx2-. This update represents a unique tool and guideline for other compilers and users for applying classification coding systems to dietary supplements. Moreover, this updated dataset represents a valuable resource for several applications such as epidemiological investigations, exposure studies, and dietary assessment.
      Citation: Data
      PubDate: 2023-09-13
      DOI: 10.3390/data8090142
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 143: A New Odd Beta Prime-Burr X Distribution with
           Applications to Petroleum Rock Sample Data and COVID-19 Mortality Rate

    • Authors: Ahmad Abubakar Suleiman, Hanita Daud, Narinderjit Singh Sawaran Singh, Aliyu Ismail Ishaq, Mahmod Othman
      First page: 143
      Abstract: In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is monotonically increasing, decreasing, bathtub, and N-shaped, making it suitable for modeling skewed data and failure rates. Various statistical properties of the new model are obtained, such as moments, moment-generating function, entropies, quantile function, and limit behavior. The maximum-likelihood-estimation procedure is utilized to determine the parameters of the model. A Monte Carlo simulation study is implemented to ascertain the efficiency of maximum-likelihood estimators. The findings demonstrate the empirical application and flexibility of the OBPBX distribution, as showcased through its analysis of petroleum rock samples and COVID-19 mortality data, along with its superior performance compared to well-known extended versions of the Burr X distribution. We anticipate that the new distribution will attract a wider readership and provide a vital tool for modeling various phenomena in different domains.
      Citation: Data
      PubDate: 2023-09-19
      DOI: 10.3390/data8090143
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 144: Potential Range Map Dataset of Indian Birds

    • Authors: Arpit Deomurari, Ajay Sharma, Dipankar Ghose, Randeep Singh
      First page: 144
      Abstract: Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km2. Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps significantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action.
      Citation: Data
      PubDate: 2023-09-21
      DOI: 10.3390/data8090144
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 123: Blockchain Payment Services in the Hospitality
           Sector: The Mediating Role of Data Security on Utilisation Efficiency of
           the Customer

    • Authors: Ankit Dhiraj, Sanjeev Kumar, Divya Rani, Simon Grima, Kiran Sood
      First page: 123
      Abstract: Blockchain technology has the potential to completely transform the hospitality sector by offering a safe, open, and effective method of payment. Increased customer utilisation efficiency may result from this. This study looks into how blockchain payment methods affect hotel customers’ intentions to stay loyal by devising four hypotheses. A questionnaire was specifically created and self-administered for this study as a data-gathering tool and distributed to hotel customers. The I.B.M. SPSS and Amos software packages were used to analyse the data of the 301 valid responses. Findings show that hospitality customers may use blockchain payment services if the customer is satisfied with the data security of this payment system. The study also highlighted that customer data security mediated the association between utilisation efficiency and blockchain payment systems. Blockchain payment services can affect visitors’ intentions to stay loyal by impacting data security and consumer happiness. Results suggest that blockchain payment systems can be useful for hospitality firms looking to increase client utilisation efficiency. Blockchain can simplify visitor booking and payment processes by providing a safe, open, and effective transacting method. This may result in a satisfying encounter that visitors are more inclined to recall and repeat.
      Citation: Data
      PubDate: 2023-07-30
      DOI: 10.3390/data8080123
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 124: Measuring the Effect of Fraud on Data-Quality

    • Authors: Samiha Brahimi, Mariam Elhussein
      First page: 124
      Abstract: Data preprocessing moves the data from raw to ready for analysis. Data resulting from fraud compromises the quality of the data and the resulting analysis. It can exist in datasets such that it goes undetected since it is included in the analysis. This study proposed a process for measuring the effect of fraudulent data during data preparation and its possible influence on quality. The five-step process begins with identifying the business rules related to the business process(s) affected by fraud and their associated quality dimensions. This is followed by measuring the business rules in the specified timeframe, detecting fraudulent data, cleaning them, and measuring their quality after cleaning. The process was implemented in the case of occupational fraud within a hospital context and the illegal issuance of underserved sick leave. The aim of the application is to identify the quality dimensions that are influenced by the injected fraudulent data and how these dimensions are affected. This study agrees with the existing literature and confirms its effects on timeliness, coherence, believability, and interpretability. However, this did not show any effect on consistency. Further studies are needed to arrive at a generalizable list of the quality dimensions that fraud can affect.
      Citation: Data
      PubDate: 2023-07-30
      DOI: 10.3390/data8080124
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 125: Quantitative Metabolomic Dataset of Avian Eye

    • Authors: Ekaterina A. Zelentsova, Sofia S. Mariasina, Vadim V. Yanshole, Lyudmila V. Yanshole, Nataliya A. Osik, Kirill A. Sharshov, Yuri P. Tsentalovich
      First page: 125
      Abstract: Metabolomics is a powerful set of methods that uses analytical techniques to identify and quantify metabolites in biological samples, providing a snapshot of the metabolic state of a biological system. In medicine, metabolomics may help to reveal the molecular basis of a disease, make a diagnosis, and monitor treatment responses, while in agriculture, it can improve crop yields and plant breeding. However, animal metabolomics faces several challenges due to the complexity and diversity of animal metabolomes, the lack of standardized protocols, and the difficulty in interpreting metabolomic data. The current dataset includes quantitative metabolomic profiles of eye lenses from 26 bird species (111 specimens) that can aid researchers in developing new experiments, mathematical models, and integrating with other “-omics” data. The dataset includes raw 1H NMR spectra, protocols for sample preparation, and data preprocessing, with the final table containing information on the abundance of 89 reliably identified and quantified metabolites. The dataset is quantitative, making it relevant for supplementing with new specimens or comparison groups, followed by data mining and expected new interpretations. The data were obtained using the bird specimens collected in compliance with ethical standards and revealed potential differences in metabolic pathways due to phylogenetic differences or environmental exposure.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080125
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 126: Datasets of Simulated Exhaled Aerosol Images from
           Normal and Diseased Lungs with Multi-Level Similarities for Neural Network
           Training/Testing and Continuous Learning

    • Authors: Mohamed Talaat, Xiuhua Si, Jinxiang Xi
      First page: 126
      Abstract: Although exhaled aerosols and their patterns may seem chaotic in appearance, they inherently contain information related to the underlying respiratory physiology and anatomy. This study presented a multi-level database of simulated exhaled aerosol images from both normal and diseased lungs. An anatomically accurate mouth-lung geometry extending to G9 was modified to model two stages of obstructions in small airways and physiology-based simulations were utilized to capture the fluid-particle dynamics and exhaled aerosol images from varying breath tests. The dataset was designed to test two performance metrics of convolutional neural network (CNN) models when used for transfer learning: interpolation and extrapolation. To this aim, three testing datasets with decreasing image similarities were developed (i.e., level 1, inbox, and outbox). Four network models (AlexNet, ResNet-50, MobileNet, and EfficientNet) were tested and the performances of all models decreased for the outbox test images, which were outside the design space. The effect of continuous learning was also assessed for each model by adding new images into the training dataset and the newly trained network was tested at multiple levels. Among the four network models, ResNet-50 excelled in performance in both multi-level testing and continuous learning, the latter of which enhanced the accuracy of the most challenging classification task (i.e., 3-class with outbox test images) from 60.65% to 98.92%. The datasets can serve as a benchmark training/testing database for validating existent CNN models or quantifying the performance metrics of new CNN models.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080126
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 127: eMailMe: A Method to Build Datasets of Corporate
           Emails in Portuguese

    • Authors: Akira A. de Moura Galvão Uematsu, Anarosa A. F. Brandão
      First page: 127
      Abstract: One of the areas in which knowledge management has application is in companies that are concerned with maintaining and disseminating their practices among their members. However, studies involving these two domains may end up suffering from the issue of data confidentiality. Furthermore, it is difficult to find data regarding organizations processes and associated knowledge. Therefore, this paper presents a method to support the generation of a labeled dataset composed of texts that simulate corporate emails containing sensitive information regarding disclosure, written in Portuguese. The method begins with the definition of the dataset’s size and content distribution; the structure of its emails’ texts; and the guidelines for specialists to build the emails’ texts. It aims to create datasets that can be used in the validation of a tacit knowledge extraction process considering the 5W1H approach for the resulting base. The method was applied to create a dataset with content related to several domains, such as Federal Court and Registry Office and Marketing, giving it diversity and realism, while simulating real-world situations in the specialists’ professional life. The dataset generated is available in an open-access repository so that it can be downloaded and, eventually, expanded.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080127
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 128: VEPL Dataset: A Vegetation Encroachment in Power
           Line Corridors Dataset for Semantic Segmentation of Drone Aerial

    • Authors: Mateo Cano-Solis, John R. Ballesteros, John W. Branch-Bedoya
      First page: 128
      Abstract: Vegetation encroachment in power line corridors has multiple problems for modern energy-dependent societies. Failures due to the contact between power lines and vegetation can result in power outages and millions of dollars in losses. To address this problem, UAVs have emerged as a promising solution due to their ability to quickly and affordably monitor long corridors through autonomous flights or being remotely piloted. However, the extensive and manual task that requires analyzing every image acquired by the UAVs when searching for the existence of vegetation encroachment has led many authors to propose the use of Deep Learning to automate the detection process. Despite the advantages of using a combination of UAV imagery and Deep Learning, there is currently a lack of datasets that help to train Deep Learning models for this specific problem. This paper presents a dataset for the semantic segmentation of vegetation encroachment in power line corridors. RGB orthomosaics were obtained for a rural road area using a commercial UAV. The dataset is composed of pairs of tessellated RGB images, coming from the orthomosaic and corresponding multi-color masks representing three different classes: vegetation, power lines, and the background. A detailed description of the image acquisition process is provided, as well as the labeling task and the data augmentation techniques, among other relevant details to produce the dataset. Researchers would benefit from using the proposed dataset by developing and improving strategies for vegetation encroachment monitoring using UAVs and Deep Learning.
      Citation: Data
      PubDate: 2023-08-04
      DOI: 10.3390/data8080128
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 129: Anomaly Detection in Student Activity in Solving
           Unique Programming Exercises: Motivated Students against Suspicious Ones

    • Authors: Liliya A. Demidova, Peter N. Sovietov, Elena G. Andrianova, Anna A. Demidova
      First page: 129
      Abstract: This article presents a dataset containing messages from the Digital Teaching Assistant (DTA) system, which records the results from the automatic verification of students’ solutions to unique programming exercises of 11 various types. These results are automatically generated by the system, which automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). The DTA system is trained to distinguish between approaches to solve programming exercises, as well as to identify correct and incorrect solutions, using intelligent algorithms responsible for analyzing the source code in the DTA system using vector representations of programs based on Markov chains, calculating pairwise Jensen–Shannon distances for programs and using a hierarchical clustering algorithm to detect high-level approaches used by students in solving unique programming exercises. In the process of learning, each student must correctly solve 11 unique exercises in order to receive admission to the intermediate certification in the form of a test. In addition, a motivated student may try to find additional approaches to solve exercises they have already solved. At the same time, not all students are able or willing to solve the 11 unique exercises proposed to them; some will resort to outside help in solving all or part of the exercises. Since all information about the interactions of the students with the DTA system is recorded, it is possible to identify different types of students. First of all, the students can be classified into 2 classes: those who failed to solve 11 exercises and those who received admission to the intermediate certification in the form of a test, having solved the 11 unique exercises correctly. However, it is possible to identify classes of typical, motivated and suspicious students among the latter group based on the proposed dataset. The proposed dataset can be used to develop regression models that will predict outbursts of student activity when interacting with the DTA system, to solve clustering problems, to identify groups of students with a similar behavior model in the learning process and to develop intelligent data classifiers that predict the students’ behavior model and draw appropriate conclusions, not only at the end of the learning process but also during the course of it in order to motivate all students, even those who are classified as suspicious, to visualize the results of the learning process using various tools.
      Citation: Data
      PubDate: 2023-08-08
      DOI: 10.3390/data8080129
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 130: Towards Action-State Process Model Discovery

    • Authors: Alessio Bottrighi, Marco Guazzone, Giorgio Leonardi, Stefania Montani, Manuel Striani, Paolo Terenziani
      First page: 130
      Abstract: Process model discovery covers the different methodologies used to mine a process model from traces of process executions, and it has an important role in artificial intelligence research. Current approaches in this area, with a few exceptions, focus on determining a model of the flow of actions only. However, in several contexts, (i) restricting the attention to actions is quite limiting, since the effects of such actions also have to be analyzed, and (ii) traces provide additional pieces of information in the form of states (i.e., values of parameters possibly affected by the actions); for instance, in several medical domains, the traces include both actions and measurements of patient parameters. In this paper, we propose AS-SIM (Action-State SIM), the first approach able to mine a process model that comprehends two distinct classes of nodes, to capture both actions and states.
      Citation: Data
      PubDate: 2023-08-09
      DOI: 10.3390/data8080130
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 131: Draft Genome Sequence Data of Streptomyces
           anulatus, Strain K-31

    • Authors: Andrey P. Bogoyavlenskiy, Madina S. Alexyuk, Amankeldi K. Sadanov, Vladimir E. Berezin, Lyudmila P. Trenozhnikova, Gul B. Baymakhanova
      First page: 131
      Abstract: Streptomyces anulatus is a typical representative of the Streptomyces genus synthesizing a large number of biologically active compounds. In this study, the draft genome of Streptomyces anulatus, strain K-31 is presented, generated from Illumina reads by SPAdes software. The size of the assembled genome was 8.548838 Mb. Annotation of the S. anulatus genome assembly identified C. hemipterus genome 7749 genes, including 7149 protein-coding genes and 92 RNA genes. This genome will be helpful to further understand Streptomyces genetics and evolution and can be useful for obtained biological active compounds.
      Citation: Data
      PubDate: 2023-08-10
      DOI: 10.3390/data8080131
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 132: VR Traffic Dataset on Broad Range of End-User

    • Authors: Marina Polupanova
      First page: 132
      Abstract: With the emergence of new internet traffic types in modern transport networks, it has become critical for service providers to understand the structure of that traffic and predict peaks of that load for planning infrastructure expansion. Several studies have investigated traffic parameters for Virtual Reality (VR) applications. Still, most of them test only a partial range of user activities during a limited time interval. This work creates a dataset of captures from a broader spectrum of VR activities performed with a Meta Quest 2 headset, with the duration of each real residential user session recorded for at least half an hour. Newly collected data helped show that some gaming VR traffic activities have a high share of uplink traffic and require symmetric user links. Also, we have figured out that the gaming phase of the overall gameplay is more sensitive to the channel resources reduction than the higher bitrate game launch phase. Hence, we recommend it as a source of traffic distribution for channel sizing model creation. From the gaming phase, capture intervals of more than 100 s contain the most representative information for modeling activity.
      Citation: Data
      PubDate: 2023-08-17
      DOI: 10.3390/data8080132
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 133: Leveraging Return Prediction Approaches for
           Improved Value-at-Risk Estimation

    • Authors: Farid Bagheri, Diego Reforgiato Recupero, Espen Sirnes
      First page: 133
      Abstract: Value at risk is a statistic used to anticipate the largest possible losses over a specific time frame and within some level of confidence, usually 95% or 99%. For risk management and regulators, it offers a solution for trustworthy quantitative risk management tools. VaR has become the most widely used and accepted indicator of downside risk. Today, commercial banks and financial institutions utilize it as a tool to estimate the size and probability of upcoming losses in portfolios and, as a result, to estimate and manage the degree of risk exposure. The goal is to obtain the average number of VaR “failures” or “breaches” (losses that are more than the VaR) as near to the target rate as possible. It is also desired that the losses be evenly distributed as possible. VaR can be modeled in a variety of ways. The simplest method is to estimate volatility based on prior returns according to the assumption that volatility is constant. Otherwise, the volatility process can be modeled using the GARCH model. Machine learning techniques have been used in recent years to carry out stock market forecasts based on historical time series. A machine learning system is often trained on an in-sample dataset, where it can adjust and improve specific hyperparameters in accordance with the underlying metric. The trained model is tested on an out-of-sample dataset. We compared the baselines for the VaR estimation of a day (d) according to different metrics (i) to their respective variants that included stock return forecast information of d and stock return data of the days before d and (ii) to a GARCH model that included return prediction information of d and stock return data of the days before d. Various strategies such as ARIMA and a proposed ensemble of regressors have been employed to predict stock returns. We observed that the versions of the univariate techniques and GARCH integrated with return predictions outperformed the baselines in four different marketplaces.
      Citation: Data
      PubDate: 2023-08-17
      DOI: 10.3390/data8080133
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 134: Quantifying Webpage Performance: A Comparative
           Analysis of TCP/IP and QUIC Communication Protocols for Improved

    • Authors: Thyago Celso Cavalcante Nepomuceno, Késsia Thais Cavalcanti Nepomuceno, Fabiano Carlos da Silva, Silas Garrido Teixeira de Carvalho Santos
      First page: 134
      Abstract: Browsing is a prevalent activity on the World Wide Web, and users usually demonstrate significant expectations for expeditious information retrieval and seamless transactions. This article presents a comprehensive performance evaluation of the most frequently accessed webpages in recent years using Data Envelopment Analysis (DEA) adapted to the context (inverse DEA), comparing their performance under two distinct communication protocols: TCP/IP and QUIC. To assess performance disparities, parametric and non-parametric hypothesis tests are employed to investigate the appropriateness of each website’s communication protocols. We provide data on the inputs, outputs, and efficiency scores for 82 out of the world’s top 100 most-accessed websites, describing how experiments and analyses were conducted. The evaluation yields quantitative metrics pertaining to the technical efficiency of the websites and efficient benchmarks for best practices. Nine websites are considered efficient from the point of view of at least one of the communication protocols. Considering TCP/IP, about 80.5% of all units (66 webpages) need to reduce more than 50% of their page load time to be competitive, while this number is 28.05% (23 webpages), considering QUIC communication protocol. In addition, results suggest that TCP/IP protocol has an unfavorable effect on the overall distribution of inefficiencies.
      Citation: Data
      PubDate: 2023-08-19
      DOI: 10.3390/data8080134
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 113: VPTD: Human Face Video Dataset for Personality
           Traits Detection

    • Authors: Kenan Kassab, Alexey Kashevnik, Alexander Mayatin, Dmitry Zubok
      First page: 113
      Abstract: In this paper, we propose a dataset for personality traits detection based on human face videos. Ground truth data have been annotated using the IPIP-50 personality test that every participant is implementing. To collect the dataset, we developed a web-based platform that allows us to acquire spontaneous answers for predefined questions from the respondents. The website allows the participants to record an interactive interview in order to imitate the real-life interview. The dataset includes 38 videos (2 min on average) for people of different races, genders, and ages. In the paper, we propose the top five personality traits calculated based on the test, as well as the top five personality traits calculated by our own developed model that determines this information based on video analysis. We introduced a statistical analysis for the collected dataset, and we also applied a K-means clustering algorithm to cluster the data and present the clustering results.
      Citation: Data
      PubDate: 2023-06-22
      DOI: 10.3390/data8070113
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 114: A Survey Dataset Evaluating Perceptions of Civil
           Engineering Students about Building Information Modelling (BIM)

    • Authors: Diego Maria Barbieri, Baowen Lou, Marco Passavanti, Aurora Barbieri, Fredrik Bjørheim
      First page: 114
      Abstract: The implementation of Building Information Modelling (BIM) technologies has become increasingly central in the design, construction and maintenance of both civil structures and infrastructures. As more and more software houses develop new BIM software solutions and a wide range of private and public stakeholders employ them, several educational institutes across the globe strive to expand their teaching portfolio to encompass learning and teaching of BIM. This dataset deals with the perceptions expressed by all the civil engineering undergraduate students who attended an academic course specifically about BIM at University of Stavanger (UiS), Norway, during the second semester 2022. The survey was divided into five parts and collected information regarding as many overarching aspects: socio-demographic data, perceptions about BIM before and after course attendance, satisfaction about the academic course and the way it was conducted. Considering the very moderate sample size (28 students) and potential biases due to the specific context of the University of Stavanger, the dataset can provide a useful insight into teaching approaches and future curriculum development, rather than indicating major and generalized trends in BIM education. As the questionnaire responses shed light on the feedbacks and perceptions expressed by university students dealing with BIM for their first time, the formed dataset can offer a straightforward appreciation of students’ cognitive behaviour in BIM education.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070114
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 115: Factory-Based Vibration Data for Bearing-Fault

    • Authors: Adam Lundström, Mattias O’Nils
      First page: 115
      Abstract: The importance of preventing failures in bearings has led to a large amount of research being conducted to find methods for fault diagnostics and prognostics. Many of these solutions, such as deep learning methods, require a significant amount of data to perform well. This is a reason why publicly available data are important, and there currently exist several open datasets that contain different conditions and faults. However, one challenge is that almost all of these data come from a laboratory setting, where conditions might differ from those found in an industrial environment where the methods are intended to be used. This also means that there may be characteristics of the industrial data that are important to take into account. Therefore, this study describes a completely new dataset for bearing faults from a pulp mill. The analysis of the data shows that the faults vary significantly in terms of fault development, rotation speed, and the amplitude of the vibration signal. It also suggests that methods built for this environment need to consider that no historical examples of faults in the target domain exist and that external events can occur that are not related to any condition of the bearing.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070115
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 116: Dataset of Linkability Networks of Ethereum
           Accounts Involved in NFT Trading of Top 15 NFT Collections

    • Authors: Aleksandar Tošić, Niki Hrovatin, Jernej Vičič
      First page: 116
      Abstract: In this paper, we present subgraphs of Ethereum wallets involved in NFT trades of the top 15 ERC721 NFT collections. To obtain the subgraphs, we have extracted the Ethereum transaction graph from a live Ethereum node and filtered out exchanges, mining pools, and smart contracts. For each of the selected collections, we identified the set of accounts involved in NFT trading, which we used to perform a breadth-first search in the Ethereum transaction graph to obtain a subgraph. These subgraphs can offer insight into the linkability of accounts participating in NFT trading on the Ethereum blockchain.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070116
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 117: Assessment of Maize Silage Quality Under
           Different Pre-Ensiling Conditions

    • Authors: Lorenzo Serva, Igino Andrighetto, Severino Segato, Giorgio Marchesini, Maria Chinello, Luisa Magrin
      First page: 117
      Abstract: Maize silage suffers from several factors that affect the final quality and, to some extent, pre-ensiled conditions that can be potentially tuned during harvesting. After assessing new indices for silage quality under lab-scale conditions, several trials have been conducted to find associations between fresh maize characteristics and silage features. Among the first, we included field input levels, FAO class, maturity stage, use of bacterial inoculants, sealing delay and chemical traits, whereas, among the latter, we assessed density and porosity, pH, fermentative profile, dry matter loss and aerobic stability. The trials were conducted using vacuum bags or mini silo buckets. More than 1500 maize samples harvested in Northeast Italy were analysed during the 2016–2022 period. Moreover, to evaluate silage aerobic stability, the fermentative profile and temperature were measured 14 days after the opening of the silo. The association between silage quality and aerobic stability was assessed, and a prognostic risk score was used to calculate the probability of aerobic instability. The dataset could provide baseline information to promote the continuous improvement of maize silage management from different botanical and crop fields, thus improving agronomic and animal farm resource allocation from a precision agriculture perspective.
      Citation: Data
      PubDate: 2023-07-02
      DOI: 10.3390/data8070117
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 118: A Semantically Annotated 15-Class Ground Truth
           Dataset for Substation Equipment to Train Semantic Segmentation Models

    • Authors: Andreas Anael Pereira Gomes, Francisco Itamarati Secolo Ganacim, Fabiano Gustavo Silveira Magrin, Nara Bobko, Leonardo Göbel Fernandes, Anselmo Pombeiro, Eduardo Félix Ribeiro Romaneli
      First page: 118
      Abstract: The lack of annotated semantic segmentation datasets for electrical substations in the literature poses a significant problem for machine learning tasks; before training a model, a dataset is needed. This paper presents a new dataset of electric substations with 1660 images annotated with 15 classes, including insulators, disconnect switches, transformers and other equipment commonly found in substation environments. The images were captured using a combination of human, fixed and AGV-mounted cameras at different times of the day, providing a diverse set of training and testing data for algorithm development. In total, 50,705 annotations were created by a team of experienced annotators, using a standardized process to ensure accuracy across the dataset. The resulting dataset provides a valuable resource for researchers and practitioners working in the fields of substation automation, substation monitoring and computer vision. Its availability has the potential to advance the state of the art in this important area.
      Citation: Data
      PubDate: 2023-07-05
      DOI: 10.3390/data8070118
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 119: Proteomic Shift in Mouse Embryonic Fibroblasts
           Pfa1 during Erastin, ML210, and BSO-Induced Ferroptosis

    • Authors: Olga M. Kudryashova, Alexey M. Nesterenko, Dmitry A. Korzhenevskii, Valeriy K. Sulyagin, Vasilisa M. Tereshchuk, Vsevolod V. Belousov, Arina G. Shokhina
      First page: 119
      Abstract: Ferroptosis is a unique variety of non-apoptotic cell death, driven by massive lipid oxidation in an iron-dependent manner. Since ferroptosis was introduced as a concept in 2012, it has demonstrated its essential role in the pathogenesis in neurodegenerative diseases and an important role in therapy-resistant cancer cells. Thus, detailed molecular understanding of both canonical and alternative ferroptosis pathways is required. There is a set of widely used chemical agents to modulate ferroptosis using different pathway targets: erastin blocks cystine–glutamate antiporter, system xc-; ML210 directly inactivates GPX4; and L-buthionine sulfoximine (BSO) inhibits γ-glutamylcysteine synthetase, an essential enzyme for glutathione synthesis de novo. Most studies have focused on the lipidomic profiling of model systems undergoing death in a ferroptotic modality. In this study, we developed high-quality shotgun proteome sequencing during ferroptosis induction by three widely used chemical agents (erastin, ML210, and BSO) before and after 24 and 48 h of treatment. Chromato-mass spectra were registered in DDA mode and are suitable for further label-free quantification. Both processed and raw files are publicly available and could be a valuable dynamic proteome map for further ferroptosis investigation.
      Citation: Data
      PubDate: 2023-07-12
      DOI: 10.3390/data8070119
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 120: PoPu-Data: A Multilayered, Simultaneously
           Collected Lying Position Dataset

    • Authors: Luís Fonseca, Fernando Ribeiro, José Metrôlho, Adriana Santos, Rogério Dionisio, Mohammad Mohammad Amini, Arlindo F. Silva, Ahmad Reza Heravi, Davood Fanaei Sheikholeslami, Filipe Fidalgo, Francisco B. Rodrigues, Osvaldo Santos, Patrícia Coelho, Seyyed Sajjad Aemmi
      First page: 120
      Abstract: This study presents a dataset containing three layers of data that are useful for body position classification and all uses related to it. The PoPu dataset contains simultaneously collected data from two different sensor sheets—one placed over and one placed under a mattress; furthermore, a segmentation data layer was added where different body parts are identified using the pressure data from the sensors over the mattress. The data included were gathered from 60 healthy volunteers distributed among the different gathered characteristics: namely sex, weight, and height. This dataset can be used for position classification, assessing the viability of sensors placed under a mattress, and in applications regarding bedded or lying people or sleep related disorders.
      Citation: Data
      PubDate: 2023-07-16
      DOI: 10.3390/data8070120
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 121: Knowledge Discovery and Dataset for the
           Improvement of Digital Literacy Skills in Undergraduate Students

    • Authors: Pongpon Nilaphruek, Pattama Charoenporn
      First page: 121
      Abstract: For over two decades, scholars and practitioners have emphasized the importance of digital literacy, yet the existing datasets are insufficient for establishing learning analytics in Thailand. Learning analytics focuses on gathering and analyzing student data to optimize learning tools and activities to improve students’ learning experiences. The main problem is that the ICT skill levels of the youth are rather low in Thailand. To facilitate research in this field, this study has compiled a dataset containing information from the IC3 digital literacy certification delivered at the Rajamangala University of Technology Thanyaburi (RMUTT) in Thailand between 2016 and 2023. This dataset is unique since it includes demographic and academic records about undergraduate students. The dataset was collected and underwent a preparation process, including data cleansing, anonymization, and release. This data enables the examination of student learning outcomes, represented by a dataset containing information about 45,603 records with students’ certification assessment scores. This compiled dataset provides a rich resource for researchers studying digital literacy and learning analytics. It offers researchers the opportunity to gain valuable insights, inform evidence-based educational practices, and contribute to the ongoing efforts to improve digital literacy education in Thailand and beyond.
      Citation: Data
      PubDate: 2023-07-20
      DOI: 10.3390/data8070121
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 122: A Wavelet-Decomposed WD-ARMA-GARCH-EVT Model
           Approach to Comparing the Riskiness of the BitCoin and South African Rand
           Exchange Rates

    • Authors: Thabani Ndlovu, Delson Chikobvu
      First page: 122
      Abstract: In this paper, a hybrid of a Wavelet Decomposition–Generalised Auto-Regressive Conditional Heteroscedasticity–Extreme Value Theory (WD-ARMA-GARCH-EVT) model is applied to estimate the Value at Risk (VaR) of BitCoin (BTC/USD) and the South African Rand (ZAR/USD). The aim is to measure and compare the riskiness of the two currencies. New and improved estimation techniques for VaR have been suggested in the last decade in the aftermath of the global financial crisis of 2008. This paper aims to provide an improved alternative to the already existing statistical tools in estimating a currency VaR empirically. Maximal Overlap Discrete Wavelet Transform (MODWT) and two mother wavelet filters on the returns series are considered in this paper, viz., the Haar and Daubechies (d4). The findings show that BitCoin/USD is riskier than ZAR/USD since it has a higher VaR per unit invested in each currency. At the 99% significance level, BitCoin/USD has average values of VaR of 2.71% and 4.98% for the WD-ARMA-GARCH-GPD and WD-ARMA-GARCH-GEVD models, respectively; and this is slightly higher than the respective 2.69% and 3.59% for the ZAR/USD. The average BitCoin/USD returns of 0.001990 are higher than ZAR/USD returns of −0.000125. These findings are consistent with the mean-variance portfolio theory, which suggests a higher yield for riskier assets. Based on the p-values of the Kupiec likelihood ratio test, the hybrid model adequacy is largely accepted, as p-values are greater than 0.05, except for the WD-ARMA-GARCH-GEVD models at a 99% significance level for both currencies. The findings are helpful to financial risk practitioners and forex traders in formulating their diversification and hedging strategies and ascertaining the risk-adjusted capital requirement to be set aside as a cushion in the event of the occurrence of an actual loss.
      Citation: Data
      PubDate: 2023-07-24
      DOI: 10.3390/data8070122
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 93: Target Screening of Chemicals of Emerging Concern
           (CECs) in Surface Waters of the Swedish West Coast

    • Authors: Pedro A. Inostroza, Eric Carmona, Åsa Arrhenius, Martin Krauss, Werner Brack, Thomas Backhaus
      First page: 93
      Abstract: The aquatic environment faces increasing threats from a variety of unregulated organic chemicals originating from human activities, collectively known as chemicals of emerging concern (CECs). These include pharmaceuticals, personal-care products, pesticides, surfactants, industrial chemicals, and their transformation products. CECs enter aquatic environments through various sources, including effluents from wastewater treatment plants, industrial facilities, runoff from agricultural and residential areas, as well as accidental spills. Data on the occurrence of CECs in the marine environment are scarce, and more information is needed to assess the chemical and ecological status of water bodies, and to prioritize toxic chemicals for further studies or risk assessment. In this study, we describe a monitoring campaign targeting CECs in surface waters at the Swedish west coast using, for the first time, an on-site large volume solid phase extraction (LVSPE) device. We detected up to 80 and 227 CECs in marine sites and the wastewater treatment plant (WWTP) effluent, respectively. The dataset will contribute to defining pollution fingerprints and assessing the chemical status of marine and freshwater systems affected by industrial hubs, agricultural areas, and the discharge of urban wastewater.
      Citation: Data
      PubDate: 2023-05-25
      DOI: 10.3390/data8060093
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 94: MicroRNA Profiling of Fresh Lung Adenocarcinoma
           and Adjacent Normal Tissues from Ten Korean Patients Using miRNA-Seq

    • Authors: Jihye Park, Sae Jung Na, Jung Sook Yoon, Seoree Kim, Sang Hoon Chun, Jae Jun Kim, Young-Du Kim, Young-Ho Ahn, Keunsoo Kang, Yoon Ho Ko
      First page: 94
      Abstract: MicroRNA transcriptomes from fresh tumors and the adjacent normal tissues were profiled in 10 Korean patients diagnosed with lung adenocarcinoma using a next-generation sequencing (NGS) technique called miRNA-seq. The sequencing quality was assessed using FastQC, and low-quality or adapter-contaminated portions of the reads were removed using Trim Galore. Quality-assured reads were analyzed using miRDeep2 and Bowtie. The abundance of known miRNAs was estimated using the reads per million (RPM) normalization method. Subsequently, using DESeq2 and Wx, we identified differentially expressed miRNAs and potential miRNA biomarkers for lung adenocarcinoma tissues compared to adjacent normal tissues, respectively. We defined reliable miRNA biomarkers for lung adenocarcinoma as those detected by both methods. The miRNA-seq data are available in the Gene Expression Omnibus (GEO) database under accession number GSE196633, and all processed data can be accessed via the Mendeley data website.
      Citation: Data
      PubDate: 2023-05-25
      DOI: 10.3390/data8060094
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 95: A Dataset of Scalp EEG Recordings of
           Alzheimer’s Disease, Frontotemporal Dementia and Healthy Subjects
           from Routine EEG

    • Authors: Andreas Miltiadous, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Nikolaos Grigoriadis, Dimitrios G. Tsalikakis, Pantelis Angelidis, Markos G. Tsipouras, Euripidis Glavas, Nikolaos Giannakeas, Alexandros T. Tzallas
      First page: 95
      Abstract: Recently, there has been a growing research interest in utilizing the electroencephalogram (EEG) as a non-invasive diagnostic tool for neurodegenerative diseases. This article provides a detailed description of a resting-state EEG dataset of individuals with Alzheimer’s disease and frontotemporal dementia, and healthy controls. The dataset was collected using a clinical EEG system with 19 scalp electrodes while participants were in a resting state with their eyes closed. The data collection process included rigorous quality control measures to ensure data accuracy and consistency. The dataset contains recordings of 36 Alzheimer’s patients, 23 frontotemporal dementia patients, and 29 healthy age-matched subjects. For each subject, the Mini-Mental State Examination score is reported. A monopolar montage was used to collect the signals. A raw and preprocessed EEG is included in the standard BIDS format. For the preprocessed signals, established methods such as artifact subspace reconstruction and an independent component analysis have been employed for denoising. The dataset has significant reuse potential since Alzheimer’s EEG Machine Learning studies are increasing in popularity and there is a lack of publicly available EEG datasets. The resting-state EEG data can be used to explore alterations in brain activity and connectivity in these conditions, and to develop new diagnostic and treatment approaches. Additionally, the dataset can be used to compare EEG characteristics between different types of dementia, which could provide insights into the underlying mechanisms of these conditions.
      Citation: Data
      PubDate: 2023-05-27
      DOI: 10.3390/data8060095
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 96: Exploring the Evolution of Sentiment in Spanish
           Pandemic Tweets: A Data Analysis Based on a Fine-Tuned BERT Architecture

    • Authors: Carlos Henríquez Miranda, German Sanchez-Torres, Dixon Salcedo
      First page: 96
      Abstract: The COVID-19 pandemic has had a significant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reflected in their opinions and comments on social media platforms, such as Twitter. This study explores the evolution of sentiment in Spanish pandemic tweets through a data analysis based on a fine-tuned BERT architecture. A total of six million tweets were collected using web scraping techniques, and pre-processing was applied to filter and clean the data. The fine-tuned BERT architecture was utilized to perform sentiment analysis, which allowed for a deep-learning approach to sentiment classification. The analysis results were graphically represented based on search criteria, such as “COVID-19” and “coronavirus”. This study reveals sentiment trends, significant concerns, relationship with announced news, public reactions, and information dissemination, among other aspects. These findings provide insight into the emotional impact of the COVID-19 pandemic on individuals and the corresponding impact on social media platforms.
      Citation: Data
      PubDate: 2023-05-29
      DOI: 10.3390/data8060096
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 97: A Fast Deep Learning ECG Sex Identifier Based on
           Wavelet RGB Image Classification

    • Authors: Jose-Luis Cabra Lopez, Carlos Parra, Gonzalo Forero
      First page: 97
      Abstract: Human sex recognition with electrocardiogram signals is an emerging area in machine learning, mostly oriented toward neural network approaches. It might be the beginning of a field of heart behavior analysis focused on sex. However, a person’s heartbeat changes during daily activities, which could compromise the classification. In this paper, with the intention of capturing heartbeat dynamics, we divided the heart rate into different intervals, creating a specialized identification model for each interval. The sexual differentiation for each model was performed with a deep convolutional neural network from images that represented the RGB wavelet transformation of ECG pseudo-orthogonal X, Y, and Z signals, using sufficient samples to train the network. Our database included 202 people, with a female-to-male population ratio of 49.5–50.5% and an observation period of 24 h per person. As our main goal, we looked for periods of time during which the classification rate of sex recognition was higher and the process was faster; in fact, we identified intervals in which only one heartbeat was required. We found that for each heart rate interval, the best accuracy score varied depending on the number of heartbeats collected. Furthermore, our findings indicated that as the heart rate increased, fewer heartbeats were needed for analysis. On average, our proposed model reached an accuracy of 94.82% ± 1.96%. The findings of this investigation provide a heartbeat acquisition procedure for ECG sex recognition systems. In addition, our results encourage future research to include sex as a soft biometric characteristic in person identification scenarios and for cardiology studies, in which the detection of specific male or female anomalies could help autonomous learning machines move toward specialized health applications.
      Citation: Data
      PubDate: 2023-05-29
      DOI: 10.3390/data8060097
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 98: Unmanned Aerial Vehicle (UAV) and Spectral
           Datasets in South Africa for Precision Agriculture

    • Authors: Cilence Munghemezulu, Zinhle Mashaba-Munghemezulu, Phathutshedzo Eugene Ratshiedana, Eric Economon, George Chirima, Sipho Sibanda
      First page: 98
      Abstract: Remote sensing data play a crucial role in precision agriculture and natural resource monitoring. The use of unmanned aerial vehicles (UAVs) can provide solutions to challenges faced by farmers and natural resource managers due to its high spatial resolution and flexibility compared to satellite remote sensing. This paper presents UAV and spectral datasets collected from different provinces in South Africa, covering different crops at the farm level as well as natural resources. UAV datasets consist of five multispectral bands corrected for atmospheric effects using the PIX4D mapper software to produce surface reflectance images. The spectral datasets are filtered using a Savitzky–Golay filter, corrected for Multiplicative Scatter Correction (MSC). The first and second derivatives and the Continuous Wavelet Transform (CWT) spectra are also calculated. These datasets can provide baseline information for developing solutions for precision agriculture and natural resource challenges. For example, UAV and spectral data of different crop fields captured at spatial and temporal resolutions can contribute towards calibrating satellite images, thus improving the accuracy of the derived satellite products.
      Citation: Data
      PubDate: 2023-05-30
      DOI: 10.3390/data8060098
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 99: Classification of Cocoa Pod Maturity Using
           Similarity Tools on an Image Database: Comparison of Feature Extractors
           and Color Spaces

    • Authors: Kacoutchy Jean Ayikpa, Diarra Mamadou, Pierre Gouton, Kablan Jérôme Adou
      First page: 99
      Abstract: Côte d’Ivoire, the world’s largest cocoa producer, faces the challenge of quality production. Immature or overripe pods cannot produce quality cocoa beans, resulting in losses and an unprofitable harvest. To help farmer cooperatives determine the maturity of cocoa pods in time, our study evaluates the use of automation tools based on similarity measures. Although standard techniques, such as visual inspection and weighing, are commonly used to identify the maturity of cocoa pods, the use of automation tools based on similarity measures can improve the efficiency and accuracy of this process. We set up a database of cocoa pod images and used two feature extractors: one based on convolutional neural networks (CNN), in particular, MobileNet, and the other based on texture analysis using a gray-level co-occurrence matrix (GLCM). We evaluated the impact of different color spaces and feature extraction methods on our database. We used mathematical similarity measurement tools, such as the Euclidean distance, correlation distance, and chi-square distance, to classify cocoa pod images. Our experiments showed that the chi-square distance measurement offered the best accuracy, with a score of 99.61%, when we used GLCM as a feature extractor and the Lab color space. Using automation tools based on similarity measures can improve the efficiency and accuracy of cocoa pod maturity determination. The results of our experiments prove that the chi-square distance is the most appropriate measure of similarity for this task.
      Citation: Data
      PubDate: 2023-05-30
      DOI: 10.3390/data8060099
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 100: Progress in the Cost-Optimal Methodology
           Implementation in Europe: Datasets Insights and Perspectives in Member

    • Authors: Paolo Zangheri, Delia D’Agostino, Roberto Armani, Carmen Maduta, Paolo Bertoldi
      First page: 100
      Abstract: This data article relates to the paper “Review of the cost-optimal methodology implementation in Member States in compliance with the Energy Performance of Buildings Directive”. Datasets linked with this article refer to the analysis of the latest national cost-optimal reports, providing an assessment of the implementation of the cost-optimal methodology, as established by the Energy Performance of Building Directive (EPBD). Based on latest national reports, the data provided a comprehensive update to the cost-optimal methodology implementation throughout Europe, which is currently lacking harmonization. Datasets allow an overall overview of the status of the cost-optimal methodology implementation in Europe with details on the calculations carried out (e.g., multi-stage, dynamic, macroeconomic, and financial perspectives, included energy uses, and full-cost approach). Data relate to the implemented methodology, reference buildings, assessed cost-optimal levels, energy performance, costs, and sensitivity analysis. Data also provide insight into energy consumption, efficiency measures for residential and non-residential buildings, nearly zero energy buildings (NZEBs) levels, and global costs. The reported data can be useful to quantify the cost-optimal levels for different building types, both residential (average cost-optimal level 80 kWh/m2y for new, 130 kWh/m2y for existing buildings) and non-residential buildings (140 kWh/m2y for new, 180 kWh/m2y for existing buildings). Data outline weak and strong points of the methodology, as well as future developments in the light of the methodology revision foreseen in 2026. The data support energy efficiency and energy policies related to buildings toward the EU building stock decarbonization goal within 2050.
      Citation: Data
      PubDate: 2023-05-31
      DOI: 10.3390/data8060100
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 101: Labelled Indoor Point Cloud Dataset for BIM
           Related Applications

    • Authors: Nuno Abreu, Rayssa Souza, Andry Pinto, Anibal Matos, Miguel Pires
      First page: 101
      Abstract: BIM (building information modelling) has gained wider acceptance in the AEC (architecture, engineering, and construction) industry. Conversion from 3D point cloud data to vector BIM data remains a challenging and labour-intensive process, but particularly relevant during various stages of a project lifecycle. While the challenges associated with processing very large 3D point cloud datasets are widely known, there is a pressing need for intelligent geometric feature extraction and reconstruction algorithms for automated point cloud processing. Compared to outdoor scene reconstruction, indoor scenes are challenging since they usually contain high amounts of clutter. This dataset comprises the indoor point cloud obtained by scanning four different rooms (including a hallway): two office workspaces, a workshop, and a laboratory including a water tank. The scanned space is located at the Electrical and Computer Engineering department of the Faculty of Engineering of the University of Porto. The dataset is fully labelled, containing major structural elements like walls, floor, ceiling, windows, and doors, as well as furniture, movable objects, clutter, and scanning noise. The dataset also contains an as-built BIM that can be used as a reference, making it suitable for being used in Scan-to-BIM and Scan-vs-BIM applications. For demonstration purposes, a Scan-vs-BIM change detection application is described, detailing each of the main data processing steps.
      Citation: Data
      PubDate: 2023-06-01
      DOI: 10.3390/data8060101
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 102: A Self-Attention-Based Imputation Technique for
           Enhancing Tabular Data Quality

    • Authors: Do-Hoon Lee, Han-joon Kim
      First page: 102
      Abstract: Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.
      Citation: Data
      PubDate: 2023-06-04
      DOI: 10.3390/data8060102
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 103: Physico-Chemical Quality and Physiological
           Profiles of Microbial Communities in Freshwater Systems of Mega Manila,

    • Authors: Marie Christine M. Obusan, Arizaldo E. Castro, Ren Mark D. Villanueva, Margareth Del E. Isagan, Jamaica Ann A. Caras, Jessica F. Simbahan
      First page: 103
      Abstract: Studying the quality of freshwater systems and drinking water in highly urbanized megalopolises around the world remains a challenge. This article reports data on the quality of select freshwater systems in Mega Manila, Philippines. Water samples collected between 2020 and 2021 were analyzed for physico-chemical parameters and microbial community metabolic fingerprints, i.e., carbon substrate utilization patterns (CSUPs). The detection of arsenic, lead, cadmium, mercury, polyaromatic hydrocarbons (PAHs), and organochlorine pesticides (OCPs) was carried out using standard chromatography- and spectroscopy-based protocols. Physiological profiles were determined using the Biolog EcoPlate™ system. Eight samples were free of heavy metals, and none contained PAHs or OCPs. Fourteen samples had high microbial activity, as indicated by average well color development (AWCD) and community metabolic diversity (CMD) values. Community-level physiological profiling (CLPP) revealed that (1) samples clustered as groups according to shared CSUPs, and (2) microbial communities in non-drinking samples actively utilized all six substrate classes compared to drinking samples. The data reported here can provide a baseline or a comparator for prospective quality assessments of drinking water and freshwater sources in the region. Metabolic fingerprinting using CSUPs is a simple and cheap phenotypic analysis of microbial communities and their physiological activity in aquatic environments.
      Citation: Data
      PubDate: 2023-06-04
      DOI: 10.3390/data8060103
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 104: Comparison of ARIMA and LSTM in Predicting
           Structural Deformation of Tunnels during Operation Period

    • Authors: Chuangfeng Duan, Min Hu, Haozuan Zhang
      First page: 104
      Abstract: Accurately predicting the structural deformation trend of tunnels during operation is significant to improve the scientificity of tunnel safety maintenance. With the development of data science, structural deformation prediction methods based on time-series data have attracted attention. Auto Regressive Integrated Moving Average model (ARIMA) is a classical statistical analysis model, which is suitable for processing non-stationary time-series data. Long- and Short-Term Memory (LSTM) is a special cyclic neural network that can learn long-term dependent information in time series. Both are widely used in the field of temporal prediction. In view of the lack of time-series prediction in the tunnel deformation field, the body of this paper uses historical data of the Xinjian Road and the Dalian Road tunnel in Shanghai to propose a new way of modeling based on single points and road sections. ARIMA and LSTM models are applied in comprehensive experiments, and the results show that: (1) Both LSTM and ARIMA models have great performance for settlement and convergence deformation. (2) The overall robustness of ARIMA is better than that of LSTM, and it is more adaptable to the datasets. (3) The model prediction performance is closely related to the data quality. ARIMA has more stable performance under the lack of data volume, while LSTM has better performance with high-quality data and higher upper limit.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060104
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 105: Assessing the Effectiveness of Masking and
           Encryption in Safeguarding the Identity of Social Media Publishers from
           Advanced Metadata Analysis

    • Authors: Mohammed Khader, Marcel Karam
      First page: 105
      Abstract: Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous publishers vulnerable to privacy-related attacks, as identifying information can be revealed. Twitter is an example of such a platform where identifying anonymous users/publishers is made possible by using machine learning techniques. To provide these anonymous users with stronger protection, we have examined the effectiveness of these techniques when critical fields in the metadata are masked or encrypted using tweets (text and images) from Twitter. Our results show that SVM achieved the highest accuracy rate of 95.81% without using data masking or encryption, while SVM achieved the highest identity recognition rate of 50.24% when using data masking and AES encryption algorithm. This indicates that data masking and encryption of metadata of tweets (text and images) can provide promising protection for the anonymity of users’ identities.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060105
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 106: Curated Dataset for Red Blood Cell Tracking from
           Video Sequences of Flow in Microfluidic Devices

    • Authors: Ivan Cimrák, Peter Tarábek, František Kajánek
      First page: 106
      Abstract: This work presents a dataset comprising images, annotations, and velocity fields for benchmarking cell detection and cell tracking algorithms. The dataset includes two video sequences captured during laboratory experiments, showcasing the flow of red blood cells (RBC) in microfluidic channels. From the first video 300 frames and from the second video 150 frames are annotated with bounding boxes around the cells, as well as tracks depicting the movement of individual cells throughout the video. The dataset encompasses approximately 20,000 bounding boxes and 350 tracks. Additionally, computational fluid dynamics simulations were utilized to generate 2D velocity fields representing the flow within the channels. These velocity fields are included in the dataset. The velocity field has been employed to improve cell tracking by predicting the positions of cells across frames. The paper also provides a comprehensive discussion on the utilization of the flow matrix in the tracking steps.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060106
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 107: A Preliminary Investigation of a Single Shock
           Impact on Italian Mortality Rates Using STMF Data: A Case Study of

    • Authors: Maria Francesca Carfora, Albina Orlando
      First page: 107
      Abstract: Mortality shocks, such as pandemics, threaten the consolidated longevity improvements, confirmed in the last decades for the majority of western countries. Indeed, just before the COVID-19 pandemic, mortality was falling for all ages, with a different behavior according to different ages and countries. It is indubitable that the changes in the population longevity induced by shock events, even transitory ones, affecting demographic projections, have financial implications in public spending as well as in pension plans and life insurance. The Short Term Mortality Fluctuations (STMF) data series, providing data of all-cause mortality fluctuations by week within each calendar year for 38 countries worldwide, offers a powerful tool to timely analyze the effects of the mortality shock caused by the COVID-19 pandemic on Italian mortality rates. This dataset, recently made available as a new component of the Human Mortality Database, is described and techniques for the integration of its data with the historical mortality time series are proposed. Then, to forecast mortality rates, the well-known stochastic mortality model proposed by Lee and Carter in 1992 is first considered, to be consistent with the internal processing of the Human Mortality Database, where exposures are estimated by the Lee–Carter model; empirical results are discussed both on the estimation of the model coefficients and on the forecast of the mortality rates. In detail, we show how the integration of the yearly aggregated STMF data in the HMD database allows the Lee–Carter model to capture the complex evolution of the Italian mortality rates, including the higher lethality for males and older people, in the years that follow a large shock event such as the COVID-19 pandemic. Finally, we discuss some key points concerning the improvement of existing models to take into account mortality shocks and evaluate their impact on future mortality dynamics.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060107
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 108: How Expert Is the Crowd' Insights into Crowd
           Opinions on the Severity of Earthquake Damage

    • Authors: Motti Zohar, Amos Salamon, Carmit Rapaport
      First page: 108
      Abstract: The evaluation of earthquake damage is central to assessing its severity and damage characteristics. However, the methods of assessment encounter difficulties concerning the subjective judgments and interpretation of the evaluators. Thus, it is mainly geologists, seismologists, and engineers who perform this exhausting task. Here, we explore whether an evaluation made by semiskilled people and by the crowd is equivalent to the experts’ opinions and, thus, can be harnessed as part of the process. Therefore, we conducted surveys in which a cohort of graduate students studying natural hazards (n = 44) and an online crowd (n = 610) were asked to evaluate the level of severity of earthquake damage. The two outcome datasets were then compared with the evaluation made by two of the present authors, who are considered experts in the field. Interestingly, the evaluations of both the semiskilled cohort and the crowd were found to be fairly similar to those of the experts, thus suggesting that they can provide an interpretation close enough to an expert’s opinion on the severity level of earthquake damage. Such an understanding may indicate that although our analysis is preliminary and requires more case studies for this to be verified, there is vast potential encapsulated in crowd-sourced opinion on simple earthquake-related damage, especially if a large amount of data is to be handled.
      Citation: Data
      PubDate: 2023-06-14
      DOI: 10.3390/data8060108
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 109: Dataset of Program Source Codes Solving Unique
           Programming Exercises Generated by Digital Teaching Assistant

    • Authors: Liliya A. Demidova, Elena G. Andrianova, Peter N. Sovietov, Artyom V. Gorchakov
      First page: 109
      Abstract: This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers or benchmarking mutation testing frameworks, and more applications are yet to be discovered. We describe the architecture of the DTA system, aiming to provide detailed insight regarding how and why the dataset was collected. In addition, we describe the algorithms responsible for source code analysis in the DTA system. These algorithms use vector representations of programs based on Markov chains, compute pairwise Jensen–Shannon divergences of programs, and apply hierarchical clustering algorithms in order to automatically discover high-level concepts used by students while solving unique tasks. The proposed approach can be incorporated into massive programming courses when there is a need to identify approaches implemented by students.
      Citation: Data
      PubDate: 2023-06-14
      DOI: 10.3390/data8060109
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 110: Deep Learning-Based Black Spot Identification on
           Greek Road Networks

    • Authors: Ioannis Karamanlis, Alexandros Kokkalis, Vassilios Profillidis, George Botzoris, Chairi Kiourt, Vasileios Sevetlidis, George Pavlidis
      First page: 110
      Abstract: Black spot identification, a spatiotemporal phenomenon, involves analysing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. By evaluating these problem areas, researchers can uncover the underlying causes and reasons for increased collision rates, such as road design, traffic volume, driver behaviour, weather, and infrastructure. However, challenges in identifying black spots include limited data availability, data quality, and assessing contributing factors. Additionally, evolving road design, infrastructure, and vehicle safety technology can affect black spot analysis and determination. This study focused on traffic accidents in Greek road networks to recognize black spots, utilizing data from police and government-issued car crash reports. The study produced a publicly available dataset called Black Spots of North Greece (BSNG) and a highly accurate identification method.
      Citation: Data
      PubDate: 2023-06-16
      DOI: 10.3390/data8060110
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 111: Self-Reported Mental Health and Psychosocial
           Correlates during the COVID-19 Pandemic: Data from the General Population
           in Italy

    • Authors: Daniela Marchetti, Roberta Maiella, Rocco Palumbo, Melissa D’Ettorre, Irene Ceccato, Marco Colasanti, Adolfo Di Crosta, Pasquale La Malva, Emanuela Bartolini, Daniela Biasone, Nicola Mammarella, Piero Porcelli, Alberto Di Domenico, Maria Cristina Verrocchio
      First page: 111
      Abstract: The COVID-19 pandemic tremendously impacted people’s day-to-day activities and mental health. This article describes the dataset used to investigate the psychological impact of the first national lockdown on the general Italian population. For this purpose, an online survey was disseminated via Qualtrics between 1 April and 20 April 2020, to record various socio-demographic and psychological variables. The measures included both validated (namely, the Impact of the Event Scale-Revised, the Perceived Stress Scale, the nine-item Patient Health Questionnaire, the seven-item Generalized Anxiety Disorder scale, the Big Five Inventory 10-Item, and the Whiteley Index-7) and ad hoc questionnaires (nine items to investigate in-group and out-group trust). The final sample comprised 4081 participants (18–85 years old). The dataset could be helpful to other researchers in understanding the psychological impact of the COVID-19 pandemic and its related preventive and protective measures. Furthermore, the present data might help shed some light on the role of individual differences in response to traumatic events. Finally, this dataset can increase the knowledge in investigating psychological distress, health anxiety, and personality traits.
      Citation: Data
      PubDate: 2023-06-16
      DOI: 10.3390/data8060111
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 112: RipSetCocoaCNCH12: Labeled Dataset for Ripeness
           Stage Detection, Semantic and Instance Segmentation of Cocoa Pods

    • Authors: Juan Felipe Restrepo-Arias, María Isabel Salinas-Agudelo, María Isabel Hernandez-Pérez, Alejandro Marulanda-Tobón, María Camila Giraldo-Carvajal
      First page: 112
      Abstract: Fruit counting and ripeness detection are computer vision applications that have gained strength in recent years due to the advancement of new algorithms, especially those based on artificial neural networks (ANNs), better known as deep learning. In agriculture, those algorithms capable of fruit counting, including information about their ripeness, are mainly applied to make production forecasts or plan different activities such as fertilization or crop harvest. This paper presents the RipSetCocoaCNCH12 dataset of cocoa pods labeled at four different ripeness stages: stage 1 (0–2 months), stage 2 (2–4 months), stage 3 (4–6 months), and harvest stage (>6 months). An additional class was also included for pods aborted by plants in the early stage of development. A total of 4116 images were labeled to train algorithms that mainly perform semantic and instance segmentation. The labeling was carried out with CVAT (Computer Vision Annotation Tool). The dataset, therefore, includes labeling in two formats: COCO 1.0 and segmentation mask 1.1. The images were taken with different mobile devices (smartphones), in field conditions, during the harvest season at different times of the day, which could allow the algorithms to be trained with data that includes many variations in lighting, colors, textures, and sizes of the cocoa pods. As far as we know, this is the first openly available dataset for cocoa pod detection with semantic segmentation for five classes, 4116 images, and 7917 instances, comprising RGB images and two different formats for labels. With the publication of this dataset, we expect that researchers in smart farming, especially in cocoa cultivation, can benefit from the quantity and variety of images it contains.
      Citation: Data
      PubDate: 2023-06-18
      DOI: 10.3390/data8060112
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 74: MN-DS: A Multilabeled News Dataset for News
           Article Hierarchical Classification

    • Authors: Alina Petukhova, Nuno Fachada
      First page: 74
      Abstract: This article presents a dataset of 10,917 news articles with hierarchical news categories collected between 1 January 2019 and 31 December 2019. We manually labeled the articles based on a hierarchical taxonomy with 17 first-level and 109 s-level categories. This dataset can be used to train machine learning models for automatically classifying news articles by topic. This dataset can be helpful for researchers working on news structuring, classification, and predicting future events based on released news.
      Citation: Data
      PubDate: 2023-04-23
      DOI: 10.3390/data8050074
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 75: Data on 33 Years of Erroneous Usage of Rainfall
           Erosivity Equations

    • Authors: Nejc Bezak, Klaudija Lebar, Yu-Chieh Huang, Walter Chen
      First page: 75
      Abstract: This paper describes the data gathered for a paper published in Earth-Science Reviews (
      DOI : 10.1016/j.earscirev.2023.104339) to address the problem of studies using incorrect equations to calculate rainfall erosivity (R factor), which can lead to issues related to land degradation, soil productivity loss, and biodiversity loss. The aim was to locate articles containing the incorrect equations and create a relational database that could be used to perform an in-depth analysis of the errors. Because the search target is an equation, it is impossible to directly query any literature database for the articles that contain the incorrect R equations. Therefore, a manual search of multiple databases was conducted. Subsequently, the literature search was broadened to identify the origin of the misuse of the R equations, and SQL (Structured Query Language) queries were formulated to understand why the errors continued to persist for a minimum of 33 years. The resulting entity-relationship-based Microsoft Access database was determined to be a valuable tool for performing in-depth analysis. It can be used to add incorrect studies and perform further analysis. It is suggested that further research should be conducted to determine the extent of the impact of these errors on soil erosion, ecosystems, and the environment.
      Citation: Data
      PubDate: 2023-04-24
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 76: A Dataset of Marine Macroinvertebrate Diversity
           from Mozambique and São Tomé and Príncipe

    • Authors: Marta Bento, Henrique Niza, Alexandra Cartaxana, Salomão Bandeira, José Paula, Alexandra Marçal Correia
      First page: 76
      Abstract: Marine macroinvertebrate communities play a key role in ecosystem functioning by regulating flows of energy and materials and providing numerous ecosystem services. In Mozambique and São Tomé and Príncipe marine macroinvertebrates are important for the livelihood and food security of local populations. We compiled a dataset on marine invertebrates from Mozambique and São Tomé and Príncipe through an extensive data search of digital platforms, scientific literature, and natural history collections (NHC). This dataset encompasses data from 1816 to 2023 and comprises 20,122 records, representing 617 families, 1552 genera, 2137 species, providing species occurrence in mangrove forests, seagrass beds, coral reefs, and other coastal and offshore habitats. The dataset has a Darwin Core standard format and has been fully released in the Global Biodiversity Information Facility (GBIF). It is accessible through the GBIF portal under the Creative Commons Attribution 4.0 International license. The data are standardized and validated with tools such as WoRMS, GEOLocate, and Google Maps. Therefore, they can be readily used for further studies on species richness, distribution, and functional traits. Overall, this dataset contributes baseline information on marine biodiversity for future research.
      Citation: Data
      PubDate: 2023-04-25
      DOI: 10.3390/data8050076
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 77: Dataset of Specific Total Embodied Energy and
           Specific Total Weight of 40 Buildings from the Last Four Decades in the
           Andean Region of Ecuador

    • Authors: Jefferson Torres-Quezada, Tatiana Sánchez-Quezada
      First page: 77
      Abstract: This article presents the Specific Total Embodied Energy (STEE) and Specific Total Weight (STW) of 40 Andean residential buildings in Ecuador, from 1980 to 2020. Firstly, the BoM of ten buildings of every decade was obtained through field work carried out in three urban sectors of this city. Secondly, the specific embodied energy and specific weight of every material found in the 40 samples were obtained by bibliography. Finally, the calculation of each building was divided into three components: Structure, Envelope and Finishes. The analyzed data show a detailed collection of different materials and construction typologies used in these four decades, and the impact on their embodied energy and their weight. Moreover, this article gives a Specific Embodied Energy and Specific Weight database of 25 materials that are extensively used in Andean regions. The results show several changes in reference to the insertion of new material, but also regarding the adoption of new architectonic models. The most important changes, in the analyzed period, have been the use of concrete and metal in the structure instead of wood, the increase in the glass surface in the envelope, and the replacement of wood by particleboard on the finishes. In conclusion, the STEE of the entire building has experienced an increase of 2.19 times in the last four decades. The STW value has also increased, but to a lesser extent (1.36 times).
      Citation: Data
      PubDate: 2023-04-26
      DOI: 10.3390/data8050077
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 78: Assessing and Forecasting the Long-Term Impact of
           the Global Financial Crisis on New Car Sales in South Africa

    • Authors: Tendai Makoni, Delson Chikobvu
      First page: 78
      Abstract: In both developed and developing nations, with South Africa (SA) being one of the latter, the motor vehicle industry is one of the most important sectors. The SA automobile industry was not unaffected by the 2007/2008 global financial crisis (GFC). This study aims to assess the impact of the GFC on new car sales in SA through statistical modeling, an impact that has not previously been investigated or quantified. The data obtained indicate that the optimal model for assessing the aforementioned impact is the SARIMA (0,1,1)(0,0,2)12 model. This model’s suitability was confirmed using Akaike information criterion (AIC) and Bayesian information criterion (BIC), as well as the root mean square error (RMSE) and the mean absolute percentage error (MAPE). An upward trend is projected for new car sales in SA, which has positive implications for SA and its economy. The projections indicate that the new car sales rate has increased and has somewhat recovered, but it has not yet reached the levels expected had the GFC not occurred. This shows that SA’s new car industry has been negatively and severely impacted by the GFC and that the effects of the latter still linger today. The findings of this study will assist new car manufacturing companies in SA to better understand their industry, to prepare for future negative shocks, to formulate potential policies for stocking inventories, and to optimize marketing and production levels. Indeed, the information presented in this study provides talking points that should be considered in future government relief packages.
      Citation: Data
      PubDate: 2023-04-27
      DOI: 10.3390/data8050078
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 79: A Tumour and Liver Automatic Segmentation (ATLAS)
           Dataset on Contrast-Enhanced Magnetic Resonance Imaging for Hepatocellular

    • Authors: Félix Quinton, Romain Popoff, Benoît Presles, Sarah Leclerc, Fabrice Meriaudeau, Guillaume Nodari, Olivier Lopez, Julie Pellegrinelli, Olivier Chevallier, Dominique Ginhac, Jean-Marc Vrigneaud, Jean-Louis Alberini
      First page: 79
      Abstract: Liver cancer is the sixth most common cancer in the world and the fourth leading cause of cancer mortality. In unresectable liver cancers, especially hepatocellular carcinoma (HCC), transarterial radioembolisation (TARE) can be considered for treatment. TARE treatment involves a contrast-enhanced magnetic resonance imaging (CE-MRI) exam performed beforehand to delineate the liver and tumour(s) in order to perform dosimetry calculation. Due to the significant amount of time and expertise required to perform the delineation process, there is a strong need for automation. Unfortunately, the lack of publicly available CE-MRI datasets with liver tumour annotations has hindered the development of fully automatic solutions for liver and tumour segmentation. The “Tumour and Liver Automatic Segmentation” (ATLAS) dataset that we present consists of 90 liver-focused CE-MRI covering the entire liver of 90 patients with unresectable HCC, along with 90 liver and liver tumour segmentation masks. To the best of our knowledge, the ATLAS dataset is the first public dataset providing CE-MRI of HCC with annotations. The public availability of this dataset should greatly facilitate the development of automated tools designed to optimise the delineation process, which is essential for treatment planning in liver cancer patients.
      Citation: Data
      PubDate: 2023-04-27
      DOI: 10.3390/data8050079
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 80: Remote Sensing Data Preparation for Recognition
           and Classification of Building Roofs

    • Authors: Emil Hristov, Dessislava Petrova-Antonova, Aleksandar Petrov, Milena Borukova, Evgeny Shirinyan
      First page: 80
      Abstract: Buildings are among the most significant urban infrastructure that directly affects citizens’ livelihood. Knowledge about their rooftops is essential not only for implementing different Levels of Detail (LoD) in 3D city models but also for performing urban analyses related to usage potential (solar, green, social), construction assessment, maintenance, etc. At the same time, the more detailed information we have about the urban environment, the more adequate urban digital twins we can create. This paper proposes an approach for dataset preparation using an orthophoto with a resolution of 10 cm. The goal is to obtain roof images into separate GeoTIFFs categorised by type (flat, pitched, complex) in a way suitable for feeding rooftop classification models. Although the dataset is initially elaborated for rooftop classification, it can be applied to developing other deep-learning models related to roof recognition, segmentation, and usage potential estimation. The dataset consists of 3617 roofs covering the Lozenets district of Sofia, Bulgaria. During its preparation, the local-specific context is considered.
      Citation: Data
      PubDate: 2023-04-28
      DOI: 10.3390/data8050080
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 81: Dataset of Fluorescence EEM and UV Spectroscopy
           Data of Olive Oils during Ageing

    • Authors: Francesca Venturini, Silvan Fluri, Michael Baumgartner
      First page: 81
      Abstract: The dataset presented in this study encompasses fluorescence excitation–emission matrices (EEMs) and UV-spectroscopy data of 24 extra virgin olive oils (EVOOs) commercially available at supermarkets in Switzerland. To investigate the effect of thermal degradation, the samples were exposed to accelerated ageing at 60 ∘C up to 53 days. EEMs and UV absorption parameters were measured in 10 ageing steps. The dataset can be used, for example, to predict one or multiple chemical parameters or to classify samples based on their quality from fluorescence spectra.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050081
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 82: Exploring Spatial Patterns in Sensor Data for
           Humidity, Temperature, and RSSI Measurements

    • Authors: Juan Botero-Valencia, Adrian Martinez-Perez, Ruber Hernández-García, Luis Castano-Londono
      First page: 82
      Abstract: The Internet of Things (IoT) is one of the fastest-growing research areas in recent years and is strongly linked to the development of smart cities, smart homes, and factories. IoT can be defined as connecting devices, sensors, and physical objects that can collect and transmit data across a network, enabling increased automation and better decision-making. In several IoT applications, humidity and temperature are some of the most used variables for adjusting system configurations and understanding their performance because they are related to various physical processes, human comfort, manufacturing processes, and 3D printing, among other things. In addition, one of the biggest problems associated with IoT is the excessive production of data, so it is necessary to develop methodologies to optimize the process of collecting information. This work presents a new dataset comprising almost 55 million values of temperature, relative humidity, and RSSI (Received Signal Strength Indicator) collected in two indoor spaces for longer than 3915 h at 10 s intervals. For each experiment, we captured the information from 13 previously calibrated sensors suspended from the ceiling at the same height and with a known relative position. The proposed dataset aims to contribute a benchmark for evaluating indoor temperature and humidity-controlled systems. The collected data allow the validation and improvement of the acquisition process for IoT applications.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050082
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 83: Cloud-Based Smart Contract Analysis in FinTech
           Using IoT-Integrated Federated Learning in Intrusion Detection

    • Authors: Venkatagurunatham Naidu Kollu, Vijayaraj Janarthanan, Muthulakshmi Karupusamy, Manikandan Ramachandran
      First page: 83
      Abstract: Data sharing is proposed because the issue of data islands hinders advancement of artificial intelligence technology in the 5G era. Sharing high-quality data has a direct impact on how well machine-learning models work, but there will always be misuse and leakage of data. The field of financial technology, or FinTech, has received a lot of attention and is growing quickly. This field has seen the introduction of new terms as a result of its ongoing expansion. One example of such terminology is “FinTech”. This term is used to describe a variety of procedures utilized frequently in the financial technology industry. This study aims to create a cloud-based intrusion detection system based on IoT federated learning architecture as well as smart contract analysis. This study proposes a novel method for detecting intrusions using a cyber-threat federated graphical authentication system and cloud-based smart contracts in FinTech data. Users are required to create a route on a world map as their credentials under this scheme. We had 120 people participate in the evaluation, 60 of whom had a background in finance or FinTech. The simulation was then carried out in Python using a variety of FinTech cyber-attack datasets for accuracy, precision, recall, F-measure, AUC (Area under the ROC Curve), trust value, scalability, and integrity. The proposed technique attained accuracy of 95%, precision of 85%, RMSE of 59%, recall of 68%, F-measure of 83%, AUC of 79%, trust value of 65%, scalability of 91%, and integrity of 83%.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050083
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 84: Biotechnology and Bio-Based Products Perceptions
           in the Community of Madrid: A Representative Survey Dataset

    • Authors: Juan Romero-Luis, Manuel Gertrudix, María del Carmen Gertrudis Casado, Alejandro Carbonell-Alcocer
      First page: 84
      Abstract: (1) Background: Bioeconomy aims to reduce dependence on non-renewable resources and foster economic growth through the development of new bio-based products and services. Achieving this goal requires social acceptance and stakeholder engagement in the development of sustainable technologies. The objective of this data article is to provide a dataset derived from a survey with a representative sample of 500 citizens over 18 years old based in the Community of Madrid. (2) Methods: We created a questionnaire on the social acceptance of technologies and bio-based products to later gather the responses using a SurveyMonkey panel for the Community of Madrid through an online CAWI survey; (3) Results: A dataset with a total of 82 columns with all responses is the result of this study. (4) Conclusions: This data article provides not only a valuable representative dataset of citizens of the Community of Madrid but also sufficient resources to replicate the same study in other regions.
      Citation: Data
      PubDate: 2023-05-01
      DOI: 10.3390/data8050084
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 85: Emission Inventory for Maritime Shipping Emissions
           in the North and Baltic Sea

    • Authors: Franziska Dettner, Simon Hilpert
      First page: 85
      Abstract: A high temporal and spatial resolution emission inventory for the North Sea and Baltic Sea was compiled using current emission factors and ship activity data. The inventory includes seagoing vessels over 100 GT registered with the International Maritime Organization traversing in the North and Baltic Seas. A bottom-up approach was chosen for the compilation of the inventory, which provides emission levels of the air pollutants CO2, NOx, SO2, PM2.5, CO, BC, Ash, NMVOC, and POA, as well as the speed-dependent fuel and energy consumption. Input data come from both main and auxiliary engines, as well as well-to-tank and tank-to-propeller emission and energy and fuel consumption quantities. The georeferenced data are provided in a temporal resolution of five minutes. The data can be used to assess, inter alia, the health effects of maritime emissions, the social costs of maritime transport, emission mitigation effects of alternative fuel scenarios, and shore-to-ship power supply.
      Citation: Data
      PubDate: 2023-05-01
      DOI: 10.3390/data8050085
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 86: RaspberrySet: Dataset of Annotated Raspberry
           Images for Object Detection

    • Authors: Sarmīte Strautiņa, Ieva Kalniņa, Edīte Kaufmane, Kaspars Sudars, Ivars Namatēvs, Arturs Nikulins, Edgars Edelmers
      First page: 86
      Abstract: The RaspberrySet dataset is a valuable resource for those working in the field of agriculture, particularly in the selection and breeding of ecologically adaptable berry cultivars. This is because long-term changes in temperature and weather patterns have made it increasingly important for crops to be able to adapt to their environment. To assess the suitability of different cultivars or to make yield predictions, it is necessary to describe and evaluate berries’ characteristics at various growth stages. This process is typically carried out visually, but it can be time-consuming and labor-intensive, requiring significant expert knowledge. The RaspberrySet dataset was created to assist with this process, and it includes images of raspberry berries at five different stages of development. These stages are flower buds, flowers, unripe berries, and ripe berries. All these stages of raspberry images classified buds, damaged buds, flowers, unripe berries, and ripe berries and were annotated using ground truth ROI and presented in YOLO format. The dataset includes 2039 high-resolution RGB images, with a total of 46,659 annotations provided by experts using Label Studio software (1.7.1). The images were taken in various weather conditions, at different times of the day, and from different angles, and they include fully visible buds, flowers, berries, and partially obscured buds. This dataset is intended to improve the efficiency of berry breeding and yield estimation and to identify the raspberry phenotype more accurately. It may also be useful for breeding other fruit crops, as it allows for the reliable detection and phenotyping of yield components at different stages of development. By providing a homogenized dataset of images taken on-site at the Institute of Horticulture in Dobele, Latvia, the RaspberrySet dataset offers a valuable resource for those working in horticulture.
      Citation: Data
      PubDate: 2023-05-10
      DOI: 10.3390/data8050086
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 87: The Effect of Short-Term Transcutaneous Electrical
           Stimulation of Auricular Vagus Nerve on Parameters of Heart Rate

    • Authors: Vladimir Shvartz, Eldar Sizhazhev, Maria Sokolskaya, Svetlana Koroleva, Soslan Enginoev, Sofia Kruchinova, Elena Shvartz, Elena Golukhova
      First page: 87
      Abstract: Many previous studies have demonstrated that transcutaneous vagus nerve stimulation (VNS) has the potential to exhibit therapeutic effects similar to its invasive counterpart. An objective assessment of VNS requires a reliable biomarker of successful vagal activation. Although many potential biomarkers have been proposed, most studies have focused on heart rate variability (HRV). Despite the physiological rationale for HRV as a biomarker for assessing vagal stimulation, data on its effects on HRV are equivocal. To further advance this field, future studies investigating VNS should contain adequate methodological specifics that make it possible to compare the results between studies, to replicate studies, and to enhance the safety of study participants. This article describes the design and methodology of a randomized study evaluating the effect of short-term noninvasive stimulation of the auricular branch of the vagus nerve on parameters of HRV. Primary records of rhythmograms of all the subjects, as well as a dataset with clinical, instrumental, and laboratory data of all the current study subjects are in the public domain for possible secondary analysis to all interested researchers. The physiological interpretation of the obtained data is not considered in the article.
      Citation: Data
      PubDate: 2023-05-11
      DOI: 10.3390/data8050087
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 88: A Multispectral UAV Imagery Dataset of Wheat,
           Soybean and Barley Crops in East Kazakhstan

    • Authors: Almasbek Maulit, Aliya Nugumanova, Kurmash Apayev, Yerzhan Baiburin, Maxim Sutula
      First page: 88
      Abstract: This study introduces a dataset of crop imagery captured during the 2022 growing season in the Eastern Kazakhstan region. The images were acquired using a multispectral camera mounted on an unmanned aerial vehicle (DJI Phantom 4). The agricultural land, encompassing 27 hectares and cultivated with wheat, barley, and soybean, was subjected to five aerial multispectral photography sessions throughout the growing season. This facilitated thorough monitoring of the most important phenological stages of crop development in the experimental design, which consisted of 27 plots, each covering one hectare. The collected imagery underwent enhancement and expansion, integrating a sixth band that embodies the normalized difference vegetation index (NDVI) values in conjunction with the original five multispectral bands (Blue, Green, Red, Red Edge, and Near Infrared Red). This amplification enables a more effective evaluation of vegetation health and growth, rendering the enriched dataset a valuable resource for the progression and validation of crop monitoring and yield prediction models, as well as for the exploration of precision agriculture methodologies.
      Citation: Data
      PubDate: 2023-05-11
      DOI: 10.3390/data8050088
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 89: A Comprehensive Dataset of Spelling Errors and
           Users’ Corrections in Croatian Language

    • Authors: Gordan Gledec, Marko Horvat, Miljenko Mikuc, Bruno Blašković
      First page: 89
      Abstract: This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form “spelling error → correction” from ispravi.me, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900,000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives; (2) helping language learners identify common errors and learn correct spelling through targeted feedback; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes; (4) identifying and evaluating factors that influence typing input; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language.
      Citation: Data
      PubDate: 2023-05-12
      DOI: 10.3390/data8050089
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 90: An Efficient Deep Learning for Thai Sentiment

    • Authors: Nattawat Khamphakdee, Pusadee Seresangtakul
      First page: 90
      Abstract: The number of reviews from customers on travel websites and platforms is quickly increasing. They provide people with the ability to write reviews about their experience with respect to service quality, location, room, and cleanliness, thereby helping others before booking hotels. Many people fail to consider hotel bookings because the numerous reviews take a long time to read, and many are in a non-native language. Thus, hotel businesses need an efficient process to analyze and categorize the polarity of reviews as positive, negative, or neutral. In particular, low-resource languages such as Thai have greater limitations in terms of resources to classify sentiment polarity. In this paper, a sentiment analysis method is proposed for Thai sentiment classification in the hotel domain. Firstly, the Word2Vec technique (the continuous bag-of-words (CBOW) and skip-gram approaches) was applied to create word embeddings of different vector dimensions. Secondly, each word embedding model was combined with deep learning (DL) models to observe the impact of each word vector dimension result. We compared the performance of nine DL models (CNN, LSTM, Bi-LSTM, GRU, Bi-GRU, CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN-BiGRU) with different numbers of layers to evaluate their performance in polarity classification. The dataset was classified using the FastText and BERT pre-trained models to carry out the sentiment polarity classification. Finally, our experimental results show that the WangchanBERTa model slightly improved the accuracy, producing a value of 0.9225, and the skip-gram and CNN model combination outperformed other DL models, reaching an accuracy of 0.9170. From the experiments, we found that the word vector dimensions, hyperparameter values, and the number of layers of the DL models affected the performance of sentiment classification. Our research provides guidance for setting suitable hyperparameter values to improve the accuracy of sentiment classification for the Thai language in the hotel domain.
      Citation: Data
      PubDate: 2023-05-13
      DOI: 10.3390/data8050090
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 91: A Set of Geophysical Fields for Modeling of the
           Lithosphere Structure and Dynamics in the Russian Arctic Zone

    • Authors: Anatoly Soloviev, Alexey Petrunin, Sofia Gvozdik, Roman Sidorov
      First page: 91
      Abstract: This paper presents a set of various geological and geophysical data for the Arctic zone, including some detailed models for the eastern part of the Russian Arctic zone. This hard-to-access territory has a complex geological structure, which is poorly studied by direct geophysical methods. Therefore, these data can be used in an integrative analysis for different purposes. These are the gravity field, heat flow, and various seismic tomography models. The gravity field data include several reductions calculated during our preceding studies, which are more appropriate for the study of the Earth’s interiors than the initial free air anomalies. Specifically, these are the Bouguer, isostatic, and decompensative gravity anomalies. A surface heat flow map included in the dataset is based on a joint inversion of multiple geophysical data constrained by the observations from the International Heat Flow Commission catalog. Available seismic tomography models were analyzed to select the best one for further investigation. We provide the models for the sedimentary cover and the Moho depth, which are significantly improved compared to the existing ones. The database provides a basis for qualitative and quantitative analysis of the region.
      Citation: Data
      PubDate: 2023-05-14
      DOI: 10.3390/data8050091
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 92: Low-Dose Radiation-Induced Transcriptomic Changes
           in Diabetic Aortic Endothelial Cells

    • Authors: Jihye Park, Kyuho Kang, Yeonghoon Son, Kwang Seok Kim, Keunsoo Kang, Hae-June Lee
      First page: 92
      Abstract: Low-dose radiation refers to exposure to ionizing radiation at levels that are generally considered safe and not expected to cause immediate health effects. However, the effects of low-dose radiation are still not fully understood, and research in this area is ongoing. In this study, we investigated the alterations in gene expression profiles of human aortic endothelial cells (HAECs) and diabetic human aortic endothelial cells (T2D-HAECs) derived from patients with type 2 diabetes. To this end, we used RNA-seq to profile the transcriptomes of cells exposed to varying doses of low-dose radiation (0.1 Gy, 0.5 Gy, and 2.0 Gy) and compared them to a control group with no radiation exposure. Differentially expressed genes and enriched pathways were identified using the DESeq2 and gene set enrichment analysis (GSEA) methods, respectively. The data generated in this study are publicly available through the gene expression omnibus (GEO) database with the accession number GSE228572. This study provides a valuable resource for examining the effects of low-dose radiation on HAECs and T2D-HAECs, thereby contributing to a better understanding of the potential human health risks associated with low-dose radiation exposure.
      Citation: Data
      PubDate: 2023-05-18
      DOI: 10.3390/data8050092
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 63: Batik Nitik 960 Dataset for Classification,
           Retrieval, and Generator

    • Authors: Agus Eko Minarno, Indah Soesanti, Hanung Adi Nugroho
      First page: 63
      Abstract: Batik is one of the traditional heritages of Indonesia, with each motif of batik having a profound cultural and philosophical significance. This article introduces Batik Nitik 960 dataset from Yogyakarta, Indonesia. The dataset was extracted from a piece of fabric with 60 Nitik patterns. The dataset was supplied by the Paguyuban Pecinta Batik Indonesia (PPBI) Sekar Jagad Yogyakarta collection of Winotosasto Batik and the data were extracted from the APIPS Gallery. Each of the 60 categories in the collection contains 16 photographs, for a total of 960 images. The photographs were acquired with a Sony Alpha a6400, illuminated with a Godox SK II 400, and the data were compressed using the jpg file format. Each category contains four motifs rotated by 90, 180, and 270 degrees. Thus, the total number of images per motif is 16. Each class has a specific philosophical significance associated with the motif’s origins. This dataset aims to enable the training and evaluation of machine learning models for classification, retrieval, or generation of a new batik pattern using a generative adversarial network. To our knowledge, this study is the first to present a Batik Nitik dataset equipped with philosophical significance that is freely accessible.
      Citation: Data
      PubDate: 2023-03-24
      DOI: 10.3390/data8040063
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 64: Improving an Acoustic Vehicle Detector Using an
           Iterative Self-Supervision Procedure

    • Authors: Birdy Phathanapirom, Jason Hite, Kenneth Dayman, David Chichester, Jared Johnson
      First page: 64
      Abstract: In many non-canonical data science scenarios, obtaining, detecting, attributing, and annotating enough high-quality training data is the primary barrier to developing highly effective models. Moreover, in many problems that are not sufficiently defined or constrained, manually developing a training dataset can often overlook interesting phenomena that should be included. To this end, we have developed and demonstrated an iterative self-supervised learning procedure, whereby models are successfully trained and applied to new data to extract new training examples that are added to the corpus of training data. Successive generations of classifiers are then trained on this augmented corpus. Using low-frequency acoustic data collected by a network of infrasound sensors deployed around the High Flux Isotope Reactor and Radiochemical Engineering Development Center at Oak Ridge National Laboratory, we test the viability of our proposed approach to develop a powerful classifier with the goal of identifying vehicles from continuously streamed data and differentiating these from other sources of noise such as tools, people, airplanes, and wind. Using a small collection of exhaustively manually labeled data, we test several implementation details of the procedure and demonstrate its success regardless of the fidelity of the initial model used to seed the iterative procedure. Finally, we demonstrate the method’s ability to update a model to accommodate changes in the data-generating distribution encountered during long-term persistent data collection.
      Citation: Data
      PubDate: 2023-03-25
      DOI: 10.3390/data8040064
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 65: CyL-GHI: Global Horizontal Irradiance Dataset
           Containing 18 Years of Refined Data at 30-Min Granularity from 37 Stations
           Located in Castile and León (Spain)

    • Authors: Llinet Benavides Cesar, Miguel Ángel Manso Callejo, Calimanut-Ionut Cira, Ramon Alcarria
      First page: 65
      Abstract: Accurate solar forecasting lately relies on advances in the field of artificial intelligence and on the availability of databases with large amounts of information on meteorological variables. In this paper, we present the methodology applied to introduce a large-scale, public, and solar irradiance dataset, CyL-GHI, containing refined data from 37 stations found within the Spanish region of Castile and León (Spanish: Castilla y León, or CyL). In addition to the data cleaning steps, the procedure also features steps that enable the addition of meteorological and geographical variables that complement the value of the initial data. The proposed dataset, resulting from applying the processing methodology, is delivered both in raw format and with the quality processing applied, and continuously covers 18 years (the period from 1 January 2002 to 31 December 2019), with a temporal resolution of 30 min. CyL-GHI can result in great importance in studies focused on the spatial-temporal characteristics of solar irradiance data, due to the geographical information considered that enables a regional analysis of the phenomena (the 37 stations cover a land area larger than 94,226 km2). Afterwards, three popular artificial intelligence algorithms were optimised and tested on CyL-GHI, their performance values being offered as baselines to compare other forecasting implementations. Furthermore, the ERA5 values corresponding to the studied area were analysed and compared with performance values delivered by the trained models. The inclusion of previous observations of neighbours as input to an optimised Random Forest model (applying a spatio-temporal approach) improved the predictive capability of the machine learning models by almost 3%.
      Citation: Data
      PubDate: 2023-03-26
      DOI: 10.3390/data8040065
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 66: Satellite-Derived Annual Glacier Surface Flow
           Velocity Products for the European Alps, 2015–2021

    • Authors: Antoine Rabatel, Etienne Ducasse, Romain Millan, Jérémie Mouginot
      First page: 66
      Abstract: Documenting glacier surface flow velocity from a longer-term perspective is highly relevant to evaluate the past and current state of glaciers worldwide. For this purpose, satellite data are widely used to obtain region-wide coverage of glacier velocity data. Well-established image correlation methods allow for the automated measurement of glacier surface displacements from satellite data (optical and radar) acquired at different dates. Although computationally expensive, image correlation is nowadays relatively simple to implement and allows two-dimensional displacement measurements. Here, we present a data set of annual glacier surface flow velocity maps at the European Alps scale, covering the period 2015–2021 at a 50 m × 50 m resolution. This data set has been quantified by applying the normalized cross-correlation approach on Sentinel-2 optical data. Parameters of the cross-correlation method (e.g., window size, sampling resolution) have been optimized, and the results have been validated by comparing them with in situ data on monitored glaciers showing an RMSE of 10 m/yr. These data can be used to evaluate glacier dynamics and its spatial and temporal evolution (e.g., quantify mass fluxes or calving) or can be used as an input for model calibration/validation or for the early detection of regional hazards associated with glacier destabilization.
      Citation: Data
      PubDate: 2023-03-27
      DOI: 10.3390/data8040066
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 67: NGS Reads Dataset of Sunflower Interspecific

    • Authors: Maksim S. Makarenko, Vera A. Gavrilova
      First page: 67
      Abstract: The sunflower (Helianthus annuus), which belongs to the family of Asteraceae, is a crop grown worldwide for consumption by humans and livestock. Interspecific hybridization is widespread for sunflowers both in wild populations and commercial breeding. The current dataset comprises 250 bp and 76 paired-end NGS reads for six interspecific sunflower hybrids (F1). The dataset aimed to expand Helianthus species genomic information and benefit genetic research, and is useful in alloploids’ features investigations and nuclear–organelle interactions studies. Mitochondrial genomes of perennial sunflower hybrids H. annuus × H. strumosus and H. annuus × H. occidentalis were assembled and compared with parental forms.
      Citation: Data
      PubDate: 2023-03-27
      DOI: 10.3390/data8040067
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 68: Sentiment Analysis of Multilingual Dataset of
           Bahraini Dialects, Arabic, and English

    • Authors: Thuraya Omran, Baraa Sharef, Crina Grosan, Yongmin Li
      First page: 68
      Abstract: Sentiment analysis is an application of natural language processing (NLP) that requires a machine learning algorithm and a dataset. In some cases, the dataset availability is scarce, particularly with Arabic dialects, precisely the Bahraini ones, which necessitates using an approach such as translation, where a rich source language is exploited to create the target language dataset. In this study, a dataset of Amazon product reviews in Bahraini dialects is presented. This dataset was generated using two cascading stages of translation—a machine translation followed by a manual one. Machine translation was applied using Google Translate to translate English Amazon product reviews into Standard Arabic. In contrast, the manual approach was applied to translate the resulting Arabic reviews into Bahraini ones by qualified native speakers utilizing constructed customized forms. The resulting parallel dataset of English, Standard Arabic, and Bahraini dialects is called English_Modern Standard Arabic_Bahraini Dialects product reviews for sentiment analysis “E_MSA_BDs-PR-SA”. The dataset is balanced, composed of 2500 positive and 2500 negative reviews. The sentiment analysis process was implemented using a stacked LSTM deep learning model. The Bahraini dialect product dataset can be utilized in the transfer learning process for sentimentally analyzing another dataset in Bahraini dialects.
      Citation: Data
      PubDate: 2023-03-30
      DOI: 10.3390/data8040068
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 69: Froth Images from Flotation Laboratory Test in
           Magotteaux Cell

    • Authors: Carlos Yantén, Willy Kracht, Gonzalo Díaz, Pía Lois-Morales, Alvaro Egaña
      First page: 69
      Abstract: Froth flotation is a widely used method for the concentration of sulfide minerals. The structure of the superficial froth is an indicator of the performance of froth flotation alongside with the operational conditions in which this process is carried out. The aim of this study is to explore how the different operational conditions that can be managed in a flotation plant could directly influence the observable characteristics of the superficial froth. For this purpose, a froth image database was created using a special laboratory cell, designed to emulate the conditions seen in an industrial flotation cell. The database contains 2250 images, distributed in 45 categories; each category has a specific combination of testing conditions, and the main visual characteristics are observed. It also includes a methodology used to assess the quality of each corresponding category.
      Citation: Data
      PubDate: 2023-03-31
      DOI: 10.3390/data8040069
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 70: Clinical Trial Data on the Mechanical Removal of
           14-Day-Old Dental Plaque Using Accelerated Micro-Droplets of Air and Water

    • Authors: Yumi C. Del Rey, Pernille D. Rikvold, Karina K. Johnsen, Sebastian Schlafer
      First page: 70
      Abstract: Novel strategies to combat dental biofilms aim at reducing biofilm stability with the ultimate goal of facilitating mechanical cleaning. To test the stability of dental biofilms, they need to be subjected to a defined mechanical stress. Here, we employed an oral care device (Airfloss) that emits microbursts of compressed air and water to apply a defined mechanical shear to 14-day-old dental plaque in 20 healthy participants with no signs of oral diseases (clinical trial no. NCT05082103). Exclusion criteria included pregnant or nursing women, users of oral prostheses, retainers or orthodontic appliances, and recent antimicrobial or anti-inflammatory therapy. Plaque accumulation, before and after treatment, was assessed using fluorescence images of disclosed dental plaque on the central incisor, first premolar, and first molar in the third quadrant (120 images). For each tooth, the pre- and post-treatment plaque percentage index (PPI) and Turesky modification of the Quigley-Hein plaque index (TM-QHPI) were recorded. The mean TM-QHPI significantly decreased after treatment (p = 0.03; one-sample sign test), but no significant difference between the mean pre- and post-treatment PPI was observed (p = 0.09; one-sample t-test). These data are of value for researchers that seek to apply a defined mechanical shear to remove and/or disrupt dental biofilms.
      Citation: Data
      PubDate: 2023-03-31
      DOI: 10.3390/data8040070
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 71: Proteomic Shotgun and Targeted Mass Spectrometric
           Datasets of Cerebrospinal Fluid (Liquor) Derived from Patients with
           Vestibular Schwannoma

    • Authors: Svetlana Novikova, Natalia Soloveva, Tatiana Farafonova, Olga Tikhonova, Vadim Shimansky, Ivan Kugushev, Victor Zgoda
      First page: 71
      Abstract: Vestibular schwannomas are relatively rare intracranial tumors compared to other brain tumors. Data on the molecular features, especially on schwannoma proteome, are scarce. The 41 cerebrospinal fluid (liquor) samples were obtained during the surgical removal of vestibular schwannoma. Obtained peptide samples were analyzed by shotgun LC-MS/MS high-resolution mass spectrometry. The same peptide samples were spiked with 148 stable isotopically labeled peptide standards (SIS) followed by alkaline fractionation and scheduled multiple reaction monitoring (MRM) for quantitative analysis. The natural counterparts of SIS peptides were mapped onto 111 proteins that were Food and Drug Administration (FDA)-approved for diagnostic use. As a result, 525 proteins were identified by shotgun LC-MS/MS with high confidence (at least two peptides per protein, FDR < 1%) in liquor samples. Absolute quantitative concentrations were obtained for 54 FDA-approved proteins detected in at least five experimental samples. Since there is lack of data on the molecular landscape of vestibular schwannoma, the obtained datasets are unique and one of the first in its field.
      Citation: Data
      PubDate: 2023-04-06
      DOI: 10.3390/data8040071
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 72: Collecting and Pre-Processing Data for Industry
           4.0 Implementation Using Hydraulic Press

    • Authors: Radim Hercik, Radek Svoboda
      First page: 72
      Abstract: More and more activities are being undertaken to implement the Industry 4.0 concept in industrial practice. One of the biggest challenges is the digitization of existing industrial systems and heavy industry operations, where there is huge potential for optimizing and managing these processes more efficiently, but this requires collecting large amounts of data, understanding, and evaluating it so that we can add value back based on it. This paper focuses on the collection, local pre-processing of data, and its subsequent transfer to the cloud from an industrial hydraulic press to create a comprehensive dataset that forms the basis for further digitization of the operation. The novelty lies mainly in the process of data collection and pre-processing in the framework of edge computing of large amounts of data. In the data pre-processing, data normalization methods are applied, which allow the data to be logically sorted, tagged, and linked, which also allows the data to be efficiently compressed, thus, dynamically creating a complex dataset for later use in the process digitization.
      Citation: Data
      PubDate: 2023-04-15
      DOI: 10.3390/data8040072
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 73: Digital Twin Application and Bibliometric Analysis
           for Digitization and Intelligence Studies in Geology and Deep Underground
           Research Areas

    • Authors: Eun-Young Ahn, Seong-Yong Kim
      First page: 73
      Abstract: As deep underground digital twins have not yet been established worldwide, this study extracted keywords from national or city-led digital twin practices and elements of digital twins and through these keywords selected research papers and topics that could contribute to the establishment of deep underground digital twins in the future. We applied the concept of digital twins in geology and underground research to collect 1702 papers from the Web of Science and conducted semantic network analysis and topic modeling. The keywords digital, three dimensions, and real time were placed in the middle and have many links in the word network. Artificial intelligence, deep learning, and neural networks all showed a low degree of centrality. As a result of topic modeling using Latent Dirichlet allocation (LDA), topics related to topography, geological structure, and rock distribution, which are the basic data for building a deep underground digital twin, were noted, and topics related to earthquakes/vibrations, landslides, groundwater, and volcanoes were identified. Energy resources and space utilization have emerged as the main themes.
      Citation: Data
      PubDate: 2023-04-20
      DOI: 10.3390/data8040073
      Issue No: Vol. 8, No. 4 (2023)
  • Data, Vol. 8, Pages 48: Reconstructed River Water Temperature Dataset for
           Western Canada 1980–2018

    • Authors: Rajesh R. Shrestha, Jennifer C. Pesklevits
      First page: 48
      Abstract: Continuous water temperature data are important for understanding historical variability and trends of river thermal regime, as well as impacts of warming climate on aquatic ecosystem health. We describe a reconstructed daily water temperature dataset that supplements sparse historical observations for 55 river stations across western Canada. We employed the air2stream model for reconstructing water temperature dataset over the period 1980–2018, with air temperature and discharge data used as model inputs. The model was calibrated and validated by comparing with observed water temperature records, and the results indicate a reasonable statistical performance. We also present historical trends over the ice-free summer months from June to September using the reconstructed dataset, which indicate- significantly increasing water temperature trends for most stations. Besides trend analysis, the dataset could be used for various applications, such as calculation of heat fluxes, calibration/validation of process-based water temperature models, establishment of baseline condition for future climate projections, and assessment of impacts on ecosystems health and water quality.
      Citation: Data
      PubDate: 2023-02-26
      DOI: 10.3390/data8030048
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 49: Data Balancing Techniques for Predicting Student
           Dropout Using Machine Learning

    • Authors: Neema Mduma
      First page: 49
      Abstract: Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.
      Citation: Data
      PubDate: 2023-02-27
      DOI: 10.3390/data8030049
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 50: Dataset of Partial Analytical Validation of the
           1,2-O-Dilauryl-Rac-Glycero-3-Glutaric Acid-(6′-Methylresorufin)
           Ester (DGGR) Lipase Assay in Equine Plasma

    • Authors: Laureen Michèle Peters, Judith Howard
      First page: 50
      Abstract: Laboratory assays require analytical validation to prove they are providing accurate results. This dataset describes the partial analytical validation of lipase activity, measured with the 1,2-o-dilauryl-rac-glycero-3-glutaric acid-(6′-methylresorufin) ester (DGGR) lipase assay in equine plasma. Samples with low (approx. 12 U/L), moderately increased (approx. 79 U/L), and markedly increased lipase activity (approx. 298 U/L) were chosen. Linearity was assessed in samples of ascending dilution prepared by mixing samples with low and high lipase activity in different proportions. Repeatability or intra-assay replication was evaluated by measuring each level in 25 replicates within the same run. Reproducibility or inter-assay replication was calculated by measuring each level in five replicates on five consecutive days. The assay was linear in the range of 12–298 U/L (R2 = 0.9998) with a <2.3% deviation from the calculated value at any point. Within-run coefficients of variation were 4.43%, 0.69%, and 1.00% for the low, medium, and high samples, respectively. Between-run coefficients of variation were 3.57%, 1.42%, and 1.16%, respectively. To our knowledge, these are the first published data on the analytical validation of the DGGR lipase assay in horses, which may be of interest to veterinary clinical pathologists and equine clinicians measuring DGGR lipase in equine blood for diagnostic and research purposes.
      Citation: Data
      PubDate: 2023-02-28
      DOI: 10.3390/data8030050
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 51: Correction: Michel et al. SEN2VENµS, a
           Dataset for the Training of Sentinel-2 Super-Resolution Algorithms. Data
           2022, 7, 96

    • Authors: Julien Michel, Juan Vinasco-Salinas, Jordi Inglada, Olivier Hagolle
      First page: 51
      Abstract: There was an error in the original publication [...]
      Citation: Data
      PubDate: 2023-02-28
      DOI: 10.3390/data8030051
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 52: Dataset on SCADA Data of an Urban Small Wind
           Turbine Operation in São Paulo, Brazil

    • Authors: Welson Bassi, Alcantaro Lemes Rodrigues, Ildo Luis Sauer
      First page: 52
      Abstract: Small wind turbines (SWTs) represent an opportunity to promote energy generation technologies from low-carbon renewable sources in cities. Tall buildings are inherently suitable for placing SWTs in urban environments. Thus, the Institute of Energy and Environment of the University of São Paulo (IEE-USP) has installed an SWT in an existing high-height High Voltage Laboratory building on its campus in São Paulo, Brazil. The dataset file contains data regarding the actual electrical and mechanical operational quantities and control parameters obtained and recorded by the internal inverter of a Skystream 3.7 SWT, with 1.8 kW rated power, from 2017 to 2022. The main electrical parameters are the generated energy, voltages, currents, and power frequency in the connection grid point. Rotation, referential wind speed, and temperatures measured in some points at the inverter and in the nacelle are also recorded. Several other parameters concerning the SWT inverter operation, including alarms and status codes, are also presented. This dataset can be helpful for reanalysis, to access information, such as capacity factor, and can also be used as overall input data of actual SWT operation quantities.
      Citation: Data
      PubDate: 2023-02-28
      DOI: 10.3390/data8030052
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 53: Toward a Spatially Segregated Urban Growth'
           Austerity, Poverty, and the Demographic Decline of Metropolitan Greece

    • Authors: Kostas Rontos, Enrico Maria Mosconi, Mattia Gianvincenzi, Simona Moretti, Luca Salvati
      First page: 53
      Abstract: Metropolitan decline in southern Europe was documented in few cases, being less intensively investigated than in other regions of the continent. Likely for the first time in recent history, the aftermath of the 2007 recession was a time period associated with economic and demographic decline in Mediterranean Europe. However, the impacts and consequences of the great crisis were occasionally verified and quantified, both in strictly urban contexts and in the surrounding rural areas. By exploiting official statistics, our study delineates sequential stages of demographic growth and decline in a large metropolitan region (Athens, Greece) as a response to economic expansion and stagnation. Having important implications for the extent and spatial direction of metropolitan cycles, the Athens’ case—taken as an example of urban cycles in Mediterranean Europe—indicates a possibly new dimension of urban shrinkage, with spatially varying population growth and decline along a geographical gradient of income and wealth. Heterogeneous dynamics led to a leapfrog urban expansion decoupled from agglomeration and scale, the factors most likely shaping long-term metropolitan expansion in advanced economies. Demographic decline in urban contexts was associated with multidimensional socioeconomic processes resulting in spatially complex demographic outcomes that require appropriate, and possibly more specific, regulation policies. By shedding further light on recession-driven metropolitan decline in advanced economies, the present study contributes to re-thinking short-term development mechanisms and medium-term demographic scenarios in Mediterranean Europe.
      Citation: Data
      PubDate: 2023-03-01
      DOI: 10.3390/data8030053
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 54: Manual of GUI Program Governing ABAQUS Simulations
           of Bar Impact Test for Calibrating Bar Properties, Measured Strain, and
           Impact Velocity

    • Authors: Hyunho Shin
      First page: 54
      Abstract: Bar impact instruments, such as the (split) Hopkinson bars and direct impact Hopkinson bars, measure blast/impact waves or mechanical properties of materials at high strain rates. To effectively use such instruments, it is essential to know (i) the elastic properties of the bar, (ii) the correction factor of the measured strain, and (iii) information on impact velocity. This paper presents a graphic-user-interface (GUI) program prepared for solving these fundamental issues. We describe the directory structure of the program, roles and relations of associated files, GUI panels, algorithm, and execution procedure of the program. This program employs a separately measured bar density value and governs the ABAQUS simulations (explicit finite element analyses) of the bar impact test at a given impact velocity for a range of bar properties (elastic modulus and Poisson’s ratio) and two correction factors (in compression and tension) of the measured strain. The simulation is repeated until the predicted elastic wave profile in the bar is reasonably consistent with the experimental counterpart. The bar properties and correction factors are determined as the calibrated values when the two wave profiles are reasonably consistent. The program is also capable of impact velocity calibration with reference to a reliably measured bar strain wave. The quantities of a 19.1 mm diameter bar (maraging steel) were successfully calibrated using the presented GUI program. The GUI program, auxiliary programs, pre-processing files, and an example ABAQUS input file are available in a publicly accessible data repository.
      Citation: Data
      PubDate: 2023-03-01
      DOI: 10.3390/data8030054
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 55: Dataset AqADAPT: Physicochemical Parameters,
           Vibrio Abundance, and Species Determination in Water Columns of Two
           Adriatic Sea Aquaculture Sites

    • Authors: Marija Purgar, Damir Kapetanović, Ana Gavrilović, Branimir K. Hackenberger, Božidar Kurtović, Ines Haberle, Jadranka Pečar Ilić, Sunčana Geček, Domagoj K. Hackenberger, Tamara Djerdj, Lav Bavčević, Jakov Žunić, Fran Barac, Zvjezdana Šoštarić Vulić, Tin Klanjšček
      First page: 55
      Abstract: Aquaculture provides more than 50% of all seafood for human consumption. This important industrial sector is already under pressure from climate-change-induced shifts in water column temperature, nutrient loads, precipitation patterns, microbial community composition, and ocean acidification, all affecting fish welfare. Disease-related risks are also shifting with important implications for risk from vibriosis, a disease that can lead to massive economic losses. Adaptation to these pressures pose numerous challenges for aquaculture producers, policy makers, and researchers. The dataset AqADAPT aims to help the development of management and adaptation tools by providing (i) measurements of physicochemical (temperature, salinity, total dissolved solids, pH, dissolved oxygen, conductivity, transparency, total nitrogen, ammonia, nitrate, nitrite, total phosphorus, total particulate matter, particulate organic matter, and particulate inorganic matter) and microbiological (heterotrophic (total) bacteria, fecal indicators, and Vibrio abundance) parameters of seawater and (ii) biochemical determination of culturable bacteria in two locations near floating cage fish farms in the Adriatic Sea. Water sampling was conducted seasonally in two fish farms (Cres and Vrgada) and corresponding reference (control) sites between 2019 and 2021 of four vertical layers for a total of 108 observations: the surface, 6 m, 12 m, and the bottom.
      Citation: Data
      PubDate: 2023-03-03
      DOI: 10.3390/data8030055
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 56: Learned Sorted Table Search and Static Indexes in
           Small-Space Data Models

    • Authors: Domenico Amato, Raffaele Giancarlo, Giosué Lo Bosco
      First page: 56
      Abstract: Machine-learning techniques, properly combined with data structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed up Binary Searches with the use of additional space with respect to the table being searched into. Such space is devoted to the machine-learning models. Although in their infancy, these are methodologically and practically important, due to the pervasiveness of Sorted Table Search procedures. In modern applications, model space is a key factor, and a major open question concerning this area is to assess to what extent one can enjoy the speeding up of Binary Searches achieved by Learned Indexes while using constant or nearly constant-space models. In this paper, we investigate the mentioned question by (a) introducing two new models, i.e., the Learned k-ary Search Model and the Synoptic Recursive Model Index; and (b) systematically exploring the time–space trade-offs of a hierarchy of existing models, i.e., the ones in the reference software platform Searching on Sorted Data, together with the new ones proposed here. We document a novel and rather complex time–space trade-off picture, which is informative for users as well as designers of Learned Indexing data structures. By adhering to and extending the current benchmarking methodology, we experimentally show that the Learned k-ary Search Model is competitive in time with respect to Binary Search in constant additional space. Our second model, together with the bi-criteria Piece-wise Geometric Model Index, can achieve speeding up of Binary Search with a model space of 0.05% more than the one taken by the table, thereby, being competitive in terms of the time–space trade-off with existing proposals. The Synoptic Recursive Model Index and the bi-criteria Piece-wise Geometric Model complement each other quite well across the various levels of the internal memory hierarchy. Finally, our findings stimulate research in this area since they highlight the need for further studies regarding the time–space relation in Learned Indexes.
      Citation: Data
      PubDate: 2023-03-03
      DOI: 10.3390/data8030056
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 57: Dataset for Spectroscopic, Structural and Dynamic
           Analysis of Human Fe(II)/2OG-Dependent Dioxygenase ALKBH3

    • Authors: Lyubov Yu. Kanazhevskaya, Alexey A. Gorbunov, Polina V. Zhdanova, Vladimir V. Koval
      First page: 57
      Abstract: Fe(II)/2OG-dependent dioxygenases of the AlkB family catalyze a direct removal of alkylated damages in the course of DNA and RNA repair. A human homolog of the E. coli AlkB ALKBH3 protein is able to hydroxylate N1-methyladenine, N3-methylcytosine, and N1-methylguanine in single-stranded DNA and RNA. Due to its contribution to an antitumor drug resistance, this enzyme is considered a promising therapeutic target. The elucidation of ALKBH3’s structural peculiarities is important to establish a detailed mechanism of damaged DNA recognition and processing, as well as to the development of specific inhibitors. This work presents new data on the wild type ALKBH3 protein and its four mutant forms (Y143F, Y143A, L177A, and H191A) obtained by circular dichroism (CD) spectroscopy. The dataset includes the CD spectra of proteins measured at different temperatures and a 3D visualization of the ALKBH3–DNA complex where the mutated amino acid residues are marked. These results show how substitution of the key amino acids influences a secondary structure content of the protein.
      Citation: Data
      PubDate: 2023-03-03
      DOI: 10.3390/data8030057
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 58: Home Comfort Dataset: Acquired from SGH

    • Authors: Mariana Santos, Mário Antunes, Diogo Gomes, Rui L. Aguiar
      First page: 58
      Abstract: In this work, we share the dataset collected during the Smart Green Homes (SGH) project. The project’s goal was to develop integrated products and technology solutions for households, as well as to improve the standards of comfort and user satisfaction. This was to be achieved while improving household energy efficiency and reducing the usage of gaseous pollutants, in response to the planet’s sustainability issues. One of the tasks executed within the project was the collection of data from volunteers’ homes, including environmental information and the level of comfort as perceived by the volunteers themselves. While used in the original project, the resulting dataset contains valuable information that could not be explored at the time. We now share this dataset with the community, which can be used for various scenarios. These may include heating appliance optimisation, presence detection and environmental prediction.
      Citation: Data
      PubDate: 2023-03-03
      DOI: 10.3390/data8030058
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 59: WaRM: A Roof Material Spectral Library for
           Wallonia, Belgium

    • Authors: Coraline Wyard, Rodolphe Marion, Eric Hallot
      First page: 59
      Abstract: The exploitation of urban-material spectral properties is of increasing importance for a broad range of applications, such as urban climate-change modeling and mitigation or specific/dangerous roof-material detection and inventory. A new spectral library dedicated to the detection of roof material was created to reflect the regional diversity of materials employed in Wallonia, Belgium. The Walloon Roof Material (WaRM) spectral library accounts for 26 roof material spectra in the spectral range 350–2500 nm. Spectra were acquired using an ASD FieldSpec3 Hi-Res spectrometer in laboratory conditions, using a spectral sampling interval of 1 nm. The analysis of the spectra shows that spectral signatures are strongly influenced by the color of the roof materials, at least in the VIS spectral range. The SWIR spectral range is in general more relevant to distinguishing the different types of material. Exceptions are the similar properties and very close spectra of several black materials, meaning that their spectral signatures are not sufficiently different to distinguish them from each other. Although building materials can vary regionally due to different available construction materials, the WaRM spectral library can certainly be used for wider applications; Wallonia has always been strongly connected to the surrounding regions and has always encountered climatic conditions similar to all of Northwest Europe.
      Citation: Data
      PubDate: 2023-03-07
      DOI: 10.3390/data8030059
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 60: Development of a Machine-Learning-Based Novel
           Framework for Travel Time Distribution Determination Using Probe Vehicle

    • Authors: Gurmesh Sihag, Praveen Kumar, Manoranjan Parida
      First page: 60
      Abstract: Investigating travel time variability is critical for pre-trip planning, reliable route selection, traffic management, and the development of control strategies to mitigate traffic congestion problems cost-effectively. Hence, a large number of studies are available in the literature which determine the most suitable distribution to fit the travel time data, but these studies recommend different distributions for the travel time data, and there is a disagreement on the best distribution option for fitting to the travel time data. The present study proposes a novel framework to determine the best distribution to represent the travel time data obtained from probe vehicles by using the modern machine learning technique. This study employs vast travel time data collected by fitting GPS tracking units on the probe vehicles and offers a comprehensive investigation of travel time distribution in different scenarios generated due to spatiotemporal variation of the travel time. The study also considers the effect of weather and uses the three most commonly used non-parametric goodness-of-fit tests (namely, Kolmogorov–Smirnov test, Anderson–Darling test, and chi-squared test) to fit and rank a comprehensive set of around 60 unimodal statistical distributions. The framework proposed in the study can determine the travel time distribution with 91% accuracy. Additionally, the distribution determined by the framework has an acceptance rate of 98.4%, which is better than the acceptance rates of the distributions recommended in existing studies. Because of its robustness and applicability in many different traffic situations, the proposed framework can also be used in developing countries with heterogeneous disordered traffic conditions to evaluate the road network’s performance in terms of travel time reliability.
      Citation: Data
      PubDate: 2023-03-14
      DOI: 10.3390/data8030060
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 61: TKGQA Dataset: Using Question Answering to Guide
           and Validate the Evolution of Temporal Knowledge Graph

    • Authors: Ryan Ong, Jiahao Sun, Ovidiu Șerban, Yi-Ke Guo
      First page: 61
      Abstract: Temporal knowledge graphs can be used to represent the current state of the world and, as daily events happen, the need to update the temporal knowledge graph, in order to stay consistent with the state of the world, becomes very important. However, there is currently no reliable method to accurately validate the update and evolution of knowledge graphs. There has been a recent development in text summarisation, whereby question answering is used to both guide and fact-check summarisation quality. The exact process can be applied to the temporal knowledge graph update process. To the best of our knowledge, there is currently no dataset that connects temporal knowledge graphs with documents with question–answer pairs. In this paper, we proposed the TKGQA dataset, consisting of over 5000 financial news documents related to M&A. Each document has extracted facts, question–answer pairs, and before and after temporal knowledge graphs, to highlight the state of temporal knowledge and any changes caused by the facts extracted from the document. As we parse through each document, we use question–answering to check and guide the update process of the temporal knowledge graph.
      Citation: Data
      PubDate: 2023-03-14
      DOI: 10.3390/data8030061
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 62: Instance and Data Generation for the Offline
           Nanosatellite Task Scheduling Problem

    • Authors: Cezar Antônio Rigo, Edemar Morsch Filho, Laio Oriel Seman, Luís Loures, Valderi Reis Quietinho Leithardt
      First page: 62
      Abstract: This paper discusses several cases of the Offline Nanosatellite Task Scheduling (ONTS) optimization problem, which seeks to schedule the start and finish timings of payloads on a nanosatellite. Modeled after the FloripaSat-I mission, a nanosatellite, the examples were built expressly to test the performance of various solutions to the ONTS problem. Realistic input data for power harvesting calculations were used to generate the instances, and an instance creation procedure was employed to increase the instances’ difficulty. The instances are made accessible to the public to facilitate a fair comparison of various solutions and to aid in establishing a baseline for the ONTS problem. Additionally, the study discusses the various orbit types and their effects on energy harvesting and mission performance.
      Citation: Data
      PubDate: 2023-03-21
      DOI: 10.3390/data8030062
      Issue No: Vol. 8, No. 3 (2023)
  • Data, Vol. 8, Pages 145: Attention-Based Human Age Estimation from Face
           Images to Enhance Public Security

    • Authors: Md. Ashiqur Rahman, Shuhena Salam Aonty, Kaushik Deb, Iqbal H. Sarker
      First page: 145
      Abstract: Age estimation from facial images has gained significant attention due to its practical applications such as public security. However, one of the major challenges faced in this field is the limited availability of comprehensive training data. Moreover, due to the gradual nature of aging, similar-aged faces tend to share similarities despite their race, gender, or location. Recent studies on age estimation utilize convolutional neural networks (CNN), treating every facial region equally and disregarding potentially informative patches that contain age-specific details. Therefore, an attention module can be used to focus extra attention on important patches in the image. In this study, tests are conducted on different attention modules, namely CBAM, SENet, and Self-attention, implemented with a convolutional neural network. The focus is on developing a lightweight model that requires a low number of parameters. A merged dataset and other cutting-edge datasets are used to test the proposed model’s performance. In addition, transfer learning is used alongside the scratch CNN model to achieve optimal performance more efficiently. Experimental results on different aging face databases show the remarkable advantages of the proposed attention-based CNN model over the conventional CNN model by attaining the lowest mean absolute error and the lowest number of parameters with a better cumulative score.
      Citation: Data
      PubDate: 2023-09-25
      DOI: 10.3390/data8100145
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 146: Synthetic Data Generation for Data Envelopment

    • Authors: Andrey V. Lychev
      First page: 146
      Abstract: The paper is devoted to the problem of generating artificial datasets for data envelopment analysis (DEA), which can be used for testing DEA models and methods. In particular, the papers that applied DEA to big data often used synthetic data generation to obtain large-scale datasets because real datasets of large size, available in the public domain, are extremely rare. This paper proposes the algorithm which takes as input some real dataset and complements it by artificial efficient and inefficient units. The generation process extends the efficient part of the frontier by inserting artificial efficient units, keeping the original efficient frontier unchanged. For this purpose, the algorithm uses the assurance region method and consistently relaxes weight restrictions during the iterations. This approach produces synthetic datasets that are closer to real ones, compared to other algorithms that generate data from scratch. The proposed algorithm is applied to a pair of small real-life datasets. As a result, the datasets were expanded to 50K units. Computational experiments show that artificially generated DMUs preserve isotonicity and do not increase the collinearity of the original data as a whole.
      Citation: Data
      PubDate: 2023-09-27
      DOI: 10.3390/data8100146
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 147: A Retinal Oct-Angiography and Cardiovascular
           STAtus (RASTA) Dataset of Swept-Source Microvascular Imaging for
           Cardiovascular Risk Assessment

    • Authors: Germanèse, Meriaudeau, Eid, Tadayoni, Ginhac, Anwer, Laure-Anne, Guenancia, Creuzot-Garcher, Gabrielle, Arnould
      First page: 147
      Abstract: In the context of exponential demographic growth, the imbalance between human resources and public health problems impels us to envision other solutions to the difficulties faced in the diagnosis, prevention, and large-scale management of the most common diseases. Cardiovascular diseases represent the leading cause of morbidity and mortality worldwide. A large-scale screening program would make it possible to promptly identify patients with high cardiovascular risk in order to manage them adequately. Optical coherence tomography angiography (OCT-A), as a window into the state of the cardiovascular system, is a rapid, reliable, and reproducible imaging examination that enables the prompt identification of at-risk patients through the use of automated classification models. One challenge that limits the development of computer-aided diagnostic programs is the small number of open-source OCT-A acquisitions available. To facilitate the development of such models, we have assembled a set of images of the retinal microvascular system from 499 patients. It consists of 814 angiocubes as well as 2005 en face images. Angiocubes were captured with a swept-source OCT-A device of patients with varying overall cardiovascular risk. To the best of our knowledge, our dataset, Retinal oct-Angiography and cardiovascular STAtus (RASTA), is the only publicly available dataset comprising such a variety of images from healthy and at-risk patients. This dataset will enable the development of generalizable models for screening cardiovascular diseases from OCT-A retinal images.
      Citation: Data
      PubDate: 2023-09-28
      DOI: 10.3390/data8100147
      Issue No: Vol. 8, No. 10 (2023)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762

Your IP address:
Home (Search)
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-