Authors:Muhammad Adnan Aslam, Fiza Murtaza, Muhammad Ehatisham Ul Haq, Amanullah Yasin, Numan Ali First page: 27 Abstract: Education is crucial for leading a productive life and obtaining necessary resources. Higher education institutions are progressively incorporating artificial intelligence into conventional teaching methods as a result of innovations in technology. As a high academic record raises a university’s ranking and increases student career chances, predicting learning success has been a central focus in education. Both performance analysis and providing high-quality instruction are challenges faced by modern schools. Maintaining high academic standards, juggling life and academics, and adjusting to technology are problems that students must overcome. In this study, we present a comprehensive dataset, SAPEx-D (Student Academic Performance Exploration), designed to predict student performance, encompassing a wide array of personal, familial, academic, and behavioral factors. Our data collection effort at Air University, Islamabad, Pakistan, involved both online and paper questionnaires completed by students across multiple departments, ensuring diverse representation. After meticulous preprocessing to remove duplicates and entries with significant missing values, we retained 494 valid responses. The dataset includes detailed attributes such as demographic information, parental education and occupation, study habits, reading frequencies, and transportation modes. To facilitate robust analysis, we encoded ordinal attributes using label encoding and nominal attributes using one-hot encoding, expanding our dataset from 38 to 88 attributes. Feature scaling was performed to standardize the range and distribution of data, using a normalization technique. Our analysis revealed that factors such as degree major, parental education, reading frequency, and scholarship type significantly influence student performance. The machine learning models applied to this dataset, including Gradient Boosting and Random Forest, demonstrated high accuracy and robustness, underscoring the dataset’s potential for insightful academic performance prediction. In terms of model performance, Gradient Boosting achieved an accuracy of 68.7% and an F1-score of 68% for the eight-class classification task. For the three-class classification, Random Forest outperformed other models, reaching an accuracy of 80.8% and an F1-score of 78%. These findings highlight the importance of comprehensive data in understanding and predicting academic outcomes, paving the way for more personalized and effective educational strategies. Citation: Data PubDate: 2025-02-20 DOI: 10.3390/data10030027 Issue No:Vol. 10, No. 3 (2025)
Authors:Themistoklis Diamantopoulos, Andreas L. Symeonidis First page: 28 Abstract: The amount of software engineering data is constantly growing, as more and more developers employ online services to store their code, keep track of bugs, or even discuss issues. The data residing in these services can be mined to address different research challenges; therefore, certain initiatives have been established to encourage sharing research datasets collecting them. In this work, we investigate the effect of such an initiative; we create a directory that includes the papers and the corresponding datasets of the data track of the Mining Software Engineering (MSR) conference. Specifically, our directory includes metadata and citation information for the papers of all data tracks, throughout the last twelve years. We also annotate the datasets according to the data source and further assess their compliance to the FAIR principles. Using our directory, researchers can find useful datasets for their research, or even design methodologies for assessing their quality, especially in the software engineering domain. Moreover, the directory can be used for analyzing the citations of data papers, especially with regard to different data categories, as well as for examining their FAIRness score throughout the years, along with its effect on the usage/citation of the datasets. Citation: Data PubDate: 2025-02-20 DOI: 10.3390/data10030028 Issue No:Vol. 10, No. 3 (2025)
Authors:Delfina Soares, Joana Carvalho, Dimitrios Sarantis First page: 29 Abstract: The Health Online Service Provision Index (HOSPI) is an instrument to assess and monitor hospitals’ websites. The index comprises four criteria—Content, Services, Community Interaction and Technology Features—each with a subset of indicators and sub-indicators. HOSPI was applied to the Portuguese hospitals’ websites in 2023, originating the dataset described in this article. The article also provides a detailed account of the data collection process, which involved direct observation of the websites and specific treatment methods, ensuring the reliability and validity of the dataset. It underscores the relevance of having this data available and how it can improve service provision online in health facilities and support policymaking. Citation: Data PubDate: 2025-02-21 DOI: 10.3390/data10030029 Issue No:Vol. 10, No. 3 (2025)
Authors:Patrizia Gasparini, Lucio Di Cosmo, Antonio Floris, Federica Murgia, Maria Rizzo First page: 30 Abstract: Forest ecosystems are important for biodiversity conservation, climate regulation and climate change mitigation, soil and water protection, and the recreation and provision of raw materials. This paper presents a dataset on forest type and tree species composition for 934 georeferenced plots located in Italy. The forest type is classified in the field consistently with the Italian National Forest Inventory (NFI) based on the dominant tree species or species group. Tree species composition is provided by the percent crown cover of the main five species in the plot. Additional data on conifer and broadleaves pure/mixed condition, total tree and shrub cover, forest structure, sylvicultural system, development stage, and local land position are provided. The surveyed plots are distributed in the central–eastern Alps, in the central Apennines, and in the southern Apennines; they represent a wide range of species composition, ecological conditions, and silvicultural practices. Data were collected as part of a project aimed at developing a classification algorithm based on hyperspectral data. The dataset was made publicly available as it refers to forest types and species widespread in many countries of Central and Southern Europe and is potentially useful to other researchers for the study of forest biodiversity or for remote sensing applications. Citation: Data PubDate: 2025-02-21 DOI: 10.3390/data10030030 Issue No:Vol. 10, No. 3 (2025)
Authors:Reno Filla First page: 31 Abstract: In moving vehicles, the dominating energy losses are due to interactions with the environment: air resistance and rolling resistance. It is known that weather has a significant impact, yet there is a lack of literature showing how the wealth of openly available data from professional weather observations can be used in this context. This article will give an overview of how such data are structured and how they can be accessed in order to augment logs gained during vehicle operation or simulated trips. Two efficient algorithms for such data extraction and augmentation are discussed and several examples for use are provided, also demonstrating that some caveats do exist with respect to the source of weather data. Citation: Data PubDate: 2025-02-24 DOI: 10.3390/data10030031 Issue No:Vol. 10, No. 3 (2025)
Authors:Hannah Jona von Czettritz, Sandra Uthes, Johannes Schuler, Kurt-Christian Kersebaum, Peter Zander First page: 32 Abstract: Coherent spatial data are crucial for informed land use and regional planning decisions, particularly in the context of securing a crisis-proof food supply and adapting to climate change. This dataset provides spatial information on climate-robust and high-yield agricultural arable land in Brandenburg, Germany, based on the results of a classification using bio-economic climate simulations. The dataset is intended to support regional planning and policy makers in zoning decisions (e.g., photovoltaic power plants) by identifying climate-robust arable land with high current and stable future production potential that should be reserved for agricultural use. The classification method used to generate the dataset includes a wide range of indicators, including established approaches, such as a soil quality index, drought, water, and wind erosion risk, as well as a dynamic approach, using bio-economic simulations, which determine the production potential under future climate scenarios. The dataset is a valuable resource for spatial planning and climate change adaptation, contributing to long-term food security especially in dry areas such as the state of Brandenburg facing increased production risk under future climatic conditions, thereby serving globally as an example for land use planning challenges related to climate change. Citation: Data PubDate: 2025-02-25 DOI: 10.3390/data10030032 Issue No:Vol. 10, No. 3 (2025)
Authors:José Camacho, Rafael A. Rodríguez-Gómez First page: 33 Abstract: Network traffic datasets are essential for the construction of traffic models, often using machine learning (ML) techniques. Among other applications, these models can be employed to solve complex optimization problems or to identify anomalous behaviors, i.e., behaviors that deviate from the established model. However, the performance of the ML model depends, among other factors, on the quality of the data used to train it. Benchmark datasets, with a profound impact on research findings, are often assumed to be of good quality by default. In this paper, we derive four variants of a benchmark dataset in network anomaly detection (UGR’16, a flow-based real-world traffic dataset designed for anomaly detection), and show that the choice among variants has a larger impact on model performance than the ML technique used to build the model. To analyze this phenomenon, we propose a methodology to investigate the causes of these differences and to assess the quality of the data labeling. Our results underline the importance of paying more attention to data quality assessment in network anomaly detection. Citation: Data PubDate: 2025-02-25 DOI: 10.3390/data10030033 Issue No:Vol. 10, No. 3 (2025)
Authors:José Augusto Ramírez-Trujillo, Maria Guadalupe Castillo-Texta, Mario Ramírez-Yáñez, Ramón Suárez-Rodríguez First page: 34 Abstract: In this work, we report the draft genome sequence of Ensifer sp. P24N7, a symbiotic nitrogen-fixing bacterium isolated from nodules of Phaseolus vulgaris var. Negro Jamapa was planted in pots that contained mining tailings from Huautla, Morelos, México. The genomic DNA was sequenced by an Illumina NovaSeq 6000 using the 250 bp paired-end protocol obtaining 1,188,899 reads. An assembly generated with SPAdes v. 3.15.4 resulted in a genome length of 7,165,722 bp composed of 181 contigs with a N50 of 323,467 bp, a coverage of 76X, and a GC content of 61.96%. The genome was annotated with the NCBI Prokaryotic Genome Annotation Pipeline and contains 6631 protein-coding sequences, 3 complete rRNAs, 52 tRNAs, and 4 non-coding RNAs. The Ensifer sp. P24N7 genome has 59 genes related to heavy metal tolerance predicted by RAST server. These data may be useful to the scientific community because they can be used as a reference for other works related to heavy metals, including works in Huautla, Morelos. Citation: Data PubDate: 2025-02-27 DOI: 10.3390/data10030034 Issue No:Vol. 10, No. 3 (2025)
Authors:Sheik Murad Hassan Anik, Xinghua Gao, Na Meng First page: 35 Abstract: The paper describes a dataset comprising indoor environmental factors such as temperature, humidity, air quality, and noise levels. The data were collected from 10 sensing devices installed in various locations within three single-family houses in Virginia, USA. The objective of the data collection was to study the indoor environmental conditions of the houses over time. The data were collected at a frequency of one record per minute for a year, combining to a total over 2.5 million records. The paper provides actual floor plans with sensor placements to aid researchers and practitioners in creating reliable building performance models. The techniques used to collect and verify the data are also explained in the paper. The resulting dataset can be employed to enhance models for building energy consumption, occupant behavior, predictive maintenance, and other relevant purposes. Citation: Data PubDate: 2025-03-05 DOI: 10.3390/data10030035 Issue No:Vol. 10, No. 3 (2025)
Authors:Hyeongbok Kim, Eunbi Kim, Sanghoon Ahn, Beomjin Kim, Sung Jin Kim, Tae Kyung Sung, Lingling Zhao, Xiaohong Su, Gilmu Dong First page: 36 Abstract: Comprehensive datasets are crucial for developing advanced AI solutions in road infrastructure, yet most existing resources focus narrowly on vehicles or a limited set of object categories. To address this gap, we introduce the Korean Road Infrastructure Dataset (KRID), a large-scale dataset designed for real-world road maintenance and safety applications. Our dataset covers highways, national roads, and local roads in both city and non-city areas, comprising 34 distinct types of road infrastructure—from common elements (e.g., traffic signals, gaze-directed poles) to specialized structures (e.g., tunnels, guardrails). Each instance is annotated with either bounding boxes or polygon segmentation masks under stringent quality control and privacy protocols. To demonstrate the utility of this resource, we conducted object detection and segmentation experiments using YOLO-based models, focusing on guardrail damage detection and traffic sign recognition. Preliminary results confirm its suitability for complex, safety-critical scenarios in intelligent transportation systems. Our main contributions include: (1) a broader range of infrastructure classes than conventional “driving perception” datasets, (2) high-resolution, privacy-compliant annotations across diverse road conditions, and (3) open-access availability through AI Hub and GitHub. By highlighting critical yet often overlooked infrastructure elements, this dataset paves the way for AI-driven maintenance workflows, hazard detection, and further innovations in road safety. Citation: Data PubDate: 2025-03-14 DOI: 10.3390/data10030036 Issue No:Vol. 10, No. 3 (2025)
Authors:Christine M. Hale, Kyle J. Beauchemin, Courtney L. Brann, Julie K. Moulton, Ramaz Geguchadze, Benjamin J. Harrison, Geoffrey K. Ganter First page: 11 Abstract: To prepare to address the mechanisms of injury-induced nociceptor sensitization, we sequenced the translatome of the nociceptors of injured Drososophila larvae and those of uninjured larvae. Third-instar larvae expressing a green fluorescent protein (GFP)-tagged ribosomal subunit specifically in Class 4 dendritic arborization neurons, recognized as pickpocket-expressing primary nociceptors, via the GAL4/UAS method, were injured by ultraviolet light or sham-injured. Larvae were subjected to translating ribosome affinity purification for the GFP tag and nociceptor-specific ribosome-bound RNA was sequenced. Citation: Data PubDate: 2025-01-21 DOI: 10.3390/data10020011 Issue No:Vol. 10, No. 2 (2025)
Authors:Marjolène Jatteau, Jean Cauzid, Cécile Fabre, Panagiotis Voudouris, Georgios Soulamidis, Alexandre Tarantola First page: 12 Abstract: Strategic metals are indispensable for meeting the needs of modern society. It is then necessary to reassess the potential of such metals in Europe. For the exploration of strategic metals, portable XRF (X-Ray Fluorescence) and LIBS (Laser Induced Breakdown Spectroscopy) are powerful techniques allowing their multi-elementary analysis. This paper presents a database providing more than 2000 pXRF data and more than 4000 pLIBS spectra acquired on minerals from the Mineralogy and Petrology Museum of National and Kapodistrian University of Athens (NKUA), selected based on their potential in bearing strategic metals. The combination of these two portable techniques, along with expanding dataset on strategic metal-rich minerals, provides valuable insights into strategic metal affinities and demonstrates the effectiveness of portable tools for exploring strategic raw materials. Indeed, such database allows to strengthen the knowledge on strategic metals by producing statistic and chemometric analyses (e.g., boxplot, PCA, PLS) on their distribution. Citation: Data PubDate: 2025-01-27 DOI: 10.3390/data10020012 Issue No:Vol. 10, No. 2 (2025)
Authors:Ignacio Molina, Brian Keith, Mauricio Matus First page: 13 Abstract: This paper presents a multimodal dataset capturing fact-checked news coverage of Chile’s constitutional processes from 2019–2023. The collection comprises 300 articles from three sources: Fast Check, Fact Checking UC, and BioBioChile, containing 242,687 words of text and visual content in 168 entries. The dataset implements advanced natural language processing through RoBERTa and computer vision techniques via EfficientNet, with unified multimodal analysis using the CLIP model. Technical validation through clustering analysis and expert review demonstrates the dataset’s effectiveness in identifying narrative patterns within constitutional process coverage. The structured format includes verification metadata, precomputed embeddings, and documented relationships between textual and visual elements. This enables research into how misinformation propagates through multiple channels during significant political events. This paper details the dataset’s composition, collection methodology, and validation while acknowledging specific limitations. This contribution addresses a gap in current research resources by providing verified multimodal content spanning two constitutional processes, supporting investigations in computational social science and misinformation studies. Citation: Data PubDate: 2025-01-28 DOI: 10.3390/data10020013 Issue No:Vol. 10, No. 2 (2025)
Authors:Milan S. Dimitrijević, Magdalena D. Christova, Cristina Yubero, Sylvie Sahal-Bréchot First page: 14 Abstract: Data on spectral line widths and shifts broadened by interactions with charged particles, for 44 lines in the spectrum of ionized tin, for collisions with electrons and H II and HeII ions, are presented as online available tables. We obtained them by employing the semiclassical perturbation theory for temperatures, T, within the 5000–100,000 K range, and for a grid of perturber densities from 1014 cm−3 to 1020 cm−3. The presented Stark broadening data are of interest for the analysis and synthesis of ionized tin lines in the spectra of hot and dense stars, such as, for example, for white dwarfs and hot subwarfs, and for the modelling of their atmospheres. They are also useful for the diagnostics of laser-induced plasmas for high-order harmonics generation in ablated materials. Citation: Data PubDate: 2025-01-28 DOI: 10.3390/data10020014 Issue No:Vol. 10, No. 2 (2025)
Authors:Ebru Kirezci, Ian Young, Roshanka Ranasinghe, Yiqun Chen, Yibo Zhang, Abbas Rajabifard First page: 15 Abstract: A global database of coastal flooding impacts resulting from extreme sea levels is developed for the present day and for the years 2050 and 2100. The database consists of three sub-datasets: the extreme sea levels, the coastal areas flooded by these extreme sea levels, and the resulting socioeconomic implications. The extreme sea levels consider the processes of storm surge, tide levels, breaking wave setup and relative sea level rise. The socioeconomic implications are expressed in terms of Expected Annual Population Affected (EAPA) and Expected Annual Damage (EAD), and presented at the global, regional and national scales. The EAPA and EAD are determined both for existing coastal defence levels and assuming two plausible adaptation scenarios, along with socioeconomic development narratives. All the sub-datasets can be visualized with a Digital Twin platform based on a GIS-based mapping host. This publicly available database provides a first-pass assessment, enabling users to extract and identify global and national coastal hotspots under different projections of sea level rise and socioeconomic developments. Citation: Data PubDate: 2025-01-28 DOI: 10.3390/data10020015 Issue No:Vol. 10, No. 2 (2025)
Authors:Jorge Quijano, Nohemi Torres Cruz, Leslie Quijano-Quian, Eduardo Rafael Poblano-Ojinaga, Salvador Anacleto Noriega Noriega Morales First page: 16 Abstract: Optimizing production efficiency in Surface-Mount Technology (SMT) manufacturing is a critical challenge, particularly in high-mix environments where frequent product changeovers can lead to significant downtime. This study presents a scheduling algorithm that minimizes changeover times on SMT lines by leveraging the commonality of Surface-Mount Device (SMD) reel part numbers across product Bills of Materials (BOMs). The algorithm’s capabilities were demonstrated through both simulated datasets and practical validation trials, providing a comprehensive evaluation framework. In the practical implementation, the algorithm successfully aligned predicted and measured changeover times, highlighting its applicability and accuracy in operational settings. The proposed approach integrates heuristic and optimization techniques to identify scheduling strategies that not only minimize reel changes but also support production scalability and operational flexibility. This framework offers a robust solution for optimizing SMT workflows, enhancing productivity, and reducing resource inefficiencies in both greenfield projects and established manufacturing environments. Citation: Data PubDate: 2025-01-29 DOI: 10.3390/data10020016 Issue No:Vol. 10, No. 2 (2025)
Authors:Ivana Patente Torres, Roberto Avelino Cecílio, Laura Thebit de Almeida, Marcel Carvalho Abreu, Demetrius David da Silva, Sidney Sara Zanetti, Alexandre Cândido Xavier First page: 17 Abstract: This is a database containing rainfall intensity–duration–frequency equations (IDF equations) for 6550 pluviographic and pluviometric stations in Brazil. The database was compiled from 370 different publications and contains the following information: station identification, geographic position, size and period of the rainfall series used, parameters of the IDF equations, and literature references. The database is available on Mendeley Data ( DOI : 10.17632/378bdcmnc8.1) in the form of spreadsheets and vector files. Since the launch of the Pluvio 2.1 software in 2006, which included 549 IDF equations obtained in the country, this is the largest and most accessible database of IDF equations in Brazil. The data provided may be useful, among other purposes, for designing hydraulic structures, controlling water erosion, planning land use, and water resource planning and management. Citation: Data PubDate: 2025-01-29 Issue No:Vol. 10, No. 2 (2025)
Authors:Paola G. Ferrario, Maik Döring, Christian Ritz First page: 18 Abstract: In clinical nutrition, it is regularly observed that individuals respond differently to a dietary treatment. Personalized nutrition aims to consider such variability in response by delivering personalized nutritional recommendations. Ideally, the optimal treatment for each individual will be selected and then dispensed according to the specific individual’s characteristics. The aim of this paper is to discuss and apply existing statistical methods, which can be adequately used in the context of personalized nutrition. We discuss the estimation of individualized treatment rules (ITRs) as we wish to favor one out of two interventions. The applicability of the methods is demonstrated by reusing two public datasets: one in the context of a parallel group design and one in the context of a crossover design. The bias of the estimator of the ITRs underlying parameters is evaluated in a simulation study. Citation: Data PubDate: 2025-01-30 DOI: 10.3390/data10020018 Issue No:Vol. 10, No. 2 (2025)
Authors:Mohsen Khezri First page: 19 Abstract: This study introduces an innovative empirical methodology by integrating spatial panel models with satellite imagery data from 1970 to 2019. This innovative approach illuminates the effects of greenhouse gas emissions, deforestation, and various global variables on regional temperature shifts and the environmental repercussions of land-use alterations, establishing a substantial empirical basis for climate change. The results revealed that global variables such as sunspot activity, the length of day (LOD), and the Global Mean Sea Level (GMSL) have negligible impacts on global temperature variations. This model uncovers the nuanced effect of deforestation on global temperatures, highlighting a decrease in temperature following deforestation above 40°N latitude, contrary to the warming effect observed in lower latitudes. Exceptionally, deforestation within the 10° N to 10° S tropical bands results in a temperature decrease, challenging the established theories. The results suggest that converting forests to grass/shrublands and croplands plays a significant role in these temperature dynamics. Citation: Data PubDate: 2025-01-31 DOI: 10.3390/data10020019 Issue No:Vol. 10, No. 2 (2025)
Authors:Fernanda Véliz, Thulasi Bikku, Davor Ibarra-Pérez, Valentina Hernández-Muñoz, Alysia Garmulewicz, Felipe Herrera First page: 20 Abstract: Automated analysis of the scientific literature using natural language processing (NLP) can accelerate the identification of potentially unexplored formulations that enable innovations in materials engineering with fewer experimentation and testing cycles. This strategy has been successful for specific classes of inorganic materials, but their general application in broader material domains such as bioplastics remains challenging. To begin addressing this gap, we explore correlations between the ingredients and physicochemical properties of seaweed-based biofilms from a corpus of 2000 article abstracts from the scientific literature since 1958, using a supervised word co-occurrence analysis and an unsupervised approach based on the language model MatBERT without fine-tuning. Using known relations between ingredients and properties for test scenarios, we discuss the potential and limitations of these NLP approaches for identifying novel combinations of polysaccharides, plasticizers, and additives that are related to the functionality of seaweed biofilms. The model demonstrates a valuable predictive ability to identify ingredients associated with increased water vapor permeability, suggesting its potential utility in optimizing formulations for future research. Using the model further revealed alternative combinations that are underrepresented in the literature. This automated method facilitates the mapping of relationships between ingredients and properties, guiding the development of seaweed bioplastic formulations. The unstructured and heterogeneous nature of the literature on bioplastics represents a particular challenge that demands ad hoc fine-tuning strategies for state-of-the-art language models for advancing the field of seaweed bioplastics. Citation: Data PubDate: 2025-02-01 DOI: 10.3390/data10020020 Issue No:Vol. 10, No. 2 (2025)
Authors:Pedro Cavadia, José M. Benjumea, Oscar Begambre, Edison Osorio, María A. Mantilla First page: 21 Abstract: Due to climate change, the temperature monitoring of reinforced-concrete (RC) structures is becoming critical for preventive maintenance and extending their lifespan. Significant temperature variations in RC elements can affect their natural frequencies and modulus of elasticity or generate abnormal stress levels, potentially leading to structural damage. Data from thermal monitoring systems are invaluable for testing and validating numerical methodologies for estimating internal thermal responses and aiding in prevention/maintenance decision making. Despite its importance, few experimental outdoor data on the internal and external temperatures of concrete structures are available. This study presents a comprehensive dataset from a 120-day temperature-monitoring campaign on a 1.2 m long reinforced-concrete slab-on-I-beam model under tropical conditions in Bucaramanga, Colombia. The monitoring system measured the internal temperatures at 40 points using embedded thermocouples, while the surface temperatures were recorded with handheld and drone-mounted thermal cameras. Simultaneously, the ambient temperature, solar radiation, rainfall, wind velocity, and other parameters were monitored using a weather station. The instrumentation ensured the synchronization and high spatial resolution of the thermal data. The data, collected at 30 min intervals, are openly available in CSV format, offering valuable resources for validating numerical models, studying thermal gradients, and enhancing structural health-monitoring frameworks. Citation: Data PubDate: 2025-02-04 DOI: 10.3390/data10020021 Issue No:Vol. 10, No. 2 (2025)
Authors:Rodolfo Bojorque, Fernando Moscoso, Fernando Pesántez, Ángela Flores First page: 22 Abstract: This study investigates stressors in higher education, focusing on their impact on students and faculty at Universidad Politécnica Salesiana (UPS) and using eight years of comprehensive data. Employing data mining techniques, the research analyzed enrollment, retention, graduation, employability, socioeconomic status, academic performance, and faculty workload to uncover patterns affecting academic outcomes. The study found that UPS exhibits a stable educational system, maintaining consistent metrics across student success indicators. However, the COVID-19 pandemic presented unique stressors, evidenced by a paradoxical increase in student grades during heightened faculty stress levels. This anomaly suggests a potential link between academic rigor and faculty well-being during systemic disruptions. Stressors affecting students directly correlated with reduced academic performance, highlighting the importance of early detection and intervention. Conversely, faculty stress was reflected in adjustments to grading practices, raising questions about institutional pressures and faculty motivation. These findings emphasize the value of proactive data analytics in identifying stress-induced anomalies to support student success and faculty well-being. The study advocates for further research on faculty burnout, motivation, and institutional strategies to mitigate stressors, underscoring the potential of data-driven approaches to enhance the quality and sustainability of higher education ecosystems. Citation: Data PubDate: 2025-02-07 DOI: 10.3390/data10020022 Issue No:Vol. 10, No. 2 (2025)
Authors:Moeketsi Mosia First page: 23 Abstract: Early detection of academically at-risk students is crucial for designing timely interventions that improve educational outcomes. However, many existing approaches either ignore the temporal evolution of student performance or rely on “black box” models that sacrifice interpretability. In this study, we develop a dynamic hierarchical logistic regression model in a fully Bayesian framework to address these shortcomings. Our method leverages partial pooling across students and employs a state-space formulation, allowing each student’s log-odds of failure to evolve over multiple assessments. By using Markov chain Monte Carlo for inference, we obtain robust posterior estimates and credible intervals for both population-level and individual-specific effects, while posterior predictive checks ensure model adequacy and calibration. Results from simulated and real-world datasets indicate that the proposed approach more accurately tracks fluctuations in student risk compared to static logistic regression, and it yields interpretable insights into how engagement patterns and demographic factors influence failure probability. We conclude that a Bayesian dynamic hierarchical model not only enhances prediction of at-risk students but also provides actionable feedback for instructors and administrators seeking evidence-based interventions. Citation: Data PubDate: 2025-02-07 DOI: 10.3390/data10020023 Issue No:Vol. 10, No. 2 (2025)
Authors:Ersin Aytaç, Mohamed Khayet First page: 24 Abstract: Social media has revolutionized the dissemination of information, enabling the rapid and widespread sharing of news, concepts, technologies, and ideas. YouTube is one of the most important online video sharing platforms of our time. In this research, we investigate the trace of separation through membrane distillation (MD) on YouTube using statistical methods and natural language processing. The dataset collected on 04.01.2024 included 212 videos with key characteristics such as durations, views, subscribers, number of comments, likes, etc. The results show that the number of videos is not sufficient, but there is an increasing trend, especially since 2019. The high number of channels offering information about MD technology in countries such as the USA, India, and Canada indicates that these countries recognized the practical benefits of this technology, especially in areas such as water treatment, desalination, and industrial applications. This suggests that MD could play a pivotal role in finding solutions to global water challenges. Word cloud analysis showed that terms such as “water”, “treatment”, “desalination”, and “separation” were prominent, indicating that the videos focused mainly on the principles and applications of MD. The sentiment of the comments is mostly positive, and the dominant emotion is neutral, revealing that viewers generally have a positive attitude towards MD. The narrative intensity metric evaluates the information transfer efficiency of the videos and provides a guide for effective content creation strategies. The results of the analyses revealed that social media awareness about MD technology is still not sufficient and that content development and sharing strategies should focus on bringing the technology to a wider audience. Citation: Data PubDate: 2025-02-08 DOI: 10.3390/data10020024 Issue No:Vol. 10, No. 2 (2025)
Authors:Mohammad Badhruddouza Khan, Salwa Tamkin, Jinat Ara, Mobashwer Alam, Hanif Bhuiyan First page: 25 Abstract: Crop failure is defined as crop production that is significantly lower than anticipated, resulting from plants that are harmed, diseased, destroyed, or influenced by climatic circumstances. With the rise in global food security concern, the earliest detection of crop diseases has proven to be pivotal in agriculture industries to address the needs of the global food crisis and on-farm data protection, which can be met with a privacy-preserving deep learning model. However, deep learning seems to be a largely complex black box to interpret, necessitating a prerequisite for the groundwork of the model’s interpretability. Considering this, the aim of this study was to follow up on the establishment of a robust deep learning custom model named CropsDisNet, evaluated on a large-scale dataset named “New Bangladeshi Crop Disease Dataset (corn, potato and wheat)”, which contains a total of 8946 images. The integration of a differential privacy algorithm into our CropsDisNet model could establish the benefits of automated crop disease classification without compromising on-farm data privacy by reducing training data leakage. To classify corn, potato, and wheat leaf diseases, we used three representative CNN models for image classification (VGG16, Inception Resnet V2, Inception V3) along with our custom model, and the classification accuracy for these three different crops varied from 92.09% to 98.29%. In addition, demonstration of the model’s interpretability gave us insight into our model’s decision making and classification results, which can allow farmers to understand and take appropriate precautions in the event of early widespread harvest failure and food crises. Citation: Data PubDate: 2025-02-18 DOI: 10.3390/data10020025 Issue No:Vol. 10, No. 2 (2025)
Authors:Kazeem A. Dauda, Rasheed K. Lamidi First page: 26 Abstract: High-dimensional survival data, such as microarray datasets, present significant challenges in variable selection and model performance due to their complexity and dimensionality. Identifying important genes and understanding how these genes influence the survival of patients with cancer are of great interest and a major challenge to biomedical scientists, healthcare practitioners, and oncologists. Therefore, this study combined the strengths of two complementary feature selection methodologies: a filtering (correlation-based) approach and a wrapper method based on Iterative Bayesian Model Averaging (IBMA). This new approach, termed Correlation-Based IBMA, offers a highly efficient and effective means of selecting the most important and influential genes for predicting the survival of patients with cancer. The efficiency and consistency of the method were demonstrated using diffuse large B-cell lymphoma cancer data. The results revealed that the 15 most important genes out of 3835 gene features were consistently selected at a threshold p-value of 0.001, with genes with posterior probabilities below 1% being removed. The influence of these 15 genes on patient survival was assessed using the Cox Proportional Hazards (Cox-PH) Model. The results further revealed that eight genes were highly associated with patient survival at a 0.05 level of significance. Finally, these findings underscore the importance of integrating feature selection with robust modeling approaches to enhance accuracy and interpretability in high-dimensional survival data analysis. Citation: Data PubDate: 2025-02-18 DOI: 10.3390/data10020026 Issue No:Vol. 10, No. 2 (2025)
Authors:Zhengxiao Yang, Hao Zhou, Sudesh Srivastav, Jeffrey G. Shaffer, Kuukua E. Abraham, Samuel M. Naandam, Samuel Kakraba First page: 4 Abstract: Patient-level grouped data are prevalent in public health and medical fields, and multiple instance learning (MIL) offers a framework to address the challenges associated with this type of data structure. This study compares four data aggregation methods designed to tackle the grouped structure in classification tasks: post-mean, post-max, post-min, and pre-mean aggregation. We developed a customized AI pipeline that incorporates twelve machine learning algorithms along with the four aggregation methods to detect Parkinson’s disease (PD) using multiple voice recordings from individuals available in the UCI Machine Learning Repository, which includes 756 voice recordings from 188 PD patients and 64 healthy individuals. Seven performance metrics—accuracy, precision, sensitivity, specificity, F1 score, AUC, and MCC—were utilized for model evaluation. Various techniques, such as Bag Over-Sampling (BOS), cross-validation, and grid search, were implemented to enhance classification performance. Among the four aggregation methods, post-mean aggregation combined with XGBoost achieved the highest accuracy (0.880), F1 score (0.922), and MCC (0.672). Furthermore, we identified potential trends in selecting aggregation methods that are suitable for imbalanced data, particularly based on their differences in sensitivity and specificity. These findings provide meaningful implications for the further exploration of grouped imbalanced data. Citation: Data PubDate: 2025-01-02 DOI: 10.3390/data10010004 Issue No:Vol. 10, No. 1 (2025)
Authors: Racinskis, Krasnikovs, Arents, Greitans First page: 5 Abstract: This paper accompanies the initial public release of the EDI multi-modal SLAM dataset, a collection of long tracks recorded with a portable sensor package. These include two global shutter RGB camera feeds, LiDAR scans, as well as inertial and GNSS data from an RTK-enabled IMU-GNSS positioning module—both as satellite fixes and internally fused interpolated pose estimates. The tracks are formatted as ROS1 and ROS2 bags, with separately available calibration and ground truth data. In addition to the filtered positioning module outputs, a second form of sparse ground truth pose annotation is provided using independently surveyed visual fiducial markers as a reference. This enables the meaningful evaluation of systems that directly utilize data from the positioning module into their localization estimates, and serves as an alternative when the GNSS reference is disrupted by intermittent signals or multipath scattering. In this paper, we describe the methods used to collect the dataset, its contents, and its intended use. Citation: Data PubDate: 2025-01-07 DOI: 10.3390/data10010005 Issue No:Vol. 10, No. 1 (2025)
Authors:Salem Ahmed Alabdali, Salvatore Flavio Pileggi, Gnana Bharathy First page: 6 Abstract: This paper describes a dataset for the Sustainable Development of remote and rural areas. Version 1.0 includes self-reported data, with a total of 212 valid responses collected in 2024 across different sectors (education, healthcare, and business) from people living in rural and remote areas in Saudi Arabia. The structured survey is understood to support research endeavors and policy making, looking at the peculiar characteristics of those regions. The 40 core questions, in addition to the detailed demographic questions, aim to capture different perspectives and perceptions on innovative and sustainable solutions. Overall, the dataset offers valuable strategic insights to be integrated with other sources of information, as well as the opportunity to incrementally generate extensive and diverse knowledge in the field. The major limitation is inherently related to the local context, as data comes from the most educated persons with access to digital resources. Additionally, the dataset may be considered as relatively small, and there is some gender imbalance due to cultural factors. Citation: Data PubDate: 2025-01-08 DOI: 10.3390/data10010006 Issue No:Vol. 10, No. 1 (2025)
Authors:Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Herag Arabian, Alberto Battistel, Paul David Docherty, Hisham ElMoaqet, Thomas Neumuth, Knut Moeller First page: 7 Abstract: Surgical data analysis is crucial for developing and integrating context-aware systems (CAS) in advanced operating rooms. Automatic detection of surgical tools is an essential component in CAS, as it enables the recognition of surgical activities and understanding the contextual status of the procedure. Acquiring surgical data is challenging due to ethical constraints and the complexity of establishing data recording infrastructures. For machine learning tasks, there is also the large burden of data labelling. Although a relatively large dataset, namely the Cholec80, is publicly available, it is limited to the binary label data corresponding to the surgical tool presence. In this work, 15,691 frames from five videos from the dataset have been labelled with bounding boxes for surgical tool localisation. These newly labelled data support future research in developing and evaluating object detection models, particularly in the laparoscopic image data analysis domain. Citation: Data PubDate: 2025-01-08 DOI: 10.3390/data10010007 Issue No:Vol. 10, No. 1 (2025)
Authors:Damar David Wilson, Gebrekidan Worku Tefera, Ram L. Ray First page: 8 Abstract: Google Earth Engine (GEE) is a cloud-based platform revolutionizing geospatial analysis by providing access to vast satellite datasets and computational capabilities for monitoring environmental and societal issues. It incorporates machine learning (ML) techniques and algorithms as part of its tools for analyzing and processing large geospatial data. This review explores the diverse applications of GEE in monitoring and mitigating greenhouse gas emissions and uptakes. GEE is a cloud-based platform built on Google’s infrastructure for analyzing and visualizing large-scale geospatial datasets. It offers large datasets for monitoring greenhouse gas (GHG) emissions and understanding their environmental impact. By leveraging GEE’s capabilities, researchers have developed tools and algorithms to analyze remotely sensed data and accurately quantify GHG emissions and uptakes. This review examines progress and trends in GEE applications, focusing on monitoring carbon dioxide (CO2), methane (CH4), and nitrous oxide/nitrogen dioxide (N2O/NO2) emissions. It discusses the integration of GEE with different machine learning methods and the challenges and opportunities in optimizing algorithms and ensuring data interoperability. Furthermore, it highlights GEE’s role in pinpointing emission hotspots, as demonstrated in studies monitoring uptakes. By providing insights into GEE’s capabilities for precise monitoring and mapping of GHGs, this review aims to advance environmental research and decision-making processes in mitigating climate change. Citation: Data PubDate: 2025-01-11 DOI: 10.3390/data10010008 Issue No:Vol. 10, No. 1 (2025)
Authors:Bingya Wu, Zhihui Hu, Zhouyi Gu, Yuxi Zheng, Jiayan Lv First page: 9 Abstract: Technology-based small and micro enterprises play a crucial role in national economic and social development. Managing their credit risk effectively is key to ensuring their healthy growth. This study is based on corporate credit management theory and Wu’s three-dimensional credit theory. It clarifies the credit concept and measurement logic of these enterprises, considering their unique development characteristics in China. A credit evaluation system is constructed, and an innovative method combining machine learning with comprehensive evaluation is proposed. This approach aims to assess the credit status of technology-based small and micro enterprises in a thorough and objective manner. The study finds that, first, the credit level of these enterprises is currently moderate, with little variation. Second, financial information remains a key factor in credit evaluation. Third, the ML-AHP (Machine Learning-Analytic Hierarchy Process) combined weighting method effectively integrates subjective experience with objective data, providing a more rational assessment. The findings provide theoretical references and practical guidance for the healthy development of technology-based small and micro enterprises, early credit risk warning, and improved financing efficiency. Citation: Data PubDate: 2025-01-14 DOI: 10.3390/data10010009 Issue No:Vol. 10, No. 1 (2025)
Authors:Wai Yan Siu, Man Li, Arthur J. Caplan First page: 10 Abstract: Grid-cell data are increasingly used in research due to the growing availability and accessibility of remote sensing products. However, grid-cell data often fails to represent the actual decision-making unit, leading to biased estimates in socio-economic analysis. To this end, this paper presents a comprehensive parcel-level dataset for Salt Lake County, Utah, spanning from 2008 to 2018. This dataset combines detailed spatial and temporal data on land ownership, land use, and preferential farmland tax assessments under the Greenbelt program. Compiled from multiple geospatial sources, the dataset includes nearly 200,000 parcel-year observations, providing valuable insights into landowner decision-making and the impact of tax abatement incentives at the decision-making level. This resource is beneficial for researchers, educators, and practitioners in sustainable development, environmental studies, and farmland conservation. Citation: Data PubDate: 2025-01-17 DOI: 10.3390/data10010010 Issue No:Vol. 10, No. 1 (2025)
Authors: Ruiz-de-Alarcón-Quintero, De-la-Cruz-Torres First page: 102 Abstract: Introduction. Football analysis is an applied research area that has seen a huge upsurge in recent years. More complex analysis to understand the soccer players’ or teams’ performances during matches is required. The objective of this study was to prove the usefulness of the expected goals on target (xGOT) metric, as a good indicator of a soccer team’s performance in professional Spanish football leagues, both in the women’s and men’s categories. Method. The data for the Spanish teams were collected from the statistical website Football Reference (https://www.fbref.com). The 2023/24 season was analyzed for Spanish leagues, both in the women’s and men’s categories (LigaF and LaLiga, respectively). For all teams, the following variables were calculated: goals, possession value (PV), expected goals (xG) and xGOT. All data obtained for each variable were normalized by match (90 min). A descriptive and correlational statistical analysis was carried out. Results. In the men’s league, this study found a high correlation between goals per match and xGOT (R2 = 0.9248) while in the women’s league, there was a high correlation between goals per match (R2 = 0.9820) and xG and between goals per match and xGOT (R2 = 0.9574). Conclusions. In the LaLiga, the xGOT was the best metric that represented the match result while in the LigaF, the xG and the xGOT were the best metrics that represented the match score. Citation: Data PubDate: 2024-08-28 DOI: 10.3390/data9090102 Issue No:Vol. 9, No. 9 (2024)
Authors:Yingxun Wang, Adnan Mahmood, Mohamad Faizrizwan Mohd Sabri, Hushairi Zen First page: 103 Abstract: The emerging and promising paradigm of the Internet of Vehicles (IoV) employ vehicle-to-everything communication for facilitating vehicles to not only communicate with one another but also with the supporting roadside infrastructure, vulnerable pedestrians, and the backbone network in a bid to primarily address a number of safety-critical vehicular applications. Nevertheless, owing to the inherent characteristics of IoV networks, in particular, of being (a) highly dynamic in nature and which results in a continual change in the network topology and (b) non-deterministic owing to the intricate nature of its entities and their interrelationships, they are susceptible to a number of malicious attacks. Such kinds of attacks, if and when materialized, jeopardizes the entire IoV network, thereby putting human lives at risk. Whilst the cryptographic-based mechanisms are capable of mitigating the external attacks, the internal attacks are extremely hard to tackle. Trust, therefore, is an indispensable tool since it facilitates in the timely identification and eradication of malicious entities responsible for launching internal attacks in an IoV network. To date, there is no dataset pertinent to trust management in the context of IoV networks and the same has proven to be a bottleneck for conducting an in-depth research in this domain. The manuscript-at-hand, accordingly, presents a first of its kind trust-based IoV dataset encompassing 96,707 interactions amongst 79 vehicles at different time instances. The dataset involves nine salient trust parameters, i.e., packet delivery ratio, similarity, external similarity, internal similarity, familiarity, external familiarity, internal familiarity, reward/punishment, and context, which play a considerable role in ascertaining the trust of a vehicle within an IoV network. Citation: Data PubDate: 2024-08-31 DOI: 10.3390/data9090103 Issue No:Vol. 9, No. 9 (2024)
Authors:Daniel Doyle, Ovidiu Şerban First page: 104 Abstract: Despite the widespread development and use of chatbots, there is a lack of audio-based interruption datasets. This study provides a dataset of 200 manually annotated interruptions from a broader set of 355 data points of overlapping utterances. The dataset is derived from the Group Affect and Performance dataset managed by the University of the Fraser Valley, Canada. It includes both audio files and transcripts, allowing for multi-modal analysis. Given the extensive literature and the varied definitions of interruptions, it was necessary to establish precise definitions. The study aims to provide a comprehensive dataset for researchers to build and improve interruption prediction models. The findings demonstrate that classification models can generalize well to identify interruptions based on this dataset’s audio. This opens up research avenues with respect to interruption-related topics, ranging from multi-modal interruption classification using text and audio modalities to the analysis of group dynamics. Citation: Data PubDate: 2024-08-31 DOI: 10.3390/data9090104 Issue No:Vol. 9, No. 9 (2024)
Authors:Sebastian-Camilo Vanegas-Ayala, Julio Barón-Velandia, Oscar-Mauricio Garcia-Chavez, Adrian Romero-Palencia, Daniel-David Leal-Lara First page: 105 Abstract: Greenhouse cultivation is one of the current strategies to address the challenges of food production, sustainability, and food quality. Similarly, the use of technological tools to automate greenhouse environments through a set of sensors and actuators allows for the control and improvement of processes within this environment. This document presents data collected from the sensors and actuators of two identical greenhouse environments, one with the cultivation of stringless blue lake beans and the other without cultivation. The aim is that this dataset will provide a broader characterization of the behavior of climatic variables inside greenhouse environments and how they are impacted by control actions, subsequently contributing to the development of new research on implementations of or improvements to control, supervision, management, and automation actions in greenhouse environments. Citation: Data PubDate: 2024-09-04 DOI: 10.3390/data9090105 Issue No:Vol. 9, No. 9 (2024)
Authors:Anderson Carlos de Oliveira, Abel Cavalcante Lima Filho, Francisco Antonio Belo, André Victor Oliveira Cadena First page: 106 Abstract: This work presents an electrical measurement dataset from a split-system air conditioner in normal operating conditions and with specific faults, such as incrustation in the condenser and evaporator air inlet with different levels of blocking, which often occurs in this type of equipment. We also added compressor capacitor degradation, which is a very common fault in this type of equipment, although it is scarcely addressed in research. The data were obtained through a non-invasive current sensor and a grain-oriented voltage sensor containing the values of the current and voltage of equipment that was installed in the field and tested at different levels for these fault conditions. This work not only explains how the entire data collection process was carried out but also presents two examples of fast Fourier transform (FFT) applications for the detection and diagnosis of faults through the electrical measurements analyzed in our studies, which had good effectiveness. Citation: Data PubDate: 2024-09-13 DOI: 10.3390/data9090106 Issue No:Vol. 9, No. 9 (2024)
Authors:Anna Speckert, Hui Ji, Kelly Payette, Patrice Grehten, Raimund Kottke, Samuel Ackermann, Beth Padden, Luca Mazzone, Ueli Moehrlen, Spina Bifida Study Group Zurich Spina Bifida Study Group Zurich, Andras Jakab First page: 107 Abstract: We present the Open Spina Bifida Aperta (OSBA) atlas, an open atlas and set of neuroimaging templates for spina bifida aperta (SBA). Traditional brain atlases may not adequately capture anatomical variations present in pediatric or disease-specific cohorts. The OSBA atlas fills this gap by representing the computationally averaged anatomy of the neonatal brain with SBA after fetal surgical repair. The OSBA atlas was constructed using structural T2-weighted and diffusion tensor MRIs of 28 newborns with SBA who underwent prenatal surgical correction. The corrected gestational age at MRI was 38.1 ± 1.1 weeks (mean ± SD). The OSBA atlas consists of T2-weighted and fractional anisotropy templates, along with nine tissue prior maps and region of interest (ROI) delineations. The OSBA atlas offers a standardized reference space for spatial normalization and anatomical ROI definition. Our image segmentation and cortical ribbon definition are based on a human-in-the-loop approach, which includes manual segmentation. The precise alignment of the ROIs was achieved by a combination of manual image alignment and automated, non-linear image registration. From the clinical and neuroimaging perspective, the OSBA atlas enables more accurate spatial standardization and ROI-based analyses and supports advanced analyses such as diffusion tractography and connectomic studies in newborns affected by this condition. Citation: Data PubDate: 2024-09-17 DOI: 10.3390/data9090107 Issue No:Vol. 9, No. 9 (2024)
Authors:Roberto Sánchez-Cabrero, Elena López-de-Arana Prado, Pilar Aramburuzabala, Rosario Cerrillo First page: 108 Abstract: This dataset shows the original validation and standardization of the Questionnaire for the Self-Assessment of Service-Learning Experiences in Higher Education (QaSLu). The QaSLu is the first instrument to measure university service-learning (USL), validated following a strict qualitative and quantitative process by a sample of experts in USL and generating rating scales for different profiles of professors. The Delphi method was used for the qualitative validation by 16 academic experts, who evaluated the relevance and clarity of the items. After two consultation rounds, 45 items were qualitatively validated, generating the QaSLu-45. Then, 118 instructors from 43 universities took part as the sample in the quantitative validation procedure. Quantitative validation was carried out through goodness-of-fit measures using confirmatory factor analysis and the final configuration optimized using one-factor robust exploratory factor analysis, determining the most optimal version of the questionnaire under the law of parsimony, the QaSLu-27, with only 27 items and better psychometric properties. Finally, rating scales were calculated to compare different profiles of USL professors. These findings offer a valid, strong, and trustworthy instrument. The QaSLu-27 may be helpful for the design of USL experiences, in addition to facilitating the assessment of such programs to enhance teaching and learning processes. Citation: Data PubDate: 2024-09-19 DOI: 10.3390/data9090108 Issue No:Vol. 9, No. 9 (2024)
Authors: Pfunzo, Bahta, Jordaan First page: 109 Abstract: The purpose of the Social Accounting Matrix (SAM) is to improve the quality of the database for modelling, including, but not limited to, policy analysis, multiplier analysis, price analysis, and Computable General Equilibrium. This article contributes to constructing the 2017 national SAM for South Africa, incorporating regional accounts. Only in Limpopo Province of South Africa are agricultural industries, labour, and households captured at the district level, while agricultural industry, labour, and household accounts in other provinces remain unchanged. The main data sources for constructing a SAM are found from different sources, such as Supply and Use Tables, National Accounts, Census of Commercial Agriculture, Quarterly Labour Force Survey, South Africa Revenue Service, Global Insight (regional explorer), and South Africa Reserve Bank. The dataset recorded that land returns for irrigation agriculture were highest (18.2%) in the Northern Cape Province of South Africa compared to other provinces, whereas the Free State Province of South Africa rainfed agriculture had the largest shares (22%) for payment to land. Regarding intermediate inputs, rainfed agriculture in the Western Cape, Free State, and Kwazulu-Natal Provinces paid approximately 0.4% for using intermediate inputs. In terms of the districts, land returns for irrigation were highest in the Vhembe district of Limpopo Province of South Africa with 0.3%. Despite Mopani district of Limpopo Province of South Africa having the lowest land returns for irrigation agriculture, it has the highest share (1.6%) of payment to land from rainfed agriculture. The manufacturing and community service sectors had a trade deficit, whereas other sectors experienced a trade surplus. The main challenges found in developing a SAM are scarcity of data to attain the information needed for disaggregation for the sub-matrices and insufficient information from different data sources for estimating missing information to ensure the row and column totals of the SAM are consistent and complete. Citation: Data PubDate: 2024-09-20 DOI: 10.3390/data9090109 Issue No:Vol. 9, No. 9 (2024)
Authors:Christian Odenwald, Moritz Beeking First page: 90 Abstract: While cycling presents environmental benefits and promotes a healthy lifestyle, the risks associated with overtaking maneuvers by motorized vehicles represent a significant barrier for many potential cyclists. A large-scale analysis of overtaking maneuvers could inform traffic researchers and city planners how to reduce these risks by better understanding these maneuvers. Drawing from the fields of sensor-based cycling research and from LiDAR-based traffic data sets, this paper provides a step towards addressing these safety concerns by introducing the Salzburg Bicycle 3d (SaBi3d) data set, which consists of LiDAR point clouds capturing car-to-bicycle overtaking maneuvers. The data set, collected using a LiDAR-equipped bicycle, facilitates the detailed analysis of a large quantity of overtaking maneuvers without the need for manual annotation through enabling automatic labeling by a neural network. Additionally, a benchmark result for 3D object detection using a competitive neural network is provided as a baseline for future research. The SaBi3d data set is structured identically to the nuScenes data set, and therefore offers compatibility with numerous existing object detection systems. This work provides valuable resources for future researchers to better understand cycling infrastructure and mitigate risks, thus promoting cycling as a viable mode of transportation. Citation: Data PubDate: 2024-07-24 DOI: 10.3390/data9080090 Issue No:Vol. 9, No. 8 (2024)
Authors:Alexandre Vilhena Silva-Neto, Gabriel Santos Mouta, Antônio Alcirley Silva Balieiro, Jady Shayenne Mota Cordeiro, Patricia Carvalho Silva Balieiro, Tatyana Costa Amorin Ramos, Djane Clarys Baia-da-Silva, Élisson Silva Rocha, Patricia Takako Endo, Theo Lynn, Wuelton Marcelo Monteiro, Vanderson Souza Sampaio First page: 91 Abstract: Snakebite envenomations (SBE) are a significant global public health threat due to their morbidity and mortality. This is a neglected public health issue in many tropical and subtropical countries. Brazil is in the top ten countries affected by SBE, with 32,160 cases reported only in 2020, posing a high burden for this population. In this paper, we describe the data structure of snakebite records from 2007 to 2020 in the Notifiable Disease Information System (SINAN), made available by the Brazilian Ministry of Health (MoH). In addition, we also provide R scripts that allow a quick and automatic updating of data from the SINAN according to its availability. The data presented in this work are related to clinical and demographic information on SBE cases. Also, data on outcomes, laboratory results, and treatment are available. The dataset is available and freely accessible; however, preprocessing, adjustments, and standardization are necessary due to incompleteness and inconsistencies. Regardless of these limitations, it provides a solid basis for assessing different aspects and the national burden of envenoming. Citation: Data PubDate: 2024-07-24 DOI: 10.3390/data9080091 Issue No:Vol. 9, No. 8 (2024)
Authors:Cleopatra Christina Moshona, Frederic Rudawski, André Fiebig, Ennes Sarradj First page: 92 Abstract: In this article, we introduce the Berlin Dataset of Lombard and Masked Speech (BELMASK), a phonetically controlled audiovisual dataset of speech produced in adverse speaking conditions, and describe the development of the related speech task. The dataset contains in total 128 min of audio and video recordings of 10 German native speakers (4 female, 6 male) with a mean age of 30.2 years (SD: 6.3 years), uttering matrix sentences in cued, uninstructed speech in four conditions: (i) with a Filtering Facepiece P2 (FFP2) mask in silence, (ii) without an FFP2 mask in silence, (iii) with an FFP2 mask while exposed to noise, iv) without an FFP2 mask while exposed to noise. Noise consisted of mixed-gender six-talker babble played over headphones to the speakers, triggering the Lombard effect. All conditions are readily available in face-and-voice and voice-only formats. The speech material is annotated, employing a multi-layer architecture, and was originally conceptualized to be used for the administration of a working memory task. The dataset is stored in a restricted-access Zenodo repository and is available for academic research in the area of speech communication, acoustics, psychology and related disciplines upon request, after signing an End User License Agreement (EULA). Citation: Data PubDate: 2024-07-24 DOI: 10.3390/data9080092 Issue No:Vol. 9, No. 8 (2024)
Authors:Maryam Abbasi, Marco V. Bernardo, Paulo Váz, José Silva, Pedro Martins First page: 93 Abstract: Complex event processing (CEP) systems have gained significant importance in various domains, such as finance, logistics, and security, where the real-time analysis of event streams is crucial. However, as the volume and complexity of event data continue to grow, optimizing the performance of CEP systems becomes a critical challenge. This paper investigates the impact of indexing strategies on the performance of databases handling complex event processing. We propose a novel indexing technique, called Hierarchical Temporal Indexing (HTI), specifically designed for the efficient processing of complex event queries. HTI leverages the temporal nature of event data and employs a multi-level indexing approach to optimize query execution. By combining temporal indexing with spatial- and attribute-based indexing, HTI aims to accelerate the retrieval and processing of relevant events, thereby improving overall query performance. In this study, we evaluate the effectiveness of HTI by implementing complex event queries on various CEP systems with different indexing strategies. We conduct a comprehensive performance analysis, measuring the query execution times and resource utilization (CPU, memory, etc.), and analyzing the execution plans and query optimization techniques employed by each system. Our experimental results demonstrate that the proposed HTI indexing strategy outperforms traditional indexing approaches, particularly for complex event queries involving temporal constraints and multi-dimensional event attributes. We provide insights into the strengths and weaknesses of each indexing strategy, identifying the factors that influence performance, such as data volume, query complexity, and event characteristics. Furthermore, we discuss the implications of our findings for the design and optimization of CEP systems, offering recommendations for indexing strategy selection based on the specific requirements and workload characteristics. Finally, we outline the potential limitations of our study and suggest future research directions in this domain. Citation: Data PubDate: 2024-07-24 DOI: 10.3390/data9080093 Issue No:Vol. 9, No. 8 (2024)
Authors:Bernd Accou, Lies Bollens, Marlies Gillis, Wendy Verheijen, Hugo Van hamme, Tom Francart First page: 94 Abstract: Researchers investigating the neural mechanisms underlying speech perception often employ electroencephalography (EEG) to record brain activity while participants listen to spoken language. The high temporal resolution of EEG enables the study of neural responses to fast and dynamic speech signals. Previous studies have successfully extracted speech characteristics from EEG data and, conversely, predicted EEG activity from speech features. Machine learning techniques are generally employed to construct encoding and decoding models, which necessitate a substantial quantity of data. We present SparrKULee, a Speech-evoked Auditory Repository of EEG data, measured at KU Leuven, comprising 64-channel EEG recordings from 85 young individuals with normal hearing, each of whom listened to 90–150 min of natural speech. This dataset is more extensive than any currently available dataset in terms of both the number of participants and the quantity of data per participant. It is suitable for training larger machine learning models. We evaluate the dataset using linear and state-of-the-art non-linear models in a speech encoding/decoding and match/mismatch paradigm, providing benchmark scores for future research. Citation: Data PubDate: 2024-07-26 DOI: 10.3390/data9080094 Issue No:Vol. 9, No. 8 (2024)
Authors:Joanna Kostanek, Kamil Karolczak, Wiktor Kuliczkowski, Cezary Watala First page: 95 Abstract: In today’s research environment characterized by exponential data growth and increasing complexity, the selection of appropriate statistical tests, tailored to research objectives and data distributions, is paramount for rigorous analysis and accurate interpretation. This article explores the growing prominence of bootstrapping, an advanced statistical technique for multiple comparisons analysis, offering flexibility and customization by estimating sample distributions without assuming population distributions, thus serving as a valuable alternative to traditional methods in various data scenarios. Computer simulations were conducted using data from cardiovascular disease patients. Two approaches, spontaneous partly controlled simulation and fully constrained simulation using self-written R scripts, were utilized to generate datasets with specified distributions and analyze the data using tests for comparing more than two groups. The utilization of the bootstrap method greatly improves statistical analysis, especially in overcoming the constraints of conventional parametric tests. Our research showcased its effectiveness in comparing multiple scenarios, yielding strong findings across diverse distributions, even with minor inflation in p values. Serving as a valuable substitute for parametric approaches, bootstrap promotes careful consideration when rejecting hypotheses, thus fostering a deeper understanding of statistical nuances and bolstering analytical rigor. Citation: Data PubDate: 2024-07-26 DOI: 10.3390/data9080095 Issue No:Vol. 9, No. 8 (2024)
Authors:Zahra Shiri, Aymen Frija, Hichem Rejeb, Hassen Ouerghemmi, Quang Bao Le First page: 96 Abstract: Understanding past landscape changes is crucial to promote agroecological landscape transitions. This study analyzes past land cover changes (LCCs) alongside subsequent degradation and improvements in the study area. The input land cover (LC) data were taken from ESRI’s ArcGIS Living Atlas of the World and then assessed for accuracy using ground truth data points randomly selected from high-resolution images on the Google Earth Engine. The LCC analyses were performed on QGIS 3.28.15 using the Semi-Automatic Classification Plugin (SCP) to generate LCC data. The degradation or improvement derived from the analyzed data was subsequently assessed using the UNCCD Good Practice Guidance to generate land cover degradation data. Using the Landscape Ecology Statistics (LecoS) plugin in QGIS, the input LC data were processed to provide landscape metrics. The data presented in this article show that the studied landscape is not static, even over a short-term time horizon (2017–2022). The transition from one LC class to another had an impact on the ecosystem and induced different states of degradation. For the three main LC classes (forest, crops, and rangeland) representing 98.9% of the total area in 2022, the landscape metrics, especially the number of patches, reflected a 105% increase in landscape fragmentation between 2017 and 2022. Citation: Data PubDate: 2024-07-29 DOI: 10.3390/data9080096 Issue No:Vol. 9, No. 8 (2024)
Authors:Leopoldo Palma, Yolanda Bel, Baltasar Escriche First page: 97 Abstract: Bacillus thuringiensis (Bt) is a Gram-positive, spore-forming, and ubiquitous bacterium harboring plasmids encoding a variety of proteins with insecticidal activity, but also with activity against nematodes. The aim of this work was to perform the genome sequencing and analysis of a native Bt strain showing bipyramidal parasporal crystals and designated V-CO3.3, which was isolated from the dust of a grain storehouse in Córdoba (Spain). Its genome comprised 99 high-quality assembled contigs accounting for a total size of 5.2 Mb and 35.1% G + C. Phylogenetic analyses suggested that this strain should be renamed as Bacillus cereus s.s. biovar Thuringiensis. Gene annotation revealed a total of 5495 genes, among which, 1 was identified as encoding a Cry5Ba homolog protein with well-documented toxicity against nematodes. These results suggest that this Bt strain has interesting potential for nematode biocontrol. Citation: Data PubDate: 2024-07-29 DOI: 10.3390/data9080097 Issue No:Vol. 9, No. 8 (2024)
Authors:Eman Naser-Karajah, Nabil Arman First page: 98 Abstract: Lexical substitution aims to generate a list of equivalent substitutions (i.e., synonyms) to a sentence’s target word or phrase while preserving the sentence’s meaning to improve writing, enhance language understanding, improve natural language processing models, and handle ambiguity. This task has recently attracted much attention in many languages. Despite the richness of Arabic vocabulary, limited research has been performed on the lexical substitution task due to the lack of annotated data. To bridge this gap, we present the first Arabic lexical substitution benchmark dataset AraLexSubD for benchmarking lexical substitution pipelines. AraLexSubD is manually built by eight native Arabic speakers and linguists (six linguist annotators, a doctor, and an economist) who annotate the 630 sentences. AraLexSubD covers three domains: general, finance, and medical. It encompasses 2476 substitution candidates ranked according to their semantic relatedness. We also present the first Arabic lexical substitution pipeline, AraLexSub, which uses the AraBERT pre-trained language model. The pipeline consists of several modules: substitute generation, substitute filtering, and candidate ranking. The filtering step shows its effectiveness by achieving an increase of 1.6 in the F1 score on the entire AraLexSubD dataset. Additionally, an error analysis of the experiment is reported. To our knowledge, this is the first study on Arabic lexical substitution. Citation: Data PubDate: 2024-07-30 DOI: 10.3390/data9080098 Issue No:Vol. 9, No. 8 (2024)
Authors:Fred Eduardo Revoredo Rabelo Ferreira, Robson do Nascimento Fidalgo First page: 99 Abstract: A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving a shift towards more modern, cloud-based solutions that provide resources such as distributed processing, columnar storage, and horizontal scalability without the overhead of physical hardware management, i.e., a Database as a Service (DBaaS). Choosing the appropriate class of DBMS is a critical decision for organizations, and there are important differences that impact data volume and query performance (e.g., architecture, data models, and storage) to support analytics in a distributed cloud environment efficiently. In this sense, we carry out an experimental evaluation to analyze the performance of several DBaaS and the impact of data modeling, specifically the usage of a partially normalized Star Schema and a fully denormalized Flat Table Schema, to further comprehend their behavior in different configurations and designs in terms of data schema, storage form, memory availability, and cluster size. The analysis is done in two volumes of data generated by a well-established benchmark, comparing the performance of the DW in terms of average execution time, memory usage, data volume, and loading time. Our results provide guidelines for efficient DW design, showing, for example, that the denormalization of the schema does not guarantee improved performance, as solutions performed differently depending on its architecture. We also show that a Hybrid Processing (HTAP) NewSQL solution can outperform solutions that support only Online Analytical Processing (OLAP) in terms of overall execution time, but that the performance of each query is deeply influenced by its selectivity and by the number of join functions. Citation: Data PubDate: 2024-08-05 DOI: 10.3390/data9080099 Issue No:Vol. 9, No. 8 (2024)
Authors:Dominika Petríková, Ivan Cimrák, Katarína Tobiášová, Lukáš Plank First page: 100 Abstract: In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin–eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by algorithms for computing the Ki67 index. We introduce a dataset of high-resolution histological images of testicular seminoma tissue. The dataset comprises digitized histology slides from 77 conventional testicular seminoma patients, obtained via surgical resection. For each patient, two physically adjacent tissue sections are stained: one with hematoxylin and eosin, and one with Ki67 immunohistochemistry staining. This results in a total of 154 high-resolution images. The images are provided in PNG format, facilitating ease of use for image analysis compared to the original scanner output formats. Each image contains enough tissue to generate thousands of non-overlapping 224 × 224 pixel patches. This shows the potential to generate more than 50,000 pairs of patches, one with HE staining and a corresponding Ki67 patch that depicts a very similar part of the tissue. Finally, we present the results of applying a ResNet neural network for the classification of HE patches into categories according to their Ki67 label. Citation: Data PubDate: 2024-08-07 DOI: 10.3390/data9080100 Issue No:Vol. 9, No. 8 (2024)
Authors:Nilesh Kumar, M. Shahid Mukhtar First page: 101 Abstract: Network centrality analyses have proven to be successful in identifying important nodes in diverse host–pathogen interactomes. The current study presents a comprehensive investigation of the human interactome and SARS-CoV-2 host targets. We first constructed a comprehensive human interactome by compiling experimentally validated protein–protein interactions (PPIs) from eight distinct sources. Additionally, we compiled a comprehensive list of 1449 SARS-CoV-2 host proteins and analyzed their interactions within the human interactome, which identified enriched biological processes and pathways. Seven diverse topological features were employed to reveal the enrichment of the SARS-CoV-2 targets in the human interactome, with closeness centrality emerging as the most effective metric. Furthermore, a novel approach called CentralityCosDist was employed to predict SARS-CoV-2 targets, which proved to be effective in expanding the pool of predicted targets. Pathway enrichment analyses further elucidated the functional roles and potential mechanisms associated with predicted targets. Overall, this study provides valuable insights into the complex interplay between SARS-CoV-2 and the host’s cellular machinery, contributing to a deeper understanding of viral infection and immune response modulation. Citation: Data PubDate: 2024-08-20 DOI: 10.3390/data9080101 Issue No:Vol. 9, No. 8 (2024)
Authors:Joanna Choueiri, Pascal Petit, Franck Balducci, Dominique J. Bicout, Christine Demeilliers First page: 89 Abstract: Populations are exposed daily to numerous environmental pollutants, particularly through food. To address environmental issues, many agricultural production methods have been developed, including organic farming. To date, there is no exhaustive inventory of the contamination of organic foods as there is for conventional foods. The main objective of this work was to construct a growing and updatable database on chemical substances and their levels in organic foods consumed in Europe. To this end, a literature search was conducted, resulting in a total of 1207 concentration values from 823 food–substances pairs involving 166 food matrices and 209 chemical substances, among which 95% were not authorized in organic farming and 80% were pesticides. The most encountered substance groups are “inorganic contaminants” and “organophosphate”, and the most studied food groups are “fruit used as fruit” and “Cereals and cereal primary derivatives”. Further studies are needed to continue updating the database with robust and comprehensive data on organic food contamination. This database could be used to study the health risks associated with these contaminants. Citation: Data PubDate: 2024-07-03 DOI: 10.3390/data9070089 Issue No:Vol. 9, No. 7 (2024)
Authors:Christian Gück, Cyriana M. A. Roelofs, Stefan Faulstich First page: 138 Abstract: Early fault detection plays a crucial role in the field of predictive maintenance for wind turbines, yet the comparison of different algorithms poses a difficult task because domain-specific public datasets are scarce. Many comparisons of different approaches either use benchmarks composed of data from many different domains, inaccessible data, or one of the few publicly available datasets that lack detailed information about the faults. Moreover, many publications highlight a couple of case studies where fault detection was successful. With this paper, we publish a high quality dataset that contains data from 36 wind turbines across 3 different wind farms as well as the most detailed fault information of any public wind turbine dataset as far as we know. The new dataset contains 89 years worth of real-world operating data of wind turbines, distributed across 44 labeled time frames for anomalies that led up to faults, as well as 51 time series representing normal behavior. Additionally, the quality of training data is ensured by turbine-status-based labels for each data point. Furthermore, we propose a new scoring method, called CARE (Coverage, Accuracy, Reliability and Earliness), which takes advantage of the information depth that is present in the dataset to identify good early fault detection models for wind turbines. This score considers the anomaly detection performance, the ability to recognize normal behavior properly, and the capability to raise as few false alarms as possible while simultaneously detecting anomalies early. Citation: Data PubDate: 2024-11-23 DOI: 10.3390/data9120138 Issue No:Vol. 9, No. 12 (2024)
Authors:Marcello Buoncristiano, Giansalvatore Mecca, Donatello Santoro, Enzo Veltri First page: 139 Abstract: In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets. Citation: Data PubDate: 2024-11-25 DOI: 10.3390/data9120139 Issue No:Vol. 9, No. 12 (2024)
Authors:Simin Huang, Zhiying Yang First page: 140 Abstract: Simplifying trajectory data can improve the efficiency of trajectory data analysis and query and reduce the communication cost and computational overhead of trajectory data. In this paper, a real-time trajectory simplification algorithm (SSFI) based on the spatio-temporal feature information of implicit trajectory points is proposed. The algorithm constructs the preselected area through the error measurement method based on the feature information of implicit trajectory points (IEDs) proposed in this paper, predicts the falling point of trajectory points, and realizes the one-way error-bounded simplified trajectory algorithm. Experiments show that the simplified algorithm has obvious progress in three aspects: running speed, compression accuracy, and simplification rate. When the trajectory data scale is large, the performance of the algorithm is much better than that of other line segment simplification algorithms. The GPS error cannot be avoided. The Kalman filter smoothing trajectory can effectively eliminate the influence of noise and significantly improve the performance of the simplified algorithm. According to the characteristics of the trajectory data, this paper accurately constructs a mathematical model to describe the motion state of objects, so that the performance of the Kalman filter is better than other filters when smoothing trajectory data. In this paper, the trajectory data smoothing experiment is carried out by adding random Gaussian noise to the trajectory data. The experiment shows that the Kalman filter’s performance under the mathematical model is better than other filters. Citation: Data PubDate: 2024-11-29 DOI: 10.3390/data9120140 Issue No:Vol. 9, No. 12 (2024)
Authors:Chunjing Wang, Wuxian Yan, Jizhong Wan First page: 141 Abstract: This comprehensive dataset on the number of plant species, genera, and families in 383 national nature reserves in China has been compiled based on the available literature. Heilongjiang Province and the Guangxi Zhuang Autonomous Region have the highest number of nature reserves. Species richness is relatively high in the Jinfoshan, Dabashan, Wenshan, Hupingshan, and Shennongjia Nature Reserves. This dataset provides important baseline information on plant species richness coupling with genus and family numbers in Chinese national nature reserves and should help researchers and environmentalists understand the dynamic species changes in various nature reserves. This detailed and reliable information may serve as the foundation for future plant research in Chinese nature reserves and play a positive role in promoting more effective natural protection, biological distribution, and biodiversity conservation in these areas. Citation: Data PubDate: 2024-11-30 DOI: 10.3390/data9120141 Issue No:Vol. 9, No. 12 (2024)
Authors:Yiya Diao, Changhe Li, Sanyou Zeng, Shengxiang Yang First page: 142 Abstract: Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice. Despite this, there has been little experimental analysis of the fitness landscape for CSWIDN, particularly given its mixed-encoding nature. This study addresses this gap by conducting a comprehensive fitness landscape analysis of CSWIDN using the Nearest-Better Network (NBN), the only applicable method for mixed-encoding problems. Our analysis reveals for the first time that CSWIDN exhibits the landscape features, including neutrality, ruggedness, modality, dynamic change, and separability. These findings not only deepen our understanding of the problem’s inherent landscape features but also provide quantitative insights into how these features influence algorithm performance. Additionally, based on these insights, we propose specific algorithm design recommendations that are better suited to the unique challenges of the CSWIDN problem. This work advances the knowledge of CSWIDN optimization by both qualitatively characterizing its landscape and quantitatively linking these features to algorithms’ behaviors. Citation: Data PubDate: 2024-12-06 DOI: 10.3390/data9120142 Issue No:Vol. 9, No. 12 (2024)
Authors:Paulina Grigusova, Christian Beilschmidt, Maik Dobbermann, Johannes Drönner, Michael Mattig, Pablo Sanchez, Nina Farwig, Jörg Bendix First page: 143 Abstract: Over almost 20 years, a data storage, analysis, and project administration engine (TMFdw) has been continuously developed in a series of several consecutive interdisciplinary research projects on functional biodiversity of the southern Andes of Ecuador. Starting as a “working database”, the system now includes program management modules and literature databases, which are all accessible via a web interface. Originally designed to manage data in the ecological Research Unit 816 (SE Ecuador), the open software is now being used in several other environmental research programs, demonstrating its broad applicability. While the system was mainly developed for abiotic and biotic tabular data in the beginning, the new research program demands full capabilities to work with area-wide and high-resolution big models and remote sensing raster data. Thus, a raster engine was recently implemented based on the Geo Engine technology. The great variety of pre-implemented desktop GIS-like analysis options for raster point and vector data is an important incentive for researchers to use the system. A second incentive is to implement use cases prioritized by the researchers. As an example, we present machine learning models to generate high-resolution (30 m) microclimate raster layers for the study area in different temporal aggregation levels for the most important variables of air temperature, humidity, precipitation, and solar radiation. The models implemented as use cases outperform similar models developed in other research programs. Citation: Data PubDate: 2024-12-06 DOI: 10.3390/data9120143 Issue No:Vol. 9, No. 12 (2024)
Authors:Thomas Pfitzinger, Marcel Koch, Fabian Schlenke, Hendrik Wöhrle First page: 144 Abstract: The detection of human activities is an important step in automated systems to understand the context of given situations. It can be useful for applications like healthcare monitoring, smart homes, and energy management systems for buildings. To achieve this, a sufficient data basis is required. The presented dataset contains labeled recordings of 25 different activities of daily living performed individually by 14 participants. The data were captured by five multisensors in supervised sessions in which a participant repeated each activity several times. Flawed recordings were removed, and the different data types were synchronized to provide multi-modal data for each activity instance. Apart from this, the data are presented in raw form, and no further filtering was performed. The dataset comprises ambient audio and vibration, as well as infrared array data, light color and environmental measurements. Overall, 8615 activity instances are included, each captured by the five multisensor devices. These multi-modal and multi-channel data allow various machine learning approaches to the recognition of human activities, for example, federated learning and sensor fusion. Citation: Data PubDate: 2024-12-09 DOI: 10.3390/data9120144 Issue No:Vol. 9, No. 12 (2024)
Authors:Daria Bogatova, Stanislav Ogorodov First page: 145 Abstract: This study aimed to develop a methodological framework for predicting shoreline dynamics using machine learning techniques, focusing on analyzing generalized data without distinguishing areas with higher or lower retreat rates. Three sites along the southwestern Kara Sea coast were selected for this investigation. The study analyzed key coastal features, including lithology, permafrost, and geomorphology, using a combination of field studies and remote sensing data. Essential datasets were compiled and formatted for computer-based analysis. These datasets included information on permafrost and the geomorphological characteristics of the coastal zone, climatic factors influencing the shoreline, and measurements of bluff top positions and retreat rates over defined time periods. The positions of the bluff tops were determined through a combination of imagery with varying resolutions and field measurements. A novel aspect of the study involved employing geostatistical methods to analyze erosion rates, providing new insights into the shoreline dynamics. The data analysis allowed us to identify coastal areas experiencing the most significant changes. By continually refining neural network models with these datasets, we can improve our understanding of the complex interactions between natural factors and shoreline evolution, ultimately aiding in developing effective coastal management strategies. Citation: Data PubDate: 2024-12-09 DOI: 10.3390/data9120145 Issue No:Vol. 9, No. 12 (2024)
Authors:Jiajian Ke, Tian Chen First page: 146 Abstract: Accurate wind power forecasting is essential for maintaining the stability of a power system and enhancing scheduling efficiency in the power sector. To enhance prediction accuracy, this paper presents a hybrid wind power prediction model that integrates the improved complementary ensemble empirical mode decomposition (ICEEMDAN), the RIME optimization algorithm (RIME), sample entropy (SE), the improved dung beetle optimization (IDBO) algorithm, the bidirectional long short-term memory (BiLSTM) network, and multi-head attention (MHA). In this model, RIME is utilized to improve the parameters of ICEEMDAN, reducing data decomposition complexity and effectively capturing the original data information. The IDBO algorithm is then utilized to improve the hyperparameters of the MHA-BiLSTM model. The proposed RIME-ICEEMDAN-IDBO-MHA-BiLSTM model is contrasted with ten others in ablation experiments to validate its performance. The experimental findings prove that the proposed model achieves MAPE values of 5.2%, 6.3%, 8.3%, and 5.8% across four datasets, confirming its superior predictive performance and higher accuracy. Citation: Data PubDate: 2024-12-09 DOI: 10.3390/data9120146 Issue No:Vol. 9, No. 12 (2024)
Authors:Francisco Zorrilla Briones, Inocente Yuliana Meléndez Pastrana, Manuel Alonso Rodríguez Morachis, José Luís Anaya Carrasco First page: 147 Abstract: Experimentation is a strong methodology that improves and optimizes processes. Nevertheless, in many cases, real-life dynamics of production demands and other restrictions inhibit the use of these methodologies because their use implies stopping production, generating scrap, jeopardizing demand accomplishments, and other problems. Proposed here is an alternative methodology to search for the best process variable levels and optimize the response of the process without the need to stop production. This algorithm is based on the principles of the Variable Simplex developed by Nelder and Mead and the continuous iterative process of EVOPS developed by Box, which is then modified as a simplex by Spendley. It is named parallel simplex because it searches for the best response with three independent Simplexes searching for the same response at the same time. The algorithm was designed for three simplexes of two input variables each. The case study documented shows that it is efficient and effective. Citation: Data PubDate: 2024-12-10 DOI: 10.3390/data9120147 Issue No:Vol. 9, No. 12 (2024)
Authors:Salomon Obahoundje, Arona Diedhiou, Alberto Troccoli, Penny Boorman, Taofic Abdel Fabrice Alabi, Sandrine Anquetin, Louise Crochemore, Wanignon Ferdinand Fassinou, Benoit Hingray, Daouda Koné, Chérif Mamadou, Fatogoma Sorho First page: 148 Abstract: To address the growing electricity demand driven by population growth and economic development while mitigating climate change, West and Central African countries are increasingly prioritizing renewable energy as part of their Nationally Determined Contributions (NDCs). This study evaluates the implications of climate change on renewable energy potential using ten downscaled and bias-adjusted CMIP6 models (CDFt method). Key climate variables—temperature, solar radiation, and wind speed—were analyzed and integrated into the Teal-WCA platform to aid in energy resource planning. Projected temperature increases of 0.5–2.7 °C (2040–2069) and 0.7–5.2 °C (2070–2099) relative to 1985–2014 underscore the need for strategies to manage the rising demand for cooling. Solar radiation reductions (~15 W/m2) may lower photovoltaic (PV) efficiency by 1–8.75%, particularly in high-emission scenarios, requiring a focus on system optimization and diversification. Conversely, wind speeds are expected to increase, especially in coastal regions, enhancing wind power potential by 12–50% across most countries and by 25–100% in coastal nations. These findings highlight the necessity of integrating climate-resilient energy policies that leverage wind energy growth while mitigating challenges posed by reduced solar radiation. By providing a nuanced understanding of the renewable energy potential under changing climatic conditions, this study offers actionable insights for sustainable energy planning in West and Central Africa. Citation: Data PubDate: 2024-12-10 DOI: 10.3390/data9120148 Issue No:Vol. 9, No. 12 (2024)
Authors:Sara Quaresima, Pasquale Nino, Concetta Cardillo, Arianna Di Paola First page: 149 Abstract: Italy is divided into 773 Agricultural Regions (ARs) based on shared physical and agronomic characteristics. These regions offer a valuable tool for analyzing various geographical, socio-economic, and environmental aspects of agriculture, including the climate. However, the ARs have lacked geospatial data, limiting their analytical potential. This study introduces the “Italian ARs Dataset”, a georeferenced shapefile defining the boundaries of each AR. This dataset facilitates geographical assessments of Italy’s complex agricultural sector. It also unlocks the potential for integrating AR data with other datasets like the Farm Accounting Data Network (FADN) dataset, in Italy represented by the Rete di Informazione Contabile Agricola (RICA), which samples hundreds of thousands of farms annually. To demonstrate the dataset’s utility, a large sample of RICA data encompassing 179 irrigated crops from 2011 to 2021, covering all of Italy, was retrieved. Validation confirmed successful assignment of all ARs present in the RICA sample to the corresponding shapefile. Additionally, to encourage the use of the ARs Dataset with gridded data, different spatial-scale resolutions are tested to identify a suitable threshold. The minimal spatial scale identified is 0.11 degrees, a commonly adopted scale by several climate datasets within the EURO-CORDEX and COPERNICUS programs. Citation: Data PubDate: 2024-12-13 DOI: 10.3390/data9120149 Issue No:Vol. 9, No. 12 (2024)
Authors:Jim Smith, Priyadarshana Ajithkumar, Emma J. Wilkinson, Atreyi Dutta, Sai Shyam Vasantharajan, Angela Yee, Gregory Gimenez, Rathan M. Subramaniam, Michael Lau, Amir D. Zarrabi, Euan J. Rodger, Aniruddha Chatterjee First page: 150 Abstract: Prostate cancer (PCa) is a major health burden worldwide, and despite early treatment, many patients present with biochemical recurrence (BCR) post-treatment, reflected by a rise in prostate-specific antigen (PSA) over a clinical threshold. Novel transcriptomic and epigenomic biomarkers can provide a powerful tools for the clinical management of PCa. Here, we provide matched RNA sequencing and array-based genome-wide DNA methylome data of PCa patients (n = 17) with or without evidence of BCR following radical prostatectomy. Formalin-fixed paraffin-embedded (FFPE) tissues were used to generate these data, which included technical replicates to provide further validity of the data. We describe the sample features, experimental design, methods and bioinformatic pipelines for processing these multi-omic data. Importantly, comprehensive clinical, histopathological, and follow-up data for each patient were provided to enable the correlation of transcriptome and methylome features with clinical features. Our data will contribute towards the efforts of developing epigenomic and transcriptomic markers for BCR and also facilitate a deeper understanding of the molecular basis of PCa recurrence. Citation: Data PubDate: 2024-12-16 DOI: 10.3390/data9120150 Issue No:Vol. 9, No. 12 (2024)
Authors:Russell Miller, Harvey Whelan, Michael Chrubasik, David Whittaker, Paul Duncan, João Gregório First page: 151 Abstract: This paper presents a comprehensive exploration of data quality terminology, revealing a significant lack of standardisation in the field. The goal of this work was to conduct a comparative analysis of data quality terminology across different domains and structure it into a hierarchical data model. We propose a novel approach for aggregating disparate data quality terms used to describe the multiple facets of data quality under common umbrella terms with a focus on the ISO 25012 standard. We introduce four additional data quality dimensions: governance, usefulness, quantity, and semantics. These dimensions enhance specificity, complementing the framework established by the ISO 25012 standard, as well as contribute to a broad understanding of data quality aspects. The ISO 25012 standard, a general standard for managing the data quality in information systems, offers a foundation for the development of our proposed Data Quality Data Model. This is due to the prevalent nature of digital systems across a multitude of domains. In contrast, frameworks such as ALCOA+, which were originally developed for specific regulated industries, can be applied more broadly but may not always be generalisable. Ultimately, the model we propose aggregates and classifies data quality terminology, facilitating seamless communication of the data quality between different domains when collaboration is required to tackle cross-domain projects or challenges. By establishing this hierarchical model, we aim to improve understanding and implementation of data quality practices, thereby addressing critical issues in various domains. Citation: Data PubDate: 2024-12-18 DOI: 10.3390/data9120151 Issue No:Vol. 9, No. 12 (2024)
Authors:Marine Cornet, Arnaud Morin, Jean-Philippe Poirot-Crouvezier, Yann Bultel First page: 152 Abstract: This work focuses on the study of operating heterogeneities on a large MEA’s active surface area in a PEMFC stack. An advanced methodology is developed, aiming at the prediction of local operating conditions such as temperature, relative humidity and species concentration. A physics-based Pseudo-3D model developed under COMSOL Multiphysics allows for the observation of heterogeneities over the entire active surface area. Once predicted, these local operating conditions are experimentally emulated, thanks to a differential cell, to provide the local polarization curves and electrochemical impedance spectra. Coupling simulation and experimental, thirty-seven local operating conditions are emulated to examine the degree of correlation between local operating conditions and PEMFC cell performances. Researchers and engineers can use the polarization curves and Electrochemical Impedance Spectroscopy diagrams to fit the variables of an empirical model or to validate the results of a theoretical model. Citation: Data PubDate: 2024-12-20 DOI: 10.3390/data9120152 Issue No:Vol. 9, No. 12 (2024)
Authors:Dirk Steinke, Sujeevan Ratnasingham, Jireh Agda, Hamzah Ait Boutou, Isaiah C. H. Box, Mary Boyle, Dean Chan, Corey Feng, Scott C. Lowe, Jaclyn T. A. McKeown, Joschka McLeod, Alan Sanchez, Ian Smith, Spencer Walker, Catherine Y.-Y. Wei, Paul D. N. Hebert First page: 122 Abstract: The taxonomic identification of organisms from images is an active research area within the machine learning community. Current algorithms are very effective for object recognition and discrimination, but they require extensive training datasets to generate reliable assignments. This study releases 5.6 million images with representatives from 10 arthropod classes and 26 insect orders. All images were taken using a Keyence VHX-7000 Digital Microscope system with an automatic stage to permit high-resolution (4K) microphotography. Providing phenotypic data for 324,000 species derived from 48 countries, this release represents, by far, the largest dataset of standardized arthropod images. As such, this dataset is well suited for testing the efficacy of machine learning algorithms for identifying specimens into higher taxonomic categories. Citation: Data PubDate: 2024-10-25 DOI: 10.3390/data9110122 Issue No:Vol. 9, No. 11 (2024)
Authors:Sreten Jevremović, Carol Kachadoorian, Filip Arnaut, Aleksandra Kolarski, Vladimir A. Srećković First page: 123 Abstract: Cycling is a sustainable and healthy form of transportation that is gradually becoming the primary means of transportation over shorter distances in many countries. This paper describes the dataset used to determine the cycling characteristics of seniors in the USA and Canada. For these purposes, a specially created questionnaire was used in a survey conducted from August 2021 to July 2022. The questionnaire contained sections related to the general socio-demographic characteristics of the respondents, general characteristics of cycling (type of bicycle, cycle time, mileage, etc.), and specific characteristics of cycling (riding in night conditions, termination of cycling, motivating and demotivating factors for cycling, etc.). The total sample consisted of 5096 respondents (50+ years old). This database is particularly significant because it represents the first set of publicly available data related to the cycling characteristics of older adults. The database can be used by various researchers dealing with this topic, but also by the decision-makers who want to design a sustainable and accessible cycling infrastructure, respecting the requirements of this category of users. Finally, this dataset can serve as an adequate basis in the process of determining the specificities and understanding the needs of older cyclists in traffic. Citation: Data PubDate: 2024-10-25 DOI: 10.3390/data9110123 Issue No:Vol. 9, No. 11 (2024)
Authors:Aleksandar Kondinski, Nadiia Gumerova, Annette Rompel First page: 124 Abstract: Reticular and cluster materials often feature complex formulas, making a comprehensive overview challenging due to the need to consult various resources. While datasets have been collected for metal-organic frameworks (MOFs), covalent organic frameworks (COFs), and zeolites, among others, there remains a gap in systematically organized information for polyoxometalates. This paper introduces a carefully curated dataset of 1984 polyoxometalate (POM) and related cluster metal oxide formula instances, currently connecting over 2500 POM material instances. These POM instances incorporate 75 different chemical elements, with compositions ranging from binary to octonary element clusters. This dataset not only enhances accessibility to polyoxometalate data but also aims to facilitate further research and development in the study of these complex inorganic compounds. Citation: Data PubDate: 2024-10-29 DOI: 10.3390/data9110124 Issue No:Vol. 9, No. 11 (2024)
Authors:Gerda Viira, Maarten Marx First page: 125 Abstract: In the Netherlands, the Open Government Act (Wet openbare overheid or Woo/Wob in Dutch) is in effect, with the primary objective of ensuring a more transparent government. In line with the legislation, a search engine named Woogle has been designed and developed to centralize documents published under the Open Government Act. The Estonian Public Information Act serves a similar purpose and requires all public institutions to publish information generated during official duties, fostering transparency and public oversight. Currently, Estonia’s document repositories are decentralized, and content search is not supported, which hinders people’s ability to efficiently locate information. This study aims to assess public information accessibility in Estonia and to apply Woogle’s design and techniques to Estonia’s document repositories, thereby evaluating its potential for broader European implementation. The methodology involved web scraping data and documents from 57 Estonian public institutions’ document repositories. The results indicate that Woogle’s design and techniques can be implemented in Estonia. From a technical perspective, the alignment of the fields was successful, while it was found that content-wise, the Estonian data present challenges due to inconsistencies and lack of comprehensive categorization. The findings suggest potential scalability across European countries, pointing to a broader applicability of the Woogle model for creating a corpus of Freedom of Information Act documents in Europe. The collected data are available as a dataset. Citation: Data PubDate: 2024-10-29 DOI: 10.3390/data9110125 Issue No:Vol. 9, No. 11 (2024)
Authors:Alina A. Corcoran, Marcela Saracco Alvarez, Taryn Cornell, Isidora Echenique-Subiabre, Julia Gerber, Stephanie Getto, Ahlem Jebali, Heather Martinez, Jakob O. Nalley, Charles J. O’Kelly, Aidan Ryan, Jonathan B. Shurin, Shawn R. Starkenburg First page: 126 Abstract: The project “Optimizing Selection Pressures and Pest Management to Maximize Cultivation Yield” (OSPREY, award #DE-EE08902) was undertaken to enhance the annual productivity, stability, and quality of algal production strains for biofuels and bioproducts. The foundation of this project was the year-round cultivation of a Nannochloropsis strain across three outdoor systems in California, Hawaii, and New Mexico. We aimed to leverage environmental selection pressures to drive strain improvement and use metagenomic techniques to inform pest management tools. The resulting dataset includes environmental and biological parameters from these cultivation campaigns, captured in a single CSV file. This dataset aims to serve a wide range of end users, from biologists to algal farmers, addressing the scarcity of publicly available data on algae cultivation. Further data releases will include 16S rRNA amplicon sequencing and shotgun sequencing datasets. Citation: Data PubDate: 2024-10-29 DOI: 10.3390/data9110126 Issue No:Vol. 9, No. 11 (2024)
Authors:Paolo Maria Congedo, Cristina Baglivo, Delia D’Agostino, Paola Maria Albanese First page: 127 Abstract: Building energy regulations are essential for reducing energy consumption in the European Union (EU) and achieving climate neutrality goals. This data article supplements the “Overview of EU Building Envelope Energy Requirement for Climate Neutrality” by presenting a detailed dataset on building regulations across all 27 EU member states, with a focus on building envelope efficiency. The data include thermal transmittance limits for windows, walls, floors, and roofs, offering insights into regulatory differences and potential opportunities for harmonization. Information was sourced from the Energy Performance of Buildings Directive (EPBD) database, national reports, and scientific literature to ensure comprehensive coverage. Key aspects of each country’s regulations are summarized in tables, covering both new constructions and renovations. The inclusion of Köppen–Geiger climate classifications allows for climate-specific analyses, providing valuable context for researchers, policymakers, and construction professionals. This dataset enables comparative studies, helping to identify best practices and inform policy interventions aimed at enhancing energy efficiency across Europe. It also supports the development of tailored strategies to improve building performance in different environmental conditions, ultimately contributing to the EU’s energy and climate targets. Citation: Data PubDate: 2024-10-31 DOI: 10.3390/data9110127 Issue No:Vol. 9, No. 11 (2024)
Authors:Thomas Reiser, Jens Dörpinghaus, Petra Steiner, Michael Tiemann First page: 128 Abstract: The digitization of historical documents has gained particular interest in recent years in the digital humanities. The goal is to digitize historical documents by extracting and structuring text from scanned images. Here, we focus on the processing of historical German VET (vocational education and training) and CVET (continuing vocational education and training) regulations to support educational research. This dataset contains data from 1908 to the present and includes 2125 documents as PDF, 983 fully converted XML documents, and additional metadata for 7090 documents from the archive. We present an overview of the historical background and the challenges of processing different historical documents from three different federal states. Citation: Data PubDate: 2024-11-03 DOI: 10.3390/data9110128 Issue No:Vol. 9, No. 11 (2024)
Authors:Shiva Zargar, Miyuru Kannangara, Giovanna Gonzales-Calienes, Jianjun Yang, Jalil Shadbahr, Cyrille Decès-Petit, Farid Bensebaa First page: 129 Abstract: Life cycle assessment, which evaluates the complete life cycle of a product, is considered the standard methodological framework to evaluate the environmental performance of climate change solutions. However, significant challenges exist related to datasets used to quantify these environmental indicators. Although extensive research and commercial data on climate change technologies, pathways, and facilities exist, they are not readily available to practitioners of life cycle assessment in the right format and structure using an open platform. In this study, we propose a new open data hub platform for life cycle assessment, considering a hierarchical data flow starting with raw data collected on climate change technologies at laboratory, pilot, demonstration, or commercial scales to provide the information required for policy and decision-making. This platform makes data accessible at multiple levels for practitioners of life cycle assessment, while making data interoperable across platforms. The proposed data hub platform and workflow are explained through the polymer electrolyte membrane electrolysis hydrogen production as a case study. The climate change environment impact of 1.17 ± 0.03 kg CO2 eq./kg H2 was calculated for the case study. The current data hub platform is limited to evaluating environmental impacts; however, future additions of economic and social aspects are envisaged. Citation: Data PubDate: 2024-11-05 DOI: 10.3390/data9110129 Issue No:Vol. 9, No. 11 (2024)
Authors:Caroline Marc, Bertrand Marcon, Louis Denaud, Stéphane Girardon First page: 130 Abstract: Wood density measurement plays a crucial role in assessing wood quality and predicting its mechanical performance. This dataset was collected to compare the accuracy and reliability of two non-destructive techniques, X-rays and terahertz waves, for measuring wood density. While X-rays have been commonly used in the industry due to their effectiveness, they pose health risks due to ionizing radiation. Terahertz waves, on the other hand, are non-ionizing and offer high spatial resolution. This article presents a database of wood samples measurements obtained using both techniques, on the same 110 samples with a fine location of the measuring points, on a wide range of wood species (tropical and temperate ones) and densities, from 111 kg·m−3 to 1086 kg·m−3. The database includes X-ray and terahertz scans, sample dimensions, moisture content, and color photographs. Citation: Data PubDate: 2024-11-05 DOI: 10.3390/data9110130 Issue No:Vol. 9, No. 11 (2024)
Authors:Paola M. Ortiz-Grisales, Leidy Gutiérrez-León, Carlos D. Zuluaga-Ríos First page: 131 Abstract: Cities globally must make urgent decisions to ensure a sustainable future as rising pollution, particularly PM2.5, poses severe health risks like respiratory and heart diseases. PM2.5’s harmful composition also impacts vegetation and the environment. Immediate government intervention is necessary to mitigate these effects. This study tackles the urgent problem of reducing PM2.5 levels in Medellín’s urban and indoor environments, where pollution presents serious health risks. To explore effective solutions, this research provides new data on the interaction between particulate matter from various pollutants and negative ions under different temperature conditions, offering valuable insights into air quality improvement strategies. Using a high-voltage system, ions bind to pollutants, accelerating their removal. Experiments measured temperature, humidity, formaldehyde, volatile organic compounds, negative ions, and PM2.5 in a 40 cm3 chamber across various conditions. Pollutants tested included cigarette smoke, incense, charcoal, and gasoline at two voltage levels and three temperature ranges. The data, available in CSV format, were based on 36,000 samples and repeated tests for reliability. This resource is designed to support studies investigating particulate matter control in urban and indoor environments, as well as to improve our understanding of negative ion-based air purification processes. The data are publicly available and structured in formats compatible with leading data analysis platforms. Citation: Data PubDate: 2024-11-08 DOI: 10.3390/data9110131 Issue No:Vol. 9, No. 11 (2024)
Authors:Believe Ayodele, Victor Buttigieg First page: 132 Abstract: Virtualisation has received widespread adoption and deployment across a wide range of enterprises and industries throughout the years. Network Function Virtualisation (NFV) is a technical concept that presents a method for dynamically delivering virtualised network functions as virtualised or software components. Virtualised Network Function (VNF) has distinct advantages, but it also faces serious security challenges. Cyberattacks such as Denial of Service (DoS), malware/rootkit injection, port scanning, and so on can target VNF appliances just like any other network infrastructure. To create exceptional training exercises for machine or deep learning (ML/DL) models to combat cyberattacks in VNF, a suitable dataset (VNFCYBERDATA) exhibiting an actual reflection, or one that is reasonably close to an actual reflection, of the problem that the ML/DL model could address is required. This article describes a real VNF dataset that contains over seven million data points and twenty-five cyberattacks generated from five VNF appliances. To facilitate a realistic examination of VNF traffic, the dataset includes both benign and malicious traffic. Citation: Data PubDate: 2024-11-08 DOI: 10.3390/data9110132 Issue No:Vol. 9, No. 11 (2024)
Authors:Teresa M. Esman, Alexa J. Halford, Jeff Klenzing, Angeline G. Burrell First page: 133 Abstract: The Space Physics Data Facility (SPDF) is a digital archive of space physics data and is useful for the storage, analysis, and dissemination of data. We discuss the process used to create an amended dataset and store it on the SPDF. The operational software to generate the archival data software uses the open-source Python package pysat, and an end-user module has been added to the pysatNASA module. The result is the addition of data products to the Mars Global Surveyor (MGS) magnetometer (MAG) dataset, its archival location on SPDF, and pysat compatibility. The primary and metadata format increases the convenience and efficiency for users of the MGS MAG data. The storage of planetary and heliophysics data in one location supports the use of data throughout the solar system for comparison, while pysat compatibility enables loading data in an identical format for ease of processing. We encourage the use of the outlined process for past, present, and future space science missions of all sizes and funding levels. This includes balloons to Flagship-class missions. Citation: Data PubDate: 2024-11-08 DOI: 10.3390/data9110133 Issue No:Vol. 9, No. 11 (2024)
Authors:Mamtimin Qasim, Wushour Silamu, Minghui Qiu First page: 134 Abstract: Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text. Citation: Data PubDate: 2024-11-11 DOI: 10.3390/data9110134 Issue No:Vol. 9, No. 11 (2024)
Authors:Rong Zhang, Qi Zhang, Conghe Song, Li An First page: 135 Abstract: Green initiatives are popular mechanisms globally to enhance environmental and human wellbeing. However, multiple green initiatives, when overlapping geographically and targeting the same participants, may interact with each other, giving rise to what is termed “spillover effects”, where one initiative and its outcomes influence another. This study examines the spillover effects among four major concurrent initiatives in the United States (U.S.) and China using a comprehensive dataset. In the U.S., we analysed county-level data in 2018 for the Conservation Reserve Program (CRP) and the Environmental Quality Incentives Program (EQIP), both operational for over 25 years. In China, data from Fanjingshan and Tianma National Nature Reserves (2014–2015) were used to evaluate the Grain-to-Green Program (GTGP) and the Forest Ecological Benefit Compensation (FEBC) program. The dataset comprises 3106 records for the U.S. and 711 plots for China, including several socio-economic variables. The results of multivariate linear regression indicate that there exist significant spillover effects between CRP & EQIP and GTGP & FEBC, with one initiative potentially enhancing or offsetting another’s impacts by 22% to 100%. This dataset provides valuable insights for researchers and policymakers to optimize the effectiveness and resilience of concurrent green initiatives. Citation: Data PubDate: 2024-11-13 DOI: 10.3390/data9110135 Issue No:Vol. 9, No. 11 (2024)
Authors:Ludovica De Gregorio, Giovanni Cuozzo, Riccardo Barella, Francisco Corvalán, Felix Greifeneder, Peter Grosse, Abraham Mejia-Aguilar, Georg Niedrist, Valentina Premier, Paul Schattan, Alessandro Zandonai, Claudia Notarnicola First page: 136 Abstract: In this work, we present two datasets for specific areas located on the Alpine arc that can be exploited to monitor and understand water resource dynamics in mountain regions. The idea is to provide the reader with information about the different sources of water supply over five defined test areas over the South Tyrol (Italy) and Tyrol (Austria) areas in alpine environments. The snow cover fraction (SCF) and Soil Moisture Content (SMC) datasets are derived from machine learning algorithms based on remote sensing data. Both SCF and SMC products are characterized by a spatial resolution of 20 m and are provided for the period from October 2020 to May 2023 (SCF) and from October 2019 to September 2022 (SMC), respectively, covering winter seasons for SCF and spring–summer seasons for SMC. For SCF maps, the validation with very high-resolution images shows high correlation coefficients of around 0.9. The SMC products were originally produced with an algorithm validated at a global scale, but here, to obtain more insights into the specific alpine mountain environment, the values estimated from the maps are compared with ground measurements of automatic stations located at different altitudes and characterized by different aspects in the Val Mazia catchment in South Tyrol (Italy). In this case, an MAE between 0.05 and 0.08 and an unbiased RMSE between 0.05 and 0.09 m3·m−3 were achieved. The datasets presented can be used as input for hydrological models and to hydrologically characterize the study alpine area starting from different sources of information. Citation: Data PubDate: 2024-11-16 DOI: 10.3390/data9110136 Issue No:Vol. 9, No. 11 (2024)
Authors:Mailen Ortega-Cuadros, Laurine Chir, Sophie Aligon, Nubia Velasquez, Tatiana Arias, Jerome Verdier, Philippe Grappin First page: 137 Abstract: Alternaria brassicicola is a seed-borne pathogen that causes black spot disease in Brassica crops, yet the seed defense mechanisms against this fungus remain poorly understood. Building upon recent reports that highlighted the involvement of indole pathways in seeds infected by Alternaria, this study provides transcriptomic resources to further elucidate the role of these metabolic pathways during the interaction between seeds and fungal pathogens. Using RNA sequencing, we examined the gene expression of glucosinolate-deficient mutant lines (cyp79B2/cyp79B3 and qko) and a camalexin-deficient line (pad3), generating a dataset from 14 samples. These samples were inoculated with Alternaria or water, and collected at 3, 6, and 10 days after sowing to extract total RNA. Sequencing was performed using DNBseq™ technology, followed by bioinformatics analyses with tools such as FastQC (version 0.11.9), multiQC (version 1.13), Venny (version 2.0), Salmon software (version 0.14.1), and R packages DESeq2 (version 1.36.0), ClusterProfiler (version 4.12.6) and ggplot2 (version 3.4.0). By providing this valuable dataset, we aim to contribute to a deeper understanding of seed defense mechanisms against Alternaria, leveraging RNA-seq for various analyses, including differential gene expression and co-expression correlation. This work serves as a foundation for a more comprehensive grasp of the interactions during seed infection and highlights potential targets for enhancing crop protection and management. Citation: Data PubDate: 2024-11-18 DOI: 10.3390/data9110137 Issue No:Vol. 9, No. 11 (2024)
Authors:Bernhard Zagel, Hans Wiesenegger, Robert R. Junker, Gerhard Ehgartner First page: 110 Abstract: This article provides a comprehensive overview of all currently available datasets of the Long-term Ecosystem Research (LTER) site Oberes Stubachtal. The site is located in the Hohe Tauern mountain range (Eastern Alps, Austria) and includes both protected areas (Hohe Tauern National Park) and unprotected areas (Stubach valley). While the main research focus of the site is on high mountains, glaciology, glacial hydrology, and biodiversity, the eLTER Whole-System Approach (WAILS) was used for data selection. This approach involves a systematic screening of all available data to assess their suitability as eLTER Standard Observations (SOs). This includes the geosphere, atmosphere, hydrosphere, biosphere, and sociosphere. These SOs are fundamental to the development of a comprehensive long-term ecosystem research framework. In total, more than 40 datasets have been collated for the LTER site Oberes Stubachtal and included in the Dynamic Ecological Information Management System—Site and Data Registry (DEIMS-SDR), the eLTER’s data platform. This paper provides a detailed inventory of the datasets and their primary attributes, evaluates them against the WAILS-required observation data, and offers insights into strategies for future initiatives. All datasets are made available through dedicated repositories for FAIR (findable, accessible, interoperable, reusable) use. Citation: Data PubDate: 2024-09-25 DOI: 10.3390/data9100110 Issue No:Vol. 9, No. 10 (2024)
Authors:Shuangmei Tian, Ziyu Zhao, Beibei Ren, Degeng Wang First page: 111 Abstract: MicroRNAs (miRNA) exert regulatory actions via base pairing with their binding sites on target mRNAs. Cooperative binding, i.e., synergism, among binding sites on an mRNA is biochemically well characterized. We studied whether this synergism is reflected in the global relationship between miRNA-mediated regulatory activity and miRNA binding site count on the target mRNAs, i.e., leading to a non-linear relationship between the two. Recently, using our own and public datasets, we have enquired into miRNA regulatory actions: first, we analyzed the power-law distribution pattern of miRNA binding sites; second, we found that, strikingly, mRNAs for core miRNA regulatory apparatus proteins have extraordinarily high binding site counts, forming self-feedback-control loops; third, we revealed that tumor suppressor mRNAs generally have more sites than oncogene mRNAs; and fourth, we characterized enrichment of miRNA-targeted mRNAs in translationally less active polysomes relative to more active polysomes. In these four studies, we qualitatively observed obvious positive correlation between the extent to which an mRNA is miRNA-regulated and its binding site count. This paper summarizes the datasets used. We also quantitatively analyzed the correlation by comparative linear and non-linear regression analyses. Non-linear relationships, i.e., accelerating rise of regulatory activity as binding site count increases, fit the data much better, conceivably a transcriptome-level reflection of cooperative binding among miRNA binding sites on a target mRNA. This observation is potentially a guide for integrative quantitative modeling of the miRNA regulatory system. Citation: Data PubDate: 2024-09-25 DOI: 10.3390/data9100111 Issue No:Vol. 9, No. 10 (2024)
Authors:Carlos Hernández-Nava, Miguel-Félix Mata-Rivera, Sergio Flores-Hernández First page: 112 Abstract: The increasing prevalence of diabetes worldwide, including in Mexico, presents significant challenges to healthcare systems. This has a notable impact on hospital admissions, as diabetes is considered an ambulatory care-sensitive condition, meaning that hospitalizations could be avoided. This is just one example of many challenges faced in the medical and public health fields. Traditional healthcare methods have been effective in managing diabetes and preventing complications. However, they often encounter limitations when it comes to analyzing large amounts of health data to effectively identify and address diseases. This paper aims to bridge this gap by outlining a comprehensive methodology for non-physicians, particularly data scientists, working in healthcare. As a case study, this paper utilizes hospital diabetes discharge records from 2010 to 2023, totaling 36,665,793 records from medical units under the Ministry of Health of Mexico. We aim to highlight the importance for data scientists to understand the problem and its implications. By doing so, insights can be generated to inform policy decisions and reduce the burden of avoidable hospitalizations. The approach primarily relies on stratification and standardization to uncover rates based on sex and age groups. This study provides a foundation for data scientists to approach health data in a new way. Citation: Data PubDate: 2024-09-27 DOI: 10.3390/data9100112 Issue No:Vol. 9, No. 10 (2024)
Authors:Joylan Nunes Maciel, Jorge Javier Gimenez Ledesma, Oswaldo Hideo Ando Junior First page: 113 Abstract: Prediction of solar irradiance is crucial for photovoltaic energy generation, as it helps mitigate intermittencies caused by atmospheric fluctuations such as clouds, wind, and temperature. Numerous studies have applied machine learning and deep learning techniques from artificial intelligence to address this challenge. Based on the recently proposed Hybrid Prediction Method (HPM), this paper presents an original and comprehensive dataset with nine attributes extracted from all-sky images developed using image processing techniques. This dataset and analysis of its attributes offer new avenues for research into solar irradiance forecasting. To ensure reproducibility, the data processing workflow and the standardized dataset have been meticulously detailed and made available to the scientific community to promote further research into prediction methods for photovoltaic energy generation. Citation: Data PubDate: 2024-09-29 DOI: 10.3390/data9100113 Issue No:Vol. 9, No. 10 (2024)
Authors:Francesco Gagliardi, Michele Dei First page: 114 Abstract: This study introduces a collaborative and open dataset designed to classify operational transconductance amplifiers (OTAs) in switched-capacitor applications. The dataset comprises a diverse collection of OTA designs sourced from the literature, facilitating benchmarking, analysis and innovation in analog and mixed-signal integrated circuit design. Various evaluation methodologies, implemented through a companion Python notebook script, are discussed to assess OTA performances across different operating conditions and specifications. Several Figures of Merit (FoMs) are utilized as performance metrics to achieve significant performance classification. This study also uncovers intriguing behaviors and correlations among FoMs, providing valuable insights into OTA design considerations. By making the dataset openly available on platforms like GitHub, this work encourages collaboration and knowledge sharing within the integrated circuit design community, thereby enhancing transparency, reproducibility and innovation in OTA design research. Citation: Data PubDate: 2024-10-03 DOI: 10.3390/data9100114 Issue No:Vol. 9, No. 10 (2024)
Authors:Iván Juan Carlos Pérez-Olguín, Consuelo Catalina Fernández-Gaxiola, Luis Alberto Rodríguez-Picón, Luis Carlos Méndez-González First page: 115 Abstract: This research explores the torque–angle behavior of M2/M3 screws in automotive applications, focusing on ensuring component reliability and manufacturing precision within the recommended assembly specification limits. M2/M3 screws, often used in tight spaces, are susceptible to issues like stripped threads and inconsistent torque, which can compromise safety and performance. The study’s primary objective is to develop a comprehensive dataset of torque–angle measurements for these screws, facilitating the analysis of key parameters such as torque-to-seat, torque-to-fail, and process windows. By applying Gaussian curve fitting and Gaussian process regression, the research models and simulates torque behavior to understand torque dynamics in small fasteners and remarks on the potential of statistical methods in torque analysis, offering insights for improving manufacturing practices. As a result, it can be concluded that the proposed stochastics methodologies offer the benefit of fail-to-seat ratio improvement, allow inference, reduce the sample size needed in incoming test studies, and minimize the number of destructive test samples needed. Citation: Data PubDate: 2024-10-06 DOI: 10.3390/data9100115 Issue No:Vol. 9, No. 10 (2024)
Authors:Angelica Lermann Henestrosa, Joachim Kimmerle First page: 116 Abstract: With the release of ChatGPT, text-generating AI became accessible to the general public virtually overnight, and automated text generation (ATG) became the focus of public debate. Previously, however, little attention had been paid to this area of AI, resulting in a gap in the research on people’s attitudes and perceptions of this technology. Therefore, two representative surveys among the German population were conducted before (March 2022) and after (July 2023) the release of ChatGPT to investigate people’s attitudes, concepts, and knowledge on ATG in detail. This data descriptor depicts the structure of the two datasets, the measures collected, and potential analysis approaches beyond the existing research paper. Other researchers are encouraged to take up these data sets and explore them further as suggested or as they deem appropriate. Citation: Data PubDate: 2024-10-11 DOI: 10.3390/data9100116 Issue No:Vol. 9, No. 10 (2024)
Authors:Christian Vidal-Cabo, Enrique Alfonso Sánchez-Pérez, Antonia Ferrer-Sapena First page: 117 Abstract: Introduction. Open Government is a form of public policy based on the pillars of collaboration and citizen participation, transparency and the right of access to public information. With the help of information and communication technologies, governments and administrations carry out open data initiatives, making reusable datasets available to all citizens. The academic community, highly qualified personnel, can become potential reusers of this data, which would lead to its use for scientific research, generating knowledge, and for teaching, improving the training of university students and promoting the reuse of open data in the future. Method. This study was developed using a quantitative research methodology (survey), which was distributed by email in one context block and six technical blocks, with a total of 30 questions. The data collection period was between 15 March and 10 May 2021. Analysis. The data obtained through this quantitative methodology were processed, normalised, and analysed. Results. A total of 783 responses were obtained, from 34 Spanish provinces. The researchers come from 47 Spanish universities and 21 research centres, and 19 research areas of the State Research Agency are represented. In addition, a platform was developed with the data for the purpose of visualising the results of the survey. Conclusions. The sample thus obtained is representative and the conclusions can be extrapolated to the rest of the Spanish university teaching staff. In terms of gender, the study is balanced between men and women (41.76% W vs. 56.58% M). In general, researchers responding to the survey know what open data is (79.31%) but only 50.57% reuse open data. The main conclusion is that open government data prove to be useful sources of information for science, especially in areas such as Social Sciences, Industrial Production, Engineering and Engineering for Society, Information and Communication Technologies, Economics and Environmental Sciences. Citation: Data PubDate: 2024-10-11 DOI: 10.3390/data9100117 Issue No:Vol. 9, No. 10 (2024)
Authors:Andrea C. O’Neill, Kees Nederhoff, Li H. Erikson, Jennifer A. Thomas, Patrick L. Barnard First page: 118 Abstract: Here, we describe a dataset of two-dimensional (2D) XBeach model files that were developed for the Coastal Storm Modeling System (CoSMoS) in northern California as an update to an earlier CoSMoS implementation that relied on one-dimensional (1D) modeling methods. We provide details on the data and their application, such that they might be useful to end-users for other coastal studies. Modeling methods and outputs are presented for Humboldt Bay, California, in which we compare output from a nested 1D modeling approach to 2D model results, demonstrating that the 2D method, while more computationally expensive, results in a more cohesive and directly mappable flood hazard result. Citation: Data PubDate: 2024-10-11 DOI: 10.3390/data9100118 Issue No:Vol. 9, No. 10 (2024)
Authors:Roman Banakh, Elena Nyemkova, Connie Justice, Andrian Piskozub, Yuriy Lakh First page: 119 Abstract: Recent cyber security solutions for wireless networks during internet open access have become critically important for personal data security. The newest WPA3 network security protocol has been used to maximize this protection; however, attackers can use an Evil Twin attack to replace a legitimate access point. The article is devoted to solving the problem of intrusion detection at the OSI model’s physical layers. To solve this, a hardware–software complex has been developed to collect information about the signal strength from Wi-Fi access points using wireless sensor networks. The collected data were supplemented with a generative algorithm considering all possible combinations of signal strength. The k-nearest neighbor model was trained on the obtained data to distinguish the signal strength of legitimate from illegitimate access points. To verify the authenticity of the data, an Evil Twin attack was physically simulated, and a machine learning model analyzed the data from the sensors. As a result, the Evil Twin attack was successfully identified based on the signal strength in the radio spectrum. The proposed model can be used in open access points as well as in large corporate and home Wi-Fi networks to detect intrusions aimed at substituting devices in the radio spectrum where IEEE 802.11 networking equipment operates. Citation: Data PubDate: 2024-10-14 DOI: 10.3390/data9100119 Issue No:Vol. 9, No. 10 (2024)
Authors:Mariza P. Oliveira-Roza, Roberto A. Cecílio, David B. S. Teixeira, Michel C. Moreira, André Q. Almeida, Alexandre C. Xavier, Sidney S. Zanetti First page: 120 Abstract: Rainfall erosivity (RE) represents the potential of rainfall to cause soil erosion, and understanding its impact is essential for the adoption of soil and water conservation practices. Although several studies have estimated RE for Brazil, currently, no single reliable and easily accessible database exists for the country. To fill this gap, this work aimed to review the research and generate a rainfall erosivity database for Brazil. Data were gathered from studies that determined rainfall erosivity from observed rainfall records and synthetic rainfall series. Monthly and annual rainfall erosivity values were organized on a spreadsheet and in the shapefile format. In total, 54 studies from 1990 to 2023 were analyzed, resulting in the compilation of 5516 erosivity values for Brazil, of which 6.3% were pluviographic, and 93.7% were synthetic. The regions with the highest availability of information were the Northeast (35.6%), Southeast (30.1%), South (19.9%), Central-West (7.7%), and North (6.7%). The database, which can be accessed on the Mendeley Data platform, can aid professionals and researchers in adopting public policies and carrying out studies aimed at environmental conservation and management basin development. Citation: Data PubDate: 2024-10-14 DOI: 10.3390/data9100120 Issue No:Vol. 9, No. 10 (2024)
Authors:Simona Colucci, Francesco Maria Donini, Eugenio Di Sciascio First page: 121 Abstract: Clustering is a very common means of analysis of the data present in large datasets, with the aims of understanding and summarizing the data and discovering similarities, among other goals. However, despite the present success of the use of subsymbolic methods for data clustering, a description of the obtained clusters cannot rely on the intricacies of the subsymbolic processing. For clusters of data expressed in a Resource Description Framework (RDF), we extend and implement an optimized, previously proposed, logic-based methodology that computes an structure—called a Common Subsumer—describing the commonalities among all resources. We tested our implementation with two open, and very different, datasets: one devoted to public procurement, and the other devoted to drugs in pharmacology. For both datasets, we were able to provide reasonably concise and readable descriptions of clusters with up to 1800 resources. Our analysis shows the viability of our methodology and computation, and paves the way for general cluster explanations to be provided to lay users. Citation: Data PubDate: 2024-10-20 DOI: 10.3390/data9100121 Issue No:Vol. 9, No. 10 (2024)
Authors:Achille Felicetti, Franco Niccolucci First page: 1 Abstract: This study builds upon the Reactive Heritage Digital Twin paradigm established in prior research, exploring the role of artificial intelligence in expanding and enhancing its capabilities. After providing an overview of the ontological model underlying the RHDT paradigm, this paper investigates the application of AI to improve data analysis and predictive capabilities of Heritage Digital Twins in synergy with the previously defined RHDTO semantic model. The structured nature of ontologies is highlighted as essential for enabling AIs to operate transparently, minimising hallucinations and other errors that are characteristic challenges of these technologies. New classes and properties within RHDTO are introduced to represent the AI-enhanced functions. Finally, some case studies are provided to illustrate how integrating AI within the RHDT framework can contribute to enriching the understanding of cultural information through interconnected data and facilitate real-time monitoring and preservation of cultural objects. Citation: Data PubDate: 2024-12-26 DOI: 10.3390/data10010001 Issue No:Vol. 10, No. 1 (2024)
Authors:Oleg S. Alexandrov, Dmitry V. Romanov First page: 2 Abstract: Minisatellites are widespread tandem DNA repeats in the genome with a monomer length of 10 to 100 bp. The high variability of minisatellite loci makes them attractive for the development of molecular markers. Minisatellites are used as markers according to three strategies: marking of digested genomic DNA with minisatellite-based probes; amplification with primers based on the sequences of the minisatellites themselves; amplification with primers designed for borders upstream and downstream of the minisatellite locus. In this study, a microsatellite dataset was obtained from the analysis of the Citrus limon (L.) Osbeck genome using Tandem Repeat Finder (TRF) and GMATA software. The minisatellite loci found were used to develop molecular markers that were tested in GMATA using electronic PCR (e-PCR). The obtained dataset includes sequences of extracted minisatellites and their characteristics (start and end nucleotide positions on the chromosome, length of monomer, number of repetitions and length of array), as well as sequences of developed primers, expected lengths of amplicons, and e-PCR results. The presented dataset can be used for the marking of lemon samples according to any of the three strategies. It provides a useful basis for lemon variety certification, identification of samples, verification of collections, lemon genome mapping, saturation of already created maps, studying of the lemon genome architecture etc. Citation: Data PubDate: 2024-12-28 DOI: 10.3390/data10010002 Issue No:Vol. 10, No. 1 (2024)
Authors:Sebastian Valencia-Garzon, Esteban Gonzalez-Valencia, Nelson Gómez-Cardona, Andres Calvo-Salcedo, J. A. Jaramillo-Villegas, Jorge Montoya-Cardona, Erick Reyes-Vera First page: 3 Abstract: This study focuses on the analysis of the spectral response of all-pass micro-ring resonators (MRRs), which are essential in photonic device applications such as telecommunications, sensing, and optical frequency comb generation. The aim of this work is to generate a synthetic dataset that explores the spectral characteristics of the expected transmission spectra of MRRs by varying their structural parameters. Using numerical simulations, the dataset will allow the optimization of MRR performance metrics such as free spectral range (FSR), full width at half maximum (FWHM), and quality factor (Q-factor). The results confirm that variations in geometric configurations can significantly affect MRR performance, and the dataset provides valuable insights into the optimization process. Furthermore, machine learning techniques can be applied to the dataset to automate and improve the design process, reducing simulation times and increasing accuracy. This work contributes to the development of photonic devices by providing a broad dataset for further analysis and optimization. Citation: Data PubDate: 2024-12-30 DOI: 10.3390/data10010003 Issue No:Vol. 10, No. 1 (2024)