Similar Journals
![]() |
Data
Number of Followers: 4 ![]() ISSN (Online) 2306-5729 Published by MDPI ![]() |
- Data, Vol. 8, Pages 135: Enhancing Small Tabular Clinical Trial Dataset
through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP
Authors: Winston Wang, Tun-Wen Pai
First page: 135
Abstract: This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.
Citation: Data
PubDate: 2023-08-23
DOI: 10.3390/data8090135
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 136: Knowledge Graph Dataset for Semantic Enrichment
of Picture Description in NAPS Database
Authors: Marko Horvat, Gordan Gledec, Tomislav Jagušt, Zoran Kalafatić
First page: 136
Abstract: This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations for 1356 NAPS pictures in 5 categories, linked to WordNet synsets and Suggested Upper Merged Ontology (SUMO) concepts presented in a tabular format. Both knowledge databases provide an extensive and supervised taxonomy glossary suitable for describing picture semantics. The annotation glossary consists of 935 WordNet and 513 SUMO entities. A description of the dataset and the specific processes used to collect, process, review, and publish the dataset as open data are also provided. This dataset is unique in that it captures complex objects, scenes, actions, and the overall context of emotional stimuli with knowledge taxonomies at a high level of quality. It provides a valuable resource for a variety of projects investigating emotion, attention, and related phenomena. In addition, researchers can use this dataset to explore the relationship between emotions and high-level semantics or to develop data-retrieval tools to generate personalized stimuli sequences. The dataset is freely available in common formats (Excel and CSV).
Citation: Data
PubDate: 2023-08-24
DOI: 10.3390/data8090136
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 137: A Framework for Evaluating Renewable Energy for
Decision-Making Integrating a Hybrid FAHP-TOPSIS Approach: A Case Study in
Valle del Cauca, Colombia
Authors: Mateo Barrera-Zapata, Fabian Zuñiga-Cortes, Eduardo Caicedo-Bravo
First page: 137
Abstract: At present, the energy landscape of many countries faces transformational challenges driven by sustainable development objectives, supported by the implementation of clean technologies, such as renewable energy sources, to meet the flexibility and diversification needs of the traditional energy mix. However, integrating these technologies requires a thorough study of the context in which they are developed. Furthermore, it is necessary to carry out an analysis from a sustainable approach that quantifies the impact of proposals on multiple objectives established by stakeholders. This article presents a framework for analysis that integrates a method for evaluating the technical feasibility of resources for photovoltaic solar, wind, small hydroelectric power, and biomass generation. These resources are used to construct a set of alternatives and are evaluated using a hybrid FAHP-TOPSIS approach. FAHP-TOPSIS is used as a comparison technique among a collection of technical, economic, and environmental criteria, ranking the alternatives considering their level of trade-off between criteria. The results of a case study in Valle del Cauca (Colombia) offer a wide range of alternatives and indicate a combination of 50% biomass, and 50% solar as the best, assisting in decision-making for the correct use of available resources and maximizing the benefits for stakeholders.
Citation: Data
PubDate: 2023-08-30
DOI: 10.3390/data8090137
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 138: Using Landsat-5 for Accurate Historical LULC
Classification: A Comparison of Machine Learning Models
Authors: Denis Krivoguz, Sergei G. Chernyi, Elena Zinchenko, Artem Silkin, Anton Zinchenko
First page: 138
Abstract: This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from 1990 as input data. Four machine learning models (deep neural network, Random Forest, support vector machine (SVM), and AdaBoost) are employed, and their hyperparameters are tuned using random search and grid search. Model performance is evaluated through cross-validation and confusion matrices. The deep neural network achieves the highest accuracy (96.2%) and performs well in classifying water, urban lands, open soils, and high vegetation. However, it faces challenges in classifying grasslands, bare lands, and agricultural areas. The Random Forest model achieves an accuracy of 90.5% but struggles with differentiating high vegetation from agricultural lands. The SVM model achieves an accuracy of 86.1%, while the AdaBoost model performs the lowest with an accuracy of 58.4%. The novel contributions of this study include the comparison and evaluation of multiple machine learning models for land use classification in the Kerch Peninsula. The deep neural network and Random Forest models outperform SVM and AdaBoost in terms of accuracy. However, the use of limited data sources such as cadastral data and scientific articles may introduce limitations and potential errors. Future research should consider incorporating field studies and additional data sources for improved accuracy. This study provides valuable insights for land use classification, facilitating the assessment and management of natural resources in the Kerch Peninsula. The findings contribute to informed decision-making processes and lay the groundwork for further research in the field.
Citation: Data
PubDate: 2023-08-30
DOI: 10.3390/data8090138
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 139: Dataset of Multi-Aspect Integrated Migration
Indicators
Authors: Diletta Goglia, Laura Pollacci, Alina Sîrbu
First page: 139
Abstract: Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated using traditional data, which are however distributed across different sources and difficult to integrate. In this context we present the Multi-aspect Integrated Migration Indicators (MIMI) dataset, a new dataset of migration indicators (flows and stocks) and possible migration drivers (cultural, economic, demographic and geographic indicators). This was obtained through acquisition, transformation and integration of disparate traditional datasets together with social network data from Facebook (Social Connectedness Index). This article describes the process of gathering, embedding and merging traditional and novel variables, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.
Citation: Data
PubDate: 2023-08-31
DOI: 10.3390/data8090139
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 140: Employing Source Code Quality Analytics for
Enriching Code Snippets Data
Authors: Thomas Karanikiotis, Themistoklis Diamantopoulos, Andreas Symeonidis
First page: 140
Abstract: The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability.
Citation: Data
PubDate: 2023-08-31
DOI: 10.3390/data8090140
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 141: Thailand Raw Water Quality Dataset Analysis and
Evaluation
Authors: Jaturapith Krohkaew, Pongpon Nilaphruek, Niti Witthayawiroj, Sakchai Uapipatanakul, Yamin Thwe, Padma Nyoman Crisnapati
First page: 141
Abstract: Sustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need for accurate monitoring and assessment of water quality parameters. This research describes a reconstructed daily water quality dataset that complements rare historical observations for six station points along the Chao Phraya River in Thailand. Internet of Things technology and a Eureka water probe sensor is used to collect and reconstruct the water quality dataset for the period from June 2022–February 2023, with Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, Acidity/Basicity, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth as the recorded parameters from six different stations. The presented dataset comprises a total of 211,322 data points, which are separated into six CSV files. The dataset is then evaluated using the Long Short-Term Memory (LSTM) algorithm with a Mean Squared Error (MSE) of 0.0012256, and Root Mean Squared Error (RMSE) of 0.0350080. The proposed dataset provides valuable insights for researchers studying river ecosystems, supporting informed decision-making and sustainable water management practices.
Citation: Data
PubDate: 2023-09-04
DOI: 10.3390/data8090141
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 142: Update of Dietary Supplement Label Database
Addressing on Coding in Italy
Authors: Giorgia Perelli, Roberta Bernini, Massimo Lucarini, Alessandra Durazzo
First page: 142
Abstract: Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of dietary supplements has increased. A food supplements-based database has, as a principal feature, an intrinsic dynamism related to the continuous changes in formulations, which consequently leads to the need for constant monitoring of the market and for regular updates of the database. This study presents an update to the Dietary Supplement Label Database in Italy focused on dietary supplements coding. The updated dataset here, presented for the first time, consists of the codes of 216 dietary supplements currently on the market in Italy that have functional foods as their characterizing ingredients, throughout the two commonly most used description and classification systems: LanguaLTM and FoodEx2-. This update represents a unique tool and guideline for other compilers and users for applying classification coding systems to dietary supplements. Moreover, this updated dataset represents a valuable resource for several applications such as epidemiological investigations, exposure studies, and dietary assessment.
Citation: Data
PubDate: 2023-09-13
DOI: 10.3390/data8090142
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 143: A New Odd Beta Prime-Burr X Distribution with
Applications to Petroleum Rock Sample Data and COVID-19 Mortality Rate
Authors: Ahmad Abubakar Suleiman, Hanita Daud, Narinderjit Singh Sawaran Singh, Aliyu Ismail Ishaq, Mahmod Othman
First page: 143
Abstract: In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is monotonically increasing, decreasing, bathtub, and N-shaped, making it suitable for modeling skewed data and failure rates. Various statistical properties of the new model are obtained, such as moments, moment-generating function, entropies, quantile function, and limit behavior. The maximum-likelihood-estimation procedure is utilized to determine the parameters of the model. A Monte Carlo simulation study is implemented to ascertain the efficiency of maximum-likelihood estimators. The findings demonstrate the empirical application and flexibility of the OBPBX distribution, as showcased through its analysis of petroleum rock samples and COVID-19 mortality data, along with its superior performance compared to well-known extended versions of the Burr X distribution. We anticipate that the new distribution will attract a wider readership and provide a vital tool for modeling various phenomena in different domains.
Citation: Data
PubDate: 2023-09-19
DOI: 10.3390/data8090143
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 144: Potential Range Map Dataset of Indian Birds
Authors: Arpit Deomurari, Ajay Sharma, Dipankar Ghose, Randeep Singh
First page: 144
Abstract: Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km2. Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps significantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action.
Citation: Data
PubDate: 2023-09-21
DOI: 10.3390/data8090144
Issue No: Vol. 8, No. 9 (2023)
- Data, Vol. 8, Pages 123: Blockchain Payment Services in the Hospitality
Sector: The Mediating Role of Data Security on Utilisation Efficiency of
the Customer
Authors: Ankit Dhiraj, Sanjeev Kumar, Divya Rani, Simon Grima, Kiran Sood
First page: 123
Abstract: Blockchain technology has the potential to completely transform the hospitality sector by offering a safe, open, and effective method of payment. Increased customer utilisation efficiency may result from this. This study looks into how blockchain payment methods affect hotel customers’ intentions to stay loyal by devising four hypotheses. A questionnaire was specifically created and self-administered for this study as a data-gathering tool and distributed to hotel customers. The I.B.M. SPSS and Amos software packages were used to analyse the data of the 301 valid responses. Findings show that hospitality customers may use blockchain payment services if the customer is satisfied with the data security of this payment system. The study also highlighted that customer data security mediated the association between utilisation efficiency and blockchain payment systems. Blockchain payment services can affect visitors’ intentions to stay loyal by impacting data security and consumer happiness. Results suggest that blockchain payment systems can be useful for hospitality firms looking to increase client utilisation efficiency. Blockchain can simplify visitor booking and payment processes by providing a safe, open, and effective transacting method. This may result in a satisfying encounter that visitors are more inclined to recall and repeat.
Citation: Data
PubDate: 2023-07-30
DOI: 10.3390/data8080123
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 124: Measuring the Effect of Fraud on Data-Quality
Dimensions
Authors: Samiha Brahimi, Mariam Elhussein
First page: 124
Abstract: Data preprocessing moves the data from raw to ready for analysis. Data resulting from fraud compromises the quality of the data and the resulting analysis. It can exist in datasets such that it goes undetected since it is included in the analysis. This study proposed a process for measuring the effect of fraudulent data during data preparation and its possible influence on quality. The five-step process begins with identifying the business rules related to the business process(s) affected by fraud and their associated quality dimensions. This is followed by measuring the business rules in the specified timeframe, detecting fraudulent data, cleaning them, and measuring their quality after cleaning. The process was implemented in the case of occupational fraud within a hospital context and the illegal issuance of underserved sick leave. The aim of the application is to identify the quality dimensions that are influenced by the injected fraudulent data and how these dimensions are affected. This study agrees with the existing literature and confirms its effects on timeliness, coherence, believability, and interpretability. However, this did not show any effect on consistency. Further studies are needed to arrive at a generalizable list of the quality dimensions that fraud can affect.
Citation: Data
PubDate: 2023-07-30
DOI: 10.3390/data8080124
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 125: Quantitative Metabolomic Dataset of Avian Eye
Lenses
Authors: Ekaterina A. Zelentsova, Sofia S. Mariasina, Vadim V. Yanshole, Lyudmila V. Yanshole, Nataliya A. Osik, Kirill A. Sharshov, Yuri P. Tsentalovich
First page: 125
Abstract: Metabolomics is a powerful set of methods that uses analytical techniques to identify and quantify metabolites in biological samples, providing a snapshot of the metabolic state of a biological system. In medicine, metabolomics may help to reveal the molecular basis of a disease, make a diagnosis, and monitor treatment responses, while in agriculture, it can improve crop yields and plant breeding. However, animal metabolomics faces several challenges due to the complexity and diversity of animal metabolomes, the lack of standardized protocols, and the difficulty in interpreting metabolomic data. The current dataset includes quantitative metabolomic profiles of eye lenses from 26 bird species (111 specimens) that can aid researchers in developing new experiments, mathematical models, and integrating with other “-omics” data. The dataset includes raw 1H NMR spectra, protocols for sample preparation, and data preprocessing, with the final table containing information on the abundance of 89 reliably identified and quantified metabolites. The dataset is quantitative, making it relevant for supplementing with new specimens or comparison groups, followed by data mining and expected new interpretations. The data were obtained using the bird specimens collected in compliance with ethical standards and revealed potential differences in metabolic pathways due to phylogenetic differences or environmental exposure.
Citation: Data
PubDate: 2023-07-31
DOI: 10.3390/data8080125
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 126: Datasets of Simulated Exhaled Aerosol Images from
Normal and Diseased Lungs with Multi-Level Similarities for Neural Network
Training/Testing and Continuous Learning
Authors: Mohamed Talaat, Xiuhua Si, Jinxiang Xi
First page: 126
Abstract: Although exhaled aerosols and their patterns may seem chaotic in appearance, they inherently contain information related to the underlying respiratory physiology and anatomy. This study presented a multi-level database of simulated exhaled aerosol images from both normal and diseased lungs. An anatomically accurate mouth-lung geometry extending to G9 was modified to model two stages of obstructions in small airways and physiology-based simulations were utilized to capture the fluid-particle dynamics and exhaled aerosol images from varying breath tests. The dataset was designed to test two performance metrics of convolutional neural network (CNN) models when used for transfer learning: interpolation and extrapolation. To this aim, three testing datasets with decreasing image similarities were developed (i.e., level 1, inbox, and outbox). Four network models (AlexNet, ResNet-50, MobileNet, and EfficientNet) were tested and the performances of all models decreased for the outbox test images, which were outside the design space. The effect of continuous learning was also assessed for each model by adding new images into the training dataset and the newly trained network was tested at multiple levels. Among the four network models, ResNet-50 excelled in performance in both multi-level testing and continuous learning, the latter of which enhanced the accuracy of the most challenging classification task (i.e., 3-class with outbox test images) from 60.65% to 98.92%. The datasets can serve as a benchmark training/testing database for validating existent CNN models or quantifying the performance metrics of new CNN models.
Citation: Data
PubDate: 2023-07-31
DOI: 10.3390/data8080126
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 127: eMailMe: A Method to Build Datasets of Corporate
Emails in Portuguese
Authors: Akira A. de Moura Galvão Uematsu, Anarosa A. F. Brandão
First page: 127
Abstract: One of the areas in which knowledge management has application is in companies that are concerned with maintaining and disseminating their practices among their members. However, studies involving these two domains may end up suffering from the issue of data confidentiality. Furthermore, it is difficult to find data regarding organizations processes and associated knowledge. Therefore, this paper presents a method to support the generation of a labeled dataset composed of texts that simulate corporate emails containing sensitive information regarding disclosure, written in Portuguese. The method begins with the definition of the dataset’s size and content distribution; the structure of its emails’ texts; and the guidelines for specialists to build the emails’ texts. It aims to create datasets that can be used in the validation of a tacit knowledge extraction process considering the 5W1H approach for the resulting base. The method was applied to create a dataset with content related to several domains, such as Federal Court and Registry Office and Marketing, giving it diversity and realism, while simulating real-world situations in the specialists’ professional life. The dataset generated is available in an open-access repository so that it can be downloaded and, eventually, expanded.
Citation: Data
PubDate: 2023-07-31
DOI: 10.3390/data8080127
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 128: VEPL Dataset: A Vegetation Encroachment in Power
Line Corridors Dataset for Semantic Segmentation of Drone Aerial
Orthomosaics
Authors: Mateo Cano-Solis, John R. Ballesteros, John W. Branch-Bedoya
First page: 128
Abstract: Vegetation encroachment in power line corridors has multiple problems for modern energy-dependent societies. Failures due to the contact between power lines and vegetation can result in power outages and millions of dollars in losses. To address this problem, UAVs have emerged as a promising solution due to their ability to quickly and affordably monitor long corridors through autonomous flights or being remotely piloted. However, the extensive and manual task that requires analyzing every image acquired by the UAVs when searching for the existence of vegetation encroachment has led many authors to propose the use of Deep Learning to automate the detection process. Despite the advantages of using a combination of UAV imagery and Deep Learning, there is currently a lack of datasets that help to train Deep Learning models for this specific problem. This paper presents a dataset for the semantic segmentation of vegetation encroachment in power line corridors. RGB orthomosaics were obtained for a rural road area using a commercial UAV. The dataset is composed of pairs of tessellated RGB images, coming from the orthomosaic and corresponding multi-color masks representing three different classes: vegetation, power lines, and the background. A detailed description of the image acquisition process is provided, as well as the labeling task and the data augmentation techniques, among other relevant details to produce the dataset. Researchers would benefit from using the proposed dataset by developing and improving strategies for vegetation encroachment monitoring using UAVs and Deep Learning.
Citation: Data
PubDate: 2023-08-04
DOI: 10.3390/data8080128
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 129: Anomaly Detection in Student Activity in Solving
Unique Programming Exercises: Motivated Students against Suspicious Ones
Authors: Liliya A. Demidova, Peter N. Sovietov, Elena G. Andrianova, Anna A. Demidova
First page: 129
Abstract: This article presents a dataset containing messages from the Digital Teaching Assistant (DTA) system, which records the results from the automatic verification of students’ solutions to unique programming exercises of 11 various types. These results are automatically generated by the system, which automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). The DTA system is trained to distinguish between approaches to solve programming exercises, as well as to identify correct and incorrect solutions, using intelligent algorithms responsible for analyzing the source code in the DTA system using vector representations of programs based on Markov chains, calculating pairwise Jensen–Shannon distances for programs and using a hierarchical clustering algorithm to detect high-level approaches used by students in solving unique programming exercises. In the process of learning, each student must correctly solve 11 unique exercises in order to receive admission to the intermediate certification in the form of a test. In addition, a motivated student may try to find additional approaches to solve exercises they have already solved. At the same time, not all students are able or willing to solve the 11 unique exercises proposed to them; some will resort to outside help in solving all or part of the exercises. Since all information about the interactions of the students with the DTA system is recorded, it is possible to identify different types of students. First of all, the students can be classified into 2 classes: those who failed to solve 11 exercises and those who received admission to the intermediate certification in the form of a test, having solved the 11 unique exercises correctly. However, it is possible to identify classes of typical, motivated and suspicious students among the latter group based on the proposed dataset. The proposed dataset can be used to develop regression models that will predict outbursts of student activity when interacting with the DTA system, to solve clustering problems, to identify groups of students with a similar behavior model in the learning process and to develop intelligent data classifiers that predict the students’ behavior model and draw appropriate conclusions, not only at the end of the learning process but also during the course of it in order to motivate all students, even those who are classified as suspicious, to visualize the results of the learning process using various tools.
Citation: Data
PubDate: 2023-08-08
DOI: 10.3390/data8080129
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 130: Towards Action-State Process Model Discovery
Authors: Alessio Bottrighi, Marco Guazzone, Giorgio Leonardi, Stefania Montani, Manuel Striani, Paolo Terenziani
First page: 130
Abstract: Process model discovery covers the different methodologies used to mine a process model from traces of process executions, and it has an important role in artificial intelligence research. Current approaches in this area, with a few exceptions, focus on determining a model of the flow of actions only. However, in several contexts, (i) restricting the attention to actions is quite limiting, since the effects of such actions also have to be analyzed, and (ii) traces provide additional pieces of information in the form of states (i.e., values of parameters possibly affected by the actions); for instance, in several medical domains, the traces include both actions and measurements of patient parameters. In this paper, we propose AS-SIM (Action-State SIM), the first approach able to mine a process model that comprehends two distinct classes of nodes, to capture both actions and states.
Citation: Data
PubDate: 2023-08-09
DOI: 10.3390/data8080130
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 131: Draft Genome Sequence Data of Streptomyces
anulatus, Strain K-31
Authors: Andrey P. Bogoyavlenskiy, Madina S. Alexyuk, Amankeldi K. Sadanov, Vladimir E. Berezin, Lyudmila P. Trenozhnikova, Gul B. Baymakhanova
First page: 131
Abstract: Streptomyces anulatus is a typical representative of the Streptomyces genus synthesizing a large number of biologically active compounds. In this study, the draft genome of Streptomyces anulatus, strain K-31 is presented, generated from Illumina reads by SPAdes software. The size of the assembled genome was 8.548838 Mb. Annotation of the S. anulatus genome assembly identified C. hemipterus genome 7749 genes, including 7149 protein-coding genes and 92 RNA genes. This genome will be helpful to further understand Streptomyces genetics and evolution and can be useful for obtained biological active compounds.
Citation: Data
PubDate: 2023-08-10
DOI: 10.3390/data8080131
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 132: VR Traffic Dataset on Broad Range of End-User
Activities
Authors: Marina Polupanova
First page: 132
Abstract: With the emergence of new internet traffic types in modern transport networks, it has become critical for service providers to understand the structure of that traffic and predict peaks of that load for planning infrastructure expansion. Several studies have investigated traffic parameters for Virtual Reality (VR) applications. Still, most of them test only a partial range of user activities during a limited time interval. This work creates a dataset of captures from a broader spectrum of VR activities performed with a Meta Quest 2 headset, with the duration of each real residential user session recorded for at least half an hour. Newly collected data helped show that some gaming VR traffic activities have a high share of uplink traffic and require symmetric user links. Also, we have figured out that the gaming phase of the overall gameplay is more sensitive to the channel resources reduction than the higher bitrate game launch phase. Hence, we recommend it as a source of traffic distribution for channel sizing model creation. From the gaming phase, capture intervals of more than 100 s contain the most representative information for modeling activity.
Citation: Data
PubDate: 2023-08-17
DOI: 10.3390/data8080132
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 133: Leveraging Return Prediction Approaches for
Improved Value-at-Risk Estimation
Authors: Farid Bagheri, Diego Reforgiato Recupero, Espen Sirnes
First page: 133
Abstract: Value at risk is a statistic used to anticipate the largest possible losses over a specific time frame and within some level of confidence, usually 95% or 99%. For risk management and regulators, it offers a solution for trustworthy quantitative risk management tools. VaR has become the most widely used and accepted indicator of downside risk. Today, commercial banks and financial institutions utilize it as a tool to estimate the size and probability of upcoming losses in portfolios and, as a result, to estimate and manage the degree of risk exposure. The goal is to obtain the average number of VaR “failures” or “breaches” (losses that are more than the VaR) as near to the target rate as possible. It is also desired that the losses be evenly distributed as possible. VaR can be modeled in a variety of ways. The simplest method is to estimate volatility based on prior returns according to the assumption that volatility is constant. Otherwise, the volatility process can be modeled using the GARCH model. Machine learning techniques have been used in recent years to carry out stock market forecasts based on historical time series. A machine learning system is often trained on an in-sample dataset, where it can adjust and improve specific hyperparameters in accordance with the underlying metric. The trained model is tested on an out-of-sample dataset. We compared the baselines for the VaR estimation of a day (d) according to different metrics (i) to their respective variants that included stock return forecast information of d and stock return data of the days before d and (ii) to a GARCH model that included return prediction information of d and stock return data of the days before d. Various strategies such as ARIMA and a proposed ensemble of regressors have been employed to predict stock returns. We observed that the versions of the univariate techniques and GARCH integrated with return predictions outperformed the baselines in four different marketplaces.
Citation: Data
PubDate: 2023-08-17
DOI: 10.3390/data8080133
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 134: Quantifying Webpage Performance: A Comparative
Analysis of TCP/IP and QUIC Communication Protocols for Improved
Efficiency
Authors: Thyago Celso Cavalcante Nepomuceno, Késsia Thais Cavalcanti Nepomuceno, Fabiano Carlos da Silva, Silas Garrido Teixeira de Carvalho Santos
First page: 134
Abstract: Browsing is a prevalent activity on the World Wide Web, and users usually demonstrate significant expectations for expeditious information retrieval and seamless transactions. This article presents a comprehensive performance evaluation of the most frequently accessed webpages in recent years using Data Envelopment Analysis (DEA) adapted to the context (inverse DEA), comparing their performance under two distinct communication protocols: TCP/IP and QUIC. To assess performance disparities, parametric and non-parametric hypothesis tests are employed to investigate the appropriateness of each website’s communication protocols. We provide data on the inputs, outputs, and efficiency scores for 82 out of the world’s top 100 most-accessed websites, describing how experiments and analyses were conducted. The evaluation yields quantitative metrics pertaining to the technical efficiency of the websites and efficient benchmarks for best practices. Nine websites are considered efficient from the point of view of at least one of the communication protocols. Considering TCP/IP, about 80.5% of all units (66 webpages) need to reduce more than 50% of their page load time to be competitive, while this number is 28.05% (23 webpages), considering QUIC communication protocol. In addition, results suggest that TCP/IP protocol has an unfavorable effect on the overall distribution of inefficiencies.
Citation: Data
PubDate: 2023-08-19
DOI: 10.3390/data8080134
Issue No: Vol. 8, No. 8 (2023)
- Data, Vol. 8, Pages 113: VPTD: Human Face Video Dataset for Personality
Traits Detection
Authors: Kenan Kassab, Alexey Kashevnik, Alexander Mayatin, Dmitry Zubok
First page: 113
Abstract: In this paper, we propose a dataset for personality traits detection based on human face videos. Ground truth data have been annotated using the IPIP-50 personality test that every participant is implementing. To collect the dataset, we developed a web-based platform that allows us to acquire spontaneous answers for predefined questions from the respondents. The website allows the participants to record an interactive interview in order to imitate the real-life interview. The dataset includes 38 videos (2 min on average) for people of different races, genders, and ages. In the paper, we propose the top five personality traits calculated based on the test, as well as the top five personality traits calculated by our own developed model that determines this information based on video analysis. We introduced a statistical analysis for the collected dataset, and we also applied a K-means clustering algorithm to cluster the data and present the clustering results.
Citation: Data
PubDate: 2023-06-22
DOI: 10.3390/data8070113
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 114: A Survey Dataset Evaluating Perceptions of Civil
Engineering Students about Building Information Modelling (BIM)
Authors: Diego Maria Barbieri, Baowen Lou, Marco Passavanti, Aurora Barbieri, Fredrik Bjørheim
First page: 114
Abstract: The implementation of Building Information Modelling (BIM) technologies has become increasingly central in the design, construction and maintenance of both civil structures and infrastructures. As more and more software houses develop new BIM software solutions and a wide range of private and public stakeholders employ them, several educational institutes across the globe strive to expand their teaching portfolio to encompass learning and teaching of BIM. This dataset deals with the perceptions expressed by all the civil engineering undergraduate students who attended an academic course specifically about BIM at University of Stavanger (UiS), Norway, during the second semester 2022. The survey was divided into five parts and collected information regarding as many overarching aspects: socio-demographic data, perceptions about BIM before and after course attendance, satisfaction about the academic course and the way it was conducted. Considering the very moderate sample size (28 students) and potential biases due to the specific context of the University of Stavanger, the dataset can provide a useful insight into teaching approaches and future curriculum development, rather than indicating major and generalized trends in BIM education. As the questionnaire responses shed light on the feedbacks and perceptions expressed by university students dealing with BIM for their first time, the formed dataset can offer a straightforward appreciation of students’ cognitive behaviour in BIM education.
Citation: Data
PubDate: 2023-06-28
DOI: 10.3390/data8070114
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 115: Factory-Based Vibration Data for Bearing-Fault
Detection
Authors: Adam Lundström, Mattias O’Nils
First page: 115
Abstract: The importance of preventing failures in bearings has led to a large amount of research being conducted to find methods for fault diagnostics and prognostics. Many of these solutions, such as deep learning methods, require a significant amount of data to perform well. This is a reason why publicly available data are important, and there currently exist several open datasets that contain different conditions and faults. However, one challenge is that almost all of these data come from a laboratory setting, where conditions might differ from those found in an industrial environment where the methods are intended to be used. This also means that there may be characteristics of the industrial data that are important to take into account. Therefore, this study describes a completely new dataset for bearing faults from a pulp mill. The analysis of the data shows that the faults vary significantly in terms of fault development, rotation speed, and the amplitude of the vibration signal. It also suggests that methods built for this environment need to consider that no historical examples of faults in the target domain exist and that external events can occur that are not related to any condition of the bearing.
Citation: Data
PubDate: 2023-06-28
DOI: 10.3390/data8070115
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 116: Dataset of Linkability Networks of Ethereum
Accounts Involved in NFT Trading of Top 15 NFT Collections
Authors: Aleksandar Tošić, Niki Hrovatin, Jernej Vičič
First page: 116
Abstract: In this paper, we present subgraphs of Ethereum wallets involved in NFT trades of the top 15 ERC721 NFT collections. To obtain the subgraphs, we have extracted the Ethereum transaction graph from a live Ethereum node and filtered out exchanges, mining pools, and smart contracts. For each of the selected collections, we identified the set of accounts involved in NFT trading, which we used to perform a breadth-first search in the Ethereum transaction graph to obtain a subgraph. These subgraphs can offer insight into the linkability of accounts participating in NFT trading on the Ethereum blockchain.
Citation: Data
PubDate: 2023-06-28
DOI: 10.3390/data8070116
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 117: Assessment of Maize Silage Quality Under
Different Pre-Ensiling Conditions
Authors: Lorenzo Serva, Igino Andrighetto, Severino Segato, Giorgio Marchesini, Maria Chinello, Luisa Magrin
First page: 117
Abstract: Maize silage suffers from several factors that affect the final quality and, to some extent, pre-ensiled conditions that can be potentially tuned during harvesting. After assessing new indices for silage quality under lab-scale conditions, several trials have been conducted to find associations between fresh maize characteristics and silage features. Among the first, we included field input levels, FAO class, maturity stage, use of bacterial inoculants, sealing delay and chemical traits, whereas, among the latter, we assessed density and porosity, pH, fermentative profile, dry matter loss and aerobic stability. The trials were conducted using vacuum bags or mini silo buckets. More than 1500 maize samples harvested in Northeast Italy were analysed during the 2016–2022 period. Moreover, to evaluate silage aerobic stability, the fermentative profile and temperature were measured 14 days after the opening of the silo. The association between silage quality and aerobic stability was assessed, and a prognostic risk score was used to calculate the probability of aerobic instability. The dataset could provide baseline information to promote the continuous improvement of maize silage management from different botanical and crop fields, thus improving agronomic and animal farm resource allocation from a precision agriculture perspective.
Citation: Data
PubDate: 2023-07-02
DOI: 10.3390/data8070117
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 118: A Semantically Annotated 15-Class Ground Truth
Dataset for Substation Equipment to Train Semantic Segmentation Models
Authors: Andreas Anael Pereira Gomes, Francisco Itamarati Secolo Ganacim, Fabiano Gustavo Silveira Magrin, Nara Bobko, Leonardo Göbel Fernandes, Anselmo Pombeiro, Eduardo Félix Ribeiro Romaneli
First page: 118
Abstract: The lack of annotated semantic segmentation datasets for electrical substations in the literature poses a significant problem for machine learning tasks; before training a model, a dataset is needed. This paper presents a new dataset of electric substations with 1660 images annotated with 15 classes, including insulators, disconnect switches, transformers and other equipment commonly found in substation environments. The images were captured using a combination of human, fixed and AGV-mounted cameras at different times of the day, providing a diverse set of training and testing data for algorithm development. In total, 50,705 annotations were created by a team of experienced annotators, using a standardized process to ensure accuracy across the dataset. The resulting dataset provides a valuable resource for researchers and practitioners working in the fields of substation automation, substation monitoring and computer vision. Its availability has the potential to advance the state of the art in this important area.
Citation: Data
PubDate: 2023-07-05
DOI: 10.3390/data8070118
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 119: Proteomic Shift in Mouse Embryonic Fibroblasts
Pfa1 during Erastin, ML210, and BSO-Induced Ferroptosis
Authors: Olga M. Kudryashova, Alexey M. Nesterenko, Dmitry A. Korzhenevskii, Valeriy K. Sulyagin, Vasilisa M. Tereshchuk, Vsevolod V. Belousov, Arina G. Shokhina
First page: 119
Abstract: Ferroptosis is a unique variety of non-apoptotic cell death, driven by massive lipid oxidation in an iron-dependent manner. Since ferroptosis was introduced as a concept in 2012, it has demonstrated its essential role in the pathogenesis in neurodegenerative diseases and an important role in therapy-resistant cancer cells. Thus, detailed molecular understanding of both canonical and alternative ferroptosis pathways is required. There is a set of widely used chemical agents to modulate ferroptosis using different pathway targets: erastin blocks cystine–glutamate antiporter, system xc-; ML210 directly inactivates GPX4; and L-buthionine sulfoximine (BSO) inhibits γ-glutamylcysteine synthetase, an essential enzyme for glutathione synthesis de novo. Most studies have focused on the lipidomic profiling of model systems undergoing death in a ferroptotic modality. In this study, we developed high-quality shotgun proteome sequencing during ferroptosis induction by three widely used chemical agents (erastin, ML210, and BSO) before and after 24 and 48 h of treatment. Chromato-mass spectra were registered in DDA mode and are suitable for further label-free quantification. Both processed and raw files are publicly available and could be a valuable dynamic proteome map for further ferroptosis investigation.
Citation: Data
PubDate: 2023-07-12
DOI: 10.3390/data8070119
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 120: PoPu-Data: A Multilayered, Simultaneously
Collected Lying Position Dataset
Authors: Luís Fonseca, Fernando Ribeiro, José Metrôlho, Adriana Santos, Rogério Dionisio, Mohammad Mohammad Amini, Arlindo F. Silva, Ahmad Reza Heravi, Davood Fanaei Sheikholeslami, Filipe Fidalgo, Francisco B. Rodrigues, Osvaldo Santos, Patrícia Coelho, Seyyed Sajjad Aemmi
First page: 120
Abstract: This study presents a dataset containing three layers of data that are useful for body position classification and all uses related to it. The PoPu dataset contains simultaneously collected data from two different sensor sheets—one placed over and one placed under a mattress; furthermore, a segmentation data layer was added where different body parts are identified using the pressure data from the sensors over the mattress. The data included were gathered from 60 healthy volunteers distributed among the different gathered characteristics: namely sex, weight, and height. This dataset can be used for position classification, assessing the viability of sensors placed under a mattress, and in applications regarding bedded or lying people or sleep related disorders.
Citation: Data
PubDate: 2023-07-16
DOI: 10.3390/data8070120
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 121: Knowledge Discovery and Dataset for the
Improvement of Digital Literacy Skills in Undergraduate Students
Authors: Pongpon Nilaphruek, Pattama Charoenporn
First page: 121
Abstract: For over two decades, scholars and practitioners have emphasized the importance of digital literacy, yet the existing datasets are insufficient for establishing learning analytics in Thailand. Learning analytics focuses on gathering and analyzing student data to optimize learning tools and activities to improve students’ learning experiences. The main problem is that the ICT skill levels of the youth are rather low in Thailand. To facilitate research in this field, this study has compiled a dataset containing information from the IC3 digital literacy certification delivered at the Rajamangala University of Technology Thanyaburi (RMUTT) in Thailand between 2016 and 2023. This dataset is unique since it includes demographic and academic records about undergraduate students. The dataset was collected and underwent a preparation process, including data cleansing, anonymization, and release. This data enables the examination of student learning outcomes, represented by a dataset containing information about 45,603 records with students’ certification assessment scores. This compiled dataset provides a rich resource for researchers studying digital literacy and learning analytics. It offers researchers the opportunity to gain valuable insights, inform evidence-based educational practices, and contribute to the ongoing efforts to improve digital literacy education in Thailand and beyond.
Citation: Data
PubDate: 2023-07-20
DOI: 10.3390/data8070121
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 122: A Wavelet-Decomposed WD-ARMA-GARCH-EVT Model
Approach to Comparing the Riskiness of the BitCoin and South African Rand
Exchange Rates
Authors: Thabani Ndlovu, Delson Chikobvu
First page: 122
Abstract: In this paper, a hybrid of a Wavelet Decomposition–Generalised Auto-Regressive Conditional Heteroscedasticity–Extreme Value Theory (WD-ARMA-GARCH-EVT) model is applied to estimate the Value at Risk (VaR) of BitCoin (BTC/USD) and the South African Rand (ZAR/USD). The aim is to measure and compare the riskiness of the two currencies. New and improved estimation techniques for VaR have been suggested in the last decade in the aftermath of the global financial crisis of 2008. This paper aims to provide an improved alternative to the already existing statistical tools in estimating a currency VaR empirically. Maximal Overlap Discrete Wavelet Transform (MODWT) and two mother wavelet filters on the returns series are considered in this paper, viz., the Haar and Daubechies (d4). The findings show that BitCoin/USD is riskier than ZAR/USD since it has a higher VaR per unit invested in each currency. At the 99% significance level, BitCoin/USD has average values of VaR of 2.71% and 4.98% for the WD-ARMA-GARCH-GPD and WD-ARMA-GARCH-GEVD models, respectively; and this is slightly higher than the respective 2.69% and 3.59% for the ZAR/USD. The average BitCoin/USD returns of 0.001990 are higher than ZAR/USD returns of −0.000125. These findings are consistent with the mean-variance portfolio theory, which suggests a higher yield for riskier assets. Based on the p-values of the Kupiec likelihood ratio test, the hybrid model adequacy is largely accepted, as p-values are greater than 0.05, except for the WD-ARMA-GARCH-GEVD models at a 99% significance level for both currencies. The findings are helpful to financial risk practitioners and forex traders in formulating their diversification and hedging strategies and ascertaining the risk-adjusted capital requirement to be set aside as a cushion in the event of the occurrence of an actual loss.
Citation: Data
PubDate: 2023-07-24
DOI: 10.3390/data8070122
Issue No: Vol. 8, No. 7 (2023)
- Data, Vol. 8, Pages 93: Target Screening of Chemicals of Emerging Concern
(CECs) in Surface Waters of the Swedish West Coast
Authors: Pedro A. Inostroza, Eric Carmona, Åsa Arrhenius, Martin Krauss, Werner Brack, Thomas Backhaus
First page: 93
Abstract: The aquatic environment faces increasing threats from a variety of unregulated organic chemicals originating from human activities, collectively known as chemicals of emerging concern (CECs). These include pharmaceuticals, personal-care products, pesticides, surfactants, industrial chemicals, and their transformation products. CECs enter aquatic environments through various sources, including effluents from wastewater treatment plants, industrial facilities, runoff from agricultural and residential areas, as well as accidental spills. Data on the occurrence of CECs in the marine environment are scarce, and more information is needed to assess the chemical and ecological status of water bodies, and to prioritize toxic chemicals for further studies or risk assessment. In this study, we describe a monitoring campaign targeting CECs in surface waters at the Swedish west coast using, for the first time, an on-site large volume solid phase extraction (LVSPE) device. We detected up to 80 and 227 CECs in marine sites and the wastewater treatment plant (WWTP) effluent, respectively. The dataset will contribute to defining pollution fingerprints and assessing the chemical status of marine and freshwater systems affected by industrial hubs, agricultural areas, and the discharge of urban wastewater.
Citation: Data
PubDate: 2023-05-25
DOI: 10.3390/data8060093
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 94: MicroRNA Profiling of Fresh Lung Adenocarcinoma
and Adjacent Normal Tissues from Ten Korean Patients Using miRNA-Seq
Authors: Jihye Park, Sae Jung Na, Jung Sook Yoon, Seoree Kim, Sang Hoon Chun, Jae Jun Kim, Young-Du Kim, Young-Ho Ahn, Keunsoo Kang, Yoon Ho Ko
First page: 94
Abstract: MicroRNA transcriptomes from fresh tumors and the adjacent normal tissues were profiled in 10 Korean patients diagnosed with lung adenocarcinoma using a next-generation sequencing (NGS) technique called miRNA-seq. The sequencing quality was assessed using FastQC, and low-quality or adapter-contaminated portions of the reads were removed using Trim Galore. Quality-assured reads were analyzed using miRDeep2 and Bowtie. The abundance of known miRNAs was estimated using the reads per million (RPM) normalization method. Subsequently, using DESeq2 and Wx, we identified differentially expressed miRNAs and potential miRNA biomarkers for lung adenocarcinoma tissues compared to adjacent normal tissues, respectively. We defined reliable miRNA biomarkers for lung adenocarcinoma as those detected by both methods. The miRNA-seq data are available in the Gene Expression Omnibus (GEO) database under accession number GSE196633, and all processed data can be accessed via the Mendeley data website.
Citation: Data
PubDate: 2023-05-25
DOI: 10.3390/data8060094
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 95: A Dataset of Scalp EEG Recordings of
Alzheimer’s Disease, Frontotemporal Dementia and Healthy Subjects
from Routine EEG
Authors: Andreas Miltiadous, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Nikolaos Grigoriadis, Dimitrios G. Tsalikakis, Pantelis Angelidis, Markos G. Tsipouras, Euripidis Glavas, Nikolaos Giannakeas, Alexandros T. Tzallas
First page: 95
Abstract: Recently, there has been a growing research interest in utilizing the electroencephalogram (EEG) as a non-invasive diagnostic tool for neurodegenerative diseases. This article provides a detailed description of a resting-state EEG dataset of individuals with Alzheimer’s disease and frontotemporal dementia, and healthy controls. The dataset was collected using a clinical EEG system with 19 scalp electrodes while participants were in a resting state with their eyes closed. The data collection process included rigorous quality control measures to ensure data accuracy and consistency. The dataset contains recordings of 36 Alzheimer’s patients, 23 frontotemporal dementia patients, and 29 healthy age-matched subjects. For each subject, the Mini-Mental State Examination score is reported. A monopolar montage was used to collect the signals. A raw and preprocessed EEG is included in the standard BIDS format. For the preprocessed signals, established methods such as artifact subspace reconstruction and an independent component analysis have been employed for denoising. The dataset has significant reuse potential since Alzheimer’s EEG Machine Learning studies are increasing in popularity and there is a lack of publicly available EEG datasets. The resting-state EEG data can be used to explore alterations in brain activity and connectivity in these conditions, and to develop new diagnostic and treatment approaches. Additionally, the dataset can be used to compare EEG characteristics between different types of dementia, which could provide insights into the underlying mechanisms of these conditions.
Citation: Data
PubDate: 2023-05-27
DOI: 10.3390/data8060095
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 96: Exploring the Evolution of Sentiment in Spanish
Pandemic Tweets: A Data Analysis Based on a Fine-Tuned BERT Architecture
Authors: Carlos Henríquez Miranda, German Sanchez-Torres, Dixon Salcedo
First page: 96
Abstract: The COVID-19 pandemic has had a significant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reflected in their opinions and comments on social media platforms, such as Twitter. This study explores the evolution of sentiment in Spanish pandemic tweets through a data analysis based on a fine-tuned BERT architecture. A total of six million tweets were collected using web scraping techniques, and pre-processing was applied to filter and clean the data. The fine-tuned BERT architecture was utilized to perform sentiment analysis, which allowed for a deep-learning approach to sentiment classification. The analysis results were graphically represented based on search criteria, such as “COVID-19” and “coronavirus”. This study reveals sentiment trends, significant concerns, relationship with announced news, public reactions, and information dissemination, among other aspects. These findings provide insight into the emotional impact of the COVID-19 pandemic on individuals and the corresponding impact on social media platforms.
Citation: Data
PubDate: 2023-05-29
DOI: 10.3390/data8060096
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 97: A Fast Deep Learning ECG Sex Identifier Based on
Wavelet RGB Image Classification
Authors: Jose-Luis Cabra Lopez, Carlos Parra, Gonzalo Forero
First page: 97
Abstract: Human sex recognition with electrocardiogram signals is an emerging area in machine learning, mostly oriented toward neural network approaches. It might be the beginning of a field of heart behavior analysis focused on sex. However, a person’s heartbeat changes during daily activities, which could compromise the classification. In this paper, with the intention of capturing heartbeat dynamics, we divided the heart rate into different intervals, creating a specialized identification model for each interval. The sexual differentiation for each model was performed with a deep convolutional neural network from images that represented the RGB wavelet transformation of ECG pseudo-orthogonal X, Y, and Z signals, using sufficient samples to train the network. Our database included 202 people, with a female-to-male population ratio of 49.5–50.5% and an observation period of 24 h per person. As our main goal, we looked for periods of time during which the classification rate of sex recognition was higher and the process was faster; in fact, we identified intervals in which only one heartbeat was required. We found that for each heart rate interval, the best accuracy score varied depending on the number of heartbeats collected. Furthermore, our findings indicated that as the heart rate increased, fewer heartbeats were needed for analysis. On average, our proposed model reached an accuracy of 94.82% ± 1.96%. The findings of this investigation provide a heartbeat acquisition procedure for ECG sex recognition systems. In addition, our results encourage future research to include sex as a soft biometric characteristic in person identification scenarios and for cardiology studies, in which the detection of specific male or female anomalies could help autonomous learning machines move toward specialized health applications.
Citation: Data
PubDate: 2023-05-29
DOI: 10.3390/data8060097
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 98: Unmanned Aerial Vehicle (UAV) and Spectral
Datasets in South Africa for Precision Agriculture
Authors: Cilence Munghemezulu, Zinhle Mashaba-Munghemezulu, Phathutshedzo Eugene Ratshiedana, Eric Economon, George Chirima, Sipho Sibanda
First page: 98
Abstract: Remote sensing data play a crucial role in precision agriculture and natural resource monitoring. The use of unmanned aerial vehicles (UAVs) can provide solutions to challenges faced by farmers and natural resource managers due to its high spatial resolution and flexibility compared to satellite remote sensing. This paper presents UAV and spectral datasets collected from different provinces in South Africa, covering different crops at the farm level as well as natural resources. UAV datasets consist of five multispectral bands corrected for atmospheric effects using the PIX4D mapper software to produce surface reflectance images. The spectral datasets are filtered using a Savitzky–Golay filter, corrected for Multiplicative Scatter Correction (MSC). The first and second derivatives and the Continuous Wavelet Transform (CWT) spectra are also calculated. These datasets can provide baseline information for developing solutions for precision agriculture and natural resource challenges. For example, UAV and spectral data of different crop fields captured at spatial and temporal resolutions can contribute towards calibrating satellite images, thus improving the accuracy of the derived satellite products.
Citation: Data
PubDate: 2023-05-30
DOI: 10.3390/data8060098
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 99: Classification of Cocoa Pod Maturity Using
Similarity Tools on an Image Database: Comparison of Feature Extractors
and Color Spaces
Authors: Kacoutchy Jean Ayikpa, Diarra Mamadou, Pierre Gouton, Kablan Jérôme Adou
First page: 99
Abstract: Côte d’Ivoire, the world’s largest cocoa producer, faces the challenge of quality production. Immature or overripe pods cannot produce quality cocoa beans, resulting in losses and an unprofitable harvest. To help farmer cooperatives determine the maturity of cocoa pods in time, our study evaluates the use of automation tools based on similarity measures. Although standard techniques, such as visual inspection and weighing, are commonly used to identify the maturity of cocoa pods, the use of automation tools based on similarity measures can improve the efficiency and accuracy of this process. We set up a database of cocoa pod images and used two feature extractors: one based on convolutional neural networks (CNN), in particular, MobileNet, and the other based on texture analysis using a gray-level co-occurrence matrix (GLCM). We evaluated the impact of different color spaces and feature extraction methods on our database. We used mathematical similarity measurement tools, such as the Euclidean distance, correlation distance, and chi-square distance, to classify cocoa pod images. Our experiments showed that the chi-square distance measurement offered the best accuracy, with a score of 99.61%, when we used GLCM as a feature extractor and the Lab color space. Using automation tools based on similarity measures can improve the efficiency and accuracy of cocoa pod maturity determination. The results of our experiments prove that the chi-square distance is the most appropriate measure of similarity for this task.
Citation: Data
PubDate: 2023-05-30
DOI: 10.3390/data8060099
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 100: Progress in the Cost-Optimal Methodology
Implementation in Europe: Datasets Insights and Perspectives in Member
States
Authors: Paolo Zangheri, Delia D’Agostino, Roberto Armani, Carmen Maduta, Paolo Bertoldi
First page: 100
Abstract: This data article relates to the paper “Review of the cost-optimal methodology implementation in Member States in compliance with the Energy Performance of Buildings Directive”. Datasets linked with this article refer to the analysis of the latest national cost-optimal reports, providing an assessment of the implementation of the cost-optimal methodology, as established by the Energy Performance of Building Directive (EPBD). Based on latest national reports, the data provided a comprehensive update to the cost-optimal methodology implementation throughout Europe, which is currently lacking harmonization. Datasets allow an overall overview of the status of the cost-optimal methodology implementation in Europe with details on the calculations carried out (e.g., multi-stage, dynamic, macroeconomic, and financial perspectives, included energy uses, and full-cost approach). Data relate to the implemented methodology, reference buildings, assessed cost-optimal levels, energy performance, costs, and sensitivity analysis. Data also provide insight into energy consumption, efficiency measures for residential and non-residential buildings, nearly zero energy buildings (NZEBs) levels, and global costs. The reported data can be useful to quantify the cost-optimal levels for different building types, both residential (average cost-optimal level 80 kWh/m2y for new, 130 kWh/m2y for existing buildings) and non-residential buildings (140 kWh/m2y for new, 180 kWh/m2y for existing buildings). Data outline weak and strong points of the methodology, as well as future developments in the light of the methodology revision foreseen in 2026. The data support energy efficiency and energy policies related to buildings toward the EU building stock decarbonization goal within 2050.
Citation: Data
PubDate: 2023-05-31
DOI: 10.3390/data8060100
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 101: Labelled Indoor Point Cloud Dataset for BIM
Related Applications
Authors: Nuno Abreu, Rayssa Souza, Andry Pinto, Anibal Matos, Miguel Pires
First page: 101
Abstract: BIM (building information modelling) has gained wider acceptance in the AEC (architecture, engineering, and construction) industry. Conversion from 3D point cloud data to vector BIM data remains a challenging and labour-intensive process, but particularly relevant during various stages of a project lifecycle. While the challenges associated with processing very large 3D point cloud datasets are widely known, there is a pressing need for intelligent geometric feature extraction and reconstruction algorithms for automated point cloud processing. Compared to outdoor scene reconstruction, indoor scenes are challenging since they usually contain high amounts of clutter. This dataset comprises the indoor point cloud obtained by scanning four different rooms (including a hallway): two office workspaces, a workshop, and a laboratory including a water tank. The scanned space is located at the Electrical and Computer Engineering department of the Faculty of Engineering of the University of Porto. The dataset is fully labelled, containing major structural elements like walls, floor, ceiling, windows, and doors, as well as furniture, movable objects, clutter, and scanning noise. The dataset also contains an as-built BIM that can be used as a reference, making it suitable for being used in Scan-to-BIM and Scan-vs-BIM applications. For demonstration purposes, a Scan-vs-BIM change detection application is described, detailing each of the main data processing steps.
Citation: Data
PubDate: 2023-06-01
DOI: 10.3390/data8060101
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 102: A Self-Attention-Based Imputation Technique for
Enhancing Tabular Data Quality
Authors: Do-Hoon Lee, Han-joon Kim
First page: 102
Abstract: Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.
Citation: Data
PubDate: 2023-06-04
DOI: 10.3390/data8060102
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 103: Physico-Chemical Quality and Physiological
Profiles of Microbial Communities in Freshwater Systems of Mega Manila,
Philippines
Authors: Marie Christine M. Obusan, Arizaldo E. Castro, Ren Mark D. Villanueva, Margareth Del E. Isagan, Jamaica Ann A. Caras, Jessica F. Simbahan
First page: 103
Abstract: Studying the quality of freshwater systems and drinking water in highly urbanized megalopolises around the world remains a challenge. This article reports data on the quality of select freshwater systems in Mega Manila, Philippines. Water samples collected between 2020 and 2021 were analyzed for physico-chemical parameters and microbial community metabolic fingerprints, i.e., carbon substrate utilization patterns (CSUPs). The detection of arsenic, lead, cadmium, mercury, polyaromatic hydrocarbons (PAHs), and organochlorine pesticides (OCPs) was carried out using standard chromatography- and spectroscopy-based protocols. Physiological profiles were determined using the Biolog EcoPlate™ system. Eight samples were free of heavy metals, and none contained PAHs or OCPs. Fourteen samples had high microbial activity, as indicated by average well color development (AWCD) and community metabolic diversity (CMD) values. Community-level physiological profiling (CLPP) revealed that (1) samples clustered as groups according to shared CSUPs, and (2) microbial communities in non-drinking samples actively utilized all six substrate classes compared to drinking samples. The data reported here can provide a baseline or a comparator for prospective quality assessments of drinking water and freshwater sources in the region. Metabolic fingerprinting using CSUPs is a simple and cheap phenotypic analysis of microbial communities and their physiological activity in aquatic environments.
Citation: Data
PubDate: 2023-06-04
DOI: 10.3390/data8060103
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 104: Comparison of ARIMA and LSTM in Predicting
Structural Deformation of Tunnels during Operation Period
Authors: Chuangfeng Duan, Min Hu, Haozuan Zhang
First page: 104
Abstract: Accurately predicting the structural deformation trend of tunnels during operation is significant to improve the scientificity of tunnel safety maintenance. With the development of data science, structural deformation prediction methods based on time-series data have attracted attention. Auto Regressive Integrated Moving Average model (ARIMA) is a classical statistical analysis model, which is suitable for processing non-stationary time-series data. Long- and Short-Term Memory (LSTM) is a special cyclic neural network that can learn long-term dependent information in time series. Both are widely used in the field of temporal prediction. In view of the lack of time-series prediction in the tunnel deformation field, the body of this paper uses historical data of the Xinjian Road and the Dalian Road tunnel in Shanghai to propose a new way of modeling based on single points and road sections. ARIMA and LSTM models are applied in comprehensive experiments, and the results show that: (1) Both LSTM and ARIMA models have great performance for settlement and convergence deformation. (2) The overall robustness of ARIMA is better than that of LSTM, and it is more adaptable to the datasets. (3) The model prediction performance is closely related to the data quality. ARIMA has more stable performance under the lack of data volume, while LSTM has better performance with high-quality data and higher upper limit.
Citation: Data
PubDate: 2023-06-13
DOI: 10.3390/data8060104
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 105: Assessing the Effectiveness of Masking and
Encryption in Safeguarding the Identity of Social Media Publishers from
Advanced Metadata Analysis
Authors: Mohammed Khader, Marcel Karam
First page: 105
Abstract: Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous publishers vulnerable to privacy-related attacks, as identifying information can be revealed. Twitter is an example of such a platform where identifying anonymous users/publishers is made possible by using machine learning techniques. To provide these anonymous users with stronger protection, we have examined the effectiveness of these techniques when critical fields in the metadata are masked or encrypted using tweets (text and images) from Twitter. Our results show that SVM achieved the highest accuracy rate of 95.81% without using data masking or encryption, while SVM achieved the highest identity recognition rate of 50.24% when using data masking and AES encryption algorithm. This indicates that data masking and encryption of metadata of tweets (text and images) can provide promising protection for the anonymity of users’ identities.
Citation: Data
PubDate: 2023-06-13
DOI: 10.3390/data8060105
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 106: Curated Dataset for Red Blood Cell Tracking from
Video Sequences of Flow in Microfluidic Devices
Authors: Ivan Cimrák, Peter Tarábek, František Kajánek
First page: 106
Abstract: This work presents a dataset comprising images, annotations, and velocity fields for benchmarking cell detection and cell tracking algorithms. The dataset includes two video sequences captured during laboratory experiments, showcasing the flow of red blood cells (RBC) in microfluidic channels. From the first video 300 frames and from the second video 150 frames are annotated with bounding boxes around the cells, as well as tracks depicting the movement of individual cells throughout the video. The dataset encompasses approximately 20,000 bounding boxes and 350 tracks. Additionally, computational fluid dynamics simulations were utilized to generate 2D velocity fields representing the flow within the channels. These velocity fields are included in the dataset. The velocity field has been employed to improve cell tracking by predicting the positions of cells across frames. The paper also provides a comprehensive discussion on the utilization of the flow matrix in the tracking steps.
Citation: Data
PubDate: 2023-06-13
DOI: 10.3390/data8060106
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 107: A Preliminary Investigation of a Single Shock
Impact on Italian Mortality Rates Using STMF Data: A Case Study of
COVID-19
Authors: Maria Francesca Carfora, Albina Orlando
First page: 107
Abstract: Mortality shocks, such as pandemics, threaten the consolidated longevity improvements, confirmed in the last decades for the majority of western countries. Indeed, just before the COVID-19 pandemic, mortality was falling for all ages, with a different behavior according to different ages and countries. It is indubitable that the changes in the population longevity induced by shock events, even transitory ones, affecting demographic projections, have financial implications in public spending as well as in pension plans and life insurance. The Short Term Mortality Fluctuations (STMF) data series, providing data of all-cause mortality fluctuations by week within each calendar year for 38 countries worldwide, offers a powerful tool to timely analyze the effects of the mortality shock caused by the COVID-19 pandemic on Italian mortality rates. This dataset, recently made available as a new component of the Human Mortality Database, is described and techniques for the integration of its data with the historical mortality time series are proposed. Then, to forecast mortality rates, the well-known stochastic mortality model proposed by Lee and Carter in 1992 is first considered, to be consistent with the internal processing of the Human Mortality Database, where exposures are estimated by the Lee–Carter model; empirical results are discussed both on the estimation of the model coefficients and on the forecast of the mortality rates. In detail, we show how the integration of the yearly aggregated STMF data in the HMD database allows the Lee–Carter model to capture the complex evolution of the Italian mortality rates, including the higher lethality for males and older people, in the years that follow a large shock event such as the COVID-19 pandemic. Finally, we discuss some key points concerning the improvement of existing models to take into account mortality shocks and evaluate their impact on future mortality dynamics.
Citation: Data
PubDate: 2023-06-13
DOI: 10.3390/data8060107
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 108: How Expert Is the Crowd' Insights into Crowd
Opinions on the Severity of Earthquake Damage
Authors: Motti Zohar, Amos Salamon, Carmit Rapaport
First page: 108
Abstract: The evaluation of earthquake damage is central to assessing its severity and damage characteristics. However, the methods of assessment encounter difficulties concerning the subjective judgments and interpretation of the evaluators. Thus, it is mainly geologists, seismologists, and engineers who perform this exhausting task. Here, we explore whether an evaluation made by semiskilled people and by the crowd is equivalent to the experts’ opinions and, thus, can be harnessed as part of the process. Therefore, we conducted surveys in which a cohort of graduate students studying natural hazards (n = 44) and an online crowd (n = 610) were asked to evaluate the level of severity of earthquake damage. The two outcome datasets were then compared with the evaluation made by two of the present authors, who are considered experts in the field. Interestingly, the evaluations of both the semiskilled cohort and the crowd were found to be fairly similar to those of the experts, thus suggesting that they can provide an interpretation close enough to an expert’s opinion on the severity level of earthquake damage. Such an understanding may indicate that although our analysis is preliminary and requires more case studies for this to be verified, there is vast potential encapsulated in crowd-sourced opinion on simple earthquake-related damage, especially if a large amount of data is to be handled.
Citation: Data
PubDate: 2023-06-14
DOI: 10.3390/data8060108
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 109: Dataset of Program Source Codes Solving Unique
Programming Exercises Generated by Digital Teaching Assistant
Authors: Liliya A. Demidova, Elena G. Andrianova, Peter N. Sovietov, Artyom V. Gorchakov
First page: 109
Abstract: This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers or benchmarking mutation testing frameworks, and more applications are yet to be discovered. We describe the architecture of the DTA system, aiming to provide detailed insight regarding how and why the dataset was collected. In addition, we describe the algorithms responsible for source code analysis in the DTA system. These algorithms use vector representations of programs based on Markov chains, compute pairwise Jensen–Shannon divergences of programs, and apply hierarchical clustering algorithms in order to automatically discover high-level concepts used by students while solving unique tasks. The proposed approach can be incorporated into massive programming courses when there is a need to identify approaches implemented by students.
Citation: Data
PubDate: 2023-06-14
DOI: 10.3390/data8060109
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 110: Deep Learning-Based Black Spot Identification on
Greek Road Networks
Authors: Ioannis Karamanlis, Alexandros Kokkalis, Vassilios Profillidis, George Botzoris, Chairi Kiourt, Vasileios Sevetlidis, George Pavlidis
First page: 110
Abstract: Black spot identification, a spatiotemporal phenomenon, involves analysing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. By evaluating these problem areas, researchers can uncover the underlying causes and reasons for increased collision rates, such as road design, traffic volume, driver behaviour, weather, and infrastructure. However, challenges in identifying black spots include limited data availability, data quality, and assessing contributing factors. Additionally, evolving road design, infrastructure, and vehicle safety technology can affect black spot analysis and determination. This study focused on traffic accidents in Greek road networks to recognize black spots, utilizing data from police and government-issued car crash reports. The study produced a publicly available dataset called Black Spots of North Greece (BSNG) and a highly accurate identification method.
Citation: Data
PubDate: 2023-06-16
DOI: 10.3390/data8060110
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 111: Self-Reported Mental Health and Psychosocial
Correlates during the COVID-19 Pandemic: Data from the General Population
in Italy
Authors: Daniela Marchetti, Roberta Maiella, Rocco Palumbo, Melissa D’Ettorre, Irene Ceccato, Marco Colasanti, Adolfo Di Crosta, Pasquale La Malva, Emanuela Bartolini, Daniela Biasone, Nicola Mammarella, Piero Porcelli, Alberto Di Domenico, Maria Cristina Verrocchio
First page: 111
Abstract: The COVID-19 pandemic tremendously impacted people’s day-to-day activities and mental health. This article describes the dataset used to investigate the psychological impact of the first national lockdown on the general Italian population. For this purpose, an online survey was disseminated via Qualtrics between 1 April and 20 April 2020, to record various socio-demographic and psychological variables. The measures included both validated (namely, the Impact of the Event Scale-Revised, the Perceived Stress Scale, the nine-item Patient Health Questionnaire, the seven-item Generalized Anxiety Disorder scale, the Big Five Inventory 10-Item, and the Whiteley Index-7) and ad hoc questionnaires (nine items to investigate in-group and out-group trust). The final sample comprised 4081 participants (18–85 years old). The dataset could be helpful to other researchers in understanding the psychological impact of the COVID-19 pandemic and its related preventive and protective measures. Furthermore, the present data might help shed some light on the role of individual differences in response to traumatic events. Finally, this dataset can increase the knowledge in investigating psychological distress, health anxiety, and personality traits.
Citation: Data
PubDate: 2023-06-16
DOI: 10.3390/data8060111
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 112: RipSetCocoaCNCH12: Labeled Dataset for Ripeness
Stage Detection, Semantic and Instance Segmentation of Cocoa Pods
Authors: Juan Felipe Restrepo-Arias, María Isabel Salinas-Agudelo, María Isabel Hernandez-Pérez, Alejandro Marulanda-Tobón, María Camila Giraldo-Carvajal
First page: 112
Abstract: Fruit counting and ripeness detection are computer vision applications that have gained strength in recent years due to the advancement of new algorithms, especially those based on artificial neural networks (ANNs), better known as deep learning. In agriculture, those algorithms capable of fruit counting, including information about their ripeness, are mainly applied to make production forecasts or plan different activities such as fertilization or crop harvest. This paper presents the RipSetCocoaCNCH12 dataset of cocoa pods labeled at four different ripeness stages: stage 1 (0–2 months), stage 2 (2–4 months), stage 3 (4–6 months), and harvest stage (>6 months). An additional class was also included for pods aborted by plants in the early stage of development. A total of 4116 images were labeled to train algorithms that mainly perform semantic and instance segmentation. The labeling was carried out with CVAT (Computer Vision Annotation Tool). The dataset, therefore, includes labeling in two formats: COCO 1.0 and segmentation mask 1.1. The images were taken with different mobile devices (smartphones), in field conditions, during the harvest season at different times of the day, which could allow the algorithms to be trained with data that includes many variations in lighting, colors, textures, and sizes of the cocoa pods. As far as we know, this is the first openly available dataset for cocoa pod detection with semantic segmentation for five classes, 4116 images, and 7917 instances, comprising RGB images and two different formats for labels. With the publication of this dataset, we expect that researchers in smart farming, especially in cocoa cultivation, can benefit from the quantity and variety of images it contains.
Citation: Data
PubDate: 2023-06-18
DOI: 10.3390/data8060112
Issue No: Vol. 8, No. 6 (2023)
- Data, Vol. 8, Pages 81: Dataset of Fluorescence EEM and UV Spectroscopy
Data of Olive Oils during Ageing
Authors: Francesca Venturini, Silvan Fluri, Michael Baumgartner
First page: 81
Abstract: The dataset presented in this study encompasses fluorescence excitation–emission matrices (EEMs) and UV-spectroscopy data of 24 extra virgin olive oils (EVOOs) commercially available at supermarkets in Switzerland. To investigate the effect of thermal degradation, the samples were exposed to accelerated ageing at 60 ∘C up to 53 days. EEMs and UV absorption parameters were measured in 10 ageing steps. The dataset can be used, for example, to predict one or multiple chemical parameters or to classify samples based on their quality from fluorescence spectra.
Citation: Data
PubDate: 2023-04-29
DOI: 10.3390/data8050081
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 82: Exploring Spatial Patterns in Sensor Data for
Humidity, Temperature, and RSSI Measurements
Authors: Juan Botero-Valencia, Adrian Martinez-Perez, Ruber Hernández-García, Luis Castano-Londono
First page: 82
Abstract: The Internet of Things (IoT) is one of the fastest-growing research areas in recent years and is strongly linked to the development of smart cities, smart homes, and factories. IoT can be defined as connecting devices, sensors, and physical objects that can collect and transmit data across a network, enabling increased automation and better decision-making. In several IoT applications, humidity and temperature are some of the most used variables for adjusting system configurations and understanding their performance because they are related to various physical processes, human comfort, manufacturing processes, and 3D printing, among other things. In addition, one of the biggest problems associated with IoT is the excessive production of data, so it is necessary to develop methodologies to optimize the process of collecting information. This work presents a new dataset comprising almost 55 million values of temperature, relative humidity, and RSSI (Received Signal Strength Indicator) collected in two indoor spaces for longer than 3915 h at 10 s intervals. For each experiment, we captured the information from 13 previously calibrated sensors suspended from the ceiling at the same height and with a known relative position. The proposed dataset aims to contribute a benchmark for evaluating indoor temperature and humidity-controlled systems. The collected data allow the validation and improvement of the acquisition process for IoT applications.
Citation: Data
PubDate: 2023-04-29
DOI: 10.3390/data8050082
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 83: Cloud-Based Smart Contract Analysis in FinTech
Using IoT-Integrated Federated Learning in Intrusion Detection
Authors: Venkatagurunatham Naidu Kollu, Vijayaraj Janarthanan, Muthulakshmi Karupusamy, Manikandan Ramachandran
First page: 83
Abstract: Data sharing is proposed because the issue of data islands hinders advancement of artificial intelligence technology in the 5G era. Sharing high-quality data has a direct impact on how well machine-learning models work, but there will always be misuse and leakage of data. The field of financial technology, or FinTech, has received a lot of attention and is growing quickly. This field has seen the introduction of new terms as a result of its ongoing expansion. One example of such terminology is “FinTech”. This term is used to describe a variety of procedures utilized frequently in the financial technology industry. This study aims to create a cloud-based intrusion detection system based on IoT federated learning architecture as well as smart contract analysis. This study proposes a novel method for detecting intrusions using a cyber-threat federated graphical authentication system and cloud-based smart contracts in FinTech data. Users are required to create a route on a world map as their credentials under this scheme. We had 120 people participate in the evaluation, 60 of whom had a background in finance or FinTech. The simulation was then carried out in Python using a variety of FinTech cyber-attack datasets for accuracy, precision, recall, F-measure, AUC (Area under the ROC Curve), trust value, scalability, and integrity. The proposed technique attained accuracy of 95%, precision of 85%, RMSE of 59%, recall of 68%, F-measure of 83%, AUC of 79%, trust value of 65%, scalability of 91%, and integrity of 83%.
Citation: Data
PubDate: 2023-04-29
DOI: 10.3390/data8050083
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 84: Biotechnology and Bio-Based Products Perceptions
in the Community of Madrid: A Representative Survey Dataset
Authors: Juan Romero-Luis, Manuel Gertrudix, María del Carmen Gertrudis Casado, Alejandro Carbonell-Alcocer
First page: 84
Abstract: (1) Background: Bioeconomy aims to reduce dependence on non-renewable resources and foster economic growth through the development of new bio-based products and services. Achieving this goal requires social acceptance and stakeholder engagement in the development of sustainable technologies. The objective of this data article is to provide a dataset derived from a survey with a representative sample of 500 citizens over 18 years old based in the Community of Madrid. (2) Methods: We created a questionnaire on the social acceptance of technologies and bio-based products to later gather the responses using a SurveyMonkey panel for the Community of Madrid through an online CAWI survey; (3) Results: A dataset with a total of 82 columns with all responses is the result of this study. (4) Conclusions: This data article provides not only a valuable representative dataset of citizens of the Community of Madrid but also sufficient resources to replicate the same study in other regions.
Citation: Data
PubDate: 2023-05-01
DOI: 10.3390/data8050084
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 85: Emission Inventory for Maritime Shipping Emissions
in the North and Baltic Sea
Authors: Franziska Dettner, Simon Hilpert
First page: 85
Abstract: A high temporal and spatial resolution emission inventory for the North Sea and Baltic Sea was compiled using current emission factors and ship activity data. The inventory includes seagoing vessels over 100 GT registered with the International Maritime Organization traversing in the North and Baltic Seas. A bottom-up approach was chosen for the compilation of the inventory, which provides emission levels of the air pollutants CO2, NOx, SO2, PM2.5, CO, BC, Ash, NMVOC, and POA, as well as the speed-dependent fuel and energy consumption. Input data come from both main and auxiliary engines, as well as well-to-tank and tank-to-propeller emission and energy and fuel consumption quantities. The georeferenced data are provided in a temporal resolution of five minutes. The data can be used to assess, inter alia, the health effects of maritime emissions, the social costs of maritime transport, emission mitigation effects of alternative fuel scenarios, and shore-to-ship power supply.
Citation: Data
PubDate: 2023-05-01
DOI: 10.3390/data8050085
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 86: RaspberrySet: Dataset of Annotated Raspberry
Images for Object Detection
Authors: Sarmīte Strautiņa, Ieva Kalniņa, Edīte Kaufmane, Kaspars Sudars, Ivars Namatēvs, Arturs Nikulins, Edgars Edelmers
First page: 86
Abstract: The RaspberrySet dataset is a valuable resource for those working in the field of agriculture, particularly in the selection and breeding of ecologically adaptable berry cultivars. This is because long-term changes in temperature and weather patterns have made it increasingly important for crops to be able to adapt to their environment. To assess the suitability of different cultivars or to make yield predictions, it is necessary to describe and evaluate berries’ characteristics at various growth stages. This process is typically carried out visually, but it can be time-consuming and labor-intensive, requiring significant expert knowledge. The RaspberrySet dataset was created to assist with this process, and it includes images of raspberry berries at five different stages of development. These stages are flower buds, flowers, unripe berries, and ripe berries. All these stages of raspberry images classified buds, damaged buds, flowers, unripe berries, and ripe berries and were annotated using ground truth ROI and presented in YOLO format. The dataset includes 2039 high-resolution RGB images, with a total of 46,659 annotations provided by experts using Label Studio software (1.7.1). The images were taken in various weather conditions, at different times of the day, and from different angles, and they include fully visible buds, flowers, berries, and partially obscured buds. This dataset is intended to improve the efficiency of berry breeding and yield estimation and to identify the raspberry phenotype more accurately. It may also be useful for breeding other fruit crops, as it allows for the reliable detection and phenotyping of yield components at different stages of development. By providing a homogenized dataset of images taken on-site at the Institute of Horticulture in Dobele, Latvia, the RaspberrySet dataset offers a valuable resource for those working in horticulture.
Citation: Data
PubDate: 2023-05-10
DOI: 10.3390/data8050086
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 87: The Effect of Short-Term Transcutaneous Electrical
Stimulation of Auricular Vagus Nerve on Parameters of Heart Rate
Variability
Authors: Vladimir Shvartz, Eldar Sizhazhev, Maria Sokolskaya, Svetlana Koroleva, Soslan Enginoev, Sofia Kruchinova, Elena Shvartz, Elena Golukhova
First page: 87
Abstract: Many previous studies have demonstrated that transcutaneous vagus nerve stimulation (VNS) has the potential to exhibit therapeutic effects similar to its invasive counterpart. An objective assessment of VNS requires a reliable biomarker of successful vagal activation. Although many potential biomarkers have been proposed, most studies have focused on heart rate variability (HRV). Despite the physiological rationale for HRV as a biomarker for assessing vagal stimulation, data on its effects on HRV are equivocal. To further advance this field, future studies investigating VNS should contain adequate methodological specifics that make it possible to compare the results between studies, to replicate studies, and to enhance the safety of study participants. This article describes the design and methodology of a randomized study evaluating the effect of short-term noninvasive stimulation of the auricular branch of the vagus nerve on parameters of HRV. Primary records of rhythmograms of all the subjects, as well as a dataset with clinical, instrumental, and laboratory data of all the current study subjects are in the public domain for possible secondary analysis to all interested researchers. The physiological interpretation of the obtained data is not considered in the article.
Citation: Data
PubDate: 2023-05-11
DOI: 10.3390/data8050087
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 88: A Multispectral UAV Imagery Dataset of Wheat,
Soybean and Barley Crops in East Kazakhstan
Authors: Almasbek Maulit, Aliya Nugumanova, Kurmash Apayev, Yerzhan Baiburin, Maxim Sutula
First page: 88
Abstract: This study introduces a dataset of crop imagery captured during the 2022 growing season in the Eastern Kazakhstan region. The images were acquired using a multispectral camera mounted on an unmanned aerial vehicle (DJI Phantom 4). The agricultural land, encompassing 27 hectares and cultivated with wheat, barley, and soybean, was subjected to five aerial multispectral photography sessions throughout the growing season. This facilitated thorough monitoring of the most important phenological stages of crop development in the experimental design, which consisted of 27 plots, each covering one hectare. The collected imagery underwent enhancement and expansion, integrating a sixth band that embodies the normalized difference vegetation index (NDVI) values in conjunction with the original five multispectral bands (Blue, Green, Red, Red Edge, and Near Infrared Red). This amplification enables a more effective evaluation of vegetation health and growth, rendering the enriched dataset a valuable resource for the progression and validation of crop monitoring and yield prediction models, as well as for the exploration of precision agriculture methodologies.
Citation: Data
PubDate: 2023-05-11
DOI: 10.3390/data8050088
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 89: A Comprehensive Dataset of Spelling Errors and
Users’ Corrections in Croatian Language
Authors: Gordan Gledec, Marko Horvat, Miljenko Mikuc, Bruno Blašković
First page: 89
Abstract: This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form “spelling error → correction” from ispravi.me, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900,000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives; (2) helping language learners identify common errors and learn correct spelling through targeted feedback; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes; (4) identifying and evaluating factors that influence typing input; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language.
Citation: Data
PubDate: 2023-05-12
DOI: 10.3390/data8050089
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 90: An Efficient Deep Learning for Thai Sentiment
Analysis
Authors: Nattawat Khamphakdee, Pusadee Seresangtakul
First page: 90
Abstract: The number of reviews from customers on travel websites and platforms is quickly increasing. They provide people with the ability to write reviews about their experience with respect to service quality, location, room, and cleanliness, thereby helping others before booking hotels. Many people fail to consider hotel bookings because the numerous reviews take a long time to read, and many are in a non-native language. Thus, hotel businesses need an efficient process to analyze and categorize the polarity of reviews as positive, negative, or neutral. In particular, low-resource languages such as Thai have greater limitations in terms of resources to classify sentiment polarity. In this paper, a sentiment analysis method is proposed for Thai sentiment classification in the hotel domain. Firstly, the Word2Vec technique (the continuous bag-of-words (CBOW) and skip-gram approaches) was applied to create word embeddings of different vector dimensions. Secondly, each word embedding model was combined with deep learning (DL) models to observe the impact of each word vector dimension result. We compared the performance of nine DL models (CNN, LSTM, Bi-LSTM, GRU, Bi-GRU, CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN-BiGRU) with different numbers of layers to evaluate their performance in polarity classification. The dataset was classified using the FastText and BERT pre-trained models to carry out the sentiment polarity classification. Finally, our experimental results show that the WangchanBERTa model slightly improved the accuracy, producing a value of 0.9225, and the skip-gram and CNN model combination outperformed other DL models, reaching an accuracy of 0.9170. From the experiments, we found that the word vector dimensions, hyperparameter values, and the number of layers of the DL models affected the performance of sentiment classification. Our research provides guidance for setting suitable hyperparameter values to improve the accuracy of sentiment classification for the Thai language in the hotel domain.
Citation: Data
PubDate: 2023-05-13
DOI: 10.3390/data8050090
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 91: A Set of Geophysical Fields for Modeling of the
Lithosphere Structure and Dynamics in the Russian Arctic Zone
Authors: Anatoly Soloviev, Alexey Petrunin, Sofia Gvozdik, Roman Sidorov
First page: 91
Abstract: This paper presents a set of various geological and geophysical data for the Arctic zone, including some detailed models for the eastern part of the Russian Arctic zone. This hard-to-access territory has a complex geological structure, which is poorly studied by direct geophysical methods. Therefore, these data can be used in an integrative analysis for different purposes. These are the gravity field, heat flow, and various seismic tomography models. The gravity field data include several reductions calculated during our preceding studies, which are more appropriate for the study of the Earth’s interiors than the initial free air anomalies. Specifically, these are the Bouguer, isostatic, and decompensative gravity anomalies. A surface heat flow map included in the dataset is based on a joint inversion of multiple geophysical data constrained by the observations from the International Heat Flow Commission catalog. Available seismic tomography models were analyzed to select the best one for further investigation. We provide the models for the sedimentary cover and the Moho depth, which are significantly improved compared to the existing ones. The database provides a basis for qualitative and quantitative analysis of the region.
Citation: Data
PubDate: 2023-05-14
DOI: 10.3390/data8050091
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 92: Low-Dose Radiation-Induced Transcriptomic Changes
in Diabetic Aortic Endothelial Cells
Authors: Jihye Park, Kyuho Kang, Yeonghoon Son, Kwang Seok Kim, Keunsoo Kang, Hae-June Lee
First page: 92
Abstract: Low-dose radiation refers to exposure to ionizing radiation at levels that are generally considered safe and not expected to cause immediate health effects. However, the effects of low-dose radiation are still not fully understood, and research in this area is ongoing. In this study, we investigated the alterations in gene expression profiles of human aortic endothelial cells (HAECs) and diabetic human aortic endothelial cells (T2D-HAECs) derived from patients with type 2 diabetes. To this end, we used RNA-seq to profile the transcriptomes of cells exposed to varying doses of low-dose radiation (0.1 Gy, 0.5 Gy, and 2.0 Gy) and compared them to a control group with no radiation exposure. Differentially expressed genes and enriched pathways were identified using the DESeq2 and gene set enrichment analysis (GSEA) methods, respectively. The data generated in this study are publicly available through the gene expression omnibus (GEO) database with the accession number GSE228572. This study provides a valuable resource for examining the effects of low-dose radiation on HAECs and T2D-HAECs, thereby contributing to a better understanding of the potential human health risks associated with low-dose radiation exposure.
Citation: Data
PubDate: 2023-05-18
DOI: 10.3390/data8050092
Issue No: Vol. 8, No. 5 (2023)
- Data, Vol. 8, Pages 174: Machine Learning Applications to Identify Young
Offenders Using Data from Cognitive Function Tests
Authors: María Claudia Bonfante, Juan Contreras Montes, Mariana Pino, Ronald Ruiz, Gabriel González
First page: 174
Abstract: Machine learning techniques can be used to identify whether deficits in cognitive functions contribute to antisocial and aggressive behavior. This paper initially presents the results of tests conducted on delinquent and nondelinquent youths to assess their cognitive functions. The dataset extracted from these assessments, consisting of 37 predictor variables and one target, was used to train three algorithms which aim to predict whether the data correspond to those of a young offender or a nonoffending youth. Prior to this, statistical tests were conducted on the data to identify characteristics which exhibited significant differences in order to select the most relevant features and optimize the prediction results. Additionally, other feature selection methods, such as Boruta, RFE, and filter, were applied, and their effects on the accuracy of each of the three machine learning models used (SVM, RF, and KNN) were compared. In total, 80% of the data were utilized for training, while the remaining 20% were used for validation. The best result was achieved by the K-NN model, trained with 19 features selected by the Boruta method, followed by the SVM model, trained with 24 features selected by the filter method.
Citation: Data
PubDate: 2023-11-21
DOI: 10.3390/data8120174
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 175: Long-Term Spatiotemporal Oceanographic Data from
the Northeast Pacific Ocean: 1980–2022 Reconstruction Based on the
Korea Oceanographic Data Center (KODC) Dataset
Authors: Seong-Hyeon Kim, Hansoo Kim
First page: 175
Abstract: The Korea Oceanographic Data Center (KODC), overseen by the National Institute of Fisheries Science (NIFS), is a pivotal hub for collecting, processing, and disseminating marine science data. By digitizing and subjecting observational data to rigorous quality control, the KODC ensures accurate information in line with international standards. The center actively engages in global partnerships and fosters marine data exchange. A wide array of marine information is provided through the KODC website, including observational metadata, coastal oceanographic data, real-time buoy records, and fishery environmental data. Coastal oceanographic observational data from 207 stations across various sea regions have been collected biannually since 1961. This dataset covers 14 standard water depths; includes essential parameters, such as temperature, salinity, nutrients, and pH; serves as the foundation for news, reports, and analyses by the NIFS; and is widely employed to study seasonal and regional marine variations, with researchers supplementing the limited data for comprehensive insights. The dataset offers information for each water depth at a 1 m interval over 1980–2022, facilitating research across disciplines. Data processing, including interpolation and quality control, is based on MATLAB. These data are classified by region and accessible online; hence, researchers can easily explore spatiotemporal trends in marine environments.
Citation: Data
PubDate: 2023-11-23
DOI: 10.3390/data8120175
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 176: Model Design and Applied Methodology in
Geothermal Simulations in Very Low Enthalpy for Big Data Applications
Authors: Roberto Arranz-Revenga, María Pilar Dorrego de Luxán, Juan Herrera Herbert, Luis Enrique García Cambronero
First page: 176
Abstract: Low-enthalpy geothermal installations for heating, air conditioning, and domestic hot water are gaining traction due to efforts towards energy decarbonization. This article is part of a broader research project aimed at employing artificial intelligence and big data techniques to develop a predictive system for the thermal behavior of the ground in very low-enthalpy geothermal applications. In this initial article, a summarized process is outlined to generate large quantities of synthetic data through a ground simulation method. The proposed theoretical model allows simulation of the soil’s thermal behavior using an electrical equivalent. The electrical circuit derived is loaded into a simulation program along with an input function representing the system’s thermal load pattern. The simulator responds with another function that calculates the values of the ground over time. Some examples of value conversion and the utility of the input function system to encode thermal loads during simulation are demonstrated. It bears the limitation of invalidity in the presence of underground water currents. Model validation is pending, and once defined, a corresponding testing plan will be proposed for its validation.
Citation: Data
PubDate: 2023-11-23
DOI: 10.3390/data8120176
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 177: Dataset: Impact of β-galactosylceramidase
Overexpression on the Protein Profile of Braf(V600E) Mutated Melanoma
Cells
Authors: Davide Capoferri, Paola Chiodelli, Stefano Calza, Marcello Manfredi, Marco Presta
First page: 177
Abstract: β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism by removing β-galactosyl moieties from β-galactosyl ceramide and β-galactosyl sphingosine. Previous observations have shown that GALC exerts a pro-oncogenic activity in human melanoma. Here, the impact of GALC overexpression on the proteomic landscape of BRAF-mutated A2058 and A375 human melanoma cell lines was investigated by liquid chromatography–tandem mass spectrometry analysis of the cell extracts. The results indicate that GALC overexpression causes the upregulation/downregulation of 172/99 proteins in GALC-transduced cells when compared to control cells. Gene ontology categorization of up/down-regulated proteins indicates that GALC may modulate the protein landscape in BRAF-mutated melanoma cells by affecting various biological processes, including RNA metabolism, cell organelle fate, and intracellular redox status. Overall, these data provide further insights into the pro-oncogenic functions of the sphingolipid metabolizing enzyme GALC in human melanoma.
Citation: Data
PubDate: 2023-11-24
DOI: 10.3390/data8120177
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 178: In Vivo Drug Testing during Embryonic Wound
Healing: Establishing the Avian Model
Authors: Martin Bablok, Beate Brand-Saberi, Morris Gellisch, Gabriela Morosan-Puopolo
First page: 178
Abstract: The relevance of identifying pathological processes in the context of embryonic development is increasingly gaining attention in terms of professionalized prenatal care. To analyze local effects of prenatally administered drugs during embryonic development, the model organism of the chicken embryo can be used in a first exploratory approach. For the examination of local dexamethasone administration—as an exemplary drug—common bead implantation protocols have been adapted to serve as an in vivo technique for local drug testing during embryonic skin regeneration. For this, acrylic beads were soaked in a dexamethasone solution and implanted into skin incisional wounds of 4-day-old chicken embryos. After further incubation, the effects of the applied substance on the process of embryonic skin regeneration were analyzed using histological and molecular biological techniques. This data descriptor contains a detailed microsurgical protocol, a representative video demonstration, and exemplary results of local glucocorticoid-induced changes during embryonic wound healing. To conclude, this method allows for the analysis of the local effects of a particular substance on a cellular level and can be extended to serve as an in vivo technique for numerous other drugs to be tested on embryonic tissue.
Citation: Data
PubDate: 2023-11-25
DOI: 10.3390/data8120178
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 179: A Tourist-Based Framework for Developing Digital
Marketing for Small and Medium-Sized Enterprises in the Tourism Sector in
Saudi Arabia
Authors: Rishaa Abdulaziz Alnajim, Bahjat Fakieh
First page: 179
Abstract: Social media has become an essential tool for travel planning, with tourists increasingly using it to research destinations, book accommodation, and make travel arrangements. However, little is known about how tourists use social media for travel planning and what factors influence their intentions to use social media for this purpose. This thesis aims to understand tourists’ intentions to use social media for travel planning. Specifically, it investigates the factors influencing tourists’ intentions to use social media for planning travel to Saudi Arabia. It develops a machine learning (ML) classification model to assist Saudi tourism SMEs in creating effective digital marketing strategies for social media platforms. A survey was conducted with 573 tourists interested in visiting Saudi Arabia, using the Design Science Research (DSR) approach. The findings support the tourist-based theoretical framework, showing that perceived usefulness (PU), perceived ease of use (PEOU), satisfaction (SAT), marketing-generated content (MGC), and user-generated content (UGC) significantly impact tourists’ intentions to use social media for travel planning. Tourists’ characteristics and visit characteristics influenced their intentions to use MGC but not UGC. The tourist-based ML classification model, developed using the LinearSVC algorithm, achieved an accuracy of 99% when evaluated using the K-Fold Cross-Validation (KF-CV) technique. The findings of this study have several implications for Saudi tourism SMEs. First, the results suggest that SMEs should focus on developing social media content that is perceived as useful, easy to use, and satisfying. Second, the findings suggest that SMEs should focus on using MGC in their social media marketing campaigns. Third, the results suggest that SMEs should tailor their social media marketing campaigns to the characteristics of their target tourists. This study contributes to the literature on tourism marketing and social media by providing a better understanding of how tourists use social media for travel planning. Saudi tourism SMEs can use the findings of this study to develop more effective digital marketing strategies for social media platforms.
Citation: Data
PubDate: 2023-11-28
DOI: 10.3390/data8120179
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 180: Public Perception of ChatGPT and Transfer
Learning for Tweets Sentiment Analysis Using Wolfram Mathematica
Authors: Yankang Su, Zbigniew J. Kabala
First page: 180
Abstract: Understanding public opinion on ChatGPT is crucial for recognizing its strengths and areas of concern. By utilizing natural language processing (NLP), this study delves into tweets regarding ChatGPT to determine temporal patterns, content features, and topic modeling and perform a sentiment analysis. Analyzing a dataset of 500,000 tweets, our research shifts from conventional data science tools like Python and R to exploit Wolfram Mathematica’s robust capabilities. Additionally, with the aim of solving the problem of ignoring semantic information in the LDA model feature extraction, a synergistic methodology entwining LDA, GloVe embeddings, and K-Nearest Neighbors (KNN) clustering is proposed to categorize topics within ChatGPT-related tweets. This comprehensive strategy ensures semantic, syntactic, and topical congruence within classified groups by utilizing the strengths of probabilistic modeling, semantic embeddings, and similarity-based clustering. While built-in sentiment classifiers often fall short in accuracy, we introduce four transfer learning techniques from the Wolfram Neural Net Repository to address this gap. Two of these techniques involve transferring static word embeddings, “GloVe” and “ConceptNet”, which are further processed using an LSTM layer. The remaining techniques center on fine-tuning pre-trained models using scantily annotated data; one refines embeddings from language models (ELMo), while the other fine-tunes bidirectional encoder representations from transformers (BERT). Our experiments on the dataset underscore the effectiveness of the four methods for the sentiment analysis of tweets. This investigation augments our comprehension of user sentiment towards ChatGPT and emphasizes the continued significance of exploration in this domain. Furthermore, this work serves as a pivotal reference for scholars who are accustomed to using Wolfram Mathematica in other research domains, aiding their efforts in text analytics on social media platforms.
Citation: Data
PubDate: 2023-11-28
DOI: 10.3390/data8120180
Issue No: Vol. 8, No. 12 (2023)
- Data, Vol. 8, Pages 159: DataPLAN: A Web-Based Data Management Plan
Generator for the Plant Sciences
Authors: Xiao-Ran Zhou, Sebastian Beier, Dominik Brilhaus, Cristina Martins Rodrigues, Timo Mühlhaus, Dirk von Suchodoletz, Richard M. Twyman, Björn Usadel, Angela Kranz
First page: 159
Abstract: Research data management (RDM) combines a set of practices for the organization, storage and preservation of data from research projects. The RDM strategy of a project is usually formalized as a data management plan (DMP)—a document that sets out procedures to ensure data findability, accessibility, interoperability and reusability (FAIR-ness). Many aspects of RDM are standardized across disciplines so that data and metadata are reusable, but the components of DMPs in the plant sciences are often disconnected. The inability to reuse plant-specific DMP content across projects and funding sources requires additional time and effort to write unique DMPs for different settings. To address this issue, we developed DataPLAN—an open-source tool incorporating prewritten DMP content for the plant sciences that can be used online or offline to prepare multiple DMPs. The current version of DataPLAN supports Horizon 2020 and Horizon Europe projects, as well as projects funded by the German Research Foundation (DFG). Furthermore, DataPLAN offers the option for users to customize their own templates. Additional templates to accommodate other funding schemes will be added in the future. DataPLAN reduces the workload needed to create or update DMPs in the plant sciences by presenting standardized RDM practices optimized for different funding contexts.
Citation: Data
PubDate: 2023-10-24
DOI: 10.3390/data8110159
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 160: Fabaceae: South African Medicinal Plant Species
Used in the Treatment and Management of Sexually Transmitted and Related
Opportunistic Infections Associated with HIV-AIDS
Authors: Nkoana Ishmael Mongalo, Maropeng Vellry Raletsena
First page: 160
Abstract: The use of medicinal plants, particularly in the treatment of sexually transmitted and related infections, is ancient. These plants may well be used as alternative and complementary medicine to a variety of antibiotics that may possess limitations mainly due to an emerging enormous antimicrobial resistance. Several computerized database literature sources such as ScienceDirect, Scopus, Scielo, PubMed, and Google Scholar were used to retrieve information on Fabaceae species used in the treatment and management of sexually transmitted and related infections in South Africa. The other information was sourced from various academic dissertations, theses, and botanical books. A total of 42 medicinal plant species belonging to the Fabaceae family, used in the treatment of sexually transmitted and related opportunistic infections associated with HIV-AIDS, have been documented. Trees were the most reported life form, yielding 47.62%, while Senna and Vachellia were the frequently cited genera yielding six and three species, respectively. Peltophorum africanum Sond. was the most preferred medicinal plant, yielding a frequency of citation of 14, while Vachellia karoo (Hayne) Banfi and Glasso as well as Elephantorrhiza burkei Benth. yielded 12 citations each. The most frequently used plant parts were roots, yielding 57.14%, while most of the plant species were administered orally after boiling (51.16%) until the infection subsided. Amazingly, many of the medicinal plant species are recommended for use to treat impotence (29.87%), while most common STI infections such as chlamydia (7.79%), gonorrhea (6.49%), syphilis (5.19%), genital warts (2.60%), and many other unidentified STIs that may include “Makgoma” and “Divhu” were less cited. Although there are widespread data on the in vitro evidence of the use of the Fabaceae species in the treatment of sexually transmitted and related infections, there is a need to explore the in vivo studies to further ascertain the use of species as a possible complementary and alternative medicine to the currently used antibiotics in both developing and underdeveloped countries. Furthermore, the toxicological profiles of many of these studies need to be further explored. The safety and efficacy of over-the-counter pharmaceutical products developed using these species also need to be explored.
Citation: Data
PubDate: 2023-10-24
DOI: 10.3390/data8110160
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 161: Dataset: Biodiversity of Ground Beetles
(Coleoptera, Carabidae) of the Republic of Mordovia (Russia)
Authors: Leonid V. Egorov, Viktor V. Aleksanov, Sergei K. Alekseev, Alexander B. Ruchin, Oleg N. Artaev, Mikhail N. Esin, Sergei V. Lukiyanov, Evgeniy A. Lobachev, Gennadiy B. Semishin
First page: 161
Abstract: (1) Background: Carabidae is one of the most diverse families of Coleoptera. Many species of Carabidae are sensitive to anthropogenic impacts and are indicators of their environmental state. Some species of large beetles are on the verge of extinction. The aim of this research is to describe the Carabidae fauna of the Republic of Mordovia (central part of European Russia); (2) Methods: The research was carried out in April-September 1979, 1987, 2000, 2001, 2005, 2007–2022. Collections were performed using a variety of methods (light trapping, soil traps, window traps, etc.). For each observation, the coordinates of the sampling location, abundance, and dates were recorded; (3) Results: The dataset contains data on 251 species of Carabidae from 12 subfamilies and 4576 occurrences. A total of 66,378 specimens of Carabidae were studied. Another 29 species are additionally known from other publications. Also, twenty-two species were excluded from the fauna of the region, as they were determined earlier by mistake (4). Conclusions: The biodiversity of Carabidae in the Republic of Mordovia included 280 species from 12 subfamilies. Four species (Agonum scitulum, Lebia scapularis, Bembidion humerale, and Bembidion tenellum) were identified for the first time in the Republic of Mordovia.
Citation: Data
PubDate: 2023-10-24
DOI: 10.3390/data8110161
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 162: The Development of a Water Resource Monitoring
Ontology as a Research Tool for Sustainable Regional Development
Authors: Assel Ospan, Madina Mansurova, Vladimir Barakhnin, Aliya Nugumanova, Roman Titkov
First page: 162
Abstract: The development of knowledge graphs about water resources as a tool for studying the sustainable development of a region is currently an urgent task, because the growing deterioration of the state of water bodies affects the ecology, economy, and health of the population of the region. This study presents a new ontological approach to water resource monitoring in Kazakhstan, providing data integration from heterogeneous sources, semantic analysis, decision support, and querying and searching and presenting new knowledge in the field of water monitoring. The contribution of this work is the integration of table extraction and understanding, semantic web rule language, semantic sensor network, time ontology methods, and the inclusion of a module of socioeconomic indicators that reveal the impact of water quality on the quality of life of the population. Using machine learning methods, the study derived six ontological rules to establish new knowledge about water resource monitoring. The results of the queries demonstrate the effectiveness of the proposed method, demonstrating its potential to improve water monitoring practices, promote sustainable resource management, and support decision-making processes in Kazakhstan, and can also be integrated into the ontology of water resources at the scale of Central Asia.
Citation: Data
PubDate: 2023-10-26
DOI: 10.3390/data8110162
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 163: A Large-Scale Dataset of Search Interests Related
to Disease X Originating from Different Geographic Regions
Authors: Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Isabella Hall, Yuvraj Nihal Duggal
First page: 163
Abstract: The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.
Citation: Data
PubDate: 2023-10-26
DOI: 10.3390/data8110163
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 164: Information Competences and Academic Achievement:
A Dataset
Authors: Jacqueline Köhler, Roberto González-Ibáñez
First page: 164
Abstract: Information literacy (IL) is becoming fundamental in the modern world. Although several IL standards and assessments have been developed for secondary and higher education, there is still no agreement about the possible associations between IL and both academic achievement and student dropout rates. In this article, we present a dataset including IL competences measurements, as well as academic achievement and socioeconomic indicators for 153 Chilean first- and second-year engineering students. The dataset is intended to allow researchers to use machine learning methods to study to what extent, if any, IL and academic achievement are related.
Citation: Data
PubDate: 2023-10-27
DOI: 10.3390/data8110164
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 165: Can We Mathematically Spot the Possible
Manipulation of Results in Research Manuscripts Using Benford’s Law'
Authors: Teddy Lazebnik, Dan Gorlitsky
First page: 165
Abstract: The reproducibility of academic research has long been a persistent issue, contradicting one of the fundamental principles of science. Recently, there has been an increasing number of false claims found in academic manuscripts, casting doubt on the validity of reported results. In this paper, we utilize an adapted version of Benford’s law, a statistical phenomenon that describes the distribution of leading digits in naturally occurring datasets, to identify the potential manipulation of results in research manuscripts, solely using the aggregated data presented in those manuscripts rather than the commonly unavailable raw datasets. Our methodology applies the principles of Benford’s law to commonly employed analyses in academic manuscripts, thus reducing the need for the raw data itself. To validate our approach, we employed 100 open-source datasets and successfully predicted 79% of them accurately using our rules. Moreover, we tested the proposed method on known retracted manuscripts, showing that around half (48.6%) can be detected using the proposed method. Additionally, we analyzed 100 manuscripts published in the last two years across ten prominent economic journals, with 10 manuscripts randomly sampled from each journal. Our analysis predicted a 3% occurrence of results manipulation with a 96% confidence level. Our findings show that Benford’s law adapted for aggregated data, can be an initial tool for identifying data manipulation; however, it is not a silver bullet, requiring further investigation for each flagged manuscript due to the relatively low prediction accuracy.
Citation: Data
PubDate: 2023-10-31
DOI: 10.3390/data8110165
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 166: A Scalable Data Structure for Efficient Graph
Analytics and In-Place Mutations
Authors: Soukaina Firmli, Dalila Chiadmi
First page: 166
Abstract: The graph model enables a broad range of analyses; thus, graph processing (GP) is an invaluable tool in data analytics. At the heart of every GP system lies a concurrent graph data structure that stores the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, GP systems face the challenge of providing an appropriate graph data structure that enables both fast analytical workloads and fast, low-memory graph mutations. Existing graph structures offer a hard tradeoff among read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these tradeoffs and enables both fast read-only analytics, and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists (ALs) to achieve the best of both worlds. We compare CSR++ to CSR, ALs from the Boost Graph Library (BGL), and the following state-of-the-art update-friendly graph structures: LLAMA, STINGER, GraphOne, and Teseo. In our evaluation, which is based on popular GP algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average) while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2×) with frequent updates. We also show that both CSR++’s update throughput and analytics performance exceed those of several state-of-the-art graph structures while maintaining low memory consumption when the workload includes updates.
Citation: Data
PubDate: 2023-11-03
DOI: 10.3390/data8110166
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 167: Draft Genome Sequence Data of Lysinibacillus
sphaericus Strain 1795 with Insecticidal Properties
Authors: Maria N. Romanenko, Maksim A. Nesterenko, Anton E. Shikov, Anton A. Nizhnikov, Kirill S. Antonets
First page: 167
Abstract: Lysinibacillus sphaericus holds a significant agricultural importance by being able to produce insecticidal toxins and chemical moieties of varying antibacterial and fungicidal activities. In this study, the genome of the L. sphaericus strain 1795 is presented. Illumina short reads sequenced on the HiSeq X platform were used to obtain the genome’s assembly by applying the SPAdes v3.15.4 software. The genome size based on a cumulative length of 23 contigs reached 4.74 Mb, with a respective N50 of 1.34 Mb. The assembled genome carried 4672 genes, including 4643 protein-encoding ones, 5 of which represented loci coding for insecticidal toxins active against the orders Diptera, Lepidoptera, and Blattodea. We also revealed biosynthetic gene clusters responsible for the synthesis of secondary metabolites with predicted antibacterial, fungicidal, and growth-promoting properties. The genomic data provided will be helpful for deepening our understanding of genetic markers determining the efficient application of the L. sphaericus strain 1795 primarily for biocontrol purposes in veterinary and medical applications against several groups of blood-sucking insects.
Citation: Data
PubDate: 2023-11-03
DOI: 10.3390/data8110167
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 168: Applying Eye Tracking with Deep Learning
Techniques for Early-Stage Detection of Autism Spectrum Disorders
Authors: Zeyad A. T. Ahmed, Eid Albalawi, Theyazn H. H. Aldhyani, Mukti E. Jadhav, Prachi Janrao, Mansour Ratib Mohammad Obeidat
First page: 168
Abstract: Autism spectrum disorder (ASD) poses a complex challenge to researchers and practitioners, with its multifaceted etiology and varied manifestations. Timely intervention is critical in enhancing the developmental outcomes of individuals with ASD. This paper underscores the paramount significance of early detection and diagnosis as a pivotal precursor to effective intervention. To this end, integrating advanced technological tools, specifically eye-tracking technology and deep learning algorithms, is investigated for its potential to discriminate between children with ASD and their typically developing (TD) peers. By employing these methods, the research aims to contribute to refining early detection strategies and support mechanisms. This study introduces innovative deep learning models grounded in convolutional neural network (CNN) and recurrent neural network (RNN) architectures, employing an eye-tracking dataset for training. Of note, performance outcomes have been realised, with the bidirectional long short-term memory (BiLSTM) achieving an accuracy of 96.44%, the gated recurrent unit (GRU) attaining 97.49%, the CNN-LSTM hybridising to 97.94%, and the LSTM achieving the most remarkable accuracy result of 98.33%. These outcomes underscore the efficacy of the applied methodologies and the potential of advanced computational frameworks in achieving substantial accuracy levels in ASD detection and classification.
Citation: Data
PubDate: 2023-11-03
DOI: 10.3390/data8110168
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 169: Machine Learning for Credit Risk Prediction: A
Systematic Literature Review
Authors: Jomark Pablo Noriega, Luis Antonio Rivera, José Alfredo Herrera
First page: 169
Abstract: In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for financial institutions to use Artificial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identified 52 relevant studies within the credit industry of microfinance. Challenges and approaches in credit risk prediction using ML models were identified; we had difficulties with the implemented models such as the black box model, the need for explanatory artificial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identified that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most significant limitation identified is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.
Citation: Data
PubDate: 2023-11-07
DOI: 10.3390/data8110169
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 170: Introducing DeReKoGram: A Novel Frequency Dataset
with Lemma and Part-of-Speech Information for German
Authors: Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer
First page: 170
Abstract: We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
Citation: Data
PubDate: 2023-11-10
DOI: 10.3390/data8110170
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 171: ChatGPT across Arabic Twitter: A Study of Topics,
Sentiments, and Sarcasm
Authors: Shahad Al-Khalifa, Fatima Alhumaidhi, Hind Alotaibi, Hend S. Al-Khalifa
First page: 171
Abstract: While ChatGPT has gained global significance and widespread adoption, its exploration within specific cultural contexts, particularly within the Arab world, remains relatively limited. This study investigates the discussions among early Arab users in Arabic tweets related to ChatGPT, focusing on topics, sentiments, and the presence of sarcasm. Data analysis and topic-modeling techniques were employed to examine 34,760 Arabic tweets collected using specific keywords. This study revealed a strong interest within the Arabic-speaking community in ChatGPT technology, with prevalent discussions spanning various topics, including controversies, regional relevance, fake content, and sector-specific dialogues. Despite the enthusiasm, concerns regarding ethical risks and negative implications of ChatGPT’s emergence were highlighted, indicating apprehension toward advanced artificial intelligence (AI) technology in language generation. Region-specific discussions underscored the diverse adoption of AI applications and ChatGPT technology. Sentiment analysis of the tweets demonstrated a predominantly neutral sentiment distribution (92.8%), suggesting a focus on objectivity and factuality over emotional expression. The prevalence of neutral sentiments indicated a preference for evidence-based reasoning and logical arguments, fostering constructive discussions influenced by cultural norms. Sarcasm was found in 4% of the tweets, distributed across various topics but not dominating the conversation. This study’s implications include the need for AI developers to address ethical concerns and the importance of educating users about the technology’s ethical considerations and risks. Policymakers should consider the regional relevance and potential scams, emphasizing the necessity for ethical guidelines and regulations.
Citation: Data
PubDate: 2023-11-14
DOI: 10.3390/data8110171
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 172: Testate Amoebae (Amphitremida, Arcellinida,
Euglyphida) in Sphagnum Bogs: The Dataset from Eastern Fennoscandia
Authors: Aleksandr Ivanovskii, Kirill Babeshko, Viktor Chernyshov, Anton Esaulov, Aleksandr Komarov, Elena Malysheva, Natalia Mazei, Diana Meskhadze, Damir Saldaev, Andrey N. Tsyganov, Yuri Mazei
First page: 172
Abstract: The paper describes a dataset, comprising 236 surface moss samples and 143 testate amoeba taxa. The samples were collected in 11 Sphagnum-dominated bogs during frost-free seasons of 2004, 2007, 2009, 2017, and 2022. For the whole dataset, the sampling effort was sufficient in terms of observed species richness (143 species in total), though a regional species pool is deemed to be discovered incompletely (143 species is its lower 95 % confidence limit using Chao’s estimator). The local community composition demonstrated high heterogeneity in a reduced ordination space. It supports the opinion that the high versatility of bog ecosystems should be taken into account during ecological studies.
Citation: Data
PubDate: 2023-11-15
DOI: 10.3390/data8110172
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 173: Biodiversity of Terrestrial Testate Amoebae in
Western Siberia Lowland Peatlands
Authors: Damir Saldaev, Kirill Babeshko, Viktor Chernyshov, Anton Esaulov, Xiuyuan Gu, Nikita Kriuchkov, Natalia Mazei, Nailia Saldaeva, Jiahui Su, Andrey Tsyganov, Basil Yakimov, Svetlana Yushkovets, Yuri Mazei
First page: 173
Abstract: Testate amoebae are unicellular eukaryotic organisms covered with an external skeleton called a shell. They are an important component of many terrestrial ecosystems, especially peatlands, where they can be preserved in peat deposits and used as a proxy of surface wetness in paleoecological reconstructions. Here, we represent a database from a vast but poorly studied region of the Western Siberia Lowland containing information on TA occurrences in relation to substrate moisture and WTD. The dataset includes 88 species from 32 genera, with 2181 incidences and 21,562 counted individuals. All samples were collected in oligotrophic peatlands and prepared using the method of wet sieving with a subsequent sedimentation of aqueous suspensions. This database contributes to the understanding of the distribution of testate amoebae and can be further used in large-scale investigations.
Citation: Data
PubDate: 2023-11-17
DOI: 10.3390/data8110173
Issue No: Vol. 8, No. 11 (2023)
- Data, Vol. 8, Pages 145: Attention-Based Human Age Estimation from Face
Images to Enhance Public Security
Authors: Md. Ashiqur Rahman, Shuhena Salam Aonty, Kaushik Deb, Iqbal H. Sarker
First page: 145
Abstract: Age estimation from facial images has gained significant attention due to its practical applications such as public security. However, one of the major challenges faced in this field is the limited availability of comprehensive training data. Moreover, due to the gradual nature of aging, similar-aged faces tend to share similarities despite their race, gender, or location. Recent studies on age estimation utilize convolutional neural networks (CNN), treating every facial region equally and disregarding potentially informative patches that contain age-specific details. Therefore, an attention module can be used to focus extra attention on important patches in the image. In this study, tests are conducted on different attention modules, namely CBAM, SENet, and Self-attention, implemented with a convolutional neural network. The focus is on developing a lightweight model that requires a low number of parameters. A merged dataset and other cutting-edge datasets are used to test the proposed model’s performance. In addition, transfer learning is used alongside the scratch CNN model to achieve optimal performance more efficiently. Experimental results on different aging face databases show the remarkable advantages of the proposed attention-based CNN model over the conventional CNN model by attaining the lowest mean absolute error and the lowest number of parameters with a better cumulative score.
Citation: Data
PubDate: 2023-09-25
DOI: 10.3390/data8100145
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 146: Synthetic Data Generation for Data Envelopment
Analysis
Authors: Andrey V. Lychev
First page: 146
Abstract: The paper is devoted to the problem of generating artificial datasets for data envelopment analysis (DEA), which can be used for testing DEA models and methods. In particular, the papers that applied DEA to big data often used synthetic data generation to obtain large-scale datasets because real datasets of large size, available in the public domain, are extremely rare. This paper proposes the algorithm which takes as input some real dataset and complements it by artificial efficient and inefficient units. The generation process extends the efficient part of the frontier by inserting artificial efficient units, keeping the original efficient frontier unchanged. For this purpose, the algorithm uses the assurance region method and consistently relaxes weight restrictions during the iterations. This approach produces synthetic datasets that are closer to real ones, compared to other algorithms that generate data from scratch. The proposed algorithm is applied to a pair of small real-life datasets. As a result, the datasets were expanded to 50K units. Computational experiments show that artificially generated DMUs preserve isotonicity and do not increase the collinearity of the original data as a whole.
Citation: Data
PubDate: 2023-09-27
DOI: 10.3390/data8100146
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 147: A Retinal Oct-Angiography and Cardiovascular
STAtus (RASTA) Dataset of Swept-Source Microvascular Imaging for
Cardiovascular Risk Assessment
Authors: Germanèse, Meriaudeau, Eid, Tadayoni, Ginhac, Anwer, Laure-Anne, Guenancia, Creuzot-Garcher, Gabrielle, Arnould
First page: 147
Abstract: In the context of exponential demographic growth, the imbalance between human resources and public health problems impels us to envision other solutions to the difficulties faced in the diagnosis, prevention, and large-scale management of the most common diseases. Cardiovascular diseases represent the leading cause of morbidity and mortality worldwide. A large-scale screening program would make it possible to promptly identify patients with high cardiovascular risk in order to manage them adequately. Optical coherence tomography angiography (OCT-A), as a window into the state of the cardiovascular system, is a rapid, reliable, and reproducible imaging examination that enables the prompt identification of at-risk patients through the use of automated classification models. One challenge that limits the development of computer-aided diagnostic programs is the small number of open-source OCT-A acquisitions available. To facilitate the development of such models, we have assembled a set of images of the retinal microvascular system from 499 patients. It consists of 814 angiocubes as well as 2005 en face images. Angiocubes were captured with a swept-source OCT-A device of patients with varying overall cardiovascular risk. To the best of our knowledge, our dataset, Retinal oct-Angiography and cardiovascular STAtus (RASTA), is the only publicly available dataset comprising such a variety of images from healthy and at-risk patients. This dataset will enable the development of generalizable models for screening cardiovascular diseases from OCT-A retinal images.
Citation: Data
PubDate: 2023-09-28
DOI: 10.3390/data8100147
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 148: Towards Data Storage, Scalability, and
Availability in Blockchain Systems: A Bibliometric Analysis
Authors: Meenakshi Kandpal, Veena Goswami, Rojalina Priyadarshini, Rabindra Kumar Barik
First page: 148
Abstract: In recent years, blockchain research has drawn attention from all across the world. It is a decentralized competence that is spread out and uncertain. Several nations and scholars have already successfully applied blockchain in numerous arenas. Blockchain is essential in delicate situations because it secures data and keeps it from being altered or forged. In addition, the market’s increased demand for data is driving demand for data scaling across all industries. Researchers from many nations have used blockchain in various sectors over time, thus bringing extreme focus to this newly escalating blockchain domain. Every research project begins with in-depth knowledge about the working domain, and new interest information about blockchain is quite scattered. This study analyzes academic literature on blockchain technology, emphasizing three key aspects: blockchain storage, scalability, and availability. These are critical areas within the broader field of blockchain technology. This study employs CiteSpace and VOSviewer to understand the current state of research in these areas comprehensively. These are bibliometric analysis tools commonly used in academic research to examine patterns and relationships within scientific literature. Thus, to visualize a way to store data with scalability and availability while keeping the security of the blockchain in sync, the required research has been performed on the storage, scalability, and availability of data in the blockchain environment. The ultimate goal is to contribute to developing secure and efficient data storage solutions within blockchain technology.
Citation: Data
PubDate: 2023-10-02
DOI: 10.3390/data8100148
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 149: Fast Radius Outlier Filter Variant for Large
Point Clouds
Authors: Péter Szutor, Marianna Zichar
First page: 149
Abstract: Currently, several devices (such as laser scanners, Kinect, time of flight cameras, medical imaging equipment (CT, MRI, intraoral scanners)), and technologies (e.g., photogrammetry) are capable of generating 3D point clouds. Each point cloud type has its unique structure or characteristics, but they have a common point: they may be loaded with errors. Before further data processing, these unwanted portions of the data must be removed with filtering and outlier detection. There are several algorithms for detecting outliers, but their performances decrease when the size of the point cloud increases. The industry has a high demand for efficient algorithms to deal with large point clouds.
Citation: Data
PubDate: 2023-10-02
DOI: 10.3390/data8100149
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 150: Power-Flow Simulations for Integrating Renewable
Distributed Generation from Biogas, Photovoltaic, and Small Wind Sources
on an Underground Distribution Feeder
Authors: Welson Bassi, Igor Cordeiro, Ildo Luis Sauer
First page: 150
Abstract: The rapid expansion of distributed generation leads to the integration of an increasing number of energy generation sources. However, integrating these sources into electrical distribution networks presents specific challenges to ensure that the distribution networks can effectively accommodate the associated distributed energy and power. Thus, it is crucial to evaluate the electrical effects of power along the conductors, components, and loads. Power-flow analysis is a well-established numerical methodology for assessing parameters and quantities within power systems during steady-state operation. The University of São Paulo’s Cidade Universitária “Armando de Salles Oliveira” (CUASO) campus in São Paulo, Brazil, features an underground power distribution system. The Institute of Energy and Environment (IEE) leads the integration of several distributed generation (DG) sources, including a biogas plant, photovoltaic installations, and a small wind turbine, into one of the CUASO’s feeders, referred to as “USP-105”. Load-flow simulations were conducted using the PowerWorldTM Simulator v.23, considering the interconnection of these sources. This dataset provides comprehensive information and computational files utilized in the simulations. It serves as a valuable resource for reanalysis, didactic purposes, and the dissemination of technical insights related to DG implementation.
Citation: Data
PubDate: 2023-10-07
DOI: 10.3390/data8100150
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 151: Tracking a Decade of Hydrogeological Emergencies
in Italian Municipalities
Authors: Alessio Gatto, Stefano Clò, Federico Martellozzo, Samuele Segoni
First page: 151
Abstract: This dataset collects tabular and geographical information about all hydrogeological disasters (landslides and floods) that occurred in Italy from 2013 to 2022 that caused such severe impacts as to require the declaration of national-level emergencies. The severity and spatiotemporal extension of each emergency are characterized in terms of duration and timing, funds requested by local administrations, funds approved by the national government, and municipalities and provinces hit by the event (further subdivided between those included in the emergency and those not, depending on whether relevant impacts were ascertained). Italian exposure to hydrogeological risk is portrayed strikingly: in the covered period, 123 emergencies affected Italy, all regions were struck at least once, and some provinces were struck more than 10 times. Damage declared by local institutions adds up to EUR 11,000,000,000, while national recovery funds add up to EUR 1,000,000,000. The dataset may foster further research on risk assessment, econometric analysis, public policy support, and decision-making implementation. Moreover, it provides systematic evidence helpful in raising awareness about hydrogeological risks affecting Italy.
Citation: Data
PubDate: 2023-10-11
DOI: 10.3390/data8100151
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 152: Dataset of Contamination (2009–2022) Legacy
Contaminants (PCB and DDT) in Zooplankton of Lake Maggiore (CIPAIS,
International Commission for the Protection of Italian-Swiss Waters)
Authors: Roberta Bettinetti, Roberta Piscia, Marina Manca, Silvana Galassi, Silvia Quadroni, Carlo Dossi, Rossella Perna, Emanuela Boggio, Ginevra Boldrocchi, Michela Mazzoni, Benedetta Villa
First page: 152
Abstract: In this paper, we describe a 13-year (2009–2022) dataset of legacy POP concentrations (DDTtot and sumPCB14 from 2016 isomers and congeners concentrations are also reported) in the planktonic crustaceans of Lake Maggiore (≥450 µm size fraction). The data were collected in the framework of a monitoring program finalized to assess the presence of pollutants in the lake biota, including zooplankton organisms directly preyed by fish. The data report both concentration of DDTtot and sumPCB14 in the zooplankton and the standing stock density and biomass of the population in each season. The dataset allows for detecting changes in the concentration over the long term and within a year, thus providing evidence for the seasonal and the plurennial variations in the presence of these pollutants in the lake. They also provide a basis for further studies aimed at modeling paths and the fate of persistent organic pollutants, for which the amount of toxicants stocked in the zooplankton compartment linked to fish is a crucial estimate.
Citation: Data
PubDate: 2023-10-12
DOI: 10.3390/data8100152
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 153: USC-DCT: A Collection of Diverse Classification
Tasks
Authors: Adam M. Jones, Gozde Sahin, Zachary W. Murdock, Yunhao Ge, Ao Xu, Yuecheng Li, Di Wu, Shuo Ni, Po-Hsuan Huang, Kiran Lekkala, Laurent Itti
First page: 153
Abstract: Machine learning is a crucial tool for both academic and real-world applications. Classification problems are often used as the preferred showcase in this space, which has led to a wide variety of datasets being collected and utilized for a myriad of applications. Unfortunately, there is very little standardization in how these datasets are collected, processed, and disseminated. As new learning paradigms like lifelong or meta-learning become more popular, the demand for merging tasks for at-scale evaluation of algorithms has also increased. This paper provides a methodology for processing and cleaning datasets that can be applied to existing or new classification tasks as well as implements these practices in a collection of diverse classification tasks called USC-DCT. Constructed using 107 classification tasks collected from the internet, this collection provides a transparent and standardized pipeline that can be useful for many different applications and frameworks. While there are currently 107 tasks, USC-DCT is designed to enable future growth. Additional discussion provides explanations of applications in machine learning paradigms such as transfer, lifelong, or meta-learning, how revisions to the collection will be handled, and further tips for curating and using classification tasks at this scale.
Citation: Data
PubDate: 2023-10-12
DOI: 10.3390/data8100153
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 154: A Dataset of Non-Indigenous and Native Fish of
the Volga and Kama Rivers (European Russia)
Authors: Dmitry P. Karabanov, Dmitry D. Pavlov, Yury Y. Dgebuadze, Mikhail I. Bazarov, Elena A. Borovikova, Yuriy V. Gerasimov, Yulia V. Kodukhova, Pavel B. Mikheev, Eduard V. Nikitin, Tatyana L. Opaleva, Yuri A. Severov, Rimma Z. Sabitova, Alexey K. Smirnov, Yury I. Solomatin, Igor A. Stolbunov, Alexander I. Tsvetkov, Stanislav A. Vlasenko, Irina S. Voroshilova, Wenjun Zhong, Xiaowei Zhang, Alexey A. Kotov
First page: 154
Abstract: Fish in the Volga-Kama River System (the largest river system in Europe) are important as a crucial food source for local populations; fish have the highest trophic level among hydrobionts. The purpose of this research is to describe the diversity of non-indigenous and native fish in the Volga and Kama Rivers, in the European part of Russia. This dataset encompasses data from June 2001 to September 2021 and comprises 1888 records (36,376 individual observations) for littoral and pelagic habitats from 143 sampling sites, representing 52 species from 42 genera in 22 families. The dataset has a Darwin Core standard format and has been fully released in the Global Biodiversity Information Facility (GBIF) under CC-BY 4.0 International license. The data are validated with several international databases such as FishBase, Eschmeyer’s Catalog of Fishes, the Barcode of Life Data System, and the SAS.Planet geoinformations system. Newly established populations have been found for several species belonging to the following Actinopteri families: Alosidae, Anguillidae, Cichlidae, Ehiravidae, Gobiidae, Odontobutidae, Syngnathidae, and Xenocyprididae. Therefore, this dataset can be used in the particular taxon species distribution analysis, which are especially important for non-indigenous species.
Citation: Data
PubDate: 2023-10-18
DOI: 10.3390/data8100154
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 155: A Data-Driven Exploration of a New Islamic Fatwas
Dataset for Arabic NLP Tasks
Authors: Ohoud Alyemny, Hend Al-Khalifa, Abdulrahman Mirza
First page: 155
Abstract: Islamic content is a broad and diverse domain that encompasses various sources, topics, and perspectives. However, there is a lack of comprehensive and reliable datasets that can facilitate conducting studies on Islamic content. In this paper, we present fatwaset, the first public Arabic dataset of Islamic fatwas. It contains Islamic fatwas that we collected from various trusted and authenticated sources in the Islamic fatwa domain, such as agencies, religious scholars, and websites. Fatwaset is a rich resource as it does not only contain fatwas but also includes a considerable set of their surrounding metadata. It can be used for many natural language processing (NLP) tasks, such as language modeling, question answering, author attribution, topic identification, text classification, and text summarization. It can also support other domains that are related to Islamic culture, such as philosophy and language art. We describe the methodology and criteria we used to select the content, as well as the challenges and limitations we faced. Additionally, we perform an Exploratory Data Analysis (EDA), which investigates the dataset from different perspectives. The results of the EDA reveal important information that greatly benefits researchers in this area.
Citation: Data
PubDate: 2023-10-19
DOI: 10.3390/data8100155
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 156: Cybersecurity Risk Assessments within Critical
Infrastructure Social Networks
Authors: Alimbubi Aktayeva, Yerkhan Makatov, Akku Kubigenova Tulegenovna, Aibek Dautov, Rozamgul Niyazova, Maxud Zhamankarin, Sergey Khan
First page: 156
Abstract: Cybersecurity social networking is a new scientific and engineering discipline that was interdisciplinary in its early days, but is now transdisciplinary. The issues of reviewing and analyzing of principal tasks related to information collection, monitoring of social networks, assessment methods, and preventing and combating cybersecurity threats are, therefore, essential and pending. There is a need to design certain methods, models, and program complexes aimed at estimating risks related to the cyberspace of social networks and the support of their activities. This study considers a risk to be the combination of consequences of a given event (or incident) with a probable occurrence (likelihood of occurrence) involved, while risk assessment is a general issue of identification, estimation, and evaluation of risk. The findings of the study made it possible to elucidate that the technique of cognitive modeling for risk assessment is part of a comprehensive cybersecurity approach included in the requirements of basic IT standards, including IT security risk management. The study presents a comprehensive approach in the field of cybersecurity in social networks that allows for consideration of all the elements that constitute cybersecurity as a complex, interconnected system. The ultimate goal of this approach to cybersecurity is the organization of an uninterrupted scheme of protection against any impacts related to physical, hardware, software, network, and human objects or resources of the critical infrastructure of social networks, as well as the integration of various levels and means of protection.
Citation: Data
PubDate: 2023-10-19
DOI: 10.3390/data8100156
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 157: Industrial Environment Multi-Sensor Dataset for
Vehicle Indoor Tracking with Wi-Fi, Inertial and Odometry Data
Authors: Ivo Silva , Cristiano Pendão, Joaquín Torres-Sospedra, Adriano Moreira
First page: 157
Abstract: This paper describes a dataset collected in an industrial setting using a mobile unit resembling an industrial vehicle equipped with several sensors. Wi-Fi interfaces collect signals from available Access Points (APs), while motion sensors collect data regarding the mobile unit’s movement (orientation and displacement). The distinctive features of this dataset include synchronous data collection from multiple sensors, such as Wi-Fi data acquired from multiple interfaces (including a radio map), orientation provided by two low-cost Inertial Measurement Unit (IMU) sensors, and displacement (travelled distance) measured by an absolute encoder attached to the mobile unit’s wheel. Accurate ground-truth information was determined using a computer vision approach that recorded timestamps as the mobile unit passed through reference locations. We assessed the quality of the proposed dataset by applying baseline methods for dead reckoning and Wi-Fi fingerprinting. The average positioning error for simple dead reckoning, without using any other absolute positioning technique, is 8.25 m and 11.66 m for IMU1 and IMU2, respectively. The average positioning error for simple Wi-Fi fingerprinting is 2.19 m when combining the RSSI information from five Wi-Fi interfaces. This dataset contributes to the fields of Industry 4.0 and mobile sensing, providing researchers with a resource to develop, test, and evaluate indoor tracking solutions for industrial vehicles.
Citation: Data
PubDate: 2023-10-23
DOI: 10.3390/data8100157
Issue No: Vol. 8, No. 10 (2023)
- Data, Vol. 8, Pages 158: Panel Regression Modelling for COVID-19
Infections and Deaths in Tamil Nadu, India
Authors: Rajarathinam Arunachalam
First page: 158
Abstract: The impacts of the coronavirus disease 2019 (COVID-19) pandemic have been extremely severe, with both economic and health crises experienced worldwide. Based on the panel regression model, this study examined the trends and correlations in the number of COVID-19-related deaths and the number of COVID-19-infected cases in all 37 regions of the Tamil Nadu state in India, in August 2020. The fixed effects model had the greatest R2 value of 78% and exhibited significant results. The slope coefficient was also highly significant, showing a considerable variation in the relationship between new COVID-19 cases and deaths. Additionally, for every unit increase in COVID-19-infected cases, the death rate increased by 0.02%.
Citation: Data
PubDate: 2023-10-23
DOI: 10.3390/data8100158
Issue No: Vol. 8, No. 10 (2023)