Subjects -> COMPUTER SCIENCE (Total: 2313 journals)
    - ANIMATION AND SIMULATION (33 journals)
    - ARTIFICIAL INTELLIGENCE (133 journals)
    - AUTOMATION AND ROBOTICS (116 journals)
    - COMPUTER ARCHITECTURE (11 journals)
    - COMPUTER ENGINEERING (12 journals)
    - COMPUTER GAMES (23 journals)
    - COMPUTER PROGRAMMING (25 journals)
    - COMPUTER SCIENCE (1305 journals)
    - COMPUTER SECURITY (59 journals)
    - DATA BASE MANAGEMENT (21 journals)
    - DATA MINING (50 journals)
    - E-BUSINESS (21 journals)
    - E-LEARNING (30 journals)
    - IMAGE AND VIDEO PROCESSING (42 journals)
    - INFORMATION SYSTEMS (109 journals)
    - INTERNET (111 journals)
    - SOCIAL WEB (61 journals)
    - SOFTWARE (43 journals)
    - THEORY OF COMPUTING (10 journals)

COMPUTER SCIENCE (1305 journals)            First | 1 2 3 4 5 6 7 | Last

Showing 201 - 400 of 872 Journals sorted alphabetically
Computational Communication Research     Open Access   (Followers: 2)
Computational Complexity     Hybrid Journal   (Followers: 5)
Computational Condensed Matter     Open Access   (Followers: 1)
Computational Ecology and Software     Open Access   (Followers: 9)
Computational Economics     Hybrid Journal   (Followers: 13)
Computational Geosciences     Hybrid Journal   (Followers: 16)
Computational Linguistics     Open Access   (Followers: 26)
Computational Management Science     Hybrid Journal  
Computational Mathematics and Modeling     Hybrid Journal   (Followers: 8)
Computational Mechanics     Hybrid Journal   (Followers: 13)
Computational Methods and Function Theory     Hybrid Journal  
Computational Molecular Bioscience     Open Access   (Followers: 1)
Computational Optimization and Applications     Hybrid Journal   (Followers: 10)
Computational Particle Mechanics     Hybrid Journal   (Followers: 1)
Computational Science and Discovery     Full-text available via subscription   (Followers: 3)
Computational Science and Techniques     Open Access  
Computational Statistics     Hybrid Journal   (Followers: 16)
Computational Statistics & Data Analysis     Hybrid Journal   (Followers: 37)
Computational Toxicology     Hybrid Journal  
Computer     Full-text available via subscription   (Followers: 178)
Computer Aided Surgery     Open Access   (Followers: 5)
Computer Applications in Engineering Education     Hybrid Journal   (Followers: 6)
Computer Communications     Hybrid Journal   (Followers: 18)
Computer Engineering and Applications Journal     Open Access   (Followers: 8)
Computer Journal     Hybrid Journal   (Followers: 7)
Computer Methods in Applied Mechanics and Engineering     Hybrid Journal   (Followers: 29)
Computer Methods in Biomechanics and Biomedical Engineering     Hybrid Journal   (Followers: 12)
Computer Methods in Biomechanics and Biomedical Engineering : Imaging & Visualization     Hybrid Journal  
Computer Music Journal     Hybrid Journal   (Followers: 22)
Computer Physics Communications     Hybrid Journal   (Followers: 11)
Computer Science - Research and Development     Hybrid Journal   (Followers: 9)
Computer Science and Engineering     Open Access   (Followers: 14)
Computer Science and Information Technology     Open Access   (Followers: 11)
Computer Science Education     Hybrid Journal   (Followers: 18)
Computer Science Journal     Open Access   (Followers: 22)
Computer Science Review     Hybrid Journal   (Followers: 12)
Computer Standards & Interfaces     Hybrid Journal   (Followers: 3)
Computer Supported Cooperative Work (CSCW)     Hybrid Journal   (Followers: 10)
Computer-aided Civil and Infrastructure Engineering     Hybrid Journal   (Followers: 9)
Computer-Aided Design and Applications     Hybrid Journal   (Followers: 6)
Computers     Open Access   (Followers: 1)
Computers & Chemical Engineering     Hybrid Journal   (Followers: 11)
Computers & Education     Hybrid Journal   (Followers: 94)
Computers & Electrical Engineering     Hybrid Journal   (Followers: 11)
Computers & Geosciences     Hybrid Journal   (Followers: 30)
Computers & Mathematics with Applications     Full-text available via subscription   (Followers: 11)
Computers & Structures     Hybrid Journal   (Followers: 46)
Computers & Education Open     Open Access   (Followers: 4)
Computers & Industrial Engineering     Hybrid Journal   (Followers: 7)
Computers and Composition     Hybrid Journal   (Followers: 13)
Computers and Education: Artificial Intelligence     Open Access   (Followers: 7)
Computers and Electronics in Agriculture     Hybrid Journal   (Followers: 11)
Computers and Geotechnics     Hybrid Journal   (Followers: 13)
Computers in Biology and Medicine     Hybrid Journal   (Followers: 8)
Computers in Entertainment     Hybrid Journal   (Followers: 3)
Computers in Human Behavior Reports     Open Access   (Followers: 1)
Computers in Industry     Hybrid Journal   (Followers: 7)
Computers in the Schools     Hybrid Journal   (Followers: 8)
Computers, Environment and Urban Systems     Hybrid Journal   (Followers: 13)
Computerworld Magazine     Free   (Followers: 2)
Computing     Hybrid Journal   (Followers: 2)
Computing and Software for Big Science     Hybrid Journal   (Followers: 1)
Computing and Visualization in Science     Hybrid Journal   (Followers: 4)
Computing in Science & Engineering     Full-text available via subscription   (Followers: 31)
Computing Reviews     Full-text available via subscription   (Followers: 1)
Concurrency and Computation: Practice & Experience     Hybrid Journal   (Followers: 1)
Connection Science     Open Access  
Control Engineering Practice     Hybrid Journal   (Followers: 49)
Cryptologia     Hybrid Journal   (Followers: 3)
CSI Transactions on ICT     Hybrid Journal   (Followers: 2)
Cuadernos de Documentación Multimedia     Open Access  
Current Science     Open Access   (Followers: 147)
Cyber-Physical Systems     Hybrid Journal  
Cyberspace : Jurnal Pendidikan Teknologi Informasi     Open Access  
DAIMI Report Series     Open Access  
Data     Open Access   (Followers: 4)
Data & Policy     Open Access   (Followers: 4)
Data Science     Open Access   (Followers: 5)
Data Science and Engineering     Open Access   (Followers: 3)
Data Technologies and Applications     Hybrid Journal   (Followers: 244)
Data-Centric Engineering     Open Access   (Followers: 2)
Datenbank-Spektrum     Hybrid Journal   (Followers: 1)
Datenschutz und Datensicherheit - DuD     Hybrid Journal  
Decision Analytics     Open Access   (Followers: 3)
Decision Support Systems     Hybrid Journal   (Followers: 14)
Design Journal : An International Journal for All Aspects of Design     Hybrid Journal   (Followers: 38)
Digital Biomarkers     Open Access   (Followers: 1)
Digital Chemical Engineering     Open Access   (Followers: 2)
Digital Chinese Medicine     Open Access  
Digital Creativity     Hybrid Journal   (Followers: 12)
Digital Experiences in Mathematics Education     Hybrid Journal   (Followers: 3)
Digital Finance : Smart Data Analytics, Investment Innovation, and Financial Technology     Hybrid Journal   (Followers: 3)
Digital Geography and Society     Open Access   (Followers: 3)
Digital Government : Research and Practice     Open Access   (Followers: 2)
Digital Health     Open Access   (Followers: 10)
Digital Journalism     Hybrid Journal   (Followers: 8)
Digital Medicine     Open Access   (Followers: 3)
Digital Platform: Information Technologies in Sociocultural Sphere     Open Access   (Followers: 4)
Digital Policy, Regulation and Governance     Hybrid Journal   (Followers: 3)
Digital War     Hybrid Journal   (Followers: 2)
Digitale Welt : Das Wirtschaftsmagazin zur Digitalisierung     Hybrid Journal  
Digitális Bölcsészet / Digital Humanities     Open Access   (Followers: 2)
Disaster Prevention and Management     Hybrid Journal   (Followers: 27)
Discours     Open Access   (Followers: 1)
Discourse & Communication     Hybrid Journal   (Followers: 27)
Discover Internet of Things     Open Access   (Followers: 3)
Discrete and Continuous Models and Applied Computational Science     Open Access  
Discrete Event Dynamic Systems     Hybrid Journal   (Followers: 3)
Discrete Mathematics & Theoretical Computer Science     Open Access   (Followers: 1)
Discrete Optimization     Full-text available via subscription   (Followers: 6)
Displays     Hybrid Journal  
Distributed and Parallel Databases     Hybrid Journal   (Followers: 2)
e-learning and education (eleed)     Open Access   (Followers: 40)
Ecological Indicators     Hybrid Journal   (Followers: 22)
Ecological Informatics     Hybrid Journal   (Followers: 4)
Ecological Management & Restoration     Hybrid Journal   (Followers: 16)
Ecosystems     Hybrid Journal   (Followers: 33)
Edu Komputika Journal     Open Access   (Followers: 1)
Education and Information Technologies     Hybrid Journal   (Followers: 54)
Educational Philosophy and Theory     Hybrid Journal   (Followers: 11)
Educational Psychology in Practice: theory, research and practice in educational psychology     Hybrid Journal   (Followers: 13)
Educational Research and Evaluation: An International Journal on Theory and Practice     Hybrid Journal   (Followers: 8)
Educational Theory     Hybrid Journal   (Followers: 9)
Egyptian Informatics Journal     Open Access   (Followers: 6)
Electronic Commerce Research and Applications     Hybrid Journal   (Followers: 5)
Electronic Design     Partially Free   (Followers: 155)
electronic Journal of Health Informatics     Open Access   (Followers: 7)
Electronic Letters on Computer Vision and Image Analysis     Open Access   (Followers: 10)
Elektron     Open Access  
Empirical Software Engineering     Hybrid Journal   (Followers: 10)
Energy for Sustainable Development     Hybrid Journal   (Followers: 13)
Engineering & Technology     Hybrid Journal   (Followers: 23)
Engineering Applications of Computational Fluid Mechanics     Open Access   (Followers: 23)
Engineering Computations     Hybrid Journal   (Followers: 3)
Engineering Economist, The     Hybrid Journal   (Followers: 4)
Engineering Optimization     Hybrid Journal   (Followers: 11)
Engineering With Computers     Hybrid Journal   (Followers: 5)
Enterprise Information Systems     Hybrid Journal   (Followers: 2)
Entertainment Computing     Hybrid Journal   (Followers: 2)
Environmental and Ecological Statistics     Hybrid Journal   (Followers: 7)
Environmental Communication: A Journal of Nature and Culture     Hybrid Journal   (Followers: 16)
EPJ Data Science     Open Access   (Followers: 11)
ESAIM: Control Optimisation and Calculus of Variations     Open Access   (Followers: 3)
Ethics and Information Technology     Hybrid Journal   (Followers: 66)
eTransportation     Open Access   (Followers: 1)
EURO Journal on Computational Optimization     Open Access   (Followers: 4)
EuroCALL Review     Open Access   (Followers: 1)
European Food Research and Technology     Hybrid Journal   (Followers: 8)
European Journal of Combinatorics     Full-text available via subscription   (Followers: 3)
European Journal of Computational Mechanics     Hybrid Journal   (Followers: 1)
European Journal of Information Systems     Hybrid Journal   (Followers: 97)
European Journal of Law and Technology     Open Access   (Followers: 21)
European Journal of Political Theory     Hybrid Journal   (Followers: 31)
Evolutionary Computation     Hybrid Journal   (Followers: 12)
Fibreculture Journal     Open Access   (Followers: 9)
Finite Fields and Their Applications     Full-text available via subscription   (Followers: 5)
Fixed Point Theory and Applications     Open Access  
Focus on Catalysts     Full-text available via subscription  
Focus on Pigments     Full-text available via subscription   (Followers: 3)
Focus on Powder Coatings     Full-text available via subscription   (Followers: 5)
Forensic Science International: Digital Investigation     Full-text available via subscription   (Followers: 363)
Formal Aspects of Computing     Hybrid Journal   (Followers: 3)
Formal Methods in System Design     Hybrid Journal   (Followers: 7)
Forschung     Hybrid Journal   (Followers: 1)
Foundations and Trends® in Communications and Information Theory     Full-text available via subscription   (Followers: 6)
Foundations and Trends® in Databases     Full-text available via subscription   (Followers: 2)
Foundations and Trends® in Human-Computer Interaction     Full-text available via subscription   (Followers: 5)
Foundations and Trends® in Information Retrieval     Full-text available via subscription   (Followers: 30)
Foundations and Trends® in Networking     Full-text available via subscription   (Followers: 1)
Foundations and Trends® in Signal Processing     Full-text available via subscription   (Followers: 6)
Foundations and Trends® in Theoretical Computer Science     Full-text available via subscription   (Followers: 1)
Foundations of Computational Mathematics     Hybrid Journal   (Followers: 1)
Foundations of Computing and Decision Sciences     Open Access  
Frontiers in Computational Neuroscience     Open Access   (Followers: 24)
Frontiers in Computer Science     Open Access   (Followers: 1)
Frontiers in Digital Health     Open Access   (Followers: 4)
Frontiers in Digital Humanities     Open Access   (Followers: 9)
Frontiers in ICT     Open Access  
Frontiers in Neuromorphic Engineering     Open Access   (Followers: 2)
Frontiers in Research Metrics and Analytics     Open Access   (Followers: 5)
Frontiers of Computer Science in China     Hybrid Journal   (Followers: 2)
Frontiers of Environmental Science & Engineering     Hybrid Journal   (Followers: 3)
Frontiers of Information Technology & Electronic Engineering     Hybrid Journal  
Fuel Cells Bulletin     Full-text available via subscription   (Followers: 10)
Functional Analysis and Its Applications     Hybrid Journal   (Followers: 2)
Future Computing and Informatics Journal     Open Access   (Followers: 1)
Future Generation Computer Systems     Hybrid Journal   (Followers: 2)
Geo-spatial Information Science     Open Access   (Followers: 8)
Geoforum Perspektiv     Open Access   (Followers: 1)
GeoInformatica     Hybrid Journal   (Followers: 7)
Geoinformatics FCE CTU     Open Access   (Followers: 5)
GetMobile : Mobile Computing and Communications     Full-text available via subscription   (Followers: 2)
Government Information Quarterly     Hybrid Journal   (Followers: 29)
Granular Computing     Hybrid Journal  
Graphics and Visual Computing     Open Access  
Grey Room     Hybrid Journal   (Followers: 21)
Group Dynamics : Theory, Research, and Practice     Full-text available via subscription   (Followers: 16)
Groups, Complexity, Cryptology     Open Access   (Followers: 2)
HardwareX     Open Access  
Harvard Data Science Review     Open Access   (Followers: 2)

  First | 1 2 3 4 5 6 7 | Last

Similar Journals
Journal Cover
Number of Followers: 4  

  This is an Open Access Journal Open Access journal
ISSN (Online) 2306-5729
Published by MDPI Homepage  [258 journals]
  • Data, Vol. 8, Pages 135: Enhancing Small Tabular Clinical Trial Dataset
           through Hybrid Data Augmentation: Combining SMOTE and WCGAN-GP

    • Authors: Winston Wang, Tun-Wen Pai
      First page: 135
      Abstract: This study addressed the challenge of training generative adversarial networks (GANs) on small tabular clinical trial datasets for data augmentation, which are known to pose difficulties in training due to limited sample sizes. To overcome this obstacle, a hybrid approach is proposed, combining the synthetic minority oversampling technique (SMOTE) to initially augment the original data to a more substantial size for improving the subsequent GAN training with a Wasserstein conditional generative adversarial network with gradient penalty (WCGAN-GP), proven for its state-of-art performance and enhanced stability. The ultimate objective of this research was to demonstrate that the quality of synthetic tabular data generated by the final WCGAN-GP model maintains the structural integrity and statistical representation of the original small dataset using this hybrid approach. This focus is particularly relevant for clinical trials, where limited data availability due to privacy concerns and restricted accessibility to subject enrollment pose common challenges. Despite the limitation of data, the findings demonstrate that the hybrid approach successfully generates synthetic data that closely preserved the characteristics of the original small dataset. By harnessing the power of this hybrid approach to generate faithful synthetic data, the potential for enhancing data-driven research in drug clinical trials become evident. This includes enabling a robust analysis on small datasets, supplementing the lack of clinical trial data, facilitating its utility in machine learning tasks, even extending to using the model for anomaly detection to ensure better quality control during clinical trial data collection, all while prioritizing data privacy and implementing strict data protection measures.
      Citation: Data
      PubDate: 2023-08-23
      DOI: 10.3390/data8090135
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 136: Knowledge Graph Dataset for Semantic Enrichment
           of Picture Description in NAPS Database

    • Authors: Marko Horvat, Gordan Gledec, Tomislav Jagušt, Zoran Kalafatić
      First page: 136
      Abstract: This data description introduces a comprehensive knowledge graph (KG) dataset with detailed information about the relevant high-level semantics of visual stimuli used to induce emotional states stored in the Nencki Affective Picture System (NAPS) repository. The dataset contains 6808 systematically manually assigned annotations for 1356 NAPS pictures in 5 categories, linked to WordNet synsets and Suggested Upper Merged Ontology (SUMO) concepts presented in a tabular format. Both knowledge databases provide an extensive and supervised taxonomy glossary suitable for describing picture semantics. The annotation glossary consists of 935 WordNet and 513 SUMO entities. A description of the dataset and the specific processes used to collect, process, review, and publish the dataset as open data are also provided. This dataset is unique in that it captures complex objects, scenes, actions, and the overall context of emotional stimuli with knowledge taxonomies at a high level of quality. It provides a valuable resource for a variety of projects investigating emotion, attention, and related phenomena. In addition, researchers can use this dataset to explore the relationship between emotions and high-level semantics or to develop data-retrieval tools to generate personalized stimuli sequences. The dataset is freely available in common formats (Excel and CSV).
      Citation: Data
      PubDate: 2023-08-24
      DOI: 10.3390/data8090136
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 137: A Framework for Evaluating Renewable Energy for
           Decision-Making Integrating a Hybrid FAHP-TOPSIS Approach: A Case Study in
           Valle del Cauca, Colombia

    • Authors: Mateo Barrera-Zapata, Fabian Zuñiga-Cortes, Eduardo Caicedo-Bravo
      First page: 137
      Abstract: At present, the energy landscape of many countries faces transformational challenges driven by sustainable development objectives, supported by the implementation of clean technologies, such as renewable energy sources, to meet the flexibility and diversification needs of the traditional energy mix. However, integrating these technologies requires a thorough study of the context in which they are developed. Furthermore, it is necessary to carry out an analysis from a sustainable approach that quantifies the impact of proposals on multiple objectives established by stakeholders. This article presents a framework for analysis that integrates a method for evaluating the technical feasibility of resources for photovoltaic solar, wind, small hydroelectric power, and biomass generation. These resources are used to construct a set of alternatives and are evaluated using a hybrid FAHP-TOPSIS approach. FAHP-TOPSIS is used as a comparison technique among a collection of technical, economic, and environmental criteria, ranking the alternatives considering their level of trade-off between criteria. The results of a case study in Valle del Cauca (Colombia) offer a wide range of alternatives and indicate a combination of 50% biomass, and 50% solar as the best, assisting in decision-making for the correct use of available resources and maximizing the benefits for stakeholders.
      Citation: Data
      PubDate: 2023-08-30
      DOI: 10.3390/data8090137
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 138: Using Landsat-5 for Accurate Historical LULC
           Classification: A Comparison of Machine Learning Models

    • Authors: Denis Krivoguz, Sergei G. Chernyi, Elena Zinchenko, Artem Silkin, Anton Zinchenko
      First page: 138
      Abstract: This study investigates the application of various machine learning models for land use and land cover (LULC) classification in the Kerch Peninsula. The study utilizes archival field data, cadastral data, and published scientific literature for model training and testing, using Landsat-5 imagery from 1990 as input data. Four machine learning models (deep neural network, Random Forest, support vector machine (SVM), and AdaBoost) are employed, and their hyperparameters are tuned using random search and grid search. Model performance is evaluated through cross-validation and confusion matrices. The deep neural network achieves the highest accuracy (96.2%) and performs well in classifying water, urban lands, open soils, and high vegetation. However, it faces challenges in classifying grasslands, bare lands, and agricultural areas. The Random Forest model achieves an accuracy of 90.5% but struggles with differentiating high vegetation from agricultural lands. The SVM model achieves an accuracy of 86.1%, while the AdaBoost model performs the lowest with an accuracy of 58.4%. The novel contributions of this study include the comparison and evaluation of multiple machine learning models for land use classification in the Kerch Peninsula. The deep neural network and Random Forest models outperform SVM and AdaBoost in terms of accuracy. However, the use of limited data sources such as cadastral data and scientific articles may introduce limitations and potential errors. Future research should consider incorporating field studies and additional data sources for improved accuracy. This study provides valuable insights for land use classification, facilitating the assessment and management of natural resources in the Kerch Peninsula. The findings contribute to informed decision-making processes and lay the groundwork for further research in the field.
      Citation: Data
      PubDate: 2023-08-30
      DOI: 10.3390/data8090138
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 139: Dataset of Multi-Aspect Integrated Migration

    • Authors: Diletta Goglia, Laura Pollacci, Alina Sîrbu
      First page: 139
      Abstract: Nowadays, new branches of research are proposing the use of non-traditional data sources for the study of migration trends in order to find an original methodology to answer open questions about cross-border human mobility. New knowledge extracted from these data must be validated using traditional data, which are however distributed across different sources and difficult to integrate. In this context we present the Multi-aspect Integrated Migration Indicators (MIMI) dataset, a new dataset of migration indicators (flows and stocks) and possible migration drivers (cultural, economic, demographic and geographic indicators). This was obtained through acquisition, transformation and integration of disparate traditional datasets together with social network data from Facebook (Social Connectedness Index). This article describes the process of gathering, embedding and merging traditional and novel variables, resulting in this new multidisciplinary dataset that we believe could significantly contribute to nowcast/forecast bilateral migration trends and migration drivers.
      Citation: Data
      PubDate: 2023-08-31
      DOI: 10.3390/data8090139
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 140: Employing Source Code Quality Analytics for
           Enriching Code Snippets Data

    • Authors: Thomas Karanikiotis, Themistoklis Diamantopoulos, Andreas Symeonidis
      First page: 140
      Abstract: The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability.
      Citation: Data
      PubDate: 2023-08-31
      DOI: 10.3390/data8090140
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 141: Thailand Raw Water Quality Dataset Analysis and

    • Authors: Jaturapith Krohkaew, Pongpon Nilaphruek, Niti Witthayawiroj, Sakchai Uapipatanakul, Yamin Thwe, Padma Nyoman Crisnapati
      First page: 141
      Abstract: Sustainable water quality data are important for understanding historical variability and trends in river regimes, as well as the impact of industrial waste on the health of aquatic ecosystems. Sustainable water management practices heavily depend on reliable and comprehensive data, prompting the need for accurate monitoring and assessment of water quality parameters. This research describes a reconstructed daily water quality dataset that complements rare historical observations for six station points along the Chao Phraya River in Thailand. Internet of Things technology and a Eureka water probe sensor is used to collect and reconstruct the water quality dataset for the period from June 2022–February 2023, with Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, Acidity/Basicity, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth as the recorded parameters from six different stations. The presented dataset comprises a total of 211,322 data points, which are separated into six CSV files. The dataset is then evaluated using the Long Short-Term Memory (LSTM) algorithm with a Mean Squared Error (MSE) of 0.0012256, and Root Mean Squared Error (RMSE) of 0.0350080. The proposed dataset provides valuable insights for researchers studying river ecosystems, supporting informed decision-making and sustainable water management practices.
      Citation: Data
      PubDate: 2023-09-04
      DOI: 10.3390/data8090141
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 142: Update of Dietary Supplement Label Database
           Addressing on Coding in Italy

    • Authors: Giorgia Perelli, Roberta Bernini, Massimo Lucarini, Alessandra Durazzo
      First page: 142
      Abstract: Harmonized composition data for foods and dietary supplements are needed for research and for policy decision making. For a correct assessment of dietary intake, the categorization and the classification of food products and dietary supplements are necessary. In recent decades, the marketing of dietary supplements has increased. A food supplements-based database has, as a principal feature, an intrinsic dynamism related to the continuous changes in formulations, which consequently leads to the need for constant monitoring of the market and for regular updates of the database. This study presents an update to the Dietary Supplement Label Database in Italy focused on dietary supplements coding. The updated dataset here, presented for the first time, consists of the codes of 216 dietary supplements currently on the market in Italy that have functional foods as their characterizing ingredients, throughout the two commonly most used description and classification systems: LanguaLTM and FoodEx2-. This update represents a unique tool and guideline for other compilers and users for applying classification coding systems to dietary supplements. Moreover, this updated dataset represents a valuable resource for several applications such as epidemiological investigations, exposure studies, and dietary assessment.
      Citation: Data
      PubDate: 2023-09-13
      DOI: 10.3390/data8090142
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 143: A New Odd Beta Prime-Burr X Distribution with
           Applications to Petroleum Rock Sample Data and COVID-19 Mortality Rate

    • Authors: Ahmad Abubakar Suleiman, Hanita Daud, Narinderjit Singh Sawaran Singh, Aliyu Ismail Ishaq, Mahmod Othman
      First page: 143
      Abstract: In this article, we pioneer a new Burr X distribution using the odd beta prime generalized (OBP-G) family of distributions called the OBP-Burr X (OBPBX) distribution. The density function of this model is symmetric, left-skewed, right-skewed, and reversed-J, while the hazard function is monotonically increasing, decreasing, bathtub, and N-shaped, making it suitable for modeling skewed data and failure rates. Various statistical properties of the new model are obtained, such as moments, moment-generating function, entropies, quantile function, and limit behavior. The maximum-likelihood-estimation procedure is utilized to determine the parameters of the model. A Monte Carlo simulation study is implemented to ascertain the efficiency of maximum-likelihood estimators. The findings demonstrate the empirical application and flexibility of the OBPBX distribution, as showcased through its analysis of petroleum rock samples and COVID-19 mortality data, along with its superior performance compared to well-known extended versions of the Burr X distribution. We anticipate that the new distribution will attract a wider readership and provide a vital tool for modeling various phenomena in different domains.
      Citation: Data
      PubDate: 2023-09-19
      DOI: 10.3390/data8090143
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 144: Potential Range Map Dataset of Indian Birds

    • Authors: Arpit Deomurari, Ajay Sharma, Dipankar Ghose, Randeep Singh
      First page: 144
      Abstract: Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difficult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km2. Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps significantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action.
      Citation: Data
      PubDate: 2023-09-21
      DOI: 10.3390/data8090144
      Issue No: Vol. 8, No. 9 (2023)
  • Data, Vol. 8, Pages 123: Blockchain Payment Services in the Hospitality
           Sector: The Mediating Role of Data Security on Utilisation Efficiency of
           the Customer

    • Authors: Ankit Dhiraj, Sanjeev Kumar, Divya Rani, Simon Grima, Kiran Sood
      First page: 123
      Abstract: Blockchain technology has the potential to completely transform the hospitality sector by offering a safe, open, and effective method of payment. Increased customer utilisation efficiency may result from this. This study looks into how blockchain payment methods affect hotel customers’ intentions to stay loyal by devising four hypotheses. A questionnaire was specifically created and self-administered for this study as a data-gathering tool and distributed to hotel customers. The I.B.M. SPSS and Amos software packages were used to analyse the data of the 301 valid responses. Findings show that hospitality customers may use blockchain payment services if the customer is satisfied with the data security of this payment system. The study also highlighted that customer data security mediated the association between utilisation efficiency and blockchain payment systems. Blockchain payment services can affect visitors’ intentions to stay loyal by impacting data security and consumer happiness. Results suggest that blockchain payment systems can be useful for hospitality firms looking to increase client utilisation efficiency. Blockchain can simplify visitor booking and payment processes by providing a safe, open, and effective transacting method. This may result in a satisfying encounter that visitors are more inclined to recall and repeat.
      Citation: Data
      PubDate: 2023-07-30
      DOI: 10.3390/data8080123
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 124: Measuring the Effect of Fraud on Data-Quality

    • Authors: Samiha Brahimi, Mariam Elhussein
      First page: 124
      Abstract: Data preprocessing moves the data from raw to ready for analysis. Data resulting from fraud compromises the quality of the data and the resulting analysis. It can exist in datasets such that it goes undetected since it is included in the analysis. This study proposed a process for measuring the effect of fraudulent data during data preparation and its possible influence on quality. The five-step process begins with identifying the business rules related to the business process(s) affected by fraud and their associated quality dimensions. This is followed by measuring the business rules in the specified timeframe, detecting fraudulent data, cleaning them, and measuring their quality after cleaning. The process was implemented in the case of occupational fraud within a hospital context and the illegal issuance of underserved sick leave. The aim of the application is to identify the quality dimensions that are influenced by the injected fraudulent data and how these dimensions are affected. This study agrees with the existing literature and confirms its effects on timeliness, coherence, believability, and interpretability. However, this did not show any effect on consistency. Further studies are needed to arrive at a generalizable list of the quality dimensions that fraud can affect.
      Citation: Data
      PubDate: 2023-07-30
      DOI: 10.3390/data8080124
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 125: Quantitative Metabolomic Dataset of Avian Eye

    • Authors: Ekaterina A. Zelentsova, Sofia S. Mariasina, Vadim V. Yanshole, Lyudmila V. Yanshole, Nataliya A. Osik, Kirill A. Sharshov, Yuri P. Tsentalovich
      First page: 125
      Abstract: Metabolomics is a powerful set of methods that uses analytical techniques to identify and quantify metabolites in biological samples, providing a snapshot of the metabolic state of a biological system. In medicine, metabolomics may help to reveal the molecular basis of a disease, make a diagnosis, and monitor treatment responses, while in agriculture, it can improve crop yields and plant breeding. However, animal metabolomics faces several challenges due to the complexity and diversity of animal metabolomes, the lack of standardized protocols, and the difficulty in interpreting metabolomic data. The current dataset includes quantitative metabolomic profiles of eye lenses from 26 bird species (111 specimens) that can aid researchers in developing new experiments, mathematical models, and integrating with other “-omics” data. The dataset includes raw 1H NMR spectra, protocols for sample preparation, and data preprocessing, with the final table containing information on the abundance of 89 reliably identified and quantified metabolites. The dataset is quantitative, making it relevant for supplementing with new specimens or comparison groups, followed by data mining and expected new interpretations. The data were obtained using the bird specimens collected in compliance with ethical standards and revealed potential differences in metabolic pathways due to phylogenetic differences or environmental exposure.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080125
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 126: Datasets of Simulated Exhaled Aerosol Images from
           Normal and Diseased Lungs with Multi-Level Similarities for Neural Network
           Training/Testing and Continuous Learning

    • Authors: Mohamed Talaat, Xiuhua Si, Jinxiang Xi
      First page: 126
      Abstract: Although exhaled aerosols and their patterns may seem chaotic in appearance, they inherently contain information related to the underlying respiratory physiology and anatomy. This study presented a multi-level database of simulated exhaled aerosol images from both normal and diseased lungs. An anatomically accurate mouth-lung geometry extending to G9 was modified to model two stages of obstructions in small airways and physiology-based simulations were utilized to capture the fluid-particle dynamics and exhaled aerosol images from varying breath tests. The dataset was designed to test two performance metrics of convolutional neural network (CNN) models when used for transfer learning: interpolation and extrapolation. To this aim, three testing datasets with decreasing image similarities were developed (i.e., level 1, inbox, and outbox). Four network models (AlexNet, ResNet-50, MobileNet, and EfficientNet) were tested and the performances of all models decreased for the outbox test images, which were outside the design space. The effect of continuous learning was also assessed for each model by adding new images into the training dataset and the newly trained network was tested at multiple levels. Among the four network models, ResNet-50 excelled in performance in both multi-level testing and continuous learning, the latter of which enhanced the accuracy of the most challenging classification task (i.e., 3-class with outbox test images) from 60.65% to 98.92%. The datasets can serve as a benchmark training/testing database for validating existent CNN models or quantifying the performance metrics of new CNN models.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080126
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 127: eMailMe: A Method to Build Datasets of Corporate
           Emails in Portuguese

    • Authors: Akira A. de Moura Galvão Uematsu, Anarosa A. F. Brandão
      First page: 127
      Abstract: One of the areas in which knowledge management has application is in companies that are concerned with maintaining and disseminating their practices among their members. However, studies involving these two domains may end up suffering from the issue of data confidentiality. Furthermore, it is difficult to find data regarding organizations processes and associated knowledge. Therefore, this paper presents a method to support the generation of a labeled dataset composed of texts that simulate corporate emails containing sensitive information regarding disclosure, written in Portuguese. The method begins with the definition of the dataset’s size and content distribution; the structure of its emails’ texts; and the guidelines for specialists to build the emails’ texts. It aims to create datasets that can be used in the validation of a tacit knowledge extraction process considering the 5W1H approach for the resulting base. The method was applied to create a dataset with content related to several domains, such as Federal Court and Registry Office and Marketing, giving it diversity and realism, while simulating real-world situations in the specialists’ professional life. The dataset generated is available in an open-access repository so that it can be downloaded and, eventually, expanded.
      Citation: Data
      PubDate: 2023-07-31
      DOI: 10.3390/data8080127
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 128: VEPL Dataset: A Vegetation Encroachment in Power
           Line Corridors Dataset for Semantic Segmentation of Drone Aerial

    • Authors: Mateo Cano-Solis, John R. Ballesteros, John W. Branch-Bedoya
      First page: 128
      Abstract: Vegetation encroachment in power line corridors has multiple problems for modern energy-dependent societies. Failures due to the contact between power lines and vegetation can result in power outages and millions of dollars in losses. To address this problem, UAVs have emerged as a promising solution due to their ability to quickly and affordably monitor long corridors through autonomous flights or being remotely piloted. However, the extensive and manual task that requires analyzing every image acquired by the UAVs when searching for the existence of vegetation encroachment has led many authors to propose the use of Deep Learning to automate the detection process. Despite the advantages of using a combination of UAV imagery and Deep Learning, there is currently a lack of datasets that help to train Deep Learning models for this specific problem. This paper presents a dataset for the semantic segmentation of vegetation encroachment in power line corridors. RGB orthomosaics were obtained for a rural road area using a commercial UAV. The dataset is composed of pairs of tessellated RGB images, coming from the orthomosaic and corresponding multi-color masks representing three different classes: vegetation, power lines, and the background. A detailed description of the image acquisition process is provided, as well as the labeling task and the data augmentation techniques, among other relevant details to produce the dataset. Researchers would benefit from using the proposed dataset by developing and improving strategies for vegetation encroachment monitoring using UAVs and Deep Learning.
      Citation: Data
      PubDate: 2023-08-04
      DOI: 10.3390/data8080128
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 129: Anomaly Detection in Student Activity in Solving
           Unique Programming Exercises: Motivated Students against Suspicious Ones

    • Authors: Liliya A. Demidova, Peter N. Sovietov, Elena G. Andrianova, Anna A. Demidova
      First page: 129
      Abstract: This article presents a dataset containing messages from the Digital Teaching Assistant (DTA) system, which records the results from the automatic verification of students’ solutions to unique programming exercises of 11 various types. These results are automatically generated by the system, which automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). The DTA system is trained to distinguish between approaches to solve programming exercises, as well as to identify correct and incorrect solutions, using intelligent algorithms responsible for analyzing the source code in the DTA system using vector representations of programs based on Markov chains, calculating pairwise Jensen–Shannon distances for programs and using a hierarchical clustering algorithm to detect high-level approaches used by students in solving unique programming exercises. In the process of learning, each student must correctly solve 11 unique exercises in order to receive admission to the intermediate certification in the form of a test. In addition, a motivated student may try to find additional approaches to solve exercises they have already solved. At the same time, not all students are able or willing to solve the 11 unique exercises proposed to them; some will resort to outside help in solving all or part of the exercises. Since all information about the interactions of the students with the DTA system is recorded, it is possible to identify different types of students. First of all, the students can be classified into 2 classes: those who failed to solve 11 exercises and those who received admission to the intermediate certification in the form of a test, having solved the 11 unique exercises correctly. However, it is possible to identify classes of typical, motivated and suspicious students among the latter group based on the proposed dataset. The proposed dataset can be used to develop regression models that will predict outbursts of student activity when interacting with the DTA system, to solve clustering problems, to identify groups of students with a similar behavior model in the learning process and to develop intelligent data classifiers that predict the students’ behavior model and draw appropriate conclusions, not only at the end of the learning process but also during the course of it in order to motivate all students, even those who are classified as suspicious, to visualize the results of the learning process using various tools.
      Citation: Data
      PubDate: 2023-08-08
      DOI: 10.3390/data8080129
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 130: Towards Action-State Process Model Discovery

    • Authors: Alessio Bottrighi, Marco Guazzone, Giorgio Leonardi, Stefania Montani, Manuel Striani, Paolo Terenziani
      First page: 130
      Abstract: Process model discovery covers the different methodologies used to mine a process model from traces of process executions, and it has an important role in artificial intelligence research. Current approaches in this area, with a few exceptions, focus on determining a model of the flow of actions only. However, in several contexts, (i) restricting the attention to actions is quite limiting, since the effects of such actions also have to be analyzed, and (ii) traces provide additional pieces of information in the form of states (i.e., values of parameters possibly affected by the actions); for instance, in several medical domains, the traces include both actions and measurements of patient parameters. In this paper, we propose AS-SIM (Action-State SIM), the first approach able to mine a process model that comprehends two distinct classes of nodes, to capture both actions and states.
      Citation: Data
      PubDate: 2023-08-09
      DOI: 10.3390/data8080130
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 131: Draft Genome Sequence Data of Streptomyces
           anulatus, Strain K-31

    • Authors: Andrey P. Bogoyavlenskiy, Madina S. Alexyuk, Amankeldi K. Sadanov, Vladimir E. Berezin, Lyudmila P. Trenozhnikova, Gul B. Baymakhanova
      First page: 131
      Abstract: Streptomyces anulatus is a typical representative of the Streptomyces genus synthesizing a large number of biologically active compounds. In this study, the draft genome of Streptomyces anulatus, strain K-31 is presented, generated from Illumina reads by SPAdes software. The size of the assembled genome was 8.548838 Mb. Annotation of the S. anulatus genome assembly identified C. hemipterus genome 7749 genes, including 7149 protein-coding genes and 92 RNA genes. This genome will be helpful to further understand Streptomyces genetics and evolution and can be useful for obtained biological active compounds.
      Citation: Data
      PubDate: 2023-08-10
      DOI: 10.3390/data8080131
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 132: VR Traffic Dataset on Broad Range of End-User

    • Authors: Marina Polupanova
      First page: 132
      Abstract: With the emergence of new internet traffic types in modern transport networks, it has become critical for service providers to understand the structure of that traffic and predict peaks of that load for planning infrastructure expansion. Several studies have investigated traffic parameters for Virtual Reality (VR) applications. Still, most of them test only a partial range of user activities during a limited time interval. This work creates a dataset of captures from a broader spectrum of VR activities performed with a Meta Quest 2 headset, with the duration of each real residential user session recorded for at least half an hour. Newly collected data helped show that some gaming VR traffic activities have a high share of uplink traffic and require symmetric user links. Also, we have figured out that the gaming phase of the overall gameplay is more sensitive to the channel resources reduction than the higher bitrate game launch phase. Hence, we recommend it as a source of traffic distribution for channel sizing model creation. From the gaming phase, capture intervals of more than 100 s contain the most representative information for modeling activity.
      Citation: Data
      PubDate: 2023-08-17
      DOI: 10.3390/data8080132
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 133: Leveraging Return Prediction Approaches for
           Improved Value-at-Risk Estimation

    • Authors: Farid Bagheri, Diego Reforgiato Recupero, Espen Sirnes
      First page: 133
      Abstract: Value at risk is a statistic used to anticipate the largest possible losses over a specific time frame and within some level of confidence, usually 95% or 99%. For risk management and regulators, it offers a solution for trustworthy quantitative risk management tools. VaR has become the most widely used and accepted indicator of downside risk. Today, commercial banks and financial institutions utilize it as a tool to estimate the size and probability of upcoming losses in portfolios and, as a result, to estimate and manage the degree of risk exposure. The goal is to obtain the average number of VaR “failures” or “breaches” (losses that are more than the VaR) as near to the target rate as possible. It is also desired that the losses be evenly distributed as possible. VaR can be modeled in a variety of ways. The simplest method is to estimate volatility based on prior returns according to the assumption that volatility is constant. Otherwise, the volatility process can be modeled using the GARCH model. Machine learning techniques have been used in recent years to carry out stock market forecasts based on historical time series. A machine learning system is often trained on an in-sample dataset, where it can adjust and improve specific hyperparameters in accordance with the underlying metric. The trained model is tested on an out-of-sample dataset. We compared the baselines for the VaR estimation of a day (d) according to different metrics (i) to their respective variants that included stock return forecast information of d and stock return data of the days before d and (ii) to a GARCH model that included return prediction information of d and stock return data of the days before d. Various strategies such as ARIMA and a proposed ensemble of regressors have been employed to predict stock returns. We observed that the versions of the univariate techniques and GARCH integrated with return predictions outperformed the baselines in four different marketplaces.
      Citation: Data
      PubDate: 2023-08-17
      DOI: 10.3390/data8080133
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 134: Quantifying Webpage Performance: A Comparative
           Analysis of TCP/IP and QUIC Communication Protocols for Improved

    • Authors: Thyago Celso Cavalcante Nepomuceno, Késsia Thais Cavalcanti Nepomuceno, Fabiano Carlos da Silva, Silas Garrido Teixeira de Carvalho Santos
      First page: 134
      Abstract: Browsing is a prevalent activity on the World Wide Web, and users usually demonstrate significant expectations for expeditious information retrieval and seamless transactions. This article presents a comprehensive performance evaluation of the most frequently accessed webpages in recent years using Data Envelopment Analysis (DEA) adapted to the context (inverse DEA), comparing their performance under two distinct communication protocols: TCP/IP and QUIC. To assess performance disparities, parametric and non-parametric hypothesis tests are employed to investigate the appropriateness of each website’s communication protocols. We provide data on the inputs, outputs, and efficiency scores for 82 out of the world’s top 100 most-accessed websites, describing how experiments and analyses were conducted. The evaluation yields quantitative metrics pertaining to the technical efficiency of the websites and efficient benchmarks for best practices. Nine websites are considered efficient from the point of view of at least one of the communication protocols. Considering TCP/IP, about 80.5% of all units (66 webpages) need to reduce more than 50% of their page load time to be competitive, while this number is 28.05% (23 webpages), considering QUIC communication protocol. In addition, results suggest that TCP/IP protocol has an unfavorable effect on the overall distribution of inefficiencies.
      Citation: Data
      PubDate: 2023-08-19
      DOI: 10.3390/data8080134
      Issue No: Vol. 8, No. 8 (2023)
  • Data, Vol. 8, Pages 113: VPTD: Human Face Video Dataset for Personality
           Traits Detection

    • Authors: Kenan Kassab, Alexey Kashevnik, Alexander Mayatin, Dmitry Zubok
      First page: 113
      Abstract: In this paper, we propose a dataset for personality traits detection based on human face videos. Ground truth data have been annotated using the IPIP-50 personality test that every participant is implementing. To collect the dataset, we developed a web-based platform that allows us to acquire spontaneous answers for predefined questions from the respondents. The website allows the participants to record an interactive interview in order to imitate the real-life interview. The dataset includes 38 videos (2 min on average) for people of different races, genders, and ages. In the paper, we propose the top five personality traits calculated based on the test, as well as the top five personality traits calculated by our own developed model that determines this information based on video analysis. We introduced a statistical analysis for the collected dataset, and we also applied a K-means clustering algorithm to cluster the data and present the clustering results.
      Citation: Data
      PubDate: 2023-06-22
      DOI: 10.3390/data8070113
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 114: A Survey Dataset Evaluating Perceptions of Civil
           Engineering Students about Building Information Modelling (BIM)

    • Authors: Diego Maria Barbieri, Baowen Lou, Marco Passavanti, Aurora Barbieri, Fredrik Bjørheim
      First page: 114
      Abstract: The implementation of Building Information Modelling (BIM) technologies has become increasingly central in the design, construction and maintenance of both civil structures and infrastructures. As more and more software houses develop new BIM software solutions and a wide range of private and public stakeholders employ them, several educational institutes across the globe strive to expand their teaching portfolio to encompass learning and teaching of BIM. This dataset deals with the perceptions expressed by all the civil engineering undergraduate students who attended an academic course specifically about BIM at University of Stavanger (UiS), Norway, during the second semester 2022. The survey was divided into five parts and collected information regarding as many overarching aspects: socio-demographic data, perceptions about BIM before and after course attendance, satisfaction about the academic course and the way it was conducted. Considering the very moderate sample size (28 students) and potential biases due to the specific context of the University of Stavanger, the dataset can provide a useful insight into teaching approaches and future curriculum development, rather than indicating major and generalized trends in BIM education. As the questionnaire responses shed light on the feedbacks and perceptions expressed by university students dealing with BIM for their first time, the formed dataset can offer a straightforward appreciation of students’ cognitive behaviour in BIM education.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070114
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 115: Factory-Based Vibration Data for Bearing-Fault

    • Authors: Adam Lundström, Mattias O’Nils
      First page: 115
      Abstract: The importance of preventing failures in bearings has led to a large amount of research being conducted to find methods for fault diagnostics and prognostics. Many of these solutions, such as deep learning methods, require a significant amount of data to perform well. This is a reason why publicly available data are important, and there currently exist several open datasets that contain different conditions and faults. However, one challenge is that almost all of these data come from a laboratory setting, where conditions might differ from those found in an industrial environment where the methods are intended to be used. This also means that there may be characteristics of the industrial data that are important to take into account. Therefore, this study describes a completely new dataset for bearing faults from a pulp mill. The analysis of the data shows that the faults vary significantly in terms of fault development, rotation speed, and the amplitude of the vibration signal. It also suggests that methods built for this environment need to consider that no historical examples of faults in the target domain exist and that external events can occur that are not related to any condition of the bearing.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070115
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 116: Dataset of Linkability Networks of Ethereum
           Accounts Involved in NFT Trading of Top 15 NFT Collections

    • Authors: Aleksandar Tošić, Niki Hrovatin, Jernej Vičič
      First page: 116
      Abstract: In this paper, we present subgraphs of Ethereum wallets involved in NFT trades of the top 15 ERC721 NFT collections. To obtain the subgraphs, we have extracted the Ethereum transaction graph from a live Ethereum node and filtered out exchanges, mining pools, and smart contracts. For each of the selected collections, we identified the set of accounts involved in NFT trading, which we used to perform a breadth-first search in the Ethereum transaction graph to obtain a subgraph. These subgraphs can offer insight into the linkability of accounts participating in NFT trading on the Ethereum blockchain.
      Citation: Data
      PubDate: 2023-06-28
      DOI: 10.3390/data8070116
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 117: Assessment of Maize Silage Quality Under
           Different Pre-Ensiling Conditions

    • Authors: Lorenzo Serva, Igino Andrighetto, Severino Segato, Giorgio Marchesini, Maria Chinello, Luisa Magrin
      First page: 117
      Abstract: Maize silage suffers from several factors that affect the final quality and, to some extent, pre-ensiled conditions that can be potentially tuned during harvesting. After assessing new indices for silage quality under lab-scale conditions, several trials have been conducted to find associations between fresh maize characteristics and silage features. Among the first, we included field input levels, FAO class, maturity stage, use of bacterial inoculants, sealing delay and chemical traits, whereas, among the latter, we assessed density and porosity, pH, fermentative profile, dry matter loss and aerobic stability. The trials were conducted using vacuum bags or mini silo buckets. More than 1500 maize samples harvested in Northeast Italy were analysed during the 2016–2022 period. Moreover, to evaluate silage aerobic stability, the fermentative profile and temperature were measured 14 days after the opening of the silo. The association between silage quality and aerobic stability was assessed, and a prognostic risk score was used to calculate the probability of aerobic instability. The dataset could provide baseline information to promote the continuous improvement of maize silage management from different botanical and crop fields, thus improving agronomic and animal farm resource allocation from a precision agriculture perspective.
      Citation: Data
      PubDate: 2023-07-02
      DOI: 10.3390/data8070117
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 118: A Semantically Annotated 15-Class Ground Truth
           Dataset for Substation Equipment to Train Semantic Segmentation Models

    • Authors: Andreas Anael Pereira Gomes, Francisco Itamarati Secolo Ganacim, Fabiano Gustavo Silveira Magrin, Nara Bobko, Leonardo Göbel Fernandes, Anselmo Pombeiro, Eduardo Félix Ribeiro Romaneli
      First page: 118
      Abstract: The lack of annotated semantic segmentation datasets for electrical substations in the literature poses a significant problem for machine learning tasks; before training a model, a dataset is needed. This paper presents a new dataset of electric substations with 1660 images annotated with 15 classes, including insulators, disconnect switches, transformers and other equipment commonly found in substation environments. The images were captured using a combination of human, fixed and AGV-mounted cameras at different times of the day, providing a diverse set of training and testing data for algorithm development. In total, 50,705 annotations were created by a team of experienced annotators, using a standardized process to ensure accuracy across the dataset. The resulting dataset provides a valuable resource for researchers and practitioners working in the fields of substation automation, substation monitoring and computer vision. Its availability has the potential to advance the state of the art in this important area.
      Citation: Data
      PubDate: 2023-07-05
      DOI: 10.3390/data8070118
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 119: Proteomic Shift in Mouse Embryonic Fibroblasts
           Pfa1 during Erastin, ML210, and BSO-Induced Ferroptosis

    • Authors: Olga M. Kudryashova, Alexey M. Nesterenko, Dmitry A. Korzhenevskii, Valeriy K. Sulyagin, Vasilisa M. Tereshchuk, Vsevolod V. Belousov, Arina G. Shokhina
      First page: 119
      Abstract: Ferroptosis is a unique variety of non-apoptotic cell death, driven by massive lipid oxidation in an iron-dependent manner. Since ferroptosis was introduced as a concept in 2012, it has demonstrated its essential role in the pathogenesis in neurodegenerative diseases and an important role in therapy-resistant cancer cells. Thus, detailed molecular understanding of both canonical and alternative ferroptosis pathways is required. There is a set of widely used chemical agents to modulate ferroptosis using different pathway targets: erastin blocks cystine–glutamate antiporter, system xc-; ML210 directly inactivates GPX4; and L-buthionine sulfoximine (BSO) inhibits γ-glutamylcysteine synthetase, an essential enzyme for glutathione synthesis de novo. Most studies have focused on the lipidomic profiling of model systems undergoing death in a ferroptotic modality. In this study, we developed high-quality shotgun proteome sequencing during ferroptosis induction by three widely used chemical agents (erastin, ML210, and BSO) before and after 24 and 48 h of treatment. Chromato-mass spectra were registered in DDA mode and are suitable for further label-free quantification. Both processed and raw files are publicly available and could be a valuable dynamic proteome map for further ferroptosis investigation.
      Citation: Data
      PubDate: 2023-07-12
      DOI: 10.3390/data8070119
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 120: PoPu-Data: A Multilayered, Simultaneously
           Collected Lying Position Dataset

    • Authors: Luís Fonseca, Fernando Ribeiro, José Metrôlho, Adriana Santos, Rogério Dionisio, Mohammad Mohammad Amini, Arlindo F. Silva, Ahmad Reza Heravi, Davood Fanaei Sheikholeslami, Filipe Fidalgo, Francisco B. Rodrigues, Osvaldo Santos, Patrícia Coelho, Seyyed Sajjad Aemmi
      First page: 120
      Abstract: This study presents a dataset containing three layers of data that are useful for body position classification and all uses related to it. The PoPu dataset contains simultaneously collected data from two different sensor sheets—one placed over and one placed under a mattress; furthermore, a segmentation data layer was added where different body parts are identified using the pressure data from the sensors over the mattress. The data included were gathered from 60 healthy volunteers distributed among the different gathered characteristics: namely sex, weight, and height. This dataset can be used for position classification, assessing the viability of sensors placed under a mattress, and in applications regarding bedded or lying people or sleep related disorders.
      Citation: Data
      PubDate: 2023-07-16
      DOI: 10.3390/data8070120
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 121: Knowledge Discovery and Dataset for the
           Improvement of Digital Literacy Skills in Undergraduate Students

    • Authors: Pongpon Nilaphruek, Pattama Charoenporn
      First page: 121
      Abstract: For over two decades, scholars and practitioners have emphasized the importance of digital literacy, yet the existing datasets are insufficient for establishing learning analytics in Thailand. Learning analytics focuses on gathering and analyzing student data to optimize learning tools and activities to improve students’ learning experiences. The main problem is that the ICT skill levels of the youth are rather low in Thailand. To facilitate research in this field, this study has compiled a dataset containing information from the IC3 digital literacy certification delivered at the Rajamangala University of Technology Thanyaburi (RMUTT) in Thailand between 2016 and 2023. This dataset is unique since it includes demographic and academic records about undergraduate students. The dataset was collected and underwent a preparation process, including data cleansing, anonymization, and release. This data enables the examination of student learning outcomes, represented by a dataset containing information about 45,603 records with students’ certification assessment scores. This compiled dataset provides a rich resource for researchers studying digital literacy and learning analytics. It offers researchers the opportunity to gain valuable insights, inform evidence-based educational practices, and contribute to the ongoing efforts to improve digital literacy education in Thailand and beyond.
      Citation: Data
      PubDate: 2023-07-20
      DOI: 10.3390/data8070121
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 122: A Wavelet-Decomposed WD-ARMA-GARCH-EVT Model
           Approach to Comparing the Riskiness of the BitCoin and South African Rand
           Exchange Rates

    • Authors: Thabani Ndlovu, Delson Chikobvu
      First page: 122
      Abstract: In this paper, a hybrid of a Wavelet Decomposition–Generalised Auto-Regressive Conditional Heteroscedasticity–Extreme Value Theory (WD-ARMA-GARCH-EVT) model is applied to estimate the Value at Risk (VaR) of BitCoin (BTC/USD) and the South African Rand (ZAR/USD). The aim is to measure and compare the riskiness of the two currencies. New and improved estimation techniques for VaR have been suggested in the last decade in the aftermath of the global financial crisis of 2008. This paper aims to provide an improved alternative to the already existing statistical tools in estimating a currency VaR empirically. Maximal Overlap Discrete Wavelet Transform (MODWT) and two mother wavelet filters on the returns series are considered in this paper, viz., the Haar and Daubechies (d4). The findings show that BitCoin/USD is riskier than ZAR/USD since it has a higher VaR per unit invested in each currency. At the 99% significance level, BitCoin/USD has average values of VaR of 2.71% and 4.98% for the WD-ARMA-GARCH-GPD and WD-ARMA-GARCH-GEVD models, respectively; and this is slightly higher than the respective 2.69% and 3.59% for the ZAR/USD. The average BitCoin/USD returns of 0.001990 are higher than ZAR/USD returns of −0.000125. These findings are consistent with the mean-variance portfolio theory, which suggests a higher yield for riskier assets. Based on the p-values of the Kupiec likelihood ratio test, the hybrid model adequacy is largely accepted, as p-values are greater than 0.05, except for the WD-ARMA-GARCH-GEVD models at a 99% significance level for both currencies. The findings are helpful to financial risk practitioners and forex traders in formulating their diversification and hedging strategies and ascertaining the risk-adjusted capital requirement to be set aside as a cushion in the event of the occurrence of an actual loss.
      Citation: Data
      PubDate: 2023-07-24
      DOI: 10.3390/data8070122
      Issue No: Vol. 8, No. 7 (2023)
  • Data, Vol. 8, Pages 93: Target Screening of Chemicals of Emerging Concern
           (CECs) in Surface Waters of the Swedish West Coast

    • Authors: Pedro A. Inostroza, Eric Carmona, Åsa Arrhenius, Martin Krauss, Werner Brack, Thomas Backhaus
      First page: 93
      Abstract: The aquatic environment faces increasing threats from a variety of unregulated organic chemicals originating from human activities, collectively known as chemicals of emerging concern (CECs). These include pharmaceuticals, personal-care products, pesticides, surfactants, industrial chemicals, and their transformation products. CECs enter aquatic environments through various sources, including effluents from wastewater treatment plants, industrial facilities, runoff from agricultural and residential areas, as well as accidental spills. Data on the occurrence of CECs in the marine environment are scarce, and more information is needed to assess the chemical and ecological status of water bodies, and to prioritize toxic chemicals for further studies or risk assessment. In this study, we describe a monitoring campaign targeting CECs in surface waters at the Swedish west coast using, for the first time, an on-site large volume solid phase extraction (LVSPE) device. We detected up to 80 and 227 CECs in marine sites and the wastewater treatment plant (WWTP) effluent, respectively. The dataset will contribute to defining pollution fingerprints and assessing the chemical status of marine and freshwater systems affected by industrial hubs, agricultural areas, and the discharge of urban wastewater.
      Citation: Data
      PubDate: 2023-05-25
      DOI: 10.3390/data8060093
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 94: MicroRNA Profiling of Fresh Lung Adenocarcinoma
           and Adjacent Normal Tissues from Ten Korean Patients Using miRNA-Seq

    • Authors: Jihye Park, Sae Jung Na, Jung Sook Yoon, Seoree Kim, Sang Hoon Chun, Jae Jun Kim, Young-Du Kim, Young-Ho Ahn, Keunsoo Kang, Yoon Ho Ko
      First page: 94
      Abstract: MicroRNA transcriptomes from fresh tumors and the adjacent normal tissues were profiled in 10 Korean patients diagnosed with lung adenocarcinoma using a next-generation sequencing (NGS) technique called miRNA-seq. The sequencing quality was assessed using FastQC, and low-quality or adapter-contaminated portions of the reads were removed using Trim Galore. Quality-assured reads were analyzed using miRDeep2 and Bowtie. The abundance of known miRNAs was estimated using the reads per million (RPM) normalization method. Subsequently, using DESeq2 and Wx, we identified differentially expressed miRNAs and potential miRNA biomarkers for lung adenocarcinoma tissues compared to adjacent normal tissues, respectively. We defined reliable miRNA biomarkers for lung adenocarcinoma as those detected by both methods. The miRNA-seq data are available in the Gene Expression Omnibus (GEO) database under accession number GSE196633, and all processed data can be accessed via the Mendeley data website.
      Citation: Data
      PubDate: 2023-05-25
      DOI: 10.3390/data8060094
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 95: A Dataset of Scalp EEG Recordings of
           Alzheimer’s Disease, Frontotemporal Dementia and Healthy Subjects
           from Routine EEG

    • Authors: Andreas Miltiadous, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Nikolaos Grigoriadis, Dimitrios G. Tsalikakis, Pantelis Angelidis, Markos G. Tsipouras, Euripidis Glavas, Nikolaos Giannakeas, Alexandros T. Tzallas
      First page: 95
      Abstract: Recently, there has been a growing research interest in utilizing the electroencephalogram (EEG) as a non-invasive diagnostic tool for neurodegenerative diseases. This article provides a detailed description of a resting-state EEG dataset of individuals with Alzheimer’s disease and frontotemporal dementia, and healthy controls. The dataset was collected using a clinical EEG system with 19 scalp electrodes while participants were in a resting state with their eyes closed. The data collection process included rigorous quality control measures to ensure data accuracy and consistency. The dataset contains recordings of 36 Alzheimer’s patients, 23 frontotemporal dementia patients, and 29 healthy age-matched subjects. For each subject, the Mini-Mental State Examination score is reported. A monopolar montage was used to collect the signals. A raw and preprocessed EEG is included in the standard BIDS format. For the preprocessed signals, established methods such as artifact subspace reconstruction and an independent component analysis have been employed for denoising. The dataset has significant reuse potential since Alzheimer’s EEG Machine Learning studies are increasing in popularity and there is a lack of publicly available EEG datasets. The resting-state EEG data can be used to explore alterations in brain activity and connectivity in these conditions, and to develop new diagnostic and treatment approaches. Additionally, the dataset can be used to compare EEG characteristics between different types of dementia, which could provide insights into the underlying mechanisms of these conditions.
      Citation: Data
      PubDate: 2023-05-27
      DOI: 10.3390/data8060095
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 96: Exploring the Evolution of Sentiment in Spanish
           Pandemic Tweets: A Data Analysis Based on a Fine-Tuned BERT Architecture

    • Authors: Carlos Henríquez Miranda, German Sanchez-Torres, Dixon Salcedo
      First page: 96
      Abstract: The COVID-19 pandemic has had a significant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reflected in their opinions and comments on social media platforms, such as Twitter. This study explores the evolution of sentiment in Spanish pandemic tweets through a data analysis based on a fine-tuned BERT architecture. A total of six million tweets were collected using web scraping techniques, and pre-processing was applied to filter and clean the data. The fine-tuned BERT architecture was utilized to perform sentiment analysis, which allowed for a deep-learning approach to sentiment classification. The analysis results were graphically represented based on search criteria, such as “COVID-19” and “coronavirus”. This study reveals sentiment trends, significant concerns, relationship with announced news, public reactions, and information dissemination, among other aspects. These findings provide insight into the emotional impact of the COVID-19 pandemic on individuals and the corresponding impact on social media platforms.
      Citation: Data
      PubDate: 2023-05-29
      DOI: 10.3390/data8060096
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 97: A Fast Deep Learning ECG Sex Identifier Based on
           Wavelet RGB Image Classification

    • Authors: Jose-Luis Cabra Lopez, Carlos Parra, Gonzalo Forero
      First page: 97
      Abstract: Human sex recognition with electrocardiogram signals is an emerging area in machine learning, mostly oriented toward neural network approaches. It might be the beginning of a field of heart behavior analysis focused on sex. However, a person’s heartbeat changes during daily activities, which could compromise the classification. In this paper, with the intention of capturing heartbeat dynamics, we divided the heart rate into different intervals, creating a specialized identification model for each interval. The sexual differentiation for each model was performed with a deep convolutional neural network from images that represented the RGB wavelet transformation of ECG pseudo-orthogonal X, Y, and Z signals, using sufficient samples to train the network. Our database included 202 people, with a female-to-male population ratio of 49.5–50.5% and an observation period of 24 h per person. As our main goal, we looked for periods of time during which the classification rate of sex recognition was higher and the process was faster; in fact, we identified intervals in which only one heartbeat was required. We found that for each heart rate interval, the best accuracy score varied depending on the number of heartbeats collected. Furthermore, our findings indicated that as the heart rate increased, fewer heartbeats were needed for analysis. On average, our proposed model reached an accuracy of 94.82% ± 1.96%. The findings of this investigation provide a heartbeat acquisition procedure for ECG sex recognition systems. In addition, our results encourage future research to include sex as a soft biometric characteristic in person identification scenarios and for cardiology studies, in which the detection of specific male or female anomalies could help autonomous learning machines move toward specialized health applications.
      Citation: Data
      PubDate: 2023-05-29
      DOI: 10.3390/data8060097
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 98: Unmanned Aerial Vehicle (UAV) and Spectral
           Datasets in South Africa for Precision Agriculture

    • Authors: Cilence Munghemezulu, Zinhle Mashaba-Munghemezulu, Phathutshedzo Eugene Ratshiedana, Eric Economon, George Chirima, Sipho Sibanda
      First page: 98
      Abstract: Remote sensing data play a crucial role in precision agriculture and natural resource monitoring. The use of unmanned aerial vehicles (UAVs) can provide solutions to challenges faced by farmers and natural resource managers due to its high spatial resolution and flexibility compared to satellite remote sensing. This paper presents UAV and spectral datasets collected from different provinces in South Africa, covering different crops at the farm level as well as natural resources. UAV datasets consist of five multispectral bands corrected for atmospheric effects using the PIX4D mapper software to produce surface reflectance images. The spectral datasets are filtered using a Savitzky–Golay filter, corrected for Multiplicative Scatter Correction (MSC). The first and second derivatives and the Continuous Wavelet Transform (CWT) spectra are also calculated. These datasets can provide baseline information for developing solutions for precision agriculture and natural resource challenges. For example, UAV and spectral data of different crop fields captured at spatial and temporal resolutions can contribute towards calibrating satellite images, thus improving the accuracy of the derived satellite products.
      Citation: Data
      PubDate: 2023-05-30
      DOI: 10.3390/data8060098
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 99: Classification of Cocoa Pod Maturity Using
           Similarity Tools on an Image Database: Comparison of Feature Extractors
           and Color Spaces

    • Authors: Kacoutchy Jean Ayikpa, Diarra Mamadou, Pierre Gouton, Kablan Jérôme Adou
      First page: 99
      Abstract: Côte d’Ivoire, the world’s largest cocoa producer, faces the challenge of quality production. Immature or overripe pods cannot produce quality cocoa beans, resulting in losses and an unprofitable harvest. To help farmer cooperatives determine the maturity of cocoa pods in time, our study evaluates the use of automation tools based on similarity measures. Although standard techniques, such as visual inspection and weighing, are commonly used to identify the maturity of cocoa pods, the use of automation tools based on similarity measures can improve the efficiency and accuracy of this process. We set up a database of cocoa pod images and used two feature extractors: one based on convolutional neural networks (CNN), in particular, MobileNet, and the other based on texture analysis using a gray-level co-occurrence matrix (GLCM). We evaluated the impact of different color spaces and feature extraction methods on our database. We used mathematical similarity measurement tools, such as the Euclidean distance, correlation distance, and chi-square distance, to classify cocoa pod images. Our experiments showed that the chi-square distance measurement offered the best accuracy, with a score of 99.61%, when we used GLCM as a feature extractor and the Lab color space. Using automation tools based on similarity measures can improve the efficiency and accuracy of cocoa pod maturity determination. The results of our experiments prove that the chi-square distance is the most appropriate measure of similarity for this task.
      Citation: Data
      PubDate: 2023-05-30
      DOI: 10.3390/data8060099
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 100: Progress in the Cost-Optimal Methodology
           Implementation in Europe: Datasets Insights and Perspectives in Member

    • Authors: Paolo Zangheri, Delia D’Agostino, Roberto Armani, Carmen Maduta, Paolo Bertoldi
      First page: 100
      Abstract: This data article relates to the paper “Review of the cost-optimal methodology implementation in Member States in compliance with the Energy Performance of Buildings Directive”. Datasets linked with this article refer to the analysis of the latest national cost-optimal reports, providing an assessment of the implementation of the cost-optimal methodology, as established by the Energy Performance of Building Directive (EPBD). Based on latest national reports, the data provided a comprehensive update to the cost-optimal methodology implementation throughout Europe, which is currently lacking harmonization. Datasets allow an overall overview of the status of the cost-optimal methodology implementation in Europe with details on the calculations carried out (e.g., multi-stage, dynamic, macroeconomic, and financial perspectives, included energy uses, and full-cost approach). Data relate to the implemented methodology, reference buildings, assessed cost-optimal levels, energy performance, costs, and sensitivity analysis. Data also provide insight into energy consumption, efficiency measures for residential and non-residential buildings, nearly zero energy buildings (NZEBs) levels, and global costs. The reported data can be useful to quantify the cost-optimal levels for different building types, both residential (average cost-optimal level 80 kWh/m2y for new, 130 kWh/m2y for existing buildings) and non-residential buildings (140 kWh/m2y for new, 180 kWh/m2y for existing buildings). Data outline weak and strong points of the methodology, as well as future developments in the light of the methodology revision foreseen in 2026. The data support energy efficiency and energy policies related to buildings toward the EU building stock decarbonization goal within 2050.
      Citation: Data
      PubDate: 2023-05-31
      DOI: 10.3390/data8060100
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 101: Labelled Indoor Point Cloud Dataset for BIM
           Related Applications

    • Authors: Nuno Abreu, Rayssa Souza, Andry Pinto, Anibal Matos, Miguel Pires
      First page: 101
      Abstract: BIM (building information modelling) has gained wider acceptance in the AEC (architecture, engineering, and construction) industry. Conversion from 3D point cloud data to vector BIM data remains a challenging and labour-intensive process, but particularly relevant during various stages of a project lifecycle. While the challenges associated with processing very large 3D point cloud datasets are widely known, there is a pressing need for intelligent geometric feature extraction and reconstruction algorithms for automated point cloud processing. Compared to outdoor scene reconstruction, indoor scenes are challenging since they usually contain high amounts of clutter. This dataset comprises the indoor point cloud obtained by scanning four different rooms (including a hallway): two office workspaces, a workshop, and a laboratory including a water tank. The scanned space is located at the Electrical and Computer Engineering department of the Faculty of Engineering of the University of Porto. The dataset is fully labelled, containing major structural elements like walls, floor, ceiling, windows, and doors, as well as furniture, movable objects, clutter, and scanning noise. The dataset also contains an as-built BIM that can be used as a reference, making it suitable for being used in Scan-to-BIM and Scan-vs-BIM applications. For demonstration purposes, a Scan-vs-BIM change detection application is described, detailing each of the main data processing steps.
      Citation: Data
      PubDate: 2023-06-01
      DOI: 10.3390/data8060101
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 102: A Self-Attention-Based Imputation Technique for
           Enhancing Tabular Data Quality

    • Authors: Do-Hoon Lee, Han-joon Kim
      First page: 102
      Abstract: Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.
      Citation: Data
      PubDate: 2023-06-04
      DOI: 10.3390/data8060102
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 103: Physico-Chemical Quality and Physiological
           Profiles of Microbial Communities in Freshwater Systems of Mega Manila,

    • Authors: Marie Christine M. Obusan, Arizaldo E. Castro, Ren Mark D. Villanueva, Margareth Del E. Isagan, Jamaica Ann A. Caras, Jessica F. Simbahan
      First page: 103
      Abstract: Studying the quality of freshwater systems and drinking water in highly urbanized megalopolises around the world remains a challenge. This article reports data on the quality of select freshwater systems in Mega Manila, Philippines. Water samples collected between 2020 and 2021 were analyzed for physico-chemical parameters and microbial community metabolic fingerprints, i.e., carbon substrate utilization patterns (CSUPs). The detection of arsenic, lead, cadmium, mercury, polyaromatic hydrocarbons (PAHs), and organochlorine pesticides (OCPs) was carried out using standard chromatography- and spectroscopy-based protocols. Physiological profiles were determined using the Biolog EcoPlate™ system. Eight samples were free of heavy metals, and none contained PAHs or OCPs. Fourteen samples had high microbial activity, as indicated by average well color development (AWCD) and community metabolic diversity (CMD) values. Community-level physiological profiling (CLPP) revealed that (1) samples clustered as groups according to shared CSUPs, and (2) microbial communities in non-drinking samples actively utilized all six substrate classes compared to drinking samples. The data reported here can provide a baseline or a comparator for prospective quality assessments of drinking water and freshwater sources in the region. Metabolic fingerprinting using CSUPs is a simple and cheap phenotypic analysis of microbial communities and their physiological activity in aquatic environments.
      Citation: Data
      PubDate: 2023-06-04
      DOI: 10.3390/data8060103
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 104: Comparison of ARIMA and LSTM in Predicting
           Structural Deformation of Tunnels during Operation Period

    • Authors: Chuangfeng Duan, Min Hu, Haozuan Zhang
      First page: 104
      Abstract: Accurately predicting the structural deformation trend of tunnels during operation is significant to improve the scientificity of tunnel safety maintenance. With the development of data science, structural deformation prediction methods based on time-series data have attracted attention. Auto Regressive Integrated Moving Average model (ARIMA) is a classical statistical analysis model, which is suitable for processing non-stationary time-series data. Long- and Short-Term Memory (LSTM) is a special cyclic neural network that can learn long-term dependent information in time series. Both are widely used in the field of temporal prediction. In view of the lack of time-series prediction in the tunnel deformation field, the body of this paper uses historical data of the Xinjian Road and the Dalian Road tunnel in Shanghai to propose a new way of modeling based on single points and road sections. ARIMA and LSTM models are applied in comprehensive experiments, and the results show that: (1) Both LSTM and ARIMA models have great performance for settlement and convergence deformation. (2) The overall robustness of ARIMA is better than that of LSTM, and it is more adaptable to the datasets. (3) The model prediction performance is closely related to the data quality. ARIMA has more stable performance under the lack of data volume, while LSTM has better performance with high-quality data and higher upper limit.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060104
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 105: Assessing the Effectiveness of Masking and
           Encryption in Safeguarding the Identity of Social Media Publishers from
           Advanced Metadata Analysis

    • Authors: Mohammed Khader, Marcel Karam
      First page: 105
      Abstract: Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous publishers vulnerable to privacy-related attacks, as identifying information can be revealed. Twitter is an example of such a platform where identifying anonymous users/publishers is made possible by using machine learning techniques. To provide these anonymous users with stronger protection, we have examined the effectiveness of these techniques when critical fields in the metadata are masked or encrypted using tweets (text and images) from Twitter. Our results show that SVM achieved the highest accuracy rate of 95.81% without using data masking or encryption, while SVM achieved the highest identity recognition rate of 50.24% when using data masking and AES encryption algorithm. This indicates that data masking and encryption of metadata of tweets (text and images) can provide promising protection for the anonymity of users’ identities.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060105
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 106: Curated Dataset for Red Blood Cell Tracking from
           Video Sequences of Flow in Microfluidic Devices

    • Authors: Ivan Cimrák, Peter Tarábek, František Kajánek
      First page: 106
      Abstract: This work presents a dataset comprising images, annotations, and velocity fields for benchmarking cell detection and cell tracking algorithms. The dataset includes two video sequences captured during laboratory experiments, showcasing the flow of red blood cells (RBC) in microfluidic channels. From the first video 300 frames and from the second video 150 frames are annotated with bounding boxes around the cells, as well as tracks depicting the movement of individual cells throughout the video. The dataset encompasses approximately 20,000 bounding boxes and 350 tracks. Additionally, computational fluid dynamics simulations were utilized to generate 2D velocity fields representing the flow within the channels. These velocity fields are included in the dataset. The velocity field has been employed to improve cell tracking by predicting the positions of cells across frames. The paper also provides a comprehensive discussion on the utilization of the flow matrix in the tracking steps.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060106
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 107: A Preliminary Investigation of a Single Shock
           Impact on Italian Mortality Rates Using STMF Data: A Case Study of

    • Authors: Maria Francesca Carfora, Albina Orlando
      First page: 107
      Abstract: Mortality shocks, such as pandemics, threaten the consolidated longevity improvements, confirmed in the last decades for the majority of western countries. Indeed, just before the COVID-19 pandemic, mortality was falling for all ages, with a different behavior according to different ages and countries. It is indubitable that the changes in the population longevity induced by shock events, even transitory ones, affecting demographic projections, have financial implications in public spending as well as in pension plans and life insurance. The Short Term Mortality Fluctuations (STMF) data series, providing data of all-cause mortality fluctuations by week within each calendar year for 38 countries worldwide, offers a powerful tool to timely analyze the effects of the mortality shock caused by the COVID-19 pandemic on Italian mortality rates. This dataset, recently made available as a new component of the Human Mortality Database, is described and techniques for the integration of its data with the historical mortality time series are proposed. Then, to forecast mortality rates, the well-known stochastic mortality model proposed by Lee and Carter in 1992 is first considered, to be consistent with the internal processing of the Human Mortality Database, where exposures are estimated by the Lee–Carter model; empirical results are discussed both on the estimation of the model coefficients and on the forecast of the mortality rates. In detail, we show how the integration of the yearly aggregated STMF data in the HMD database allows the Lee–Carter model to capture the complex evolution of the Italian mortality rates, including the higher lethality for males and older people, in the years that follow a large shock event such as the COVID-19 pandemic. Finally, we discuss some key points concerning the improvement of existing models to take into account mortality shocks and evaluate their impact on future mortality dynamics.
      Citation: Data
      PubDate: 2023-06-13
      DOI: 10.3390/data8060107
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 108: How Expert Is the Crowd' Insights into Crowd
           Opinions on the Severity of Earthquake Damage

    • Authors: Motti Zohar, Amos Salamon, Carmit Rapaport
      First page: 108
      Abstract: The evaluation of earthquake damage is central to assessing its severity and damage characteristics. However, the methods of assessment encounter difficulties concerning the subjective judgments and interpretation of the evaluators. Thus, it is mainly geologists, seismologists, and engineers who perform this exhausting task. Here, we explore whether an evaluation made by semiskilled people and by the crowd is equivalent to the experts’ opinions and, thus, can be harnessed as part of the process. Therefore, we conducted surveys in which a cohort of graduate students studying natural hazards (n = 44) and an online crowd (n = 610) were asked to evaluate the level of severity of earthquake damage. The two outcome datasets were then compared with the evaluation made by two of the present authors, who are considered experts in the field. Interestingly, the evaluations of both the semiskilled cohort and the crowd were found to be fairly similar to those of the experts, thus suggesting that they can provide an interpretation close enough to an expert’s opinion on the severity level of earthquake damage. Such an understanding may indicate that although our analysis is preliminary and requires more case studies for this to be verified, there is vast potential encapsulated in crowd-sourced opinion on simple earthquake-related damage, especially if a large amount of data is to be handled.
      Citation: Data
      PubDate: 2023-06-14
      DOI: 10.3390/data8060108
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 109: Dataset of Program Source Codes Solving Unique
           Programming Exercises Generated by Digital Teaching Assistant

    • Authors: Liliya A. Demidova, Elena G. Andrianova, Peter N. Sovietov, Artyom V. Gorchakov
      First page: 109
      Abstract: This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers or benchmarking mutation testing frameworks, and more applications are yet to be discovered. We describe the architecture of the DTA system, aiming to provide detailed insight regarding how and why the dataset was collected. In addition, we describe the algorithms responsible for source code analysis in the DTA system. These algorithms use vector representations of programs based on Markov chains, compute pairwise Jensen–Shannon divergences of programs, and apply hierarchical clustering algorithms in order to automatically discover high-level concepts used by students while solving unique tasks. The proposed approach can be incorporated into massive programming courses when there is a need to identify approaches implemented by students.
      Citation: Data
      PubDate: 2023-06-14
      DOI: 10.3390/data8060109
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 110: Deep Learning-Based Black Spot Identification on
           Greek Road Networks

    • Authors: Ioannis Karamanlis, Alexandros Kokkalis, Vassilios Profillidis, George Botzoris, Chairi Kiourt, Vasileios Sevetlidis, George Pavlidis
      First page: 110
      Abstract: Black spot identification, a spatiotemporal phenomenon, involves analysing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. By evaluating these problem areas, researchers can uncover the underlying causes and reasons for increased collision rates, such as road design, traffic volume, driver behaviour, weather, and infrastructure. However, challenges in identifying black spots include limited data availability, data quality, and assessing contributing factors. Additionally, evolving road design, infrastructure, and vehicle safety technology can affect black spot analysis and determination. This study focused on traffic accidents in Greek road networks to recognize black spots, utilizing data from police and government-issued car crash reports. The study produced a publicly available dataset called Black Spots of North Greece (BSNG) and a highly accurate identification method.
      Citation: Data
      PubDate: 2023-06-16
      DOI: 10.3390/data8060110
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 111: Self-Reported Mental Health and Psychosocial
           Correlates during the COVID-19 Pandemic: Data from the General Population
           in Italy

    • Authors: Daniela Marchetti, Roberta Maiella, Rocco Palumbo, Melissa D’Ettorre, Irene Ceccato, Marco Colasanti, Adolfo Di Crosta, Pasquale La Malva, Emanuela Bartolini, Daniela Biasone, Nicola Mammarella, Piero Porcelli, Alberto Di Domenico, Maria Cristina Verrocchio
      First page: 111
      Abstract: The COVID-19 pandemic tremendously impacted people’s day-to-day activities and mental health. This article describes the dataset used to investigate the psychological impact of the first national lockdown on the general Italian population. For this purpose, an online survey was disseminated via Qualtrics between 1 April and 20 April 2020, to record various socio-demographic and psychological variables. The measures included both validated (namely, the Impact of the Event Scale-Revised, the Perceived Stress Scale, the nine-item Patient Health Questionnaire, the seven-item Generalized Anxiety Disorder scale, the Big Five Inventory 10-Item, and the Whiteley Index-7) and ad hoc questionnaires (nine items to investigate in-group and out-group trust). The final sample comprised 4081 participants (18–85 years old). The dataset could be helpful to other researchers in understanding the psychological impact of the COVID-19 pandemic and its related preventive and protective measures. Furthermore, the present data might help shed some light on the role of individual differences in response to traumatic events. Finally, this dataset can increase the knowledge in investigating psychological distress, health anxiety, and personality traits.
      Citation: Data
      PubDate: 2023-06-16
      DOI: 10.3390/data8060111
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 112: RipSetCocoaCNCH12: Labeled Dataset for Ripeness
           Stage Detection, Semantic and Instance Segmentation of Cocoa Pods

    • Authors: Juan Felipe Restrepo-Arias, María Isabel Salinas-Agudelo, María Isabel Hernandez-Pérez, Alejandro Marulanda-Tobón, María Camila Giraldo-Carvajal
      First page: 112
      Abstract: Fruit counting and ripeness detection are computer vision applications that have gained strength in recent years due to the advancement of new algorithms, especially those based on artificial neural networks (ANNs), better known as deep learning. In agriculture, those algorithms capable of fruit counting, including information about their ripeness, are mainly applied to make production forecasts or plan different activities such as fertilization or crop harvest. This paper presents the RipSetCocoaCNCH12 dataset of cocoa pods labeled at four different ripeness stages: stage 1 (0–2 months), stage 2 (2–4 months), stage 3 (4–6 months), and harvest stage (>6 months). An additional class was also included for pods aborted by plants in the early stage of development. A total of 4116 images were labeled to train algorithms that mainly perform semantic and instance segmentation. The labeling was carried out with CVAT (Computer Vision Annotation Tool). The dataset, therefore, includes labeling in two formats: COCO 1.0 and segmentation mask 1.1. The images were taken with different mobile devices (smartphones), in field conditions, during the harvest season at different times of the day, which could allow the algorithms to be trained with data that includes many variations in lighting, colors, textures, and sizes of the cocoa pods. As far as we know, this is the first openly available dataset for cocoa pod detection with semantic segmentation for five classes, 4116 images, and 7917 instances, comprising RGB images and two different formats for labels. With the publication of this dataset, we expect that researchers in smart farming, especially in cocoa cultivation, can benefit from the quantity and variety of images it contains.
      Citation: Data
      PubDate: 2023-06-18
      DOI: 10.3390/data8060112
      Issue No: Vol. 8, No. 6 (2023)
  • Data, Vol. 8, Pages 81: Dataset of Fluorescence EEM and UV Spectroscopy
           Data of Olive Oils during Ageing

    • Authors: Francesca Venturini, Silvan Fluri, Michael Baumgartner
      First page: 81
      Abstract: The dataset presented in this study encompasses fluorescence excitation–emission matrices (EEMs) and UV-spectroscopy data of 24 extra virgin olive oils (EVOOs) commercially available at supermarkets in Switzerland. To investigate the effect of thermal degradation, the samples were exposed to accelerated ageing at 60 ∘C up to 53 days. EEMs and UV absorption parameters were measured in 10 ageing steps. The dataset can be used, for example, to predict one or multiple chemical parameters or to classify samples based on their quality from fluorescence spectra.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050081
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 82: Exploring Spatial Patterns in Sensor Data for
           Humidity, Temperature, and RSSI Measurements

    • Authors: Juan Botero-Valencia, Adrian Martinez-Perez, Ruber Hernández-García, Luis Castano-Londono
      First page: 82
      Abstract: The Internet of Things (IoT) is one of the fastest-growing research areas in recent years and is strongly linked to the development of smart cities, smart homes, and factories. IoT can be defined as connecting devices, sensors, and physical objects that can collect and transmit data across a network, enabling increased automation and better decision-making. In several IoT applications, humidity and temperature are some of the most used variables for adjusting system configurations and understanding their performance because they are related to various physical processes, human comfort, manufacturing processes, and 3D printing, among other things. In addition, one of the biggest problems associated with IoT is the excessive production of data, so it is necessary to develop methodologies to optimize the process of collecting information. This work presents a new dataset comprising almost 55 million values of temperature, relative humidity, and RSSI (Received Signal Strength Indicator) collected in two indoor spaces for longer than 3915 h at 10 s intervals. For each experiment, we captured the information from 13 previously calibrated sensors suspended from the ceiling at the same height and with a known relative position. The proposed dataset aims to contribute a benchmark for evaluating indoor temperature and humidity-controlled systems. The collected data allow the validation and improvement of the acquisition process for IoT applications.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050082
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 83: Cloud-Based Smart Contract Analysis in FinTech
           Using IoT-Integrated Federated Learning in Intrusion Detection

    • Authors: Venkatagurunatham Naidu Kollu, Vijayaraj Janarthanan, Muthulakshmi Karupusamy, Manikandan Ramachandran
      First page: 83
      Abstract: Data sharing is proposed because the issue of data islands hinders advancement of artificial intelligence technology in the 5G era. Sharing high-quality data has a direct impact on how well machine-learning models work, but there will always be misuse and leakage of data. The field of financial technology, or FinTech, has received a lot of attention and is growing quickly. This field has seen the introduction of new terms as a result of its ongoing expansion. One example of such terminology is “FinTech”. This term is used to describe a variety of procedures utilized frequently in the financial technology industry. This study aims to create a cloud-based intrusion detection system based on IoT federated learning architecture as well as smart contract analysis. This study proposes a novel method for detecting intrusions using a cyber-threat federated graphical authentication system and cloud-based smart contracts in FinTech data. Users are required to create a route on a world map as their credentials under this scheme. We had 120 people participate in the evaluation, 60 of whom had a background in finance or FinTech. The simulation was then carried out in Python using a variety of FinTech cyber-attack datasets for accuracy, precision, recall, F-measure, AUC (Area under the ROC Curve), trust value, scalability, and integrity. The proposed technique attained accuracy of 95%, precision of 85%, RMSE of 59%, recall of 68%, F-measure of 83%, AUC of 79%, trust value of 65%, scalability of 91%, and integrity of 83%.
      Citation: Data
      PubDate: 2023-04-29
      DOI: 10.3390/data8050083
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 84: Biotechnology and Bio-Based Products Perceptions
           in the Community of Madrid: A Representative Survey Dataset

    • Authors: Juan Romero-Luis, Manuel Gertrudix, María del Carmen Gertrudis Casado, Alejandro Carbonell-Alcocer
      First page: 84
      Abstract: (1) Background: Bioeconomy aims to reduce dependence on non-renewable resources and foster economic growth through the development of new bio-based products and services. Achieving this goal requires social acceptance and stakeholder engagement in the development of sustainable technologies. The objective of this data article is to provide a dataset derived from a survey with a representative sample of 500 citizens over 18 years old based in the Community of Madrid. (2) Methods: We created a questionnaire on the social acceptance of technologies and bio-based products to later gather the responses using a SurveyMonkey panel for the Community of Madrid through an online CAWI survey; (3) Results: A dataset with a total of 82 columns with all responses is the result of this study. (4) Conclusions: This data article provides not only a valuable representative dataset of citizens of the Community of Madrid but also sufficient resources to replicate the same study in other regions.
      Citation: Data
      PubDate: 2023-05-01
      DOI: 10.3390/data8050084
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 85: Emission Inventory for Maritime Shipping Emissions
           in the North and Baltic Sea

    • Authors: Franziska Dettner, Simon Hilpert
      First page: 85
      Abstract: A high temporal and spatial resolution emission inventory for the North Sea and Baltic Sea was compiled using current emission factors and ship activity data. The inventory includes seagoing vessels over 100 GT registered with the International Maritime Organization traversing in the North and Baltic Seas. A bottom-up approach was chosen for the compilation of the inventory, which provides emission levels of the air pollutants CO2, NOx, SO2, PM2.5, CO, BC, Ash, NMVOC, and POA, as well as the speed-dependent fuel and energy consumption. Input data come from both main and auxiliary engines, as well as well-to-tank and tank-to-propeller emission and energy and fuel consumption quantities. The georeferenced data are provided in a temporal resolution of five minutes. The data can be used to assess, inter alia, the health effects of maritime emissions, the social costs of maritime transport, emission mitigation effects of alternative fuel scenarios, and shore-to-ship power supply.
      Citation: Data
      PubDate: 2023-05-01
      DOI: 10.3390/data8050085
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 86: RaspberrySet: Dataset of Annotated Raspberry
           Images for Object Detection

    • Authors: Sarmīte Strautiņa, Ieva Kalniņa, Edīte Kaufmane, Kaspars Sudars, Ivars Namatēvs, Arturs Nikulins, Edgars Edelmers
      First page: 86
      Abstract: The RaspberrySet dataset is a valuable resource for those working in the field of agriculture, particularly in the selection and breeding of ecologically adaptable berry cultivars. This is because long-term changes in temperature and weather patterns have made it increasingly important for crops to be able to adapt to their environment. To assess the suitability of different cultivars or to make yield predictions, it is necessary to describe and evaluate berries’ characteristics at various growth stages. This process is typically carried out visually, but it can be time-consuming and labor-intensive, requiring significant expert knowledge. The RaspberrySet dataset was created to assist with this process, and it includes images of raspberry berries at five different stages of development. These stages are flower buds, flowers, unripe berries, and ripe berries. All these stages of raspberry images classified buds, damaged buds, flowers, unripe berries, and ripe berries and were annotated using ground truth ROI and presented in YOLO format. The dataset includes 2039 high-resolution RGB images, with a total of 46,659 annotations provided by experts using Label Studio software (1.7.1). The images were taken in various weather conditions, at different times of the day, and from different angles, and they include fully visible buds, flowers, berries, and partially obscured buds. This dataset is intended to improve the efficiency of berry breeding and yield estimation and to identify the raspberry phenotype more accurately. It may also be useful for breeding other fruit crops, as it allows for the reliable detection and phenotyping of yield components at different stages of development. By providing a homogenized dataset of images taken on-site at the Institute of Horticulture in Dobele, Latvia, the RaspberrySet dataset offers a valuable resource for those working in horticulture.
      Citation: Data
      PubDate: 2023-05-10
      DOI: 10.3390/data8050086
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 87: The Effect of Short-Term Transcutaneous Electrical
           Stimulation of Auricular Vagus Nerve on Parameters of Heart Rate

    • Authors: Vladimir Shvartz, Eldar Sizhazhev, Maria Sokolskaya, Svetlana Koroleva, Soslan Enginoev, Sofia Kruchinova, Elena Shvartz, Elena Golukhova
      First page: 87
      Abstract: Many previous studies have demonstrated that transcutaneous vagus nerve stimulation (VNS) has the potential to exhibit therapeutic effects similar to its invasive counterpart. An objective assessment of VNS requires a reliable biomarker of successful vagal activation. Although many potential biomarkers have been proposed, most studies have focused on heart rate variability (HRV). Despite the physiological rationale for HRV as a biomarker for assessing vagal stimulation, data on its effects on HRV are equivocal. To further advance this field, future studies investigating VNS should contain adequate methodological specifics that make it possible to compare the results between studies, to replicate studies, and to enhance the safety of study participants. This article describes the design and methodology of a randomized study evaluating the effect of short-term noninvasive stimulation of the auricular branch of the vagus nerve on parameters of HRV. Primary records of rhythmograms of all the subjects, as well as a dataset with clinical, instrumental, and laboratory data of all the current study subjects are in the public domain for possible secondary analysis to all interested researchers. The physiological interpretation of the obtained data is not considered in the article.
      Citation: Data
      PubDate: 2023-05-11
      DOI: 10.3390/data8050087
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 88: A Multispectral UAV Imagery Dataset of Wheat,
           Soybean and Barley Crops in East Kazakhstan

    • Authors: Almasbek Maulit, Aliya Nugumanova, Kurmash Apayev, Yerzhan Baiburin, Maxim Sutula
      First page: 88
      Abstract: This study introduces a dataset of crop imagery captured during the 2022 growing season in the Eastern Kazakhstan region. The images were acquired using a multispectral camera mounted on an unmanned aerial vehicle (DJI Phantom 4). The agricultural land, encompassing 27 hectares and cultivated with wheat, barley, and soybean, was subjected to five aerial multispectral photography sessions throughout the growing season. This facilitated thorough monitoring of the most important phenological stages of crop development in the experimental design, which consisted of 27 plots, each covering one hectare. The collected imagery underwent enhancement and expansion, integrating a sixth band that embodies the normalized difference vegetation index (NDVI) values in conjunction with the original five multispectral bands (Blue, Green, Red, Red Edge, and Near Infrared Red). This amplification enables a more effective evaluation of vegetation health and growth, rendering the enriched dataset a valuable resource for the progression and validation of crop monitoring and yield prediction models, as well as for the exploration of precision agriculture methodologies.
      Citation: Data
      PubDate: 2023-05-11
      DOI: 10.3390/data8050088
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 89: A Comprehensive Dataset of Spelling Errors and
           Users’ Corrections in Croatian Language

    • Authors: Gordan Gledec, Marko Horvat, Miljenko Mikuc, Bruno Blašković
      First page: 89
      Abstract: This paper presents a unique and extensive dataset containing over 33 million entries with pairs in the form “spelling error → correction” from, the most popular Croatian online spellchecking service, collected since 2008. The dataset, compiled from the contribution of nearly 900,000 users, is a valuable resource for researchers and developers in the field of natural language processing (NLP), improving spellcheck accuracy, and language learning applications. The dataset may be used to accomplish several goals: (1) improving spellchecking accuracy by incorporating common user corrections and reducing false positives and negatives; (2) helping language learners identify common errors and learn correct spelling through targeted feedback; (3) analyzing data trends and patterns to uncover the most common spelling errors and their underlying causes; (4) identifying and evaluating factors that influence typing input; (5) improving NLP applications such as text recognition and machine translation. Tasks specific to the Croatian language include the creation of a letter-level confusion matrix and the refinement of word suggestions based on historical usage of the service. This comprehensive dataset provides researchers and practitioners with a wealth of information, opening the path for advancements in spellchecking, language learning, and NLP applications in the Croatian language.
      Citation: Data
      PubDate: 2023-05-12
      DOI: 10.3390/data8050089
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 90: An Efficient Deep Learning for Thai Sentiment

    • Authors: Nattawat Khamphakdee, Pusadee Seresangtakul
      First page: 90
      Abstract: The number of reviews from customers on travel websites and platforms is quickly increasing. They provide people with the ability to write reviews about their experience with respect to service quality, location, room, and cleanliness, thereby helping others before booking hotels. Many people fail to consider hotel bookings because the numerous reviews take a long time to read, and many are in a non-native language. Thus, hotel businesses need an efficient process to analyze and categorize the polarity of reviews as positive, negative, or neutral. In particular, low-resource languages such as Thai have greater limitations in terms of resources to classify sentiment polarity. In this paper, a sentiment analysis method is proposed for Thai sentiment classification in the hotel domain. Firstly, the Word2Vec technique (the continuous bag-of-words (CBOW) and skip-gram approaches) was applied to create word embeddings of different vector dimensions. Secondly, each word embedding model was combined with deep learning (DL) models to observe the impact of each word vector dimension result. We compared the performance of nine DL models (CNN, LSTM, Bi-LSTM, GRU, Bi-GRU, CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN-BiGRU) with different numbers of layers to evaluate their performance in polarity classification. The dataset was classified using the FastText and BERT pre-trained models to carry out the sentiment polarity classification. Finally, our experimental results show that the WangchanBERTa model slightly improved the accuracy, producing a value of 0.9225, and the skip-gram and CNN model combination outperformed other DL models, reaching an accuracy of 0.9170. From the experiments, we found that the word vector dimensions, hyperparameter values, and the number of layers of the DL models affected the performance of sentiment classification. Our research provides guidance for setting suitable hyperparameter values to improve the accuracy of sentiment classification for the Thai language in the hotel domain.
      Citation: Data
      PubDate: 2023-05-13
      DOI: 10.3390/data8050090
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 91: A Set of Geophysical Fields for Modeling of the
           Lithosphere Structure and Dynamics in the Russian Arctic Zone

    • Authors: Anatoly Soloviev, Alexey Petrunin, Sofia Gvozdik, Roman Sidorov
      First page: 91
      Abstract: This paper presents a set of various geological and geophysical data for the Arctic zone, including some detailed models for the eastern part of the Russian Arctic zone. This hard-to-access territory has a complex geological structure, which is poorly studied by direct geophysical methods. Therefore, these data can be used in an integrative analysis for different purposes. These are the gravity field, heat flow, and various seismic tomography models. The gravity field data include several reductions calculated during our preceding studies, which are more appropriate for the study of the Earth’s interiors than the initial free air anomalies. Specifically, these are the Bouguer, isostatic, and decompensative gravity anomalies. A surface heat flow map included in the dataset is based on a joint inversion of multiple geophysical data constrained by the observations from the International Heat Flow Commission catalog. Available seismic tomography models were analyzed to select the best one for further investigation. We provide the models for the sedimentary cover and the Moho depth, which are significantly improved compared to the existing ones. The database provides a basis for qualitative and quantitative analysis of the region.
      Citation: Data
      PubDate: 2023-05-14
      DOI: 10.3390/data8050091
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 92: Low-Dose Radiation-Induced Transcriptomic Changes
           in Diabetic Aortic Endothelial Cells

    • Authors: Jihye Park, Kyuho Kang, Yeonghoon Son, Kwang Seok Kim, Keunsoo Kang, Hae-June Lee
      First page: 92
      Abstract: Low-dose radiation refers to exposure to ionizing radiation at levels that are generally considered safe and not expected to cause immediate health effects. However, the effects of low-dose radiation are still not fully understood, and research in this area is ongoing. In this study, we investigated the alterations in gene expression profiles of human aortic endothelial cells (HAECs) and diabetic human aortic endothelial cells (T2D-HAECs) derived from patients with type 2 diabetes. To this end, we used RNA-seq to profile the transcriptomes of cells exposed to varying doses of low-dose radiation (0.1 Gy, 0.5 Gy, and 2.0 Gy) and compared them to a control group with no radiation exposure. Differentially expressed genes and enriched pathways were identified using the DESeq2 and gene set enrichment analysis (GSEA) methods, respectively. The data generated in this study are publicly available through the gene expression omnibus (GEO) database with the accession number GSE228572. This study provides a valuable resource for examining the effects of low-dose radiation on HAECs and T2D-HAECs, thereby contributing to a better understanding of the potential human health risks associated with low-dose radiation exposure.
      Citation: Data
      PubDate: 2023-05-18
      DOI: 10.3390/data8050092
      Issue No: Vol. 8, No. 5 (2023)
  • Data, Vol. 8, Pages 174: Machine Learning Applications to Identify Young
           Offenders Using Data from Cognitive Function Tests

    • Authors: María Claudia Bonfante, Juan Contreras Montes, Mariana Pino, Ronald Ruiz, Gabriel González
      First page: 174
      Abstract: Machine learning techniques can be used to identify whether deficits in cognitive functions contribute to antisocial and aggressive behavior. This paper initially presents the results of tests conducted on delinquent and nondelinquent youths to assess their cognitive functions. The dataset extracted from these assessments, consisting of 37 predictor variables and one target, was used to train three algorithms which aim to predict whether the data correspond to those of a young offender or a nonoffending youth. Prior to this, statistical tests were conducted on the data to identify characteristics which exhibited significant differences in order to select the most relevant features and optimize the prediction results. Additionally, other feature selection methods, such as Boruta, RFE, and filter, were applied, and their effects on the accuracy of each of the three machine learning models used (SVM, RF, and KNN) were compared. In total, 80% of the data were utilized for training, while the remaining 20% were used for validation. The best result was achieved by the K-NN model, trained with 19 features selected by the Boruta method, followed by the SVM model, trained with 24 features selected by the filter method.
      Citation: Data
      PubDate: 2023-11-21
      DOI: 10.3390/data8120174
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 175: Long-Term Spatiotemporal Oceanographic Data from
           the Northeast Pacific Ocean: 1980–2022 Reconstruction Based on the
           Korea Oceanographic Data Center (KODC) Dataset

    • Authors: Seong-Hyeon Kim, Hansoo Kim
      First page: 175
      Abstract: The Korea Oceanographic Data Center (KODC), overseen by the National Institute of Fisheries Science (NIFS), is a pivotal hub for collecting, processing, and disseminating marine science data. By digitizing and subjecting observational data to rigorous quality control, the KODC ensures accurate information in line with international standards. The center actively engages in global partnerships and fosters marine data exchange. A wide array of marine information is provided through the KODC website, including observational metadata, coastal oceanographic data, real-time buoy records, and fishery environmental data. Coastal oceanographic observational data from 207 stations across various sea regions have been collected biannually since 1961. This dataset covers 14 standard water depths; includes essential parameters, such as temperature, salinity, nutrients, and pH; serves as the foundation for news, reports, and analyses by the NIFS; and is widely employed to study seasonal and regional marine variations, with researchers supplementing the limited data for comprehensive insights. The dataset offers information for each water depth at a 1 m interval over 1980–2022, facilitating research across disciplines. Data processing, including interpolation and quality control, is based on MATLAB. These data are classified by region and accessible online; hence, researchers can easily explore spatiotemporal trends in marine environments.
      Citation: Data
      PubDate: 2023-11-23
      DOI: 10.3390/data8120175
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 176: Model Design and Applied Methodology in
           Geothermal Simulations in Very Low Enthalpy for Big Data Applications

    • Authors: Roberto Arranz-Revenga, María Pilar Dorrego de Luxán, Juan Herrera Herbert, Luis Enrique García Cambronero
      First page: 176
      Abstract: Low-enthalpy geothermal installations for heating, air conditioning, and domestic hot water are gaining traction due to efforts towards energy decarbonization. This article is part of a broader research project aimed at employing artificial intelligence and big data techniques to develop a predictive system for the thermal behavior of the ground in very low-enthalpy geothermal applications. In this initial article, a summarized process is outlined to generate large quantities of synthetic data through a ground simulation method. The proposed theoretical model allows simulation of the soil’s thermal behavior using an electrical equivalent. The electrical circuit derived is loaded into a simulation program along with an input function representing the system’s thermal load pattern. The simulator responds with another function that calculates the values of the ground over time. Some examples of value conversion and the utility of the input function system to encode thermal loads during simulation are demonstrated. It bears the limitation of invalidity in the presence of underground water currents. Model validation is pending, and once defined, a corresponding testing plan will be proposed for its validation.
      Citation: Data
      PubDate: 2023-11-23
      DOI: 10.3390/data8120176
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 177: Dataset: Impact of β-galactosylceramidase
           Overexpression on the Protein Profile of Braf(V600E) Mutated Melanoma

    • Authors: Davide Capoferri, Paola Chiodelli, Stefano Calza, Marcello Manfredi, Marco Presta
      First page: 177
      Abstract: β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism by removing β-galactosyl moieties from β-galactosyl ceramide and β-galactosyl sphingosine. Previous observations have shown that GALC exerts a pro-oncogenic activity in human melanoma. Here, the impact of GALC overexpression on the proteomic landscape of BRAF-mutated A2058 and A375 human melanoma cell lines was investigated by liquid chromatography–tandem mass spectrometry analysis of the cell extracts. The results indicate that GALC overexpression causes the upregulation/downregulation of 172/99 proteins in GALC-transduced cells when compared to control cells. Gene ontology categorization of up/down-regulated proteins indicates that GALC may modulate the protein landscape in BRAF-mutated melanoma cells by affecting various biological processes, including RNA metabolism, cell organelle fate, and intracellular redox status. Overall, these data provide further insights into the pro-oncogenic functions of the sphingolipid metabolizing enzyme GALC in human melanoma.
      Citation: Data
      PubDate: 2023-11-24
      DOI: 10.3390/data8120177
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 178: In Vivo Drug Testing during Embryonic Wound
           Healing: Establishing the Avian Model

    • Authors: Martin Bablok, Beate Brand-Saberi, Morris Gellisch, Gabriela Morosan-Puopolo
      First page: 178
      Abstract: The relevance of identifying pathological processes in the context of embryonic development is increasingly gaining attention in terms of professionalized prenatal care. To analyze local effects of prenatally administered drugs during embryonic development, the model organism of the chicken embryo can be used in a first exploratory approach. For the examination of local dexamethasone administration—as an exemplary drug—common bead implantation protocols have been adapted to serve as an in vivo technique for local drug testing during embryonic skin regeneration. For this, acrylic beads were soaked in a dexamethasone solution and implanted into skin incisional wounds of 4-day-old chicken embryos. After further incubation, the effects of the applied substance on the process of embryonic skin regeneration were analyzed using histological and molecular biological techniques. This data descriptor contains a detailed microsurgical protocol, a representative video demonstration, and exemplary results of local glucocorticoid-induced changes during embryonic wound healing. To conclude, this method allows for the analysis of the local effects of a particular substance on a cellular level and can be extended to serve as an in vivo technique for numerous other drugs to be tested on embryonic tissue.
      Citation: Data
      PubDate: 2023-11-25
      DOI: 10.3390/data8120178
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 179: A Tourist-Based Framework for Developing Digital
           Marketing for Small and Medium-Sized Enterprises in the Tourism Sector in
           Saudi Arabia

    • Authors: Rishaa Abdulaziz Alnajim, Bahjat Fakieh
      First page: 179
      Abstract: Social media has become an essential tool for travel planning, with tourists increasingly using it to research destinations, book accommodation, and make travel arrangements. However, little is known about how tourists use social media for travel planning and what factors influence their intentions to use social media for this purpose. This thesis aims to understand tourists’ intentions to use social media for travel planning. Specifically, it investigates the factors influencing tourists’ intentions to use social media for planning travel to Saudi Arabia. It develops a machine learning (ML) classification model to assist Saudi tourism SMEs in creating effective digital marketing strategies for social media platforms. A survey was conducted with 573 tourists interested in visiting Saudi Arabia, using the Design Science Research (DSR) approach. The findings support the tourist-based theoretical framework, showing that perceived usefulness (PU), perceived ease of use (PEOU), satisfaction (SAT), marketing-generated content (MGC), and user-generated content (UGC) significantly impact tourists’ intentions to use social media for travel planning. Tourists’ characteristics and visit characteristics influenced their intentions to use MGC but not UGC. The tourist-based ML classification model, developed using the LinearSVC algorithm, achieved an accuracy of 99% when evaluated using the K-Fold Cross-Validation (KF-CV) technique. The findings of this study have several implications for Saudi tourism SMEs. First, the results suggest that SMEs should focus on developing social media content that is perceived as useful, easy to use, and satisfying. Second, the findings suggest that SMEs should focus on using MGC in their social media marketing campaigns. Third, the results suggest that SMEs should tailor their social media marketing campaigns to the characteristics of their target tourists. This study contributes to the literature on tourism marketing and social media by providing a better understanding of how tourists use social media for travel planning. Saudi tourism SMEs can use the findings of this study to develop more effective digital marketing strategies for social media platforms.
      Citation: Data
      PubDate: 2023-11-28
      DOI: 10.3390/data8120179
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 180: Public Perception of ChatGPT and Transfer
           Learning for Tweets Sentiment Analysis Using Wolfram Mathematica

    • Authors: Yankang Su, Zbigniew J. Kabala
      First page: 180
      Abstract: Understanding public opinion on ChatGPT is crucial for recognizing its strengths and areas of concern. By utilizing natural language processing (NLP), this study delves into tweets regarding ChatGPT to determine temporal patterns, content features, and topic modeling and perform a sentiment analysis. Analyzing a dataset of 500,000 tweets, our research shifts from conventional data science tools like Python and R to exploit Wolfram Mathematica’s robust capabilities. Additionally, with the aim of solving the problem of ignoring semantic information in the LDA model feature extraction, a synergistic methodology entwining LDA, GloVe embeddings, and K-Nearest Neighbors (KNN) clustering is proposed to categorize topics within ChatGPT-related tweets. This comprehensive strategy ensures semantic, syntactic, and topical congruence within classified groups by utilizing the strengths of probabilistic modeling, semantic embeddings, and similarity-based clustering. While built-in sentiment classifiers often fall short in accuracy, we introduce four transfer learning techniques from the Wolfram Neural Net Repository to address this gap. Two of these techniques involve transferring static word embeddings, “GloVe” and “ConceptNet”, which are further processed using an LSTM layer. The remaining techniques center on fine-tuning pre-trained models using scantily annotated data; one refines embeddings from language models (ELMo), while the other fine-tunes bidirectional encoder representations from transformers (BERT). Our experiments on the dataset underscore the effectiveness of the four methods for the sentiment analysis of tweets. This investigation augments our comprehension of user sentiment towards ChatGPT and emphasizes the continued significance of exploration in this domain. Furthermore, this work serves as a pivotal reference for scholars who are accustomed to using Wolfram Mathematica in other research domains, aiding their efforts in text analytics on social media platforms.
      Citation: Data
      PubDate: 2023-11-28
      DOI: 10.3390/data8120180
      Issue No: Vol. 8, No. 12 (2023)
  • Data, Vol. 8, Pages 159: DataPLAN: A Web-Based Data Management Plan
           Generator for the Plant Sciences

    • Authors: Xiao-Ran Zhou, Sebastian Beier, Dominik Brilhaus, Cristina Martins Rodrigues, Timo Mühlhaus, Dirk von Suchodoletz, Richard M. Twyman, Björn Usadel, Angela Kranz
      First page: 159
      Abstract: Research data management (RDM) combines a set of practices for the organization, storage and preservation of data from research projects. The RDM strategy of a project is usually formalized as a data management plan (DMP)—a document that sets out procedures to ensure data findability, accessibility, interoperability and reusability (FAIR-ness). Many aspects of RDM are standardized across disciplines so that data and metadata are reusable, but the components of DMPs in the plant sciences are often disconnected. The inability to reuse plant-specific DMP content across projects and funding sources requires additional time and effort to write unique DMPs for different settings. To address this issue, we developed DataPLAN—an open-source tool incorporating prewritten DMP content for the plant sciences that can be used online or offline to prepare multiple DMPs. The current version of DataPLAN supports Horizon 2020 and Horizon Europe projects, as well as projects funded by the German Research Foundation (DFG). Furthermore, DataPLAN offers the option for users to customize their own templates. Additional templates to accommodate other funding schemes will be added in the future. DataPLAN reduces the workload needed to create or update DMPs in the plant sciences by presenting standardized RDM practices optimized for different funding contexts.
      Citation: Data
      PubDate: 2023-10-24
      DOI: 10.3390/data8110159
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 160: Fabaceae: South African Medicinal Plant Species
           Used in the Treatment and Management of Sexually Transmitted and Related
           Opportunistic Infections Associated with HIV-AIDS

    • Authors: Nkoana Ishmael Mongalo, Maropeng Vellry Raletsena
      First page: 160
      Abstract: The use of medicinal plants, particularly in the treatment of sexually transmitted and related infections, is ancient. These plants may well be used as alternative and complementary medicine to a variety of antibiotics that may possess limitations mainly due to an emerging enormous antimicrobial resistance. Several computerized database literature sources such as ScienceDirect, Scopus, Scielo, PubMed, and Google Scholar were used to retrieve information on Fabaceae species used in the treatment and management of sexually transmitted and related infections in South Africa. The other information was sourced from various academic dissertations, theses, and botanical books. A total of 42 medicinal plant species belonging to the Fabaceae family, used in the treatment of sexually transmitted and related opportunistic infections associated with HIV-AIDS, have been documented. Trees were the most reported life form, yielding 47.62%, while Senna and Vachellia were the frequently cited genera yielding six and three species, respectively. Peltophorum africanum Sond. was the most preferred medicinal plant, yielding a frequency of citation of 14, while Vachellia karoo (Hayne) Banfi and Glasso as well as Elephantorrhiza burkei Benth. yielded 12 citations each. The most frequently used plant parts were roots, yielding 57.14%, while most of the plant species were administered orally after boiling (51.16%) until the infection subsided. Amazingly, many of the medicinal plant species are recommended for use to treat impotence (29.87%), while most common STI infections such as chlamydia (7.79%), gonorrhea (6.49%), syphilis (5.19%), genital warts (2.60%), and many other unidentified STIs that may include “Makgoma” and “Divhu” were less cited. Although there are widespread data on the in vitro evidence of the use of the Fabaceae species in the treatment of sexually transmitted and related infections, there is a need to explore the in vivo studies to further ascertain the use of species as a possible complementary and alternative medicine to the currently used antibiotics in both developing and underdeveloped countries. Furthermore, the toxicological profiles of many of these studies need to be further explored. The safety and efficacy of over-the-counter pharmaceutical products developed using these species also need to be explored.
      Citation: Data
      PubDate: 2023-10-24
      DOI: 10.3390/data8110160
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 161: Dataset: Biodiversity of Ground Beetles
           (Coleoptera, Carabidae) of the Republic of Mordovia (Russia)

    • Authors: Leonid V. Egorov, Viktor V. Aleksanov, Sergei K. Alekseev, Alexander B. Ruchin, Oleg N. Artaev, Mikhail N. Esin, Sergei V. Lukiyanov, Evgeniy A. Lobachev, Gennadiy B. Semishin
      First page: 161
      Abstract: (1) Background: Carabidae is one of the most diverse families of Coleoptera. Many species of Carabidae are sensitive to anthropogenic impacts and are indicators of their environmental state. Some species of large beetles are on the verge of extinction. The aim of this research is to describe the Carabidae fauna of the Republic of Mordovia (central part of European Russia); (2) Methods: The research was carried out in April-September 1979, 1987, 2000, 2001, 2005, 2007–2022. Collections were performed using a variety of methods (light trapping, soil traps, window traps, etc.). For each observation, the coordinates of the sampling location, abundance, and dates were recorded; (3) Results: The dataset contains data on 251 species of Carabidae from 12 subfamilies and 4576 occurrences. A total of 66,378 specimens of Carabidae were studied. Another 29 species are additionally known from other publications. Also, twenty-two species were excluded from the fauna of the region, as they were determined earlier by mistake (4). Conclusions: The biodiversity of Carabidae in the Republic of Mordovia included 280 species from 12 subfamilies. Four species (Agonum scitulum, Lebia scapularis, Bembidion humerale, and Bembidion tenellum) were identified for the first time in the Republic of Mordovia.
      Citation: Data
      PubDate: 2023-10-24
      DOI: 10.3390/data8110161
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 162: The Development of a Water Resource Monitoring
           Ontology as a Research Tool for Sustainable Regional Development

    • Authors: Assel Ospan, Madina Mansurova, Vladimir Barakhnin, Aliya Nugumanova, Roman Titkov
      First page: 162
      Abstract: The development of knowledge graphs about water resources as a tool for studying the sustainable development of a region is currently an urgent task, because the growing deterioration of the state of water bodies affects the ecology, economy, and health of the population of the region. This study presents a new ontological approach to water resource monitoring in Kazakhstan, providing data integration from heterogeneous sources, semantic analysis, decision support, and querying and searching and presenting new knowledge in the field of water monitoring. The contribution of this work is the integration of table extraction and understanding, semantic web rule language, semantic sensor network, time ontology methods, and the inclusion of a module of socioeconomic indicators that reveal the impact of water quality on the quality of life of the population. Using machine learning methods, the study derived six ontological rules to establish new knowledge about water resource monitoring. The results of the queries demonstrate the effectiveness of the proposed method, demonstrating its potential to improve water monitoring practices, promote sustainable resource management, and support decision-making processes in Kazakhstan, and can also be integrated into the ontology of water resources at the scale of Central Asia.
      Citation: Data
      PubDate: 2023-10-26
      DOI: 10.3390/data8110162
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 163: A Large-Scale Dataset of Search Interests Related
           to Disease X Originating from Different Geographic Regions

    • Authors: Nirmalya Thakur, Shuqi Cui, Kesha A. Patel, Isabella Hall, Yuvraj Nihal Duggal
      First page: 163
      Abstract: The World Health Organization (WHO) added Disease X to their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic. During different virus outbreaks of the past, such as COVID-19, Influenza, Lyme Disease, and Zika virus, researchers from various disciplines utilized Google Trends to mine multimodal components of web behavior to study, investigate, and analyze the global awareness, preparedness, and response associated with these respective virus outbreaks. As the world prepares for Disease X, a dataset on web behavior related to Disease X would be crucial to contribute towards the timely advancement of research in this field. Furthermore, none of the prior works in this field have focused on the development of a dataset to compile relevant web behavior data, which would help to prepare for Disease X. To address these research challenges, this work presents a dataset of web behavior related to Disease X, which emerged from different geographic regions of the world, between February 2018 and August 2023. Specifically, this dataset presents the search interests related to Disease X from 94 geographic regions. These regions were chosen for data mining as these regions recorded significant search interests related to Disease X during this timeframe. The dataset was developed by collecting data using Google Trends. The relevant search interests for all these regions for each month in this time range are available in this dataset. This paper also discusses the compliance of this dataset with the FAIR principles of scientific data management. Finally, an analysis of this dataset is presented to uphold the applicability, relevance, and usefulness of this dataset for the investigation of different research questions in the interrelated fields of Big Data, Data Mining, Healthcare, Epidemiology, and Data Analysis with a specific focus on Disease X.
      Citation: Data
      PubDate: 2023-10-26
      DOI: 10.3390/data8110163
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 164: Information Competences and Academic Achievement:
           A Dataset

    • Authors: Jacqueline Köhler, Roberto González-Ibáñez
      First page: 164
      Abstract: Information literacy (IL) is becoming fundamental in the modern world. Although several IL standards and assessments have been developed for secondary and higher education, there is still no agreement about the possible associations between IL and both academic achievement and student dropout rates. In this article, we present a dataset including IL competences measurements, as well as academic achievement and socioeconomic indicators for 153 Chilean first- and second-year engineering students. The dataset is intended to allow researchers to use machine learning methods to study to what extent, if any, IL and academic achievement are related.
      Citation: Data
      PubDate: 2023-10-27
      DOI: 10.3390/data8110164
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 165: Can We Mathematically Spot the Possible
           Manipulation of Results in Research Manuscripts Using Benford’s Law'

    • Authors: Teddy Lazebnik, Dan Gorlitsky
      First page: 165
      Abstract: The reproducibility of academic research has long been a persistent issue, contradicting one of the fundamental principles of science. Recently, there has been an increasing number of false claims found in academic manuscripts, casting doubt on the validity of reported results. In this paper, we utilize an adapted version of Benford’s law, a statistical phenomenon that describes the distribution of leading digits in naturally occurring datasets, to identify the potential manipulation of results in research manuscripts, solely using the aggregated data presented in those manuscripts rather than the commonly unavailable raw datasets. Our methodology applies the principles of Benford’s law to commonly employed analyses in academic manuscripts, thus reducing the need for the raw data itself. To validate our approach, we employed 100 open-source datasets and successfully predicted 79% of them accurately using our rules. Moreover, we tested the proposed method on known retracted manuscripts, showing that around half (48.6%) can be detected using the proposed method. Additionally, we analyzed 100 manuscripts published in the last two years across ten prominent economic journals, with 10 manuscripts randomly sampled from each journal. Our analysis predicted a 3% occurrence of results manipulation with a 96% confidence level. Our findings show that Benford’s law adapted for aggregated data, can be an initial tool for identifying data manipulation; however, it is not a silver bullet, requiring further investigation for each flagged manuscript due to the relatively low prediction accuracy.
      Citation: Data
      PubDate: 2023-10-31
      DOI: 10.3390/data8110165
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 166: A Scalable Data Structure for Efficient Graph
           Analytics and In-Place Mutations

    • Authors: Soukaina Firmli, Dalila Chiadmi
      First page: 166
      Abstract: The graph model enables a broad range of analyses; thus, graph processing (GP) is an invaluable tool in data analytics. At the heart of every GP system lies a concurrent graph data structure that stores the graph. Such a data structure needs to be highly efficient for both graph algorithms and queries. Due to the continuous evolution, the sparsity, and the scale-free nature of real-world graphs, GP systems face the challenge of providing an appropriate graph data structure that enables both fast analytical workloads and fast, low-memory graph mutations. Existing graph structures offer a hard tradeoff among read-only performance, update friendliness, and memory consumption upon updates. In this paper, we introduce CSR++, a new graph data structure that removes these tradeoffs and enables both fast read-only analytics, and quick and memory-friendly mutations. CSR++ combines ideas from CSR, the fastest read-only data structure, and adjacency lists (ALs) to achieve the best of both worlds. We compare CSR++ to CSR, ALs from the Boost Graph Library (BGL), and the following state-of-the-art update-friendly graph structures: LLAMA, STINGER, GraphOne, and Teseo. In our evaluation, which is based on popular GP algorithms executed over real-world graphs, we show that CSR++ remains close to CSR in read-only concurrent performance (within 10% on average) while significantly outperforming CSR (by an order of magnitude) and LLAMA (by almost 2×) with frequent updates. We also show that both CSR++’s update throughput and analytics performance exceed those of several state-of-the-art graph structures while maintaining low memory consumption when the workload includes updates.
      Citation: Data
      PubDate: 2023-11-03
      DOI: 10.3390/data8110166
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 167: Draft Genome Sequence Data of Lysinibacillus
           sphaericus Strain 1795 with Insecticidal Properties

    • Authors: Maria N. Romanenko, Maksim A. Nesterenko, Anton E. Shikov, Anton A. Nizhnikov, Kirill S. Antonets
      First page: 167
      Abstract: Lysinibacillus sphaericus holds a significant agricultural importance by being able to produce insecticidal toxins and chemical moieties of varying antibacterial and fungicidal activities. In this study, the genome of the L. sphaericus strain 1795 is presented. Illumina short reads sequenced on the HiSeq X platform were used to obtain the genome’s assembly by applying the SPAdes v3.15.4 software. The genome size based on a cumulative length of 23 contigs reached 4.74 Mb, with a respective N50 of 1.34 Mb. The assembled genome carried 4672 genes, including 4643 protein-encoding ones, 5 of which represented loci coding for insecticidal toxins active against the orders Diptera, Lepidoptera, and Blattodea. We also revealed biosynthetic gene clusters responsible for the synthesis of secondary metabolites with predicted antibacterial, fungicidal, and growth-promoting properties. The genomic data provided will be helpful for deepening our understanding of genetic markers determining the efficient application of the L. sphaericus strain 1795 primarily for biocontrol purposes in veterinary and medical applications against several groups of blood-sucking insects.
      Citation: Data
      PubDate: 2023-11-03
      DOI: 10.3390/data8110167
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 168: Applying Eye Tracking with Deep Learning
           Techniques for Early-Stage Detection of Autism Spectrum Disorders

    • Authors: Zeyad A. T. Ahmed, Eid Albalawi, Theyazn H. H. Aldhyani, Mukti E. Jadhav, Prachi Janrao, Mansour Ratib Mohammad Obeidat
      First page: 168
      Abstract: Autism spectrum disorder (ASD) poses a complex challenge to researchers and practitioners, with its multifaceted etiology and varied manifestations. Timely intervention is critical in enhancing the developmental outcomes of individuals with ASD. This paper underscores the paramount significance of early detection and diagnosis as a pivotal precursor to effective intervention. To this end, integrating advanced technological tools, specifically eye-tracking technology and deep learning algorithms, is investigated for its potential to discriminate between children with ASD and their typically developing (TD) peers. By employing these methods, the research aims to contribute to refining early detection strategies and support mechanisms. This study introduces innovative deep learning models grounded in convolutional neural network (CNN) and recurrent neural network (RNN) architectures, employing an eye-tracking dataset for training. Of note, performance outcomes have been realised, with the bidirectional long short-term memory (BiLSTM) achieving an accuracy of 96.44%, the gated recurrent unit (GRU) attaining 97.49%, the CNN-LSTM hybridising to 97.94%, and the LSTM achieving the most remarkable accuracy result of 98.33%. These outcomes underscore the efficacy of the applied methodologies and the potential of advanced computational frameworks in achieving substantial accuracy levels in ASD detection and classification.
      Citation: Data
      PubDate: 2023-11-03
      DOI: 10.3390/data8110168
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 169: Machine Learning for Credit Risk Prediction: A
           Systematic Literature Review

    • Authors: Jomark Pablo Noriega, Luis Antonio Rivera, José Alfredo Herrera
      First page: 169
      Abstract: In this systematic review of the literature on using Machine Learning (ML) for credit risk prediction, we raise the need for financial institutions to use Artificial Intelligence (AI) and ML to assess credit risk, analyzing large volumes of information. We posed research questions about algorithms, metrics, results, datasets, variables, and related limitations in predicting credit risk. In addition, we searched renowned databases responding to them and identified 52 relevant studies within the credit industry of microfinance. Challenges and approaches in credit risk prediction using ML models were identified; we had difficulties with the implemented models such as the black box model, the need for explanatory artificial intelligence, the importance of selecting relevant features, addressing multicollinearity, and the problem of the imbalance in the input data. By answering the inquiries, we identified that the Boosted Category is the most researched family of ML models; the most commonly used metrics for evaluation are Area Under Curve (AUC), Accuracy (ACC), Recall, precision measure F1 (F1), and Precision. Research mainly uses public datasets to compare models, and private ones to generate new knowledge when applied to the real world. The most significant limitation identified is the representativeness of reality, and the variables primarily used in the microcredit industry are data related to the Demographic, Operation, and Payment behavior. This study aims to guide developers of credit risk management tools and software towards the existing ability of ML methods, metrics, and techniques used to forecast it, thereby minimizing possible losses due to default and guiding risk appetite.
      Citation: Data
      PubDate: 2023-11-07
      DOI: 10.3390/data8110169
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 170: Introducing DeReKoGram: A Novel Frequency Dataset
           with Lemma and Part-of-Speech Information for German

    • Authors: Sascha Wolfer, Alexander Koplenig, Marc Kupietz, Carolin Müller-Spitzer
      First page: 170
      Abstract: We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.
      Citation: Data
      PubDate: 2023-11-10
      DOI: 10.3390/data8110170
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 171: ChatGPT across Arabic Twitter: A Study of Topics,
           Sentiments, and Sarcasm

    • Authors: Shahad Al-Khalifa, Fatima Alhumaidhi, Hind Alotaibi, Hend S. Al-Khalifa
      First page: 171
      Abstract: While ChatGPT has gained global significance and widespread adoption, its exploration within specific cultural contexts, particularly within the Arab world, remains relatively limited. This study investigates the discussions among early Arab users in Arabic tweets related to ChatGPT, focusing on topics, sentiments, and the presence of sarcasm. Data analysis and topic-modeling techniques were employed to examine 34,760 Arabic tweets collected using specific keywords. This study revealed a strong interest within the Arabic-speaking community in ChatGPT technology, with prevalent discussions spanning various topics, including controversies, regional relevance, fake content, and sector-specific dialogues. Despite the enthusiasm, concerns regarding ethical risks and negative implications of ChatGPT’s emergence were highlighted, indicating apprehension toward advanced artificial intelligence (AI) technology in language generation. Region-specific discussions underscored the diverse adoption of AI applications and ChatGPT technology. Sentiment analysis of the tweets demonstrated a predominantly neutral sentiment distribution (92.8%), suggesting a focus on objectivity and factuality over emotional expression. The prevalence of neutral sentiments indicated a preference for evidence-based reasoning and logical arguments, fostering constructive discussions influenced by cultural norms. Sarcasm was found in 4% of the tweets, distributed across various topics but not dominating the conversation. This study’s implications include the need for AI developers to address ethical concerns and the importance of educating users about the technology’s ethical considerations and risks. Policymakers should consider the regional relevance and potential scams, emphasizing the necessity for ethical guidelines and regulations.
      Citation: Data
      PubDate: 2023-11-14
      DOI: 10.3390/data8110171
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 172: Testate Amoebae (Amphitremida, Arcellinida,
           Euglyphida) in Sphagnum Bogs: The Dataset from Eastern Fennoscandia

    • Authors: Aleksandr Ivanovskii, Kirill Babeshko, Viktor Chernyshov, Anton Esaulov, Aleksandr Komarov, Elena Malysheva, Natalia Mazei, Diana Meskhadze, Damir Saldaev, Andrey N. Tsyganov, Yuri Mazei
      First page: 172
      Abstract: The paper describes a dataset, comprising 236 surface moss samples and 143 testate amoeba taxa. The samples were collected in 11 Sphagnum-dominated bogs during frost-free seasons of 2004, 2007, 2009, 2017, and 2022. For the whole dataset, the sampling effort was sufficient in terms of observed species richness (143 species in total), though a regional species pool is deemed to be discovered incompletely (143 species is its lower 95 % confidence limit using Chao’s estimator). The local community composition demonstrated high heterogeneity in a reduced ordination space. It supports the opinion that the high versatility of bog ecosystems should be taken into account during ecological studies.
      Citation: Data
      PubDate: 2023-11-15
      DOI: 10.3390/data8110172
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 173: Biodiversity of Terrestrial Testate Amoebae in
           Western Siberia Lowland Peatlands

    • Authors: Damir Saldaev, Kirill Babeshko, Viktor Chernyshov, Anton Esaulov, Xiuyuan Gu, Nikita Kriuchkov, Natalia Mazei, Nailia Saldaeva, Jiahui Su, Andrey Tsyganov, Basil Yakimov, Svetlana Yushkovets, Yuri Mazei
      First page: 173
      Abstract: Testate amoebae are unicellular eukaryotic organisms covered with an external skeleton called a shell. They are an important component of many terrestrial ecosystems, especially peatlands, where they can be preserved in peat deposits and used as a proxy of surface wetness in paleoecological reconstructions. Here, we represent a database from a vast but poorly studied region of the Western Siberia Lowland containing information on TA occurrences in relation to substrate moisture and WTD. The dataset includes 88 species from 32 genera, with 2181 incidences and 21,562 counted individuals. All samples were collected in oligotrophic peatlands and prepared using the method of wet sieving with a subsequent sedimentation of aqueous suspensions. This database contributes to the understanding of the distribution of testate amoebae and can be further used in large-scale investigations.
      Citation: Data
      PubDate: 2023-11-17
      DOI: 10.3390/data8110173
      Issue No: Vol. 8, No. 11 (2023)
  • Data, Vol. 8, Pages 145: Attention-Based Human Age Estimation from Face
           Images to Enhance Public Security

    • Authors: Md. Ashiqur Rahman, Shuhena Salam Aonty, Kaushik Deb, Iqbal H. Sarker
      First page: 145
      Abstract: Age estimation from facial images has gained significant attention due to its practical applications such as public security. However, one of the major challenges faced in this field is the limited availability of comprehensive training data. Moreover, due to the gradual nature of aging, similar-aged faces tend to share similarities despite their race, gender, or location. Recent studies on age estimation utilize convolutional neural networks (CNN), treating every facial region equally and disregarding potentially informative patches that contain age-specific details. Therefore, an attention module can be used to focus extra attention on important patches in the image. In this study, tests are conducted on different attention modules, namely CBAM, SENet, and Self-attention, implemented with a convolutional neural network. The focus is on developing a lightweight model that requires a low number of parameters. A merged dataset and other cutting-edge datasets are used to test the proposed model’s performance. In addition, transfer learning is used alongside the scratch CNN model to achieve optimal performance more efficiently. Experimental results on different aging face databases show the remarkable advantages of the proposed attention-based CNN model over the conventional CNN model by attaining the lowest mean absolute error and the lowest number of parameters with a better cumulative score.
      Citation: Data
      PubDate: 2023-09-25
      DOI: 10.3390/data8100145
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 146: Synthetic Data Generation for Data Envelopment

    • Authors: Andrey V. Lychev
      First page: 146
      Abstract: The paper is devoted to the problem of generating artificial datasets for data envelopment analysis (DEA), which can be used for testing DEA models and methods. In particular, the papers that applied DEA to big data often used synthetic data generation to obtain large-scale datasets because real datasets of large size, available in the public domain, are extremely rare. This paper proposes the algorithm which takes as input some real dataset and complements it by artificial efficient and inefficient units. The generation process extends the efficient part of the frontier by inserting artificial efficient units, keeping the original efficient frontier unchanged. For this purpose, the algorithm uses the assurance region method and consistently relaxes weight restrictions during the iterations. This approach produces synthetic datasets that are closer to real ones, compared to other algorithms that generate data from scratch. The proposed algorithm is applied to a pair of small real-life datasets. As a result, the datasets were expanded to 50K units. Computational experiments show that artificially generated DMUs preserve isotonicity and do not increase the collinearity of the original data as a whole.
      Citation: Data
      PubDate: 2023-09-27
      DOI: 10.3390/data8100146
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 147: A Retinal Oct-Angiography and Cardiovascular
           STAtus (RASTA) Dataset of Swept-Source Microvascular Imaging for
           Cardiovascular Risk Assessment

    • Authors: Germanèse, Meriaudeau, Eid, Tadayoni, Ginhac, Anwer, Laure-Anne, Guenancia, Creuzot-Garcher, Gabrielle, Arnould
      First page: 147
      Abstract: In the context of exponential demographic growth, the imbalance between human resources and public health problems impels us to envision other solutions to the difficulties faced in the diagnosis, prevention, and large-scale management of the most common diseases. Cardiovascular diseases represent the leading cause of morbidity and mortality worldwide. A large-scale screening program would make it possible to promptly identify patients with high cardiovascular risk in order to manage them adequately. Optical coherence tomography angiography (OCT-A), as a window into the state of the cardiovascular system, is a rapid, reliable, and reproducible imaging examination that enables the prompt identification of at-risk patients through the use of automated classification models. One challenge that limits the development of computer-aided diagnostic programs is the small number of open-source OCT-A acquisitions available. To facilitate the development of such models, we have assembled a set of images of the retinal microvascular system from 499 patients. It consists of 814 angiocubes as well as 2005 en face images. Angiocubes were captured with a swept-source OCT-A device of patients with varying overall cardiovascular risk. To the best of our knowledge, our dataset, Retinal oct-Angiography and cardiovascular STAtus (RASTA), is the only publicly available dataset comprising such a variety of images from healthy and at-risk patients. This dataset will enable the development of generalizable models for screening cardiovascular diseases from OCT-A retinal images.
      Citation: Data
      PubDate: 2023-09-28
      DOI: 10.3390/data8100147
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 148: Towards Data Storage, Scalability, and
           Availability in Blockchain Systems: A Bibliometric Analysis

    • Authors: Meenakshi Kandpal, Veena Goswami, Rojalina Priyadarshini, Rabindra Kumar Barik
      First page: 148
      Abstract: In recent years, blockchain research has drawn attention from all across the world. It is a decentralized competence that is spread out and uncertain. Several nations and scholars have already successfully applied blockchain in numerous arenas. Blockchain is essential in delicate situations because it secures data and keeps it from being altered or forged. In addition, the market’s increased demand for data is driving demand for data scaling across all industries. Researchers from many nations have used blockchain in various sectors over time, thus bringing extreme focus to this newly escalating blockchain domain. Every research project begins with in-depth knowledge about the working domain, and new interest information about blockchain is quite scattered. This study analyzes academic literature on blockchain technology, emphasizing three key aspects: blockchain storage, scalability, and availability. These are critical areas within the broader field of blockchain technology. This study employs CiteSpace and VOSviewer to understand the current state of research in these areas comprehensively. These are bibliometric analysis tools commonly used in academic research to examine patterns and relationships within scientific literature. Thus, to visualize a way to store data with scalability and availability while keeping the security of the blockchain in sync, the required research has been performed on the storage, scalability, and availability of data in the blockchain environment. The ultimate goal is to contribute to developing secure and efficient data storage solutions within blockchain technology.
      Citation: Data
      PubDate: 2023-10-02
      DOI: 10.3390/data8100148
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 149: Fast Radius Outlier Filter Variant for Large
           Point Clouds

    • Authors: Péter Szutor, Marianna Zichar
      First page: 149
      Abstract: Currently, several devices (such as laser scanners, Kinect, time of flight cameras, medical imaging equipment (CT, MRI, intraoral scanners)), and technologies (e.g., photogrammetry) are capable of generating 3D point clouds. Each point cloud type has its unique structure or characteristics, but they have a common point: they may be loaded with errors. Before further data processing, these unwanted portions of the data must be removed with filtering and outlier detection. There are several algorithms for detecting outliers, but their performances decrease when the size of the point cloud increases. The industry has a high demand for efficient algorithms to deal with large point clouds.
      Citation: Data
      PubDate: 2023-10-02
      DOI: 10.3390/data8100149
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 150: Power-Flow Simulations for Integrating Renewable
           Distributed Generation from Biogas, Photovoltaic, and Small Wind Sources
           on an Underground Distribution Feeder

    • Authors: Welson Bassi, Igor Cordeiro, Ildo Luis Sauer
      First page: 150
      Abstract: The rapid expansion of distributed generation leads to the integration of an increasing number of energy generation sources. However, integrating these sources into electrical distribution networks presents specific challenges to ensure that the distribution networks can effectively accommodate the associated distributed energy and power. Thus, it is crucial to evaluate the electrical effects of power along the conductors, components, and loads. Power-flow analysis is a well-established numerical methodology for assessing parameters and quantities within power systems during steady-state operation. The University of São Paulo’s Cidade Universitária “Armando de Salles Oliveira” (CUASO) campus in São Paulo, Brazil, features an underground power distribution system. The Institute of Energy and Environment (IEE) leads the integration of several distributed generation (DG) sources, including a biogas plant, photovoltaic installations, and a small wind turbine, into one of the CUASO’s feeders, referred to as “USP-105”. Load-flow simulations were conducted using the PowerWorldTM Simulator v.23, considering the interconnection of these sources. This dataset provides comprehensive information and computational files utilized in the simulations. It serves as a valuable resource for reanalysis, didactic purposes, and the dissemination of technical insights related to DG implementation.
      Citation: Data
      PubDate: 2023-10-07
      DOI: 10.3390/data8100150
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 151: Tracking a Decade of Hydrogeological Emergencies
           in Italian Municipalities

    • Authors: Alessio Gatto, Stefano Clò, Federico Martellozzo, Samuele Segoni
      First page: 151
      Abstract: This dataset collects tabular and geographical information about all hydrogeological disasters (landslides and floods) that occurred in Italy from 2013 to 2022 that caused such severe impacts as to require the declaration of national-level emergencies. The severity and spatiotemporal extension of each emergency are characterized in terms of duration and timing, funds requested by local administrations, funds approved by the national government, and municipalities and provinces hit by the event (further subdivided between those included in the emergency and those not, depending on whether relevant impacts were ascertained). Italian exposure to hydrogeological risk is portrayed strikingly: in the covered period, 123 emergencies affected Italy, all regions were struck at least once, and some provinces were struck more than 10 times. Damage declared by local institutions adds up to EUR 11,000,000,000, while national recovery funds add up to EUR 1,000,000,000. The dataset may foster further research on risk assessment, econometric analysis, public policy support, and decision-making implementation. Moreover, it provides systematic evidence helpful in raising awareness about hydrogeological risks affecting Italy.
      Citation: Data
      PubDate: 2023-10-11
      DOI: 10.3390/data8100151
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 152: Dataset of Contamination (2009–2022) Legacy
           Contaminants (PCB and DDT) in Zooplankton of Lake Maggiore (CIPAIS,
           International Commission for the Protection of Italian-Swiss Waters)

    • Authors: Roberta Bettinetti, Roberta Piscia, Marina Manca, Silvana Galassi, Silvia Quadroni, Carlo Dossi, Rossella Perna, Emanuela Boggio, Ginevra Boldrocchi, Michela Mazzoni, Benedetta Villa
      First page: 152
      Abstract: In this paper, we describe a 13-year (2009–2022) dataset of legacy POP concentrations (DDTtot and sumPCB14 from 2016 isomers and congeners concentrations are also reported) in the planktonic crustaceans of Lake Maggiore (≥450 µm size fraction). The data were collected in the framework of a monitoring program finalized to assess the presence of pollutants in the lake biota, including zooplankton organisms directly preyed by fish. The data report both concentration of DDTtot and sumPCB14 in the zooplankton and the standing stock density and biomass of the population in each season. The dataset allows for detecting changes in the concentration over the long term and within a year, thus providing evidence for the seasonal and the plurennial variations in the presence of these pollutants in the lake. They also provide a basis for further studies aimed at modeling paths and the fate of persistent organic pollutants, for which the amount of toxicants stocked in the zooplankton compartment linked to fish is a crucial estimate.
      Citation: Data
      PubDate: 2023-10-12
      DOI: 10.3390/data8100152
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 153: USC-DCT: A Collection of Diverse Classification

    • Authors: Adam M. Jones, Gozde Sahin, Zachary W. Murdock, Yunhao Ge, Ao Xu, Yuecheng Li, Di Wu, Shuo Ni, Po-Hsuan Huang, Kiran Lekkala, Laurent Itti
      First page: 153
      Abstract: Machine learning is a crucial tool for both academic and real-world applications. Classification problems are often used as the preferred showcase in this space, which has led to a wide variety of datasets being collected and utilized for a myriad of applications. Unfortunately, there is very little standardization in how these datasets are collected, processed, and disseminated. As new learning paradigms like lifelong or meta-learning become more popular, the demand for merging tasks for at-scale evaluation of algorithms has also increased. This paper provides a methodology for processing and cleaning datasets that can be applied to existing or new classification tasks as well as implements these practices in a collection of diverse classification tasks called USC-DCT. Constructed using 107 classification tasks collected from the internet, this collection provides a transparent and standardized pipeline that can be useful for many different applications and frameworks. While there are currently 107 tasks, USC-DCT is designed to enable future growth. Additional discussion provides explanations of applications in machine learning paradigms such as transfer, lifelong, or meta-learning, how revisions to the collection will be handled, and further tips for curating and using classification tasks at this scale.
      Citation: Data
      PubDate: 2023-10-12
      DOI: 10.3390/data8100153
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 154: A Dataset of Non-Indigenous and Native Fish of
           the Volga and Kama Rivers (European Russia)

    • Authors: Dmitry P. Karabanov, Dmitry D. Pavlov, Yury Y. Dgebuadze, Mikhail I. Bazarov, Elena A. Borovikova, Yuriy V. Gerasimov, Yulia V. Kodukhova, Pavel B. Mikheev, Eduard V. Nikitin, Tatyana L. Opaleva, Yuri A. Severov, Rimma Z. Sabitova, Alexey K. Smirnov, Yury I. Solomatin, Igor A. Stolbunov, Alexander I. Tsvetkov, Stanislav A. Vlasenko, Irina S. Voroshilova, Wenjun Zhong, Xiaowei Zhang, Alexey A. Kotov
      First page: 154
      Abstract: Fish in the Volga-Kama River System (the largest river system in Europe) are important as a crucial food source for local populations; fish have the highest trophic level among hydrobionts. The purpose of this research is to describe the diversity of non-indigenous and native fish in the Volga and Kama Rivers, in the European part of Russia. This dataset encompasses data from June 2001 to September 2021 and comprises 1888 records (36,376 individual observations) for littoral and pelagic habitats from 143 sampling sites, representing 52 species from 42 genera in 22 families. The dataset has a Darwin Core standard format and has been fully released in the Global Biodiversity Information Facility (GBIF) under CC-BY 4.0 International license. The data are validated with several international databases such as FishBase, Eschmeyer’s Catalog of Fishes, the Barcode of Life Data System, and the SAS.Planet geoinformations system. Newly established populations have been found for several species belonging to the following Actinopteri families: Alosidae, Anguillidae, Cichlidae, Ehiravidae, Gobiidae, Odontobutidae, Syngnathidae, and Xenocyprididae. Therefore, this dataset can be used in the particular taxon species distribution analysis, which are especially important for non-indigenous species.
      Citation: Data
      PubDate: 2023-10-18
      DOI: 10.3390/data8100154
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 155: A Data-Driven Exploration of a New Islamic Fatwas
           Dataset for Arabic NLP Tasks

    • Authors: Ohoud Alyemny, Hend Al-Khalifa, Abdulrahman Mirza
      First page: 155
      Abstract: Islamic content is a broad and diverse domain that encompasses various sources, topics, and perspectives. However, there is a lack of comprehensive and reliable datasets that can facilitate conducting studies on Islamic content. In this paper, we present fatwaset, the first public Arabic dataset of Islamic fatwas. It contains Islamic fatwas that we collected from various trusted and authenticated sources in the Islamic fatwa domain, such as agencies, religious scholars, and websites. Fatwaset is a rich resource as it does not only contain fatwas but also includes a considerable set of their surrounding metadata. It can be used for many natural language processing (NLP) tasks, such as language modeling, question answering, author attribution, topic identification, text classification, and text summarization. It can also support other domains that are related to Islamic culture, such as philosophy and language art. We describe the methodology and criteria we used to select the content, as well as the challenges and limitations we faced. Additionally, we perform an Exploratory Data Analysis (EDA), which investigates the dataset from different perspectives. The results of the EDA reveal important information that greatly benefits researchers in this area.
      Citation: Data
      PubDate: 2023-10-19
      DOI: 10.3390/data8100155
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 156: Cybersecurity Risk Assessments within Critical
           Infrastructure Social Networks

    • Authors: Alimbubi Aktayeva, Yerkhan Makatov, Akku Kubigenova Tulegenovna, Aibek Dautov, Rozamgul Niyazova, Maxud Zhamankarin, Sergey Khan
      First page: 156
      Abstract: Cybersecurity social networking is a new scientific and engineering discipline that was interdisciplinary in its early days, but is now transdisciplinary. The issues of reviewing and analyzing of principal tasks related to information collection, monitoring of social networks, assessment methods, and preventing and combating cybersecurity threats are, therefore, essential and pending. There is a need to design certain methods, models, and program complexes aimed at estimating risks related to the cyberspace of social networks and the support of their activities. This study considers a risk to be the combination of consequences of a given event (or incident) with a probable occurrence (likelihood of occurrence) involved, while risk assessment is a general issue of identification, estimation, and evaluation of risk. The findings of the study made it possible to elucidate that the technique of cognitive modeling for risk assessment is part of a comprehensive cybersecurity approach included in the requirements of basic IT standards, including IT security risk management. The study presents a comprehensive approach in the field of cybersecurity in social networks that allows for consideration of all the elements that constitute cybersecurity as a complex, interconnected system. The ultimate goal of this approach to cybersecurity is the organization of an uninterrupted scheme of protection against any impacts related to physical, hardware, software, network, and human objects or resources of the critical infrastructure of social networks, as well as the integration of various levels and means of protection.
      Citation: Data
      PubDate: 2023-10-19
      DOI: 10.3390/data8100156
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 157: Industrial Environment Multi-Sensor Dataset for
           Vehicle Indoor Tracking with Wi-Fi, Inertial and Odometry Data

    • Authors: Ivo Silva , Cristiano Pendão, Joaquín Torres-Sospedra, Adriano Moreira
      First page: 157
      Abstract: This paper describes a dataset collected in an industrial setting using a mobile unit resembling an industrial vehicle equipped with several sensors. Wi-Fi interfaces collect signals from available Access Points (APs), while motion sensors collect data regarding the mobile unit’s movement (orientation and displacement). The distinctive features of this dataset include synchronous data collection from multiple sensors, such as Wi-Fi data acquired from multiple interfaces (including a radio map), orientation provided by two low-cost Inertial Measurement Unit (IMU) sensors, and displacement (travelled distance) measured by an absolute encoder attached to the mobile unit’s wheel. Accurate ground-truth information was determined using a computer vision approach that recorded timestamps as the mobile unit passed through reference locations. We assessed the quality of the proposed dataset by applying baseline methods for dead reckoning and Wi-Fi fingerprinting. The average positioning error for simple dead reckoning, without using any other absolute positioning technique, is 8.25 m and 11.66 m for IMU1 and IMU2, respectively. The average positioning error for simple Wi-Fi fingerprinting is 2.19 m when combining the RSSI information from five Wi-Fi interfaces. This dataset contributes to the fields of Industry 4.0 and mobile sensing, providing researchers with a resource to develop, test, and evaluate indoor tracking solutions for industrial vehicles.
      Citation: Data
      PubDate: 2023-10-23
      DOI: 10.3390/data8100157
      Issue No: Vol. 8, No. 10 (2023)
  • Data, Vol. 8, Pages 158: Panel Regression Modelling for COVID-19
           Infections and Deaths in Tamil Nadu, India

    • Authors: Rajarathinam Arunachalam
      First page: 158
      Abstract: The impacts of the coronavirus disease 2019 (COVID-19) pandemic have been extremely severe, with both economic and health crises experienced worldwide. Based on the panel regression model, this study examined the trends and correlations in the number of COVID-19-related deaths and the number of COVID-19-infected cases in all 37 regions of the Tamil Nadu state in India, in August 2020. The fixed effects model had the greatest R2 value of 78% and exhibited significant results. The slope coefficient was also highly significant, showing a considerable variation in the relationship between new COVID-19 cases and deaths. Additionally, for every unit increase in COVID-19-infected cases, the death rate increased by 0.02%.
      Citation: Data
      PubDate: 2023-10-23
      DOI: 10.3390/data8100158
      Issue No: Vol. 8, No. 10 (2023)
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762

Your IP address:
Home (Search)
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-