Authors:Indranil Ghosh; Saralees Nadarajah Abstract: Abstract In this paper, we provide some new results for the Weibull-R family of distributions (Alzaghal et al. in Int J Stat Probab 5:139–149, 2016). We derive some new structural properties of the Weibull-R family of distributions. We provide various characterizations of the family via conditional moments, some functions of order statistics and via record values. PubDate: 2018-01-13 DOI: 10.1007/s40745-018-0142-7

Authors:André G. C. Pacheco; Renato A. Krohling Abstract: Abstract In classification problems when multiple algorithms are applied to different benchmarks a difficult issue arises, i.e., how can we rank the algorithms' In machine learning, it is common to run the algorithms several times and then a statistic is calculated in terms of means and standard deviations. In order to compare the performance of the algorithms, it is very common to employ statistical tests. However, these tests may also present limitations, since they consider only the means and not the standard deviations of the obtained results. In this paper, we present the so-called A-TOPSIS, based on Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), to solve the problem of ranking and comparing classification algorithms in terms of means and standard deviations. We use two case studies to illustrate the A-TOPSIS for ranking classification algorithms and the results show the suitability of A-TOPSIS to rank the algorithms. The presented approach can be applied to compare the performance of stochastic algorithms in machine learning. Lastly, to encourage researchers to use the A-TOPSIS for ranking algorithms, we also presented in this work an easy-to-use A-TOPSIS web framework. PubDate: 2018-01-13 DOI: 10.1007/s40745-018-0136-5

Authors:A. Iduseri; J. E. Osemwenkhae Abstract: Abstract The focus of a predictive discriminant analysis is to improve classification accuracy, and to obtain statistically optimal classification accuracy or hit rate is still a challenge due to the inherent variability of most real life dataset. Improving classification accuracy is usually achieved with best subset of relevant predictors obtained by using classical variable selection methods. The goal of variable selection methods is to choose the best subset (or training sample) of relevant variables that typically reduces the complexity of a model and makes it easier to interpret, improves the classification accuracy of the model and reduces the training time. However, a statistically optimal hit rate can be achieved if the training sample meets a near optimal condition by resolving any significant differences in the variances for the groups formed by the dependent variable. This paper proposes a new approach for obtaining a near optimal training sample that will produce a statistically optimal hit rate using a modified winsorization with graphical diagnostic. In application to real life data sets, the proposed new approach was able to identify and remove legitimate contaminants in one or more predictors in the training sample, thereby resolving any significant differences in the variances for the groups formed by the dependent variable. The graphical diagnostic associated with the new approach, however, provides a useful visual tool which served as an alternative graphical test for homogeneity of variances. PubDate: 2018-01-12 DOI: 10.1007/s40745-018-0140-9

Authors:Lu Wei; Jianping Li; Xiaoqian Zhu Abstract: Abstract This paper is the first to provide a comprehensive overview of the worldwide operational loss data collection exercises (LDCEs) of internal loss, external loss, scenario analysis and business environment and internal control factors (BEICFs). Based on analyzing operational risk-related articles from 2002 to March 2017 and surveying a large amount of other information, various sources of operational risk data are classified into five types, i.e. individual banks, regulatory authorities, consortia of financial institutions, commercial vendors and researchers. Then by reviewing operational risk databases from these five data sources, we summarized and described 32 internal databases, 26 external databases, 7 scenario databases and 1 BEICFs database. We also find that compared with developing countries, developed countries have performed relatively better in operational risk LDCEs. Besides, the two subjective data elements of scenario analysis and BEICFs are less used than the two objective data elements of internal and external loss data in operational risk estimation. PubDate: 2018-01-12 DOI: 10.1007/s40745-018-0139-2

Authors:Hongxia Zhang; Liu Liu; Jin Yue; Xin Lai Abstract: Abstract According to the characteristics of parameters of cardiopulmonary function diversity and change slowly in pathology, we apply the multivariate exponentially weighted moving average (MEWMA) control chart to monitor the state of lungs. This paper aimed at five indicators of cardiopulmonary function, using principal component test to diagnose whether it is from the multivariate normal distribution, Clearing the relationship model of control line and weight coefficient of MEWMA control graph, and drawing the control diagram for monitoring. The process stay in control state before 103 observations, however, beyond the control limit from the 104 observation statistics and give an alarm. This means that there is a problem with the cardiopulmonary starting on the 103rd sample. Control chart has a good warning function because it can raise the alarm before cardiopulmonary function has a big problem. Using MEWMA control chart for monitoring can reduce the cost of medical examination and frequency, it can improve the hospital resource utilization rate and confirm the case. Thus we can avoid missing the best treatment time. PubDate: 2018-01-11 DOI: 10.1007/s40745-018-0137-4

Authors:Yong Shi; Zhiguang Shan; Jianping Li; Yufei Fang Pages: 433 - 440 Abstract: Abstract On September 5, 2015, the State Council of Chinese Government, China’s cabinet formally announced its Action Framework for Promoting Big Data (www.gov.cn, 2015). This is the milestone for China to catch up the global wave of big data. Since 2012 big data became a hot issue for scientific communities as well as the governments of many countries (Lazer et al. in Science 343:1203–1205, 2014; Einav et al. in Science 345:715, 2014; Cate in Science 346:818, 2014; Khoury and Ioannidis in Science 346:1054–1055, 2014). At the 2013 G8 Summit, the leaders of Canada, France, Germany, Italy, Japan, Russia, U.S.A. and United Kingdom agreed on an “open government plan” (www.gov.uk/government/publications/open-data-charter/g8-open-data-charter-and-technical-annex, 2013). China’s framework, however, mainly emphasizes the integration of all trans-departmental data and establishes a number of government-driven national big data platforms so as to provide big data services to research, public and enterprises. The framework not only demonstrates a strong commitment of the Chinese government on big data, but also covers a wide range of governmental branches, enterprises and institutions far more than that of other countries. In addition, the framework shows an interpretation of big data that differs from other countries. If its objective is achieved, China would become a strong “big data country”. PubDate: 2017-12-01 DOI: 10.1007/s40745-017-0129-9 Issue No:Vol. 4, No. 4 (2017)

Authors:Sanku Dey; Chunfang Zhang; A. Asgharzadeh; M. Ghorbannezhad Pages: 441 - 455 Abstract: Abstract The extended exponential distribution due to Nadarajah and Haghighi (Stat J Theor Appl Stat 45(6):543–558, 2011) is an alternative and always provides better fits than the gamma, Weibull and the generalized exponential distributions whenever the data contains zero values. This article addresses different methods of estimation of the unknown parameters from both frequentist and Bayesian view points of Nadarajah and Haghighi (in short NH ) distribution. We briefly describe different frequentist approaches, namely, maximum likelihood estimators, moment estimators, percentile estimators, least square and weighted least square estimators and compare them using extensive numerical simulations. Next we consider Bayes estimation under different types of loss functions (symmetric and asymmetric loss functions) using gamma priors for both shape and scale parameters. Besides, the asymptotic confidence intervals, two parametric bootstrap confidence intervals using frequentist approaches are provided to compare with Bayes credible intervals. Furthermore, the Bayes estimators and their respective posterior risks are computed and compared using Markov chain Monte Carlo algorithm. Finally, two real data sets have been analyzed for illustrative purposes. PubDate: 2017-12-01 DOI: 10.1007/s40745-017-0114-3 Issue No:Vol. 4, No. 4 (2017)

Authors:Ramazan S. Aygun Pages: 503 - 531 Abstract: Abstract In this paper, we evaluate maximum subarrays for approximate string matching and alignment. The global alignment score as well as local sub-alignments are indicators of good alignment. After showing how maximum sub-arrays could be used for string matching, we provide several ways of using maximum subarrays: long, short, loose, strict, and top-k. While long version extends the local sub-alignments, the short method avoids extensions that would not increase the alignment score. The loose method tries to achieve high global score whereas the strict method converts the output of loose alignment by minimizing the unnecessary gaps. The top-k method is used to find out top-k sub-alignments. The results are compared with two global and local dynamic programming methods that use gap penalties in addition to one of the state-of-art methods. In our experiments, using maximum subarrays generated good overall as well as local sub-alignments without requiring gap penalties. PubDate: 2017-12-01 DOI: 10.1007/s40745-017-0117-0 Issue No:Vol. 4, No. 4 (2017)

Authors:Suresh Dara; Haider Banka; Chandra Sekhara Rao Annavarapu Pages: 341 - 360 Abstract: Abstract Feature selection in high dimensional data, particularly, in gene expression data, is one of the challenging task in bioinformatics due to the curse of dimensionality, data redundancy and noise values. In gene expression data, insignificant features causes poor classification, hence feature selection reduces feature subset, improving classification accuracy. Feature selection algorithms in gene expression data(such as filter based, wrapper based and hybrid methods) performing poor accuracy, where as few methods takes too much time to converge for an acceptable results. For example, in NSGA-II, over 10,000 generations, on an average, to converge in the search space. where it incurs increased computational time. Proposed rough based hybrid binary PSO algorithm, which uses a heuristic based fast processing strategy to reduce crude domain features by statistical elimination of redundant features and then discretized subsequently into a binary table, known as distinction table, in rough set theory. This distinction table is later used as input to evaluate and optimize the objectives functions i.e., to generate reduct in rough set theory. The proposed hybrid binary PSO is then used to tune the objective functions, to choose the most important features (i:e:reduct). The fitness function is used in such a way that it can reduce the cardinality of the features and at the same time, improve the classification performance as well. Results have been demonstrated to show the effectiveness of the proposed method, on existing three benchmark datasets (i.e. colon cancer, lymphoma and leukemia data), from literature. PubDate: 2017-09-01 DOI: 10.1007/s40745-017-0106-3 Issue No:Vol. 4, No. 3 (2017)

Authors:Ramesh Naidu Balaka; Prasad Babu Maddali Surendra Pages: 383 - 404 Abstract: Abstract Biometric authentication plays pivotal role for providing security in any industry. In the previous works, biometric authentication systems are developed by using the Password, Pin-number and Signature as a single source of identification (i.e. unimodal biometric system). But these systems can be noisy, lost, stolen or subjected to spoofing attack. This paper proposes a Multimodal Biometric Authenticated system which use more than one biometric trait for recognition and it is more effective than the any previous work. The proposed system is strong enough from attacks as the authentication is being done by using multimodal biometric traits. The present system handles two traits face and finger for recognition and these are followed by prepossessing, removing the noise, compression the traits and then extract features by using Histogram Oriented Gradients technique (HOG). The probability Density Function (PDF) values are obtained from the HOG features by using Gaussian mixer model. Fusion the PDF values by using score level fusion. Finally correlation compares both the training dataset and testing dataset traits. Identification of biometric traits have been done based on multimodal biometric system and results are better recognition performance compared to existing methods. However, experiments also done on different parametric measures like RMSE, PSNR and CR. It was observed that DCT has better performance than the existing HAAR wavelet transform. The proposed work is useful for reduce the size of the database, utilization of bandwidth, identification of traits and authentication in bank system, crime investigation etc. PubDate: 2017-09-01 DOI: 10.1007/s40745-017-0110-7 Issue No:Vol. 4, No. 3 (2017)

Authors:Li Cai; Sijin Li; Shipu Wang; Yu Liang Abstract: Abstract The trajectory data of taxies containing time dimensional and spatial dimensional information is an important kind of traffic data. How to obtain valuable information from these data has become a hot topic in the field of intelligent transportation. Existing trajectory clustering algorithms can only compute similarities using partial characteristics of the trajectory data, leading to clustering results are not accurate. This study proposes a novel trajectory clustering algorithm named GLTC, which can obtain more accurate number of clusters based on the global and local characteristics of trajectories. This study intuitively displays the laws and knowledge in clustering results using visualization techniques. Experimental results reveal that the GLTC algorithm can discover more accurate clustering results, effectively display spatial-temporal change trends in GPS data, and better assist in analyzing the flow law of urban citizens and urban traffic conditions using visualization methods. PubDate: 2017-11-21 DOI: 10.1007/s40745-017-0131-2

Authors:Li Cai; Yifan Zhou; Yu Liang; Jing He Abstract: Abstract Taxi trajectory data is a kind of massive traffic data with spatial–temporal dimensions, and plays a key role in traffic management, travel analysis and route recommendation for residents. Analyzing trajectory data with traditional methods is complicated, but visualization techniques can intuitively reflect the change trend of spatial–temporal data and facilitate the mining of knowledge and laws in the data. A novel taxi trajectory data visualization and analysis system, TaxiVis, has been designed and developed in this study. This system not only displays the traveling routes of every taxi on the map at the micro-level, dynamically analyzing every taxi’s operating indicators with varying time, but also displays the operating statistics of every taxi company at the macro-level. In addition, the TaxiVis provides route inquiry recommendation functions for users by GLTC algorithm. Implementation of front-end functions of this system are based on Node.js, D3.js and Baidu map, and the trajectory data has been stored in MySQL database. We evaluate TaxiVis with the trajectory dataset collected from 6599 taxis in Kunming. Experimental results show that the system can effectively process and analyze trajectory data, and provide precise data supporting and presentation for the comprehensive evaluation of taxi operation efficiency and mining the drivers’ intelligence. PubDate: 2017-11-17 DOI: 10.1007/s40745-017-0132-1

Authors:Dan Zhou; Liu Liu; Xin Lai Abstract: Abstract It is doubtful of the validity of the EWMA chart that the observations from heteroscedasticity processes violate the assumption of identical distribution. In this paper, we discuss the effect of heteroscedasticity on the performance of the conventional EWMA chart. Then we analyze the principle of the improved EWMA chart for monitoring heteroscedasticity processes. Then we compare the detection performance of the improved EWMA chart with the conventional EWMA chart by using a criteria based on average run length (ARL). Finally, an instance is given to indicate the effectiveness of the proposed method and analyze the best trading time of the stock. PubDate: 2017-11-14 DOI: 10.1007/s40745-017-0133-0

Authors:K. Ramadan; M. I. Dessouky; S. Elagooz; M. Elkordy; F. E. Abd El-Samie Abstract: Abstract Due to noise enhancement, conventional Zero Forcing (ZF) equalizers are not suitable for wireless Underwater Acoustic (UWA) Orthogonal Frequency Division Multiplexing (OFDM) communication systems. Furthermore, these systems suffer from increasing complexity due to the large number of subcarriers, especially in Multiple-Input Multiple-Output (MIMO) systems. On the other hand, the Minimum Mean Square Error equalizer suffers from high complexity. This type of equalizers needs an estimation of the operating Signal-to-Noise Ratio to work properly. In this paper, we propose a Joint Low-Complexity Regularized ZF equalizer for MIMO UWA-OFDM systems to cope with these problems. The main objective of the proposed equalizer is to enhance the system performance with a lower complexity by performing equalization in two steps. The co-channel interference can be mitigated in the first step. A regularization term is added in the second step to avoid the noise enhancement. Simulation results show that the proposed equalization scheme has the ability to enhance the UWA system performance with low complexity. PubDate: 2017-09-06 DOI: 10.1007/s40745-017-0127-y

Authors:Firuz Kamalov; Fadi Thabtah Abstract: Abstract One of the major aspects of any classification process is selecting the relevant set of features to be used in a classification algorithm. This initial step in data analysis is called the feature selection process. Disposing of the irrelevant features from the dataset will reduce the complexity of the classification task and will increase the robustness of the decision rules when applied on the test set. This paper proposes a new filtering method that combines and normalizes the scores of three major feature selection methods: information gain, chi-squared statistic and inter-correlation. Our method utilizes the strengths of each of the aforementioned methods to maximum advantage while avoiding their drawbacks—especially the disparity of the results produced by these methods. Our filtering method stabilizes each variable score and gives it the true rank among the input data’s available variables. Hence it maximizes the stability in the variables’ scores without losing the overall accuracy of the predictive model. A number of experiments on different datasets from various domains have shown that features chosen by the proposed method are highly predictive when compared with features selected by other existing filtering methods. The evaluation of the filtering phase was conducted via thorough experimentations using a number of predictive classification algorithms in addition to statistical analysis of the filtering methods’ scores. PubDate: 2017-07-29 DOI: 10.1007/s40745-017-0116-1

Authors:Sakshi Agarwal; Shikha Mehta Abstract: Abstract Shortest distance query is widely used aspect in large scale networks. Numerous approaches are present in the literature to approximate the distance between two query nodes. Most popular distance approximation approach is landmark embedding scheme. In this technique selection of optimal landmarks is a NP-hard problem. Various heuristics available to locate optimal landmarks include random, degree, closeness centrality, betweenness and eccentricity etc. In this paper, we propose to employ k-medoids clustering based approach to improve distance estimation accuracy over local landmark embedding techniques. In particular, it is observed that global selection of the seed landmarks causes’ large relative error, which is further reduced using local landmark embedding. The efficacy of the proposed approach is analyzed with respect to conventional graph embedding techniques on six large-scale networks. Results express that the proposed landmark selection scheme reduces the shortest distance estimation error considerably. Proposed technique is able to reduce the approximation error of shortest distance by upto 29% with respect to the other graph embedding technique. PubDate: 2017-07-22 DOI: 10.1007/s40745-017-0119-y

Authors:Abdullah-Al Nahid; Tariq M. Khan; Yinan Kong Abstract: Abstract Bone fracture detection from the digital image segmentation is a well-known image processing application which is frequently used to process biomedical images. Hardware realization of different image processing algorithm specially utilizing Field Programmable Gate Array (FPGA) has been gained a great interest among the researchers. FPGA has many significant features like spatial and temporal parallelism that best suits for real-time implementation of image processing. To gain the benefit from these characteristics of a FPGA, a new method for bone fracture detection is proposed and its performance is validated through real-time implementation. Simulation results show that the proposed method give superior performance than the existing method. PubDate: 2017-07-21 DOI: 10.1007/s40745-017-0118-z

Authors:Sanku Dey; Mazen Nassar; Devendra Kumar Abstract: Abstract In this paper, a new three-parameter distribution, called \(\alpha \) logarithmic transformed generalized exponential distribution ( \(\alpha LTGE\) ) is proposed. Various properties of the proposed distribution, including explicit expressions for the moments, quantiles, moment generating function, mean deviation about the mean and median, mean residual life, Bonferroni curve, Lorenz curve, Gini index, Rényi entropy, stochastic ordering and order statistics are derived. It appears to be a distribution capable of allowing monotonically increasing, decreasing, bathtub and upside-down bathtub shaped hazard rates depending on its parameters. The maximum likelihood estimators of the unknown parameters cannot be obtained in explicit forms, and they have to be obtained by solving non-linear equations only. The asymptotic confidence intervals for the parameters are also obtained based on asymptotic variance covariance matrix. Finally, two empirical applications of the new model to real data are presented for illustrative purposes. PubDate: 2017-07-21 DOI: 10.1007/s40745-017-0115-2

Authors:Reza Mokarram; Mehdi Emadi Abstract: Abstract Classification is the most important issues that have gained much attention in various fields such as health and medicine. Especially in survival models, classification represents a main objective and it is also one of the main purposes in data mining. Among data mining methods used for classification, implementation of the decision tree due to its simplicity and understandable and accurate results, has gained much attention and popularity. In this paper, first we generate the observations by using Monte-Carlo simulation from hazard model with the three degrees of complexity in different levels of censorship 0 to 70%. Then the accuracy of classification in the Cox and the decision tree models is compared for the number of samples 1000, 5000 and 10,000 by area under the ROC curve(AUC) and the ROC-test. PubDate: 2017-07-12 DOI: 10.1007/s40745-017-0105-4

Authors:Yuanyuan Zhang; Saralees Nadarajah Abstract: Abstract The Pareto type I distribution (also known as the power law distribution and Zipf’s law) appears to be the main distribution used to model heavy tailed phenomena in the big data literature. The Pareto type I distribution being one of the oldest heavy tailed distributions is not very flexible. Here, we show flexibility of four other heavy tailed distributions for modeling four big data sets in social networks. The Pareto type I distribution is shown not to provide the best or even an adequate fit for any of the data sets. PubDate: 2017-06-10 DOI: 10.1007/s40745-017-0113-4