Authors:Hossein Hassani; Xu Huang; Mansi Ghodsi Abstract: Abstract Causality analysis continues to remain one of the fundamental research questions and the ultimate objective for a tremendous amount of scientific studies. In line with the rapid progress of science and technology, the age of big data has significantly influenced the causality analysis on various disciplines especially for the last decade due to the fact that the complexity and difficulty on identifying causality among big data has dramatically increased. Data mining, the process of uncovering hidden information from big data is now an important tool for causality analysis, and has been extensively exploited by scholars around the world. The primary aim of this paper is to provide a concise review of the causality analysis in big data. To this end the paper reviews recent significant applications of data mining techniques in causality analysis covering a substantial quantity of research to date, presented in chronological order with an overview table of data mining applications in causality analysis domain as a reference directory. PubDate: 2017-08-01 DOI: 10.1007/s40745-017-0122-3

Authors:S. Viswanadha Raju; K. K. V. V. S. Reddy; Chinta Someswara Rao Abstract: Abstract String Matching is a technique of searching a pattern in a text. It is the basic concept to extract the fruitful information from large volume of text, which is used in different applications like text processing, information retrieval, text mining, pattern recognition, DNA sequencing and data cleaning etc., . Though it is stated some of the simple mechanisms perform very well in practice, plenty of research has been published on the subject and research is still active in this area and there are ample opportunities to develop new techniques. For this purpose, this paper has proposed linear array based string matching, string matching with butterfly model and string matching with divide and conquer models for sequential and parallel environments. To assess the efficiency of the proposed models, the genome sequences of different sizes (10–100 Mb) are taken as input data set. The experimental results have shown that the proposed string matching algorithms performs very well compared to those of Brute force, KMP and Boyer moore string matching algorithms. PubDate: 2017-07-29 DOI: 10.1007/s40745-017-0124-1

Authors:Firuz Kamalov; Fadi Thabtah Abstract: Abstract One of the major aspects of any classification process is selecting the relevant set of features to be used in a classification algorithm. This initial step in data analysis is called the feature selection process. Disposing of the irrelevant features from the dataset will reduce the complexity of the classification task and will increase the robustness of the decision rules when applied on the test set. This paper proposes a new filtering method that combines and normalizes the scores of three major feature selection methods: information gain, chi-squared statistic and inter-correlation. Our method utilizes the strengths of each of the aforementioned methods to maximum advantage while avoiding their drawbacks—especially the disparity of the results produced by these methods. Our filtering method stabilizes each variable score and gives it the true rank among the input data’s available variables. Hence it maximizes the stability in the variables’ scores without losing the overall accuracy of the predictive model. A number of experiments on different datasets from various domains have shown that features chosen by the proposed method are highly predictive when compared with features selected by other existing filtering methods. The evaluation of the filtering phase was conducted via thorough experimentations using a number of predictive classification algorithms in addition to statistical analysis of the filtering methods’ scores. PubDate: 2017-07-29 DOI: 10.1007/s40745-017-0116-1

Authors:Tariku Tessema Abstract: Abstract Community acquired pneumonia refers to pneumonia acquired outside of hospitals or extended health facilities and it is a leading infectious disease. This study aims to model mortality of hospitalized under-5 year child pneumonia patients and investigate potential risk factors associated with child mortality due to pneumonia. The study was a retrospective study on 305 sampled under-five hospitalized patients of community acquired pneumonia. A cross-classified multilevel logistic regression was employed with resident and hospital classified at the second level. Bayesian estimation method was applied in which the posterior distribution was simulated via Markov Chain Monte Carlo. The variability attributable to hospital was found to be larger than variability attributable to residence. The odds of dying from the community acquired pneumonia was higher among patients who were; diagnosed in spring season, complicated with malaria, AGE and AFI, in a neonatal age group, diagnosed late (more than a week). The risk of mortality was also found high for lower nurse: patient and physician: patients’ ratios. PubDate: 2017-07-28 DOI: 10.1007/s40745-017-0121-4

Authors:Harihara Santosh Dadi; Gopala Krishna Mohan Pillutla; Madhavi Latha Makkena Abstract: Abstract Tracking of human and recognition in public places using surveillance cameras is the topic of research in the area computer vision. Recognition of human and then tracking completes the video surveillance system. A novel algorithm for face recognition and human tracking is presented in this article. Human is tracked using Gaussian mixture model. To track the human in specific, template of GMM is divided into four regions which are placed one above the other and tracked simultaneously. For recognizing the human, the histogram of oriented gradients features of the face region are given to the support vector machine classifier. Three experiments are conducted in taking the training faces. Every \(10{\mathrm{th}}\) frame, every \(5{\mathrm{th}}\) frame and every \(3{\mathrm{rd}}\) frame of the first 100 frames are considered. The other frames in the video are considered for testing using SVM classifier. Three datasets namely AITAM1 (simple), AITAM2 (moderate) and AITAM3 (complex) are used in this work. The experimental results show that as the complexity of dataset increases the performance metrics are getting decreased. The more the number of training faces in preparing a classifier, the better is the face recognition rate. This is experimented for all types of datasets. The Performance results show that the combination of the tracking algorithm and the face recognition algorithm not only tracks the person but also recognizes the person. This unique property of both tracking and recognition makes it best suit for video surveillance applications. PubDate: 2017-07-25 DOI: 10.1007/s40745-017-0123-2

Authors:Chandrakant; M. K. Rastogi; Y. M. Tripathi Abstract: Abstract In this paper we study various reliability properties of a Weibull inverse exponential distribution. The maximum likelihood and Bayes estimates of unknown parameters and reliability characteristics are obtained. Bayes estimates are obtained with respect to the squared error loss function under proper and improper prior situations. We use the Lindley method and the Metropolis–Hastings algorithm to compute the Bayes estimates. Interval estimation is also considered. Asymptotic and highest posterior density intervals of unknown parameters are constructed in this respect. We perform a numerical study to compare the performance of all methods and obtain comments based on this study. We also analyze two real data sets for illustration purposes. Finally a conclusion is presented. PubDate: 2017-07-24 DOI: 10.1007/s40745-017-0125-0

Authors:Sakshi Agarwal; Shikha Mehta Abstract: Abstract Shortest distance query is widely used aspect in large scale networks. Numerous approaches are present in the literature to approximate the distance between two query nodes. Most popular distance approximation approach is landmark embedding scheme. In this technique selection of optimal landmarks is a NP-hard problem. Various heuristics available to locate optimal landmarks include random, degree, closeness centrality, betweenness and eccentricity etc. In this paper, we propose to employ k-medoids clustering based approach to improve distance estimation accuracy over local landmark embedding techniques. In particular, it is observed that global selection of the seed landmarks causes’ large relative error, which is further reduced using local landmark embedding. The efficacy of the proposed approach is analyzed with respect to conventional graph embedding techniques on six large-scale networks. Results express that the proposed landmark selection scheme reduces the shortest distance estimation error considerably. Proposed technique is able to reduce the approximation error of shortest distance by upto 29% with respect to the other graph embedding technique. PubDate: 2017-07-22 DOI: 10.1007/s40745-017-0119-y

Authors:Abdullah-Al Nahid; Tariq M. Khan; Yinan Kong Abstract: Abstract Bone fracture detection from the digital image segmentation is a well-known image processing application which is frequently used to process biomedical images. Hardware realization of different image processing algorithm specially utilizing Field Programmable Gate Array (FPGA) has been gained a great interest among the researchers. FPGA has many significant features like spatial and temporal parallelism that best suits for real-time implementation of image processing. To gain the benefit from these characteristics of a FPGA, a new method for bone fracture detection is proposed and its performance is validated through real-time implementation. Simulation results show that the proposed method give superior performance than the existing method. PubDate: 2017-07-21 DOI: 10.1007/s40745-017-0118-z

Authors:Sanku Dey; Tanmay Kayal; Yogesh Mani Tripathi Abstract: Abstract This article addresses the different methods of estimation of the probability density function and the cumulative distribution function for the Gompertz distribution. Following estimation methods are considered: maximum likelihood estimators, uniformly minimum variance unbiased estimators, least squares estimators, weighted least square estimators, percentile estimators, maximum product of spacings estimators, Cramér–von-Mises estimators, Anderson–Darling estimators. Monte Carlo simulations are performed to compare the behavior of the proposed methods of estimation for different sample sizes. Finally, one real data set and one simulated data set are analyzed for illustrative purposes. PubDate: 2017-07-21 DOI: 10.1007/s40745-017-0126-z

Authors:Sanku Dey; Mazen Nassar; Devendra Kumar Abstract: Abstract In this paper, a new three-parameter distribution, called \(\alpha \) logarithmic transformed generalized exponential distribution ( \(\alpha LTGE\) ) is proposed. Various properties of the proposed distribution, including explicit expressions for the moments, quantiles, moment generating function, mean deviation about the mean and median, mean residual life, Bonferroni curve, Lorenz curve, Gini index, Rényi entropy, stochastic ordering and order statistics are derived. It appears to be a distribution capable of allowing monotonically increasing, decreasing, bathtub and upside-down bathtub shaped hazard rates depending on its parameters. The maximum likelihood estimators of the unknown parameters cannot be obtained in explicit forms, and they have to be obtained by solving non-linear equations only. The asymptotic confidence intervals for the parameters are also obtained based on asymptotic variance covariance matrix. Finally, two empirical applications of the new model to real data are presented for illustrative purposes. PubDate: 2017-07-21 DOI: 10.1007/s40745-017-0115-2

Authors:Pramendra Singh Pundir; Puneet Kumar Gupta Abstract: Abstract This study deals with the reliability analysis of a multi-component load sharing system where failure of any component within the system induces higher failure rate on the remaining surviving components. It is assumed that each component failure time follows Chen distribution. In classical set up, the maximum likelihood estimates of the load sharing parameters, system reliability and hazard rate along with their standard errors are computed. Since maximum likelihood estimates are not in closed form, so asymptotic confidence intervals and two bootstrap confidence intervals for the unknown parameters have also been constructed. Further, by assuming both informative and non-informative prior for the unknown parameters, Bayes estimates along with their posterior standard errors and HPD intervals of the parameters are obtained. Thereafter, a simulation study elicitates the theoretical developments. A real data analysis, at the end, eshtablishes the applicability of the proposed theory. PubDate: 2017-07-20 DOI: 10.1007/s40745-017-0120-5

Authors:Ramazan S. Aygun Abstract: Abstract In this paper, we evaluate maximum subarrays for approximate string matching and alignment. The global alignment score as well as local sub-alignments are indicators of good alignment. After showing how maximum sub-arrays could be used for string matching, we provide several ways of using maximum subarrays: long, short, loose, strict, and top-k. While long version extends the local sub-alignments, the short method avoids extensions that would not increase the alignment score. The loose method tries to achieve high global score whereas the strict method converts the output of loose alignment by minimizing the unnecessary gaps. The top-k method is used to find out top-k sub-alignments. The results are compared with two global and local dynamic programming methods that use gap penalties in addition to one of the state-of-art methods. In our experiments, using maximum subarrays generated good overall as well as local sub-alignments without requiring gap penalties. PubDate: 2017-07-19 DOI: 10.1007/s40745-017-0117-0

Authors:Sanku Dey; Chunfang Zhang; A. Asgharzadeh; M. Ghorbannezhad Abstract: Abstract The extended exponential distribution due to Nadarajah and Haghighi (Stat J Theor Appl Stat 45(6):543–558, 2011) is an alternative and always provides better fits than the gamma, Weibull and the generalized exponential distributions whenever the data contains zero values. This article addresses different methods of estimation of the unknown parameters from both frequentist and Bayesian view points of Nadarajah and Haghighi (in short NH ) distribution. We briefly describe different frequentist approaches, namely, maximum likelihood estimators, moment estimators, percentile estimators, least square and weighted least square estimators and compare them using extensive numerical simulations. Next we consider Bayes estimation under different types of loss functions (symmetric and asymmetric loss functions) using gamma priors for both shape and scale parameters. Besides, the asymptotic confidence intervals, two parametric bootstrap confidence intervals using frequentist approaches are provided to compare with Bayes credible intervals. Furthermore, the Bayes estimators and their respective posterior risks are computed and compared using Markov chain Monte Carlo algorithm. Finally, two real data sets have been analyzed for illustrative purposes. PubDate: 2017-07-17 DOI: 10.1007/s40745-017-0114-3

Authors:Reza Mokarram; Mehdi Emadi Abstract: Abstract Classification is the most important issues that have gained much attention in various fields such as health and medicine. Especially in survival models, classification represents a main objective and it is also one of the main purposes in data mining. Among data mining methods used for classification, implementation of the decision tree due to its simplicity and understandable and accurate results, has gained much attention and popularity. In this paper, first we generate the observations by using Monte-Carlo simulation from hazard model with the three degrees of complexity in different levels of censorship 0 to 70%. Then the accuracy of classification in the Cox and the decision tree models is compared for the number of samples 1000, 5000 and 10,000 by area under the ROC curve(AUC) and the ROC-test. PubDate: 2017-07-12 DOI: 10.1007/s40745-017-0105-4

Authors:Yuanyuan Zhang; Saralees Nadarajah Abstract: Abstract The Pareto type I distribution (also known as the power law distribution and Zipf’s law) appears to be the main distribution used to model heavy tailed phenomena in the big data literature. The Pareto type I distribution being one of the oldest heavy tailed distributions is not very flexible. Here, we show flexibility of four other heavy tailed distributions for modeling four big data sets in social networks. The Pareto type I distribution is shown not to provide the best or even an adequate fit for any of the data sets. PubDate: 2017-06-10 DOI: 10.1007/s40745-017-0113-4

Authors:Feng Liu; Yong Shi; Ying Liu Abstract: Abstract Although artificial intelligence (AI) is currently one of the most interesting areas in scientific research, the potential threats posed by emerging AI systems remain a source of persistent controversy. To address the issue of AI threat,this study proposes a “standard intelligence model” that unifies AI and human characteristics in terms of four aspects of knowledge, i.e., input, output, mastery, and creation. Using this model, we observe three challenges, namely, expanding of the von Neumann architecture; testing and ranking the intelligence quotient (IQ) of naturally and artificially intelligent systems, including humans, Google, Microsoft’s Bing, Baidu, and Siri; and finally, the dividing of artificially intelligent systems into seven grades from robots to Google Brain. Based on this, we conclude that Google’s AlphaGo belongs to the third grade. PubDate: 2017-05-16 DOI: 10.1007/s40745-017-0109-0

Authors:James M. Tien Abstract: Abstract In several earlier papers, the author defined and detailed the concept of a servgood, which can be thought of as a physical good or product enveloped by a services-oriented layer that makes the good smarter or more adaptable and customizable for a particular use. Adding another layer of physical sensors could then enhance its smartness and intelligence, especially if it were to be connected with other servgoods—thus, constituting an Internet of Things (IoT) or servgoods. More importantly, real-time decision making is central to the Internet of Things; it is about decision informatics and embraces the advanced technologies of sensing (i.e., Big Data), processing (i.e., real-time analytics), reacting (i.e., real-time decision-making), and learning (i.e., deep learning). Indeed, real-time decision making (RTDM) is becoming an integral aspect of IoT and artificial intelligence (AI), including its improving abilities at voice and video recognition, speech and predictive synthesis, and language and social-media understanding. These three key and mutually supportive technologies—IoT, RTDM, and AI—are considered herein, including their progress to date. PubDate: 2017-05-16 DOI: 10.1007/s40745-017-0112-5

Authors:Pei-Zhuang Wang; Ho-Chung Lui; Hai-Tao Liu; Si-Cong Guo Abstract: Abstract An algorithm named Gravity Sliding is presented in the paper, which emulates the gravity sliding motion in a feasible region D constrained by a group of hyper planes in \(R^{m}\) . At each stage point P of the sliding path, we need to calculate the projection of gravity vector g on constraint planes: if a constraint plane blocks the way at P, then it may change the direction of sliding path. The core technique is the synthetical treatment for multiple blocking planes, which is a basic problem of structural adjustment in practice; while the whole path provides the solution of a linear programming. Existing LP algorithms have no intuitive vision to emulate gravity sliding, therefore, their paths are not able to avoid circling and roving, and they could not provide a best direction at each step for structural adjustment. The first author presented the algorithm Cone Cutting (Wang in Inf Technol Decis Mak 10(1):65–82, 2011), which provides an intuitive explanation for Simplex pivoting. And then the algorithm Gradient Falling (Wang in Ann Data Sci 1(1):41–71, 2014. doi:10.1007/s40745-014-0005-9) was presented, which emulates the gradient motion on the feasible region. This paper is an improvement of gradient falling algorithm: in place of the description focusing on the null subspace of norm vectors, we focus the description on the expanding subspace of the very vectors in this paper. It makes the projection calculation easier and faster. We guess that the sliding path realized by the algorithm is the optimal path and the number of stage points of the path is limited by a polynomial function of the dimension number and the number of constraint planes. PubDate: 2017-04-05 DOI: 10.1007/s40745-017-0108-1

Authors:Zhuopei Yang; Yanmei Zhang; Hengyue Jia Abstract: Abstract The low success rate of lending is the main drawback of development of online P2P lending platforms in China. Based on the theory of social capital, this study analysed the influence factors of success rate of P2P lending platform in China, using social network method and multiple linear regression model. Soft information, such as bidding record, has been creatively employed to study the corresponding topics. Data used in this study comes from the largest online P2P lending platform in China. The results show that: compared with other influence factors, the bidding record has a more significant effect on the success rate, and the users depend more on the social capital; the bidding records reduce the asymmetry of information, and help increasing the success rate of lending and decreasing the cost of online P2P lending. PubDate: 2017-03-10 DOI: 10.1007/s40745-017-0103-6

Authors:F. Maleki; E. Deiri Abstract: Abstract In this paper, we consider the estimation of the PDF and the CDF of the Frechet distribution. In this regard, following estimators are considered: uniformly minimum variance unbiased estimator, maximum likelihood estimator, percentile estimator, least squares estimator and weighted least squares estimator. To do so, analytical expressions are derived for the bias and the mean squared error. As the result of simulation studies and real data applications indicate, the ML estimator performs better than the others. PubDate: 2017-02-27 DOI: 10.1007/s40745-017-0100-9