for Journals by Title or ISSN for Articles by Keywords help

Publisher: Springer-Verlag (Total: 2350 journals)

 Annals of Data Science   [11 followers]  Follow         Hybrid journal (It can contain Open Access articles)    ISSN (Print) 2198-5804 - ISSN (Online) 2198-5812    Published by Springer-Verlag  [2350 journals]
• The Improved EWMA Chart for Heteroscedasticity Process
• Authors: Dan Zhou; Liu Liu; Xin Lai
Pages: 21 - 27
Abstract: It is doubtful of the validity of the EWMA chart that the observations from heteroscedasticity processes violate the assumption of identical distribution. In this paper, we discuss the effect of heteroscedasticity on the performance of the conventional EWMA chart. Then we analyze the principle of the improved EWMA chart for monitoring heteroscedasticity processes. Then we compare the detection performance of the improved EWMA chart with the conventional EWMA chart by using a criteria based on average run length (ARL). Finally, an instance is given to indicate the effectiveness of the proposed method and analyze the best trading time of the stock.
PubDate: 2018-03-01
DOI: 10.1007/s40745-017-0133-0
Issue No: Vol. 5, No. 1 (2018)

• GPS Trajectory Clustering and Visualization Analysis
• Authors: Li Cai; Sijin Li; Shipu Wang; Yu Liang
Pages: 29 - 42
Abstract: The trajectory data of taxies containing time dimensional and spatial dimensional information is an important kind of traffic data. How to obtain valuable information from these data has become a hot topic in the field of intelligent transportation. Existing trajectory clustering algorithms can only compute similarities using partial characteristics of the trajectory data, leading to clustering results are not accurate. This study proposes a novel trajectory clustering algorithm named GLTC, which can obtain more accurate number of clusters based on the global and local characteristics of trajectories. This study intuitively displays the laws and knowledge in clustering results using visualization techniques. Experimental results reveal that the GLTC algorithm can discover more accurate clustering results, effectively display spatial-temporal change trends in GPS data, and better assist in analyzing the flow law of urban citizens and urban traffic conditions using visualization methods.
PubDate: 2018-03-01
DOI: 10.1007/s40745-017-0131-2
Issue No: Vol. 5, No. 1 (2018)

• Research and Application of GPS Trajectory Data Visualization
• Authors: Li Cai; Yifan Zhou; Yu Liang; Jing He
Pages: 43 - 57
Abstract: Taxi trajectory data is a kind of massive traffic data with spatial–temporal dimensions, and plays a key role in traffic management, travel analysis and route recommendation for residents. Analyzing trajectory data with traditional methods is complicated, but visualization techniques can intuitively reflect the change trend of spatial–temporal data and facilitate the mining of knowledge and laws in the data. A novel taxi trajectory data visualization and analysis system, TaxiVis, has been designed and developed in this study. This system not only displays the traveling routes of every taxi on the map at the micro-level, dynamically analyzing every taxi’s operating indicators with varying time, but also displays the operating statistics of every taxi company at the macro-level. In addition, the TaxiVis provides route inquiry recommendation functions for users by GLTC algorithm. Implementation of front-end functions of this system are based on Node.js, D3.js and Baidu map, and the trajectory data has been stored in MySQL database. We evaluate TaxiVis with the trajectory dataset collected from 6599 taxis in Kunming. Experimental results show that the system can effectively process and analyze trajectory data, and provide precise data supporting and presentation for the comprehensive evaluation of taxi operation efficiency and mining the drivers’ intelligence.
PubDate: 2018-03-01
DOI: 10.1007/s40745-017-0132-1
Issue No: Vol. 5, No. 1 (2018)

• Reliability Estimation in Load-Sharing System Model with Application to
Real Data
• Authors: Pramendra Singh Pundir; Puneet Kumar Gupta
Pages: 69 - 91
Abstract: This study deals with the reliability analysis of a multi-component load sharing system where failure of any component within the system induces higher failure rate on the remaining surviving components. It is assumed that each component failure time follows Chen distribution. In classical set up, the maximum likelihood estimates of the load sharing parameters, system reliability and hazard rate along with their standard errors are computed. Since maximum likelihood estimates are not in closed form, so asymptotic confidence intervals and two bootstrap confidence intervals for the unknown parameters have also been constructed. Further, by assuming both informative and non-informative prior for the unknown parameters, Bayes estimates along with their posterior standard errors and HPD intervals of the parameters are obtained. Thereafter, a simulation study elicitates the theoretical developments. A real data analysis, at the end, eshtablishes the applicability of the proposed theory.
PubDate: 2018-03-01
DOI: 10.1007/s40745-017-0120-5
Issue No: Vol. 5, No. 1 (2018)

• Power Lindley-G Family of Distributions
• Authors: Amal S. Hassan; Said G. Nassr
Abstract: In this paper, we introduce a new family of probability distributions generated from a power Lindley random variable called the power Lindley-generated family. The new family extends several classical distributions as well as generalizes the odd Lindley family which is performed by Silva et al. (Austrian J Stat 46:65–87, 2017). Some of the mathematical properties are obtained involving moments, incomplete moments, quantile function and order statistics. New four distributions are provided as special models from the family. The model parameters of the family are estimated by the maximum likelihood technique. An application to real data set and simulation study are provided to demonstrate the flexibility and interest of one special model of the suggested family.
PubDate: 2018-03-16
DOI: 10.1007/s40745-018-0159-y

• Statistical Inference and Optimum Life Testing Plans Under Type-II Hybrid
Censoring Scheme
• Authors: Tanmay Sen; Yogesh Mani Tripathi; Ritwik Bhattacharya
Abstract: This article considers estimation of unknown parameters and prediction of future observations of a generalized exponential distribution based on Type-II hybrid censored data. Bayes point and HPD interval estimates of the unknown parameters are obtained under the assumption of independent gamma priors. Different classical and Bayesian point predictors and prediction intervals are obtained in two-sample situation against squared error loss function. The optimum censoring schemes are computed under various optimality criteria. Monte Carlo simulations are performed to compare different methods and two data sets are analyzed for illustrative purposes.
PubDate: 2018-03-14
DOI: 10.1007/s40745-018-0158-z

• Forecasting the Volatility of Ethiopian Birr/Euro Exchange Rate Using
Garch-Type Models
• Authors: Desa Daba Fufa; Belianeh Legesse Zeleke
Abstract: This paper provides a robust analysis of volatility forecasting of Euro-ETB exchange rate using weekly data spanning the period January 3, 2000–December 2, 2015. The forecasting performance of various GARCH-type models is investigated based on forecasting performance criteria such as MSE and MAE based tests, and alternative measures of realized volatility. To our knowledge, this is the first study that focuses on Euro-ETB exchange rate using high frequency data, and a range of econometric models and forecast performance criteria. The empirical results indicate that the Euro-ETB exchange rate series exhibits persistent volatility clustering over the study period. We document evidence that ARCH (8), GARCH (1, 1), EGARCH (1, 1) and GJR-GARCH (2, 2) models with normal distribution, student’s-t distribution and GED are the best in-sample estimation models in terms of the volatility behavior of the series. Amongst these models, GJR-GARCH (2, 2) and GARCH (1, 1) with students t-distribution are found to perform best in terms of one step-ahead forecasting based on realized volatility calculated from the underlying daily data and squared weekly first differenced of the logarithm of the series, respectively. A one-step-ahead forecasted conditional variance of weekly Euro-ETB exchange rate portrays large spikes around 2010 and it is evident that weekly Euro-ETB exchange rate are volatile. This large spikes indicates that devaluation of Ethiopian birr against the Euro. This volatility behavior may affects the International Foreign Investment and trade balance of the country. Therefore, GJR-GARCH (2, 2) with student’s t-distribution is the best model both interms of the stylized facts and forecasting performance of the volatility of Ethiopian Birr/Euro exchange rate among others.
PubDate: 2018-03-13
DOI: 10.1007/s40745-018-0151-6

• Joint Modeling of Longitudinal CD4 Count and Time-to-Death of HIV/TB
Co-infected Patients: A Case of Jimma University Specialized Hospital
• Authors: Aboma Temesgen; Abdisa Gurmesa; Yehenew Getchew
Abstract: Tuberculosis (TB) and HIV have been closely linked since the emergence of AIDS; TB enhances HIV replication by accelerating the natural evolution of HIV infection which is the leading cause of sickness and death of peoples living with HIV/AIDS. To improve their life the co-infected patients are started to take antiretroviral treatment as patient started to take ART it is common to measure CD4 and other clinical outcomes which is correlated with survival time. However, the separate analysis of such data does not handle the association between the longitudinal measured out come and time-to-event where the joint modeling does to obtain valid and efficient survival time. Joint modeling of longitudinally measured CD4 and time-to death to understand their association. Furthermore, the study identifies factors affecting the mean change in square root CD4 measurement over time and risk factors for the survival time of HIV/TB co-infected patients. The study consists of 254 HIV/TB co-infected patients who were 18 years old or older and who were on antiretroviral treatment follow up from first February 2009 to fist July 2014 in Jimma University Specialized Hospital, West Ethiopia. First, data were analyzed using linear mixed model and survival models separately. After having appropriate separate models using Akaki information criteria, different joint models employed with different random effects longitudinal model and different shared parameters association structure of survival model and compared with deviance information criteria score. The linear mixed model showed functional status, weight, linear time and quadratic time effects have significant effect on the mean change of CD4 measurement over time. The Cox and Weibull survival model showed base line weight, baseline smoking, separated marital status group and base line functional status have significant effect on hazard function of the survival time whereas the joint model showed subject specific base line value; subject specific linear and quadratic slopes of CD4 measurement of were significantly associated with the survival time of co-infected patient at 5% significance levels. The longitudinally measured CD4 count measurement marker process is significantly associated with time to death and subject specific quadratic slope growth of CD4 measurement, base line clinical stage IV and smoking is the high risk factors that lower the survival time of HIV/TB co-infected patients. Since the longitudinally measured CD4 measurement is correlated with survival time joint modeling are used to handle the associations between these two processes to obtain valid and efficient survival time.
PubDate: 2018-03-12
DOI: 10.1007/s40745-018-0157-0

• Transmuted Kumaraswamy Quasi Lindley Distribution with Applications
• Authors: M. Elgarhy; I. Elbatal; Muhammad Ahsan ul Haq; Amal S. Hassan
Abstract: The Lindley distribution is one of the widely used models for studying most of reliability modeling. Besides, several of researchers have motivated new classes of distributions based on modifications of the quasi Lindley distribution. In this article, a new version of generalized distributions named as the transmuted Kumaraswamy quasi Lindley (TKQL) is introduced. Various statistical properties of the TKQL distribution are provided. The rth moment of the TKQL distribution and its moment generating function are explored. Moreover, estimation of the model parameters is discussed via the method of maximum likelihood. Applications to real data are performed to clarify the flexibility of the TKQL distribution in comparison with some sub-models.
PubDate: 2018-03-12
DOI: 10.1007/s40745-018-0153-4

• Fractal Dimension Calculation for Big Data Using Box Locality Index
• Authors: Rong Liu; Robert Rallo; Yoram Cohen
Abstract: The box-counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pairs with the key indexing the location of a “box” (i.e., a grid cell on the multi-dimensional space) and the value counting the number of data points inside the box (i.e., “box occupancy”). Such a key-value pair structure of BLI significantly simplifies the traditionally used hierarchical structure and encodes only necessary information required by the box-counting approach for fractal dimension calculation. Moreover, as the box occupancies (i.e., the values) associated with the same index (i.e., the key) are aggregatable, the BLI grants the box-counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box-counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for estimating the intrinsic dimension of big data, which is essential for many machine learning methods and data analytics including feature selection and dimensionality reduction.
PubDate: 2018-03-10
DOI: 10.1007/s40745-018-0152-5

• Artificial Neural Network Classification of High Dimensional Data with
Novel Optimization Approach of Dimension Reduction
• Authors: Rabia Aziz; C. K. Verma; Namita Srivastava
Abstract: Classification of high dimensional data is a very crucial task in bioinformatics. Cancer classification of the microarray is a typical application of machine learning due to the large numbers of genes. Feature (genes) selection and classification with computational intelligent techniques play an important role in diagnosis and prediction of disease in the microarray. Artificial neural networks (ANN) is an artificial intelligence technique for classifying, image processing and predicting the data. This paper evaluates the performance of ANN classifier using six different hybrid feature selection techniques, for gene selection of microarray data. These hybrid techniques use Independent component analysis (ICA), as an extraction technique, popular filter techniques and bio-inspired algorithm for optimization of the ICA feature vector. Five binary gene expression microarray datasets are used to compare the performance of these techniques and determine how these techniques improve the performance of ANN classifier. These techniques can be extremely useful in feature selection because they achieve the highest classification accuracy along with the lowest average number of selected genes. Furthermore, to check the significant difference between these different algorithms a statistical hypothesis test was employed with a certain level of confidence. The experimental result shows that a combination of ICA with genetic bee colony algorithm shows superior performance as it heuristically removes non-contributing features to improve the performance of classifiers.
PubDate: 2018-03-09
DOI: 10.1007/s40745-018-0155-2

• On Optimal Progressive Censoring Schemes for Normal Distribution
• Authors: U. H. Salemi; S. Rezaei; Y. Si; S. Nadarajah
Abstract: Selection of optimal progressive censoring schemes for the normal distribution is discussed according to maximum likelihood estimation and best linear unbiased estimation. The selection is based on variances of the estimators of the two parameters of the normal distribution. The extreme left censoring scheme is shown to be an optimal progressive censoring scheme. The usual type-II right censoring case is shown to be the worst progressive censoring scheme for estimating the scale parameter. It can greatly increase the variance of estimators.
PubDate: 2018-03-09
DOI: 10.1007/s40745-018-0156-1

• A Novel Multiview Topic Model to Compute Correlation of Heterogeneous Data
• Authors: Jinsheng Shen; Mingmin Chi
Abstract: With fast development of Internet technologies and sensor techniques, it is much easier to acquire data from different sources in different dates and times. However, how to compute the correlation of those heterogeneous data is a big challenge for data mining and information retrieval. Here, data feature from one source is called as a view, and the multiview features denote the same data point. In the paper, hidden correlation of two-view features is proposed to construct a Heterogeneous (multiview) Topic Model (HTM). In particular, probabilistic topic model is utilized for different views as usually, generative models provide much richer features when handling high-dimensional data such as texts. Nevertheless, it is necessary to know the form of probability distribution for most existent probabilistic topic models, such as latent Dirichlet allocation. By avoiding the limitation of probabilistic topic model, the HTM is reduced to solving a non-negative matrix tri-factorization problem with certain constraints such that the proposed approach can be used in terms of an arbitrary model.
PubDate: 2018-02-20
DOI: 10.1007/s40745-017-0135-y

• Assessing Survival Time of Women with Cervical Cancer Using Various
Parametric Frailty Models: A Case Study at Tikur Anbessa Specialized
• Authors: Selamawit Endale Gurmu
Abstract: Cervical cancer is one of the leading causes of death in the world and represents a tremendous burden on patients, families and societies. It is estimated that over one million women worldwide currently have cervical cancer; most of them have not been diagnosed or have no access to treatment that could cure them or prolong their lives. The goal of this study is to investigate potential risk factors affecting survival time of women with cervical cancer at Tikur Anbessa specialized hospital. Data were taken from patients’ medical record card that enrolled during September 2011–September 2015. Kaplan–Meier estimation method, Cox proportional hazard model and parametric shared frailty model were used to analysis survival time of cervical cancer patients. Study subjects (cervical cancer patients) in this study came from clustered community and hence clustered survival data correlated at the regional level. Parametric frailty models will be explored assuming that women with in the same cluster (region for this study) shares similar risk factors. We used Exponential, Weibull, Log logistics and Log normal distributions and based on AIC criteria, all models were compared for their performance. The lognormal inverse Gaussian model has the minimum AIC value among the models compared. The results implied that not giving birth up to the study ends and married after twenty years were significantly prolong the survival time of patients while age class 51–60, 61–70, > 70, smoking cigarettes, patients with stage III and IV disease, family history of cervical cancer, history of abortion and living with HIV AIDS were significantly shorten survival time of patients. The findings of this study suggested that age, smoking cigarettes, stage, family history, abortion history, living with HIV AIDS, age at first marriage and age at first birth were major factors to survival time of patients. Heterogeneity between the regions in the survival time of cervical cancer patients, indicating that one needs to account for this clustering variable using frailty models. The fit statistics showed that lognormal inverse-Gaussian frailty model described the survival time of cervical cancer patients dataset better than other distributions used in this study.
PubDate: 2018-02-17
DOI: 10.1007/s40745-018-0150-7

• Build a Tourism-Specific Sentiment Lexicon Via Word2vec
• Authors: Wei Li; Luyao Zhu; Kun Guo; Yong Shi; Yuanchun Zheng
Abstract: Online travel and online travel culture developed fast in China recently years while useful knowledge still hidden under a large number of tourism reviews. Therefore, we need effective sentiment analysis methods to mine useful knowledge which can help tourism websites make decisions and improve their travel products. Some data-driven sentiment lexicons have poor performance on sentiment polarity classification due to lack of semantic information. Thus, we propose an effective and more proper data-driven sentiment lexicon construction method incorporating manually labeled sentiment scores, semantic similarity information that is introduced by machine learning method word2vec. Experimental results demonstrate that our method improves the performance of tourism sentiment analysis significantly.
PubDate: 2018-02-16
DOI: 10.1007/s40745-017-0130-3

• User Data Can Tell Defaulters in P2P Lending
• Authors: Jackson J. Mi; Tianxiao Hu; Luke Deer
Abstract: Online peer-to-peer (P2P) lending service is a new type of financial platforms that enables individuals borrow and lend money directly from one to another. As P2P lending service is rapidly developing, a number of rating systems of borrowers’ creditworthiness are published by different P2P lending companies. However, whether these rating systems could truly reflect the creditworthiness and loan risk of borrowers is unconfirmed. In this paper, we analyzed the differences between credit levels and users’ distribution of CPLP to evaluate if the credit levels can truly reflect the borrowers’ credit. We used soft factors to establish a model that can find borrowers who are likely to default. Further, we proposed some strategies to construct and improve the risk-control of P2P lending platforms according to the result of our research.
PubDate: 2018-02-05
DOI: 10.1007/s40745-017-0134-z

• Enhancing Situation Awareness Using Semantic Web Technologies and Complex
Event Processing
• Authors: Havva Alizadeh Noughabi; Mohsen Kahani; Alireza Shakibamanesh
Abstract: Data fusion techniques combine raw data of multiple sources and collect associated data to achieve more specific inferences than what could be attained with a single source. Situational awareness is one of the levels of the JDL, a matured information fusion model. The aim of situational awareness is to understand the developing relationships of interests between entities within a specific time and space. The present research shows how semantic web technologies, i.e. ontology and semantic reasoner, can be used to describe situations and increase awareness of the situation. As the situation awareness level receives data streams from numerous distributed sources, it is necessary to manage data streams by applying data stream processor engines such as Esper. In addition, in this research, complex event processing, a technique for achieving related situational in real-time, has been used, whose main aim is to generate actionable abstractions from event streams, automatically. The proposed approach combines Complex Event Processing and semantic web technologies to achieve better situational awareness. To show the functionality of the proposed approach in practice, some simple examples are discussed.
PubDate: 2018-02-05
DOI: 10.1007/s40745-018-0148-1

• A New Family of Generalized Distributions Based on Alpha Power
Transformation with Application to Cancer Data
• Authors: M. Nassar; A. Alzaatreh; O. Abo-Kasem; M. Mead; M. Mansoor
Abstract: In this paper, we propose a new method for generating distributions based on the idea of alpha power transformation introduced by Mahdavi and Kundu (Commun Stat Theory Methods 46(13):6543–6557, 2017). The new method can be applied to any distribution by inverting its quantile function as a function of alpha power transformation. We apply the proposed method to the Weibull distribution to obtain a three-parameter alpha power within Weibull quantile function. The new distribution possesses a very flexible density and hazard rate function shapes which are very useful in cancer research. The hazard rate function can be increasing, decreasing, bathtub or upside down bathtub shapes. We derive some general properties of the proposed distribution including moments, moment generating function, quantile and Shannon entropy. The maximum likelihood estimation method is used to estimate the parameters. We illustrate the applicability of the proposed distribution to complete and censored cancer data sets.
PubDate: 2018-02-03
DOI: 10.1007/s40745-018-0144-5

• $$\ell _1$$ ℓ 1 -Norm Based Central Point Analysis for Asymmetric Radial
Data
• Authors: Qi An; Shu-Cherng Fang; Tiantian Nie; Shan Jiang
Abstract: Multivariate asymmetric radial data clouds with irregularly positioned “spokes” and “clutters” are commonly seen in real life applications. In identifying the spoke directions of such data, a key initial step is to locate a central point from which each spoke extends and diverges. In this technical note, we propose a novel method that features a preselection procedure to screen out candidate points that have sufficiently many data points in the vicinity and identifies the central point by solving an $$\ell _1$$ -norm constrained discrete optimization program. Extensive numerical experiments show that the proposed method is capable of providing central points with superior accuracy and robustness compared with other known methods and is computationally efficient for implementation.
PubDate: 2018-01-29
DOI: 10.1007/s40745-018-0147-2

• Ranking of Classification Algorithms in Terms of Mean–Standard
Deviation Using A-TOPSIS
• Authors: André G. C. Pacheco; Renato A. Krohling
Abstract: In classification problems when multiple algorithms are applied to different benchmarks a difficult issue arises, i.e., how can we rank the algorithms' In machine learning, it is common to run the algorithms several times and then a statistic is calculated in terms of means and standard deviations. In order to compare the performance of the algorithms, it is very common to employ statistical tests. However, these tests may also present limitations, since they consider only the means and not the standard deviations of the obtained results. In this paper, we present the so-called A-TOPSIS, based on Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), to solve the problem of ranking and comparing classification algorithms in terms of means and standard deviations. We use two case studies to illustrate the A-TOPSIS for ranking classification algorithms and the results show the suitability of A-TOPSIS to rank the algorithms. The presented approach can be applied to compare the performance of stochastic algorithms in machine learning. Lastly, to encourage researchers to use the A-TOPSIS for ranking algorithms, we also presented in this work an easy-to-use A-TOPSIS web framework.
PubDate: 2018-01-13
DOI: 10.1007/s40745-018-0136-5

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327

Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs