for Journals by Title or ISSN for Articles by Keywords help

Publisher: Springer-Verlag (Total: 2349 journals)

 Annals of Data ScienceNumber of Followers: 12      Hybrid journal (It can contain Open Access articles) ISSN (Print) 2198-5804 - ISSN (Online) 2198-5812 Published by Springer-Verlag  [2349 journals]
• Collective Anomaly Detection Techniques for Network Traffic Analysis
• Authors: Mohiuddin Ahmed
Pages: 497 - 512
Abstract: In certain cyber-attack scenarios, such as flooding denial of service attacks, the data distribution changes significantly. This forms a collective anomaly, where some similar kinds of normal data instances appear in abnormally large numbers. Since they are not rare anomalies, existing anomaly detection techniques cannot properly identify them. This paper investigates detecting this behaviour using the existing clustering and co-clustering based techniques and utilizes the network traffic modelling technique via Hurst parameter to propose a more effective algorithm combining clustering and Hurst parameter. Experimental analysis reflects that the proposed Hurst parameter-based technique outperforms existing collective and rare anomaly detection techniques in terms of detection accuracy and false positive rates. The experimental results are based on benchmark datasets such as KDD Cup 1999 and UNSW-NB15 datasets.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0149-0
Issue No: Vol. 5, No. 4 (2018)

• Assessing Survival Time of Women with Cervical Cancer Using Various
Parametric Frailty Models: A Case Study at Tikur Anbessa Specialized
• Authors: Selamawit Endale Gurmu
Pages: 513 - 527
Abstract: Cervical cancer is one of the leading causes of death in the world and represents a tremendous burden on patients, families and societies. It is estimated that over one million women worldwide currently have cervical cancer; most of them have not been diagnosed or have no access to treatment that could cure them or prolong their lives. The goal of this study is to investigate potential risk factors affecting survival time of women with cervical cancer at Tikur Anbessa specialized hospital. Data were taken from patients’ medical record card that enrolled during September 2011–September 2015. Kaplan–Meier estimation method, Cox proportional hazard model and parametric shared frailty model were used to analysis survival time of cervical cancer patients. Study subjects (cervical cancer patients) in this study came from clustered community and hence clustered survival data correlated at the regional level. Parametric frailty models will be explored assuming that women with in the same cluster (region for this study) shares similar risk factors. We used Exponential, Weibull, Log logistics and Log normal distributions and based on AIC criteria, all models were compared for their performance. The lognormal inverse Gaussian model has the minimum AIC value among the models compared. The results implied that not giving birth up to the study ends and married after twenty years were significantly prolong the survival time of patients while age class 51–60, 61–70, > 70, smoking cigarettes, patients with stage III and IV disease, family history of cervical cancer, history of abortion and living with HIV AIDS were significantly shorten survival time of patients. The findings of this study suggested that age, smoking cigarettes, stage, family history, abortion history, living with HIV AIDS, age at first marriage and age at first birth were major factors to survival time of patients. Heterogeneity between the regions in the survival time of cervical cancer patients, indicating that one needs to account for this clustering variable using frailty models. The fit statistics showed that lognormal inverse-Gaussian frailty model described the survival time of cervical cancer patients dataset better than other distributions used in this study.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0150-7
Issue No: Vol. 5, No. 4 (2018)

• Forecasting the Volatility of Ethiopian Birr/Euro Exchange Rate Using
Garch-Type Models
• Authors: Desa Daba Fufa; Belianeh Legesse Zeleke
Pages: 529 - 547
Abstract: This paper provides a robust analysis of volatility forecasting of Euro-ETB exchange rate using weekly data spanning the period January 3, 2000–December 2, 2015. The forecasting performance of various GARCH-type models is investigated based on forecasting performance criteria such as MSE and MAE based tests, and alternative measures of realized volatility. To our knowledge, this is the first study that focuses on Euro-ETB exchange rate using high frequency data, and a range of econometric models and forecast performance criteria. The empirical results indicate that the Euro-ETB exchange rate series exhibits persistent volatility clustering over the study period. We document evidence that ARCH (8), GARCH (1, 1), EGARCH (1, 1) and GJR-GARCH (2, 2) models with normal distribution, student’s-t distribution and GED are the best in-sample estimation models in terms of the volatility behavior of the series. Amongst these models, GJR-GARCH (2, 2) and GARCH (1, 1) with students t-distribution are found to perform best in terms of one step-ahead forecasting based on realized volatility calculated from the underlying daily data and squared weekly first differenced of the logarithm of the series, respectively. A one-step-ahead forecasted conditional variance of weekly Euro-ETB exchange rate portrays large spikes around 2010 and it is evident that weekly Euro-ETB exchange rate are volatile. This large spikes indicates that devaluation of Ethiopian birr against the Euro. This volatility behavior may affects the International Foreign Investment and trade balance of the country. Therefore, GJR-GARCH (2, 2) with student’s t-distribution is the best model both interms of the stylized facts and forecasting performance of the volatility of Ethiopian Birr/Euro exchange rate among others.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0151-6
Issue No: Vol. 5, No. 4 (2018)

• Fractal Dimension Calculation for Big Data Using Box Locality Index
• Authors: Rong Liu; Robert Rallo; Yoram Cohen
Pages: 549 - 563
Abstract: The box-counting approach for fractal dimension calculation is scaled up for big data using a data structure named box locality index (BLI). The BLI is constructed as key-value pairs with the key indexing the location of a “box” (i.e., a grid cell on the multi-dimensional space) and the value counting the number of data points inside the box (i.e., “box occupancy”). Such a key-value pair structure of BLI significantly simplifies the traditionally used hierarchical structure and encodes only necessary information required by the box-counting approach for fractal dimension calculation. Moreover, as the box occupancies (i.e., the values) associated with the same index (i.e., the key) are aggregatable, the BLI grants the box-counting approach the needed scalability for fractal dimension calculation of big data using distributed computing techniques (e.g., MapReduce and Spark). Taking the advantage of the BLI, MapReduce and Spark methods for fractal dimension calculation of big data are developed, which conduct box-counting for each grid level as a cascade of MapReduce/Spark jobs in a bottom-up fashion. In an empirical validation, the MapReduce and Spark methods demonstrated good effectiveness and efficiency in fractal calculation of a big synthetic dataset. In summary, this work provides an efficient solution for estimating the intrinsic dimension of big data, which is essential for many machine learning methods and data analytics including feature selection and dimensionality reduction.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0152-5
Issue No: Vol. 5, No. 4 (2018)

• Transmuted Kumaraswamy Quasi Lindley Distribution with Applications
• Authors: M. Elgarhy; I. Elbatal; Muhammad Ahsan ul Haq; Amal S. Hassan
Pages: 565 - 581
Abstract: The Lindley distribution is one of the widely used models for studying most of reliability modeling. Besides, several of researchers have motivated new classes of distributions based on modifications of the quasi Lindley distribution. In this article, a new version of generalized distributions named as the transmuted Kumaraswamy quasi Lindley (TKQL) is introduced. Various statistical properties of the TKQL distribution are provided. The rth moment of the TKQL distribution and its moment generating function are explored. Moreover, estimation of the model parameters is discussed via the method of maximum likelihood. Applications to real data are performed to clarify the flexibility of the TKQL distribution in comparison with some sub-models.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0153-4
Issue No: Vol. 5, No. 4 (2018)

• Exponentiated Power Lindley Logarithmic: Model, Properties and
Applications
Pages: 583 - 613
Abstract: A new class of lifetime distributions is proposed. Closed form expressions are provided for the density, cumulative distribution, survival and hazard rate functions. Maximum likelihood estimation is discussed and formulas for the elements of the observed information matrix are provided. Simulation studies are conducted. Finally, two real data applications are given showing the flexibility and potentiality of the new distribution
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0154-3
Issue No: Vol. 5, No. 4 (2018)

• Artificial Neural Network Classification of High Dimensional Data with
Novel Optimization Approach of Dimension Reduction
• Authors: Rabia Aziz; C. K. Verma; Namita Srivastava
Pages: 615 - 635
Abstract: Classification of high dimensional data is a very crucial task in bioinformatics. Cancer classification of the microarray is a typical application of machine learning due to the large numbers of genes. Feature (genes) selection and classification with computational intelligent techniques play an important role in diagnosis and prediction of disease in the microarray. Artificial neural networks (ANN) is an artificial intelligence technique for classifying, image processing and predicting the data. This paper evaluates the performance of ANN classifier using six different hybrid feature selection techniques, for gene selection of microarray data. These hybrid techniques use Independent component analysis (ICA), as an extraction technique, popular filter techniques and bio-inspired algorithm for optimization of the ICA feature vector. Five binary gene expression microarray datasets are used to compare the performance of these techniques and determine how these techniques improve the performance of ANN classifier. These techniques can be extremely useful in feature selection because they achieve the highest classification accuracy along with the lowest average number of selected genes. Furthermore, to check the significant difference between these different algorithms a statistical hypothesis test was employed with a certain level of confidence. The experimental result shows that a combination of ICA with genetic bee colony algorithm shows superior performance as it heuristically removes non-contributing features to improve the performance of classifiers.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0155-2
Issue No: Vol. 5, No. 4 (2018)

• On Optimal Progressive Censoring Schemes for Normal Distribution
• Authors: U. H. Salemi; S. Rezaei; Y. Si; S. Nadarajah
Pages: 637 - 658
Abstract: Selection of optimal progressive censoring schemes for the normal distribution is discussed according to maximum likelihood estimation and best linear unbiased estimation. The selection is based on variances of the estimators of the two parameters of the normal distribution. The extreme left censoring scheme is shown to be an optimal progressive censoring scheme. The usual type-II right censoring case is shown to be the worst progressive censoring scheme for estimating the scale parameter. It can greatly increase the variance of estimators.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0156-1
Issue No: Vol. 5, No. 4 (2018)

• Joint Modeling of Longitudinal CD4 Count and Time-to-Death of HIV/TB
Co-infected Patients: A Case of Jimma University Specialized Hospital
• Authors: Aboma Temesgen; Abdisa Gurmesa; Yehenew Getchew
Pages: 659 - 678
Abstract: Tuberculosis (TB) and HIV have been closely linked since the emergence of AIDS; TB enhances HIV replication by accelerating the natural evolution of HIV infection which is the leading cause of sickness and death of peoples living with HIV/AIDS. To improve their life the co-infected patients are started to take antiretroviral treatment as patient started to take ART it is common to measure CD4 and other clinical outcomes which is correlated with survival time. However, the separate analysis of such data does not handle the association between the longitudinal measured out come and time-to-event where the joint modeling does to obtain valid and efficient survival time. Joint modeling of longitudinally measured CD4 and time-to death to understand their association. Furthermore, the study identifies factors affecting the mean change in square root CD4 measurement over time and risk factors for the survival time of HIV/TB co-infected patients. The study consists of 254 HIV/TB co-infected patients who were 18 years old or older and who were on antiretroviral treatment follow up from first February 2009 to fist July 2014 in Jimma University Specialized Hospital, West Ethiopia. First, data were analyzed using linear mixed model and survival models separately. After having appropriate separate models using Akaki information criteria, different joint models employed with different random effects longitudinal model and different shared parameters association structure of survival model and compared with deviance information criteria score. The linear mixed model showed functional status, weight, linear time and quadratic time effects have significant effect on the mean change of CD4 measurement over time. The Cox and Weibull survival model showed base line weight, baseline smoking, separated marital status group and base line functional status have significant effect on hazard function of the survival time whereas the joint model showed subject specific base line value; subject specific linear and quadratic slopes of CD4 measurement of were significantly associated with the survival time of co-infected patient at 5% significance levels. The longitudinally measured CD4 count measurement marker process is significantly associated with time to death and subject specific quadratic slope growth of CD4 measurement, base line clinical stage IV and smoking is the high risk factors that lower the survival time of HIV/TB co-infected patients. Since the longitudinally measured CD4 measurement is correlated with survival time joint modeling are used to handle the associations between these two processes to obtain valid and efficient survival time.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0157-0
Issue No: Vol. 5, No. 4 (2018)

• Statistical Inference and Optimum Life Testing Plans Under Type-II Hybrid
Censoring Scheme
• Authors: Tanmay Sen; Yogesh Mani Tripathi; Ritwik Bhattacharya
Pages: 679 - 708
Abstract: This article considers estimation of unknown parameters and prediction of future observations of a generalized exponential distribution based on Type-II hybrid censored data. Bayes point and HPD interval estimates of the unknown parameters are obtained under the assumption of independent gamma priors. Different classical and Bayesian point predictors and prediction intervals are obtained in two-sample situation against squared error loss function. The optimum censoring schemes are computed under various optimality criteria. Monte Carlo simulations are performed to compare different methods and two data sets are analyzed for illustrative purposes.
PubDate: 2018-12-01
DOI: 10.1007/s40745-018-0158-z
Issue No: Vol. 5, No. 4 (2018)

• On Some Further Properties and Application of Weibull- R Family of
Distributions
• Authors: Indranil Ghosh; Saralees Nadarajah
Pages: 387 - 399
Abstract: In this paper, we provide some new results for the Weibull-R family of distributions (Alzaghal et al. in Int J Stat Probab 5:139–149, 2016). We derive some new structural properties of the Weibull-R family of distributions. We provide various characterizations of the family via conditional moments, some functions of order statistics and via record values.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0142-7
Issue No: Vol. 5, No. 3 (2018)

• A New Family of Generalized Distributions Based on Alpha Power
Transformation with Application to Cancer Data
• Authors: M. Nassar; A. Alzaatreh; O. Abo-Kasem; M. Mead; M. Mansoor
Pages: 421 - 436
Abstract: In this paper, we propose a new method for generating distributions based on the idea of alpha power transformation introduced by Mahdavi and Kundu (Commun Stat Theory Methods 46(13):6543–6557, 2017). The new method can be applied to any distribution by inverting its quantile function as a function of alpha power transformation. We apply the proposed method to the Weibull distribution to obtain a three-parameter alpha power within Weibull quantile function. The new distribution possesses a very flexible density and hazard rate function shapes which are very useful in cancer research. The hazard rate function can be increasing, decreasing, bathtub or upside down bathtub shapes. We derive some general properties of the proposed distribution including moments, moment generating function, quantile and Shannon entropy. The maximum likelihood estimation method is used to estimate the parameters. We illustrate the applicability of the proposed distribution to complete and censored cancer data sets.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0144-5
Issue No: Vol. 5, No. 3 (2018)

• Region Based Instance Document (RID) Approach Using Compression Features
• Authors: N. V. Ganapathi Raju; Someswara Rao Chinta
Pages: 437 - 451
Abstract:
Authors hip attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially conspicuous in legal, criminal/civil cases, threatening letters and terroristic communications also in computer forensics. There are two basic approaches for authorship attribution one is instance based (treat each training text individually) and the other is profile based (treat each training text cumulatively). Both of these methods have their own advantages and disadvantages. The present paper proposes a new region based document model for authorship identification, to address the dimensionality problem of instance based approaches and scalability problem of profile based approaches. The proposed model concatenates a set of individual ‘n’ instance documents of the author as a single region based instance document (RID). On the RID compression based similarity distance method is used. The compression based methods requires no pre-processing and easy to apply. This paper uses Gzip compression algorithm with two compression based similarity measures NCD, CDM. The proposed compression model is character based and it can automatically capture easily non word features such as word stems, punctuations etc. The only disadvantage of compression models is complexity is high. The proposed RID approach addresses this issue by reducing the repeated words in the document. The present approach is experimented on English editorial columns. We achieved approximately 98% of accuracy in identifying the author.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0145-4
Issue No: Vol. 5, No. 3 (2018)

• Development of Optimal ANN Model to Estimate the Thermal Performance of
Roughened Solar Air Heater Using Two different Learning Algorithms
Pages: 453 - 467
Abstract: In the present study, artificial neural network (ANN) model has been developed with two different training algorithms to predict the thermal efficiency of wire rib roughened solar air heater. Total 50 sets of data have been taken from experiments with three different types of absorber plate. The experimental data and calculated values of collector efficiency were used to develop ANN model. Scaled conjugate gradient (SCG) and Levenberg–Marquardt (LM) learning algorithms were used. It has been found that TRAINLM with 6 neurons and TRAINSCG with 7 neurons is optimal model on the basis of statistical error analysis. The performance of both the models have been compared with actual data and found that TRAINLM performs better than TRAINSCG. The value of coefficient of determination $$(\hbox {R}^{2})$$ for LM-6 is 0.99882 which gives the satisfactory performance. Learning algorithm with LM based proposed MLP ANN model seems more reliable for predicting performance of solar air heater.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0146-3
Issue No: Vol. 5, No. 3 (2018)

• $$\ell _1$$ ℓ 1 -Norm Based Central Point Analysis for Asymmetric Radial
Data
• Authors: Qi An; Shu-Cherng Fang; Tiantian Nie; Shan Jiang
Pages: 469 - 486
Abstract: Multivariate asymmetric radial data clouds with irregularly positioned “spokes” and “clutters” are commonly seen in real life applications. In identifying the spoke directions of such data, a key initial step is to locate a central point from which each spoke extends and diverges. In this technical note, we propose a novel method that features a preselection procedure to screen out candidate points that have sufficiently many data points in the vicinity and identifies the central point by solving an $$\ell _1$$ -norm constrained discrete optimization program. Extensive numerical experiments show that the proposed method is capable of providing central points with superior accuracy and robustness compared with other known methods and is computationally efficient for implementation.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0147-2
Issue No: Vol. 5, No. 3 (2018)

• Enhancing Situation Awareness Using Semantic Web Technologies and Complex
Event Processing
• Authors: Havva Alizadeh Noughabi; Mohsen Kahani; Alireza Shakibamanesh
Pages: 487 - 496
Abstract: Data fusion techniques combine raw data of multiple sources and collect associated data to achieve more specific inferences than what could be attained with a single source. Situational awareness is one of the levels of the JDL, a matured information fusion model. The aim of situational awareness is to understand the developing relationships of interests between entities within a specific time and space. The present research shows how semantic web technologies, i.e. ontology and semantic reasoner, can be used to describe situations and increase awareness of the situation. As the situation awareness level receives data streams from numerous distributed sources, it is necessary to manage data streams by applying data stream processor engines such as Esper. In addition, in this research, complex event processing, a technique for achieving related situational in real-time, has been used, whose main aim is to generate actionable abstractions from event streams, automatically. The proposed approach combines Complex Event Processing and semantic web technologies to achieve better situational awareness. To show the functionality of the proposed approach in practice, some simple examples are discussed.
PubDate: 2018-09-01
DOI: 10.1007/s40745-018-0148-1
Issue No: Vol. 5, No. 3 (2018)

• Field Assignment, Field Choice and Preference Matching of Ethiopian High
School Students
• Authors: Derbachew Asfaw; Zeytu Gashaw
Abstract: We examined the determinants of the admittance of students into their top wished-fields of study by university students using data from Ethiopian National Educational Assessment and Examination Agency. It is based on a 2016 cohort of 41,371 applicants in Social Science and 92,135 applicants in Natural Science, who were admitted to public universities in Ethiopia. We use a binary logistic regression model applied to four broadly defined fields in Social Science streaming and found that students’ place of residence, gender, EHEECE admission grade and age of the student have a significant positive impact on the decision process towards admitting students into their top wished-fields. Results also showed that there were significant positive interaction effects of EHEECE admission grade, gender and wished-fields on the decision process. We noticed a fair selection between girls and boys into the field of Law and Theatrical Fine Art and Music. For girls the odds of being admitted into the field of Other Social Science and Humanities were relatively better than the odds of being admitted into Business and Economics. We use a polytomous logit regression model applied to seven broadly defined fields in Natural Science streaming and found no selection bias in admitting applicants into the field of first and second ordered preferences among girls and boys, whilst there were a variation among the fields ranked thereafter.
PubDate: 2018-10-03
DOI: 10.1007/s40745-018-0182-z

• A New Generalization of the Extended Exponential Distribution with an
Application
• Authors: Devendra Kumar; Manoj Kumar
Abstract: We introduce a new lifetime distribution namely, transmuted extended exponential distribution which generalizes the extended exponential distribution proposed by Nadarajah and Haghighi (Statistics 45:543–558, 2011) with an additional parameter using the quadratic rank transmutation map which was studied by Shaw and Buckley (The alchemy of probability distributions: beyond Gram-Charlier expansions, and a skew-kurtotic-normal distribution from a rank transmutation map, 2009. arXiv:0901.0434) to provide greater flexibility in modeling data from a practical point of view. In this paper, our main focus is on estimation from frequentist point of view, yet, some statistical and reliability characteristics for the model are derived. We briefly describe different estimation procedures namely, the method of maximum likelihood estimation, maximum product of spacings estimation and least square estimation. Monte Carlo simulations are performed to compare the performance of the proposed methods of estimations for both small and large samples. Finally, the potentiality of the model is analyzed by means of one real data set.
PubDate: 2018-09-22
DOI: 10.1007/s40745-018-0181-0

• A Primer on a Flexible Bivariate Time Series Model for Analyzing First and
Second Half Football Goal Scores: The Case of the Big 3 London Rivals in
the EPL
• Authors: Yuvraj Sunecher; Naushad Mamode Khan; Vandna Jowaheer; Marcelo Bourguignon; Mohammad Arashi
Abstract: The ranking of some English Premier League (EPL) clubs during football season is of keen interest to many stakeholders with special attention to the London rivals: Arsenal, Chelsea and Tottenham. In particular, the first (GF) and second half (GS) scores, besides being inter-related, is perceived as a convenient measure of the clubs potential. This paper studies the contributory effects of the possible factors that commonly influence the club scoring capacity in the halves along with forecasted measures diagnostics via a novel flexible bivariate time series model with COM-Poisson innovations using data from August 2014 to December 2017.
PubDate: 2018-09-11
DOI: 10.1007/s40745-018-0180-1

• Treatment Effect Decomposition and Bootstrap Hypothesis Testing in
Observational Studies
• Authors: Hee Youn Kwon; Jason J. Sauppe; Sheldon H. Jacobson
Abstract: Causal inference with observational data has drawn attention across various fields. These observational studies typically use matching methods which find matched pairs with similar covariate values. However, matching methods may not directly achieve covariate balance, a measure of matching effectiveness. As an alternative, the Balance Optimization Subset Selection (BOSS) framework, which seeks optimal covariate balance directly, has been proposed. This paper extends BOSS by estimating and decomposing a treatment effect as a combination of heterogeneous treatment effects from a partitioned set. Our method differs from the traditional propensity score subclassification method in that we find a subset in each subclass using BOSS instead of using the stratum determined by the propensity score. Then, by conducting a bootstrap hypothesis test on each component, we check the statistical significance of these treatment effects. These methods are applied to a dataset from the National Supported Work Demonstration (NSW) program which was conducted in the 1970s. By examining the statistical significance, we show that the program was not significantly effective to a specific subgroup composed of those who were already employed. This differs from the combined estimate—the NSW program was effective when considering all the individuals. Lastly, we provide results that are obtained when these steps are repeated with sub-samples.
PubDate: 2018-09-08
DOI: 10.1007/s40745-018-0179-7

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327

Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs