Abstract: Hoeffding tree is a method to incrementally build decision trees. A common approach to handle numerical attributes in Hoeffding trees is to represent their sufficient statistics as Gaussian distributions. Our contribution in this paper is to prove that by using Gaussian distribution as sufficient statistics and misclassification error as impurity measure, there is an analytical method to exactly calculate the best splitting values. Three different approaches for using this theorem are proposed and all three are tested on both synthetic and real datasets. The experiments suggest that this approach can create smaller trees and learn faster and achieve higher accuracy in most problems. PubDate: 2019-06-06

Abstract: Patient discharge is one of the critical processes for medical providers from any health facility to transfer the care of the patient to another care provider after hospitalisation. The discharge plan, final clinical and physical checks, patient education, patient readiness, and general practitioner appointments play an important role in the success of this procedure. However, it has loopholes that need to be addressed to lessen the complexity of managing this critical process. When this is left unchecked, serious consequences and challenges may occur such as re-hospitalisation and financial pressure. This research investigates machine learning technology on the problem of patient discharge by using a real dataset. In particular, the applicability of techniques including Decision Trees, Bayes Net, and Random Forest have been investigated in order to predict the discharge outcome of a patient after surgery. The results of the analysis show that Bayes Net performed better than Decision Tree, and Random Forest in predicting the response variable (class) using tenfold cross validation with respect to classification accuracy. The target audiences of this research are the staff working in a healthcare facility such as clinicians, chief medical officer, and physicians among others. PubDate: 2019-06-05

Abstract: Over the last 10 years, the soaring housing prices have raised concerns over ‘affordability’ in Chinese housing market, although it is still not enshrined in agreed standards, partly because of different opinions about how it should be measured. To overcome the inadequacy of a single index, we examine the housing affordability of 35 large and medium cities in China from 2009 to 2016 using price-to-income ratio (PIR), monthly payment-income ratio (MIR) and the residual income approach (RI). With consideration of the characteristics of China’s real estate market, we have re-discussed the reasonable range of the indexes. The comparison of single index between cities shows significant periodicity and multi-index clustering analysis reveals regional characteristics, which help us to further the understanding of housing affordability. In the end, policy recommendations on reforming Chinese urban housing system are suggested according to the differences and changing laws of housing affordability among cities. PubDate: 2019-06-01

Abstract: We defined and studied and inventive distribution called Type II half logistic exponential (TIIHLE) distribution. Some well-known mathematical properties; moments, probability weighted moments, mean deviation, quantile function, Renyi entropy of TIIHLE distribution are investigated. The expressions of order statistics are derived. Parameters of the derived distribution are obtained using maximum likelihood method. The importance of proposed distribution is exemplified by two datasets. PubDate: 2019-06-01

Abstract: In this article, we introduce inverse Gompertz distribution with two parameters. Some statistical properties are presented such as hazard rate function, quantile, probability weighted (moments), skewness, kurtosis, entropies function, mean residual lifetime and mean inactive lifetime. The model parameters are estimated by the method of maximum likelihood, bootstrap, least squares, weighted least squares and Cramér-von Mises. Further, Monte Carlo simulations are carried out to compare the long-run performance of the estimators based on complete and type II right censored data. Finally, we estimate the parameters based on behavioral sciences data and fatigue life of 10 bearing of a certain type in hours censored data, which explain that the model fits the data better than some models. PubDate: 2019-06-01

Abstract: In this paper, we introduce a new family of probability distributions generated from a power Lindley random variable called the power Lindley-generated family. The new family extends several classical distributions as well as generalizes the odd Lindley family which is performed by Silva et al. (Austrian J Stat 46:65–87, 2017). Some of the mathematical properties are obtained involving moments, incomplete moments, quantile function and order statistics. New four distributions are provided as special models from the family. The model parameters of the family are estimated by the maximum likelihood technique. An application to real data set and simulation study are provided to demonstrate the flexibility and interest of one special model of the suggested family. PubDate: 2019-06-01

Abstract: The Lagos annual maximum rainfall is modeled by the generalized extreme value distribution. Hydrologic risk measures like the probability of exceedance or recurrence, return period, and return level is given. PubDate: 2019-06-01

Abstract: AdEater is an early browsing assistant that automatically removes advertisement images from internet pages. It works by generating rules from training data and implementing these rules when browsing the internet. Advertisement images on web pages are replaced by transparent images that display on the image the word “ad”, and where images are misclassified, non-advertisement images on a webpage will also be replaced by transparent images displaying “ad”. This paper critically examines the dataset derived from a trial of AdEater and tries to build a robust image classifier. We apply data mining techniques to uncover associations between features of advertisements and non-advertisements and try to predict whether the images are advertisements or non-advertisements based on three classification methods. We achieve classification accuracy of 96.5%, using k-fold cross validation to train and test the model. PubDate: 2019-06-01

Abstract: We introduce and study a new three-parameter lifetime distribution named as the inverse power Lomax. The proposed distribution is obtained as the inverse form of the power Lomax distribution. Some statistical properties of the inverse power Lomax model are implemented. Based on censored samples, maximum likelihood estimators of the model parameters are obtained. An intensive simulation study is performed for evaluating the behavior of estimators based on their biases and mean square errors. Superiority of the new model over some well-known distributions is illustrated by means of real data sets. The results revealed the fact that; the suggested model can produce better fits than some well-known distributions. PubDate: 2019-06-01

Abstract: In this paper, we propose a new conjugate prior probability distribution to many likelihoods distributions. In particular, we use the weighted Lindley distribution as a conjugate prior distribution. The weighted Lindley distribution can be viewed as a mixture of two gamma distributions with know weights. The weighted Lindley distribution of conjugate priors offers a more flexible class of priors than the class of gamma prior distributions. The results are illustrated for the problem of inference for Poisson and normal parameters. PubDate: 2019-06-01

Abstract: In this paper, a new extension of the Rayleigh distribution called the Hyperbolic Sine-Rayleigh distribution is introduced and studied. The proposed model is very flexible and is capable of modeling with increasing and unimodal hazard rates. A comprehensive treatment of its mathematical properties including explicit expressions for the moments, quantiles, moment generating function, Entropy and order statistics are provided. Maximum likelihood estimates of the model parameters are obtained. Furthermore, a simulation study is conducted to access the behavior of the maximum likelihood estimators. Finally, the superiority of the subject model is illustrated empirically over the other distributions by analyzing a real-life application. PubDate: 2019-06-01

Abstract: This paper presents a random projection scheme for cancelable iris recognition. Instead of using original iris features, masked versions of the features are generated through the random projection in order to increase the security of the iris recognition system. The proposed framework for iris recognition includes iris localization, sector selection of the iris to avoid eyelids and eyelashes effects, normalization, segmentation of normalized iris region into halves, selection of the upper half for further reduction of eyelids and eyelashes effects, feature extraction with Gabor filter, and finally random projection. This framework guarantees exclusion of eyelids and eyelashes effects, and masking of the original Gabor features to increase the level of security. Matching is performed with a Hamming Distance (HD) metric. The proposed framework achieves promising recognition rates of 99.67% and a leading Equal Error Rate (EER) of 0.58%. PubDate: 2019-06-01

Abstract: In this paper, our main objective is to illustrate the flexibility of the wider class of generalized gamma distribution to model right censored survival data. This distribution contains the commonly used gamma, Weibull, and lognormal distributions as particular cases and this flexibility allows us to carry out a model discrimination, within its class, to choose a lifetime distribution that provides the best fit to a given data. A detailed Monte Carlo simulation study is carried out to display the flexibility of the generalized distribution using likelihood ratio test and information-based criteria. The maximum likelihood estimates of the parameters are obtained by using inbuilt optimization techniques available in R statistical software. We also display the performance of the estimation technique by calculating the bias, mean square error, and coverage probabilities of the confidence intervals for different confidence levels. Finally, we illustrate the advantage of using the generalized gamma distribution using two real datasets and we motivate the use of an extended version of the generalized gamma distribution. PubDate: 2019-05-31

Abstract: The body mass index (BMI) is calculated as weight in kilograms divided by square height in meters ( \( \frac{\text{kg}}{{{\text{m}}^{2} }} \) ). Its formula was developed by Belgium Statistician Adolphe Quetelet, and was known as the Quetelet Index (Adolphe Quetelet in BMI formula was developed. Belgium Statistician, 1796–1874. http://www.cdc.gov/healthyweight/assessing/bmi/childrens_bmi/about_childrens_bmi.htm). It provides a reliable indicator of body fatness for most people and is used to screen weight categories that may lead to health problems. BMI is an internationally used measure of health status of an individual. This study was modeling of longitudinal factors under-age five children BMI at Bahir Dar Districts using First Order Transition Model. This study was based on data from 1900 pre four visits (475 per individual) children enrolled in the first 4 visits of the 4-year Longitudinal data of children in Bahir Dar Districts. First order transition model was used to describe the relationships between children BMI and some covariates accounting for the correlation among the repeated observations for a given children. There were statistically significant (P value < 0.05) difference among children BMI variation with respect to time, Sachet (plump nut), age, residence, Antiretro-Viral Therapy, diarrhea and pervious BMI. But, fever, cough, Mid-Upper Arm Circumference and sex were statistically insignificant (p value > 0.05) effect on children BMI. According to the findings of this study about 29.28% were normal weight, 67% were under weight, 2.52% were overweight and only 1.21% were obesity. Consequently, the study suggests that concerned bodies should focus on awareness creation to bring enough food to under-age five children in Bahir Dar Districts especially in rural areas. PubDate: 2019-05-23

Abstract: In this paper, we introduce a flexible modified beta linear exponential (MBLE) distribution. Our motivation, besides others are there, dues to its ability in hydrology applications. We investigate a set of its statistical properties for supporting such applications, like moments, moment generating function, conditional moments, mean deviations, entropy, mean and variance (reversed) residual life and maximum likelihood estimators with observed information matrix. The distribution can accommodate both decreasing and increasing hazard rates as well as upside down bathtub and bathtub shaped hazard rates. Moreover, several distributions arise as special cases of the distribution. The MBLE distribution with others are fitted to two hydrology data sets. It is shown that, the MBLE distribution is the best fit among the compared distributions based on nine goodness-of-fit statistics among them the Corrected Akaike information criterion, Hannan–Quinn information criterion, Anderson–Darling and Kolmogorov–Smirnov p value. Consequently, some parameters of these data are obtained such as return level, conditional mean, mean deviation about the return level, risk of failure for designing hydraulic structures. Finally, we hope that this model will be able to attract wider applicability in hydrology and other life areas. PubDate: 2019-05-18

Abstract: In the present paper, we introduce a new lifetime distribution based on the general odd hyperbolic cosine-FG model. Some important properties of proposed model including survival function, quantile function, hazard function, order statistic are obtained. In addition estimating unknown parameters of this model will be examined from the perspective of classic and Bayesian statistics. Moreover, an example of real data set is studied; point and interval estimations of all parameters are obtained by maximum likelihood, bootstrap (parametric and non-parametric) and Bayesian procedures. Finally, the superiority of proposed model in terms of parent exponential distribution over other fundamental statistical distributions is shown via the example of real observations. PubDate: 2019-05-17

Abstract: A method for developing generalized parametric regression models for count data is proposed and studied. The method is based on the framework of the T-geometric family of distributions. A T-geometric family consists of discrete distributions, which are analogues to the continuous distributions for the random variable T. The general methodology is applied to derive some generalized regression models for count data. These regression models can fit count data that are under-dispersed, equi-dispersed or over-dispersed. The extension to model truncated or inflated data is addressed. Some new generalized T-geometric regression models are applied to real world data sets to illustrate the flexibility of the models. The models were fitted to four response variables from health care data and their performance compared. No single regression model outperforms other models for all the four response variables. Thus, a researcher should evaluate different models before selecting a final regression model for a count response variable. PubDate: 2019-05-16

Abstract: We present results for Shannon entropy from environmental data, such as air temperature, relative humidity, rainfall and wind speed. We use hourly generated time-series hydrological model data covering the whole of Tasmania, a state of Australia, and employ concepts from statistical mechanics in our calculations. We also present enthalpy and heat capacitance equivalent quantities for the environment. The results capture interesting seasonal fluctuations in environmental parameters over time. Our results also present an indication that corresponds to a slight increase in the number of microstates due to air temperature over the duration of data considered in this work. PubDate: 2019-05-15

Abstract: Machine learning algorithms (MLAs) usually process large and complex datasets containing a substantial number of features to extract meaningful information about the target concept (a.k.a class). In most cases, MLAs suffer from the latency and computational complexity issues while processing such complex datasets due to the presence of lesser weight (i.e., irrelevant or redundant) features. The computing time of the MLAs increases explosively with increase in the number of features, feature dependence, number of records, types of the features, and nested features categories present in such datasets. Appropriate feature selection before applying MLA is a handy solution to effectively resolve the computing speed and accuracy trade-off while processing large and complex datasets. However, selection of the features that are sufficient, necessary, and are highly co-related with the target concept is very challenging. This paper presents an efficient feature selection algorithm based on random forest to improve the performance of the MLAs without sacrificing the guarantees on the accuracy while processing the large and complex datasets. The proposed feature selection algorithm yields unique features that are closely related with the target concept (i.e., class). The proposed algorithm significantly reduces the computing time of the MLAs without degrading the accuracy much while learning the target concept from the large and complex datasets. The simulation results fortify the efficacy and effectiveness of the proposed algorithm. PubDate: 2019-05-02

Abstract: Women have always faced a number of disadvantageous gaps in the labour market; the status of women at the labour markets throughout the world has not substantially narrowed gender gaps in the workplace. Many women in developing countries are domestic workers or informal factory workers, while others are unpaid workers in family enterprises and family farms. Agriculture is the primary sector of women’s employment; Sub-Saharan Africa is among regions with the highest proportion of women employment in the agriculture sector. This research was conducted on 274 sampled households with the objective to determine the factors associated with women’s employment status and to examine whether the estimated parameters for logistic regression model adopting Bayesian and maximum likelihood estimation approaches are similar or not. The research revealed that about 144 (52.6%) of sampled women were unemployed that is, they were not involved in any activity for earning during the data collection. The inferential analysis using both Bayesian and Maximum likelihood estimation schemes indicated that, pregnancy, age, education level, husband/partner occupation, marital status, family size, training opportunity and a child less than 5 years old had statistically significant (p < 0.05) effect on employment status of women. The maximum likelihood estimates and Bayesian estimates with non-informative prior do not have considerable difference. PubDate: 2019-04-30