for Journals by Title or ISSN
for Articles by Keywords
help
Followed Journals
Journal you Follow: 0
 
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Jurnals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover Journal of Information Science     [SJR: 1.199]   [H-I: 35]
   [815 followers]  Follow    
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 0165-5515 - ISSN (Online) 1741-6485
   Published by Sage Publications Homepage  [756 journals]
  • A novel feature selection method for text classification using association
           rules and clustering
    • Authors: Sheydaei, N; Saraee, M, Shahgholian, A.
      Pages: 3 - 15
      Abstract: Readability and accuracy are two important features of any good classifier. For reasons such as acceptable accuracy, rapid training and high interpretability, associative classifiers have recently been used in many categorization tasks. Although features could be very useful in text classification, both training time and the number of produced rules will increase significantly owing to the high dimensionality of text documents. In this paper an association classification algorithm for text classification is proposed that includes a feature selection phase to select important features and a clustering phase based on class labels to tackle this shortcoming. The experimental results from applying the proposed algorithm in comparison with the results of selected well-known classification algorithms show that our approach outperforms others both in efficiency and in performance.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514550143|hwp:master-id:spjis;0165551514550143
      Issue No: Vol. 41, No. 1 (2015)
       
  • K-State automaton burst detection model based on KOS: Emerging trends in
           cancer field
    • Authors: Wu, Q; Zhang, H, Lan, J.
      Pages: 16 - 26
      Abstract: Burst topic detection aims to extract rapidly emerging topics from large volumes of text streams, including scientific literature. Currently there are several burst models and detection algorithms based on different burst definitions, which share the common deficiency that semantic information of topics is not taken into consideration, which results in noisy bursts in identified burst topics. In this paper, a K-state automaton burst detection model based on a KOS (knowledge organization system) is proposed and applied in detecting emerging trends and burst topics in the cancer field. Experiments showed that the K-state automaton burst detection model can better represent the variety of bursts and detect burst concepts with maximal confidence. Furthermore, the application of KOS in the process of concept extraction could effectively remove noisy concepts and enhance the accuracy of identifying burst concepts.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514551500|hwp:master-id:spjis;0165551514551500
      Issue No: Vol. 41, No. 1 (2015)
       
  • LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet
           allocation for text categorization
    • Authors: Al-Salemi, B; Ab Aziz, M. J, Noah, S. A.
      Pages: 27 - 40
      Abstract: AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514551496|hwp:master-id:spjis;0165551514551496
      Issue No: Vol. 41, No. 1 (2015)
       
  • On methods and tools of table detection, extraction and annotation in PDF
           documents
    • Authors: Khusro, S; Latif, A, Ullah, I.
      Pages: 41 - 57
      Abstract: Table detection, extraction and annotation have been an important research problem for years. To handle this issue, different approaches have been designed for different types of documents. Among these PDF is a widely used format for preserving and presenting different types of documents. We investigate the state of the art in table detection, extraction and annotation in PDF documents. Because of varying table structural anatomy, the state of the art in table-related research enumerates a number of approaches that are critically and analytically investigated for identifying their strengths and limitations as well as for making recommendations for further improvement. An evaluation framework is contributed that compares different information extraction tools that may be used in table detection, extraction and annotation. We found very limited attention towards these aspects in books, especially books in PDF format. There is no searching solution that can find books having tables that are semantically related to a table in a given book.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514551903|hwp:master-id:spjis;0165551514551903
      Issue No: Vol. 41, No. 1 (2015)
       
  • Ontology alignment based on instance using NSGA-II
    • Authors: Xue, X; Wang, Y.
      Pages: 58 - 70
      Abstract: Nowadays, ontologies are widely used to solve data heterogeneity problems on the Semantic Web. However, simple use of these ontologies may raise the heterogeneity problem to a higher level. Addressing this problem requires identification of correspondences between the entities of various ontologies. Since the real semantics of a concept is often better defined by the actual instances assigned to it, instance, as an important element of ontology, contains a great quantity of knowledge that should be utilized to obtain the ontology alignment. To this end, in this paper, we propose a novel instance-based aligning approach using NSGA-II to determine the optimal instance correspondences and a similarity propagation algorithm that makes use of various semantic relations to propagate the similarity values to other entities of ontologies. The experiment of comparing our approach with the participants of OAEI 2012 has demonstrated that our method is an effective approach that can obtain the alignment with high precision value.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514550142|hwp:master-id:spjis;0165551514550142
      Issue No: Vol. 41, No. 1 (2015)
       
  • Bringing life to dead: Role of Wayback Machine in retrieving vanished URLs
    • Authors: Sampath Kumar, B. T; Prithviraj, K. R.
      Pages: 71 - 81
      Abstract: The paper makes an attempt to examine the decay and half-life of URL citations cited in articles of conference proceedings. The main focus of the paper is to explore the possibilities of recovering inactive URL citations through the Wayback Machine. The study collected a total of 5698 URLs cited in the 1700 articles published in three Indian LIS conference proceedings published during 2001–2010. Results of the study show that only 49.91% (2844 out of 5698) of URL citations remained active whereas the remaining 2854 (50.09%) were found to have vanished. The paper argues that, as the age of URLs increases, the disappearance of URL citations also increases (r = 0.861, p = 0.003). The study also found that there was an increase in the percentage of active URLs from 2844 (49.91%) to 4506 (79.08%) after the recovery of vanished URLs through the Wayback Machine. The average half-life of URLs before the recovery of vanished URLs and after the recovery of vanished URLs was 4.94 and 14.99 years, respectively (t = –6.720, d.f. = 9, p = 0.000).
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514552752|hwp:master-id:spjis;0165551514552752
      Issue No: Vol. 41, No. 1 (2015)
       
  • Hybrid string matching algorithm with a pivot
    • Authors: Al-Ssulami; A. M.
      Pages: 82 - 88
      Abstract: Pattern matching is important in text processing, molecular biology, operating systems and web search engines. Many algorithms have been developed to search for a specific pattern in a text, but the need for an efficient algorithm is an outstanding issue. In this paper, we present a simple and practical string matching algorithm. The proposed algorithm is a hybrid that combines our modification of Horspool’s algorithm with two observations on string matching. The algorithm scans the text from left to right and matches the pattern from right to left. Experimental results on natural language texts, genomes and human proteins demonstrate that the new algorithm is competitive with practical algorithms.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514555668|hwp:master-id:spjis;0165551514555668
      Issue No: Vol. 41, No. 1 (2015)
       
  • Modelling liking networks in an online healthcare community: An
           exponential random graph model analysis approach
    • Authors: Song, X; Yan, X, Li, Y.
      Pages: 89 - 96
      Abstract: The value and role of the Like button in social media have gained increased attention/focus, yet we know little about how liking relations form between Likers and Likeds. We study this problem in an online healthcare context from a social network perspective. Taking into account the effects of both the network structures and the attributes of Likers and Likeds, we utilize a theory-grounded statistical modelling approach, Exponential Random Graph Models (ERGMs), to model the liking network in an online healthcare community. The results of ERGM analysis reveal that, while network degree exhibits a big effect in the liking process, individual attributes like the level of past involvement and degree of activity also positively influence members’ future liking behaviour and performance. The evaluation indicates that our model is an effective method to identify the formation of liking networks. The findings extend the understanding of online liking behaviour and provide insights into harnessing the power of liking.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514558179|hwp:master-id:spjis;0165551514558179
      Issue No: Vol. 41, No. 1 (2015)
       
  • Retrieving haystacks: a data driven information needs model for faceted
           search
    • Authors: Cleverley, P. H; Burnett, S.
      Pages: 97 - 113
      Abstract: The research aim was to develop an understanding of information need characteristics for word co-occurrence-based search result filters (facets). No prior research has been identified into what enterprise searchers may find useful for exploratory search and why. Various word co-occurrence techniques were applied to results from sample queries performed on industry membership content. The results were used in an international survey of 54 practising petroleum engineers from 32 organizations. Subject familiarity, job role, personality and query specificity are possible causes for survey response variation. An information needs model is presented: Broad, Rich, Intriguing, Descriptive, General, Expert and Situational (BRIDGES). This may help professionals to more effectively meet their information needs and stimulate new needs, improving a system’s ability to facilitate serendipity. This research has implications for faceted search in enterprise search and digital library deployments.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514554522|hwp:master-id:spjis;0165551514554522
      Issue No: Vol. 41, No. 1 (2015)
       
  • Automatic Arabic text categorization: A comprehensive comparative study
    • Authors: Hmeidi, I; Al-Ayyoub, M, Abdulla, N. A, Almodawar, A. A, Abooraig, R, Mahyoub, N. A.
      Pages: 114 - 124
      Abstract: Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.
      PubDate: 2015-01-08T06:38:32-08:00
      DOI: 10.1177/0165551514558172|hwp:master-id:spjis;0165551514558172
      Issue No: Vol. 41, No. 1 (2015)
       
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
 
About JournalTOCs
API
Help
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2014