for Journals by Title or ISSN
for Articles by Keywords
Journal Cover Information Retrieval
  [SJR: 0.589]   [H-I: 44]   [191 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1573-7659 - ISSN (Online) 1386-4564
   Published by Springer-Verlag Homepage  [2354 journals]
  • A unified score propagation model for web spam demotion algorithm
    • Authors: Xu Zhuang; Yan Zhu; Chin-Chen Chang; Qiang Peng; Faisal Khurshid
      Pages: 547 - 574
      Abstract: Web spam pages exploit the biases of search engine algorithms to get higher than their deserved rankings in search results by using several types of spamming techniques. Many web spam demotion algorithms have been developed to combat spam via the use of the web link structure, from which the goodness or badness score of each web page is evaluated. Those scores are then used to identify spam pages or punish their rankings in search engine results. However, most of the published spam demotion algorithms differ from their base models by only very limited improvements and still suffer from some common score manipulation methods. The lack of a general framework for this field makes the task of designing high-performance spam demotion algorithms very inefficient. In this paper, we propose a unified score propagation model for web spam demotion algorithms by abstracting the score propagation process of relevant models with a forward score propagation function and a backward score propagation function, each of which can further be expressed as three sub-functions: a splitting function, an accepting function and a combination function. On the basis of the proposed model, we develop two new web spam demotion algorithms named Supervised Forward and Backward score Ranking (SFBR) and Unsupervised Forward and Backward score Ranking (UFBR). Our experiments, conducted on three large-scale public datasets, show that (1) SFBR is very robust and apparently outperforms other algorithms and (2) UFBR can obtain results comparable to some well-known supervised algorithms in the spam demotion task even if the UFBR is unsupervised.
      PubDate: 2017-12-01
      DOI: 10.1007/s10791-017-9307-9
      Issue No: Vol. 20, No. 6 (2017)
  • The orchestration of a collaborative information seeking learning task
    • Authors: Simon Knight; Bart Rienties; Karen Littleton; Dirk Tempelaar; Matthew Mitsui; Chirag Shah
      Pages: 480 - 505
      Abstract: The paper describes our novel perspective on ‘searching to learn’ through collaborative information seeking (CIS). We describe this perspective, which motivated empirical work to ‘orchestrate’ a CIS searching to learn session. The work is described through the lens of orchestration, an approach which brings to the fore the ways in which: background context—including practical classroom constraints, and theoretical perspective; actors—including the educators, researchers, and technologies; and activities that are to be completed, are brought into alignment. The orchestration is exemplified through the description of research work designed to explore a pedagogically salient construct (epistemic cognition), in a particular institutional setting. Evaluation of the session indicated satisfaction with the orchestration from students, with written feedback indicating reflection from them on features of the orchestration. We foreground this approach to demonstrate the potential of orchestration as a design approach for researching and implementing CIS as a ‘searching to learn’ context.
      PubDate: 2017-10-01
      DOI: 10.1007/s10791-017-9304-z
      Issue No: Vol. 20, No. 5 (2017)
  • Neural information retrieval: introduction to the special issue
    • PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9323-9
  • Neural information retrieval: at the end of the early years
    • Abstract: A recent “third wave” of neural network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, work in this area is often referred to as deep learning. Recent years have witnessed an explosive growth of research into NN-based approaches to information retrieval (IR). A significant body of work has now been created. In this paper, we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.
      PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9321-y
  • Using word embeddings in Twitter election classification
    • Abstract: Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models).
      PubDate: 2017-11-09
      DOI: 10.1007/s10791-017-9319-5
  • A study of untrained models for multimodal information retrieval
    • Abstract: Operational multimodal information retrieval systems have to deal with increasingly complex document collections and queries that are composed of a large set of textual and non-textual modalities such as ratings, prices, timestamps, geographical coordinates, etc. The resulting combinatorial explosion of modality combinations makes it intractable to treat each modality individually and to obtain suitable training data. As a consequence, instead of finding and training new models for each individual modality or combination of modalities, it is crucial to establish unified models, and fuse their outputs in a robust way. Since the most popular weighting schemes for textual retrieval have in the past generalized well to many retrieval tasks, we demonstrate how they can be adapted to be used with non-textual modalities, which is a first step towards finding such a unified model. We demonstrate that the popular weighting scheme BM25 is suitable to be used for multimodal IR systems and analyze the underlying assumptions of the BM25 formula with respect to merging modalities under the so-called raw-score merging hypothesis, which requires no training. We establish a multimodal baseline for two multimodal test collections, show how modalities differ with respect to their contribution to relevance and the difficulty of treating modalities with overlapping information. Our experiments demonstrate that our multimodal baseline with no training achieves a significantly higher retrieval effectiveness than using just the textual modality for the social book search 2016 collection and lies in the range of a trained multimodal approach using the optimal linear combination of the modality scores.
      PubDate: 2017-11-03
      DOI: 10.1007/s10791-017-9322-x
  • Picture it in your mind: generating high level visual representations from
           textual descriptions
    • Authors: Fabio Carrara; Andrea Esuli; Tiziano Fagni; Fabrizio Falchi; Alejandro Moreo Fernández
      Abstract: In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.
      PubDate: 2017-10-14
      DOI: 10.1007/s10791-017-9318-6
  • Sequence-based context-aware music recommendation
    • Authors: Dongjing Wang; Shuiguang Deng; Guandong Xu
      Abstract: Contextual factors greatly affect users’ preferences for music, so they can benefit music recommendation and music retrieval. However, how to acquire and utilize the contextual information is still facing challenges. This paper proposes a novel approach for context-aware music recommendation, which infers users’ preferences for music, and then recommends music pieces that fit their real-time requirements. Specifically, the proposed approach first learns the low dimensional representations of music pieces from users’ music listening sequences using neural network models. Based on the learned representations, it then infers and models users’ general and contextual preferences for music from users’ historical listening records. Finally, music pieces in accordance with user’s preferences are recommended to the target user. Extensive experiments are conducted on real world datasets to compare the proposed method with other state-of-the-art recommendation methods. The results demonstrate that the proposed method significantly outperforms those baselines, especially on sparse data.
      PubDate: 2017-10-13
      DOI: 10.1007/s10791-017-9317-7
  • Session search modeling by partially observable Markov decision process
    • Authors: Grace Hui Yang; Xuchu Dong; Jiyun Luo; Sicong Zhang
      Abstract: Session search, the task of document retrieval for a series of queries in a session, has been receiving increasing attention from the information retrieval research community. Session search exhibits the properties of rich user-system interactions and temporal dependency. These properties lead to our proposal of using partially observable Markov decision process to model session search. On the basis of a design choice schema for states, actions and rewards, we evaluate different combinations of these choices over the TREC 2012 and 2013 session track datasets. According to the experimental results, practical design recommendations for using PODMP in session search are discussed.
      PubDate: 2017-10-11
      DOI: 10.1007/s10791-017-9316-8
  • Retrieving and classifying instances of source code plagiarism
    • Authors: Debasis Ganguly; Gareth J. F. Jones; Aarón Ramírez-de-la-Cruz; Gabriela Ramírez-de-la-Rosa; Esaú Villatoro-Tello
      Abstract: Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.
      PubDate: 2017-09-13
      DOI: 10.1007/s10791-017-9313-y
  • Introduction to the special issue on search as learning
    • PubDate: 2017-09-09
      DOI: 10.1007/s10791-017-9315-9
  • Machine learning techniques for XML (co-)clustering by
           structure-constrained phrases
    • Authors: Gianni Costa; Riccardo Ortale
      Abstract: A new method is proposed for clustering XML documents by structure-constrained phrases. It is implemented by three machine-learning approaches previously unexplored in the XML domain, namely non-negative matrix (tri-)factorization, co-clustering and automatic transactional clustering. A novel class of XML features approximately captures structure-constrained phrases as n-grams contextualized by root-to-leaf paths. Experiments over real-world benchmark XML corpora show that the effectiveness of the three approaches improves with contextualized n-grams of suitable length. This confirms the validity of the devised method from multiple clustering perspectives. Two approaches overcome in effectiveness several state-of-the-art competitors. The scalability of the three approaches is investigated, too.
      PubDate: 2017-08-04
      DOI: 10.1007/s10791-017-9314-x
  • Statistical biases in Information Retrieval metrics for recommender
    • Authors: Alejandro Bellogín; Pablo Castells; Iván Cantador
      Abstract: There is an increasing consensus in the Recommender Systems community that the dominant error-based evaluation metrics are insufficient, and mostly inadequate, to properly assess the practical effectiveness of recommendations. Seeking to evaluate recommendation rankings—which largely determine the effective accuracy in matching user needs—rather than predicted rating values, Information Retrieval metrics have started to be applied for the evaluation of recommender systems. In this paper we analyse the main issues and potential divergences in the application of Information Retrieval methodologies to recommender system evaluation, and provide a systematic characterisation of experimental design alternatives for this adaptation. We lay out an experimental configuration framework upon which we identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases. These biases considerably distort the empirical measurements, hindering the interpretation and comparison of results across experiments. We develop a formal characterisation and analysis of the biases upon which we analyse their causes and main factors, as well as their impact on evaluation metrics under different experimental configurations, illustrating the theoretical findings with empirical evidence. We propose two experimental design approaches that effectively neutralise such biases to a large extent. We report experiments validating our proposed experimental variants, and comparing them to alternative approaches and metrics that have been defined in the literature with similar or related purposes.
      PubDate: 2017-07-27
      DOI: 10.1007/s10791-017-9312-z
  • Product review summarization through question retrieval and
    • Authors: Mengwen Liu; Yi Fang; Alexander G. Choulos; Dae Hoon Park; Xiaohua Hu
      Abstract: Product reviews have become an important resource for customers before they make purchase decisions. However, the abundance of reviews makes it difficult for customers to digest them and make informed choices. In our study, we aim to help customers who want to quickly capture the main idea of a lengthy product review before they read the details. In contrast with existing work on review analysis and document summarization, we aim to retrieve a set of real-world user questions to summarize a review. In this way, users would know what questions a given review can address and they may further read the review only if they have similar questions about the product. Specifically, we design a two-stage approach which consists of question selection and question diversification. For question selection phase, we first employ probabilistic retrieval models to locate candidate questions that are relevant to a given review. A Recurrent Neural Network Encoder–Decoder is utilized to measure the “answerability” of questions to a review. We then design a set function to re-rank the questions with the goal of rewarding diversity in the final question set. The set function satisfies submodularity and monotonicity, which results in an efficient greedy algorithm of submodular optimization. Evaluation on product reviews from two categories shows that the proposed approach is effective for discovering meaningful questions that are representative of individual reviews.
      PubDate: 2017-07-24
      DOI: 10.1007/s10791-017-9311-0
  • Online searching and learning: YUM and other search tools for children and
    • Authors: Ion Madrazo Azpiazu; Nevena Dragovic; Maria Soledad Pera; Jerry Alan Fails
      Abstract: Information discovery tasks using online search tools are performed on a regular basis by school-age children. However, these tools are not necessarily designed to both explicitly facilitate the retrieval of resources these young users can comprehend and aid low-literacy searchers. This is of particular concern for educational environments, as there is an inherent expectation that these tools facilitate effective learning. In this manuscript we present an initial assessment conducted over (1) children-oriented search tools based on queries generated by K-9 students, analyzing features such as readability and adequacy of retrieved results, and (2) tools used by teachers in their classrooms, analyzing their main purpose and target audience’s age range. Among the examined tools, we include YouUnderstood.Me, an enhanced search environment, which is the result of our ongoing efforts on the development of a search environment tailored to 5-15 year-olds that can foster learning through the retrieval of materials that not only satisfy the information needs of these users but also match their reading abilities. The results of these studies highlight the fact that search results presented to children have average reading levels that do not match the target audience. In addition, tools oriented to teachers do not go beyond showing the progress of their students, and seldomly provide a simple way of retrieving class contents that fit current needs of students. These facts further showcase the need for developing a dual environment oriented to both teachers and students.
      PubDate: 2017-07-21
      DOI: 10.1007/s10791-017-9310-1
  • The role of domain knowledge in cognitive modeling of information search
    • Authors: Saraschandra Karanam; Guillermo Jorge-Botana; Ricardo Olmos; Herre van Oostendorp
      Abstract: Computational cognitive models developed so far do not incorporate individual differences in domain knowledge in predicting user clicks on search result pages. We address this problem using a cognitive model of information search which enables us to use two semantic spaces having a low (non-expert semantic space) and a high (expert semantic space) amount of medical and health related information to represent respectively low and high knowledge of users in this domain. We also investigated two different processes along which one can gain a larger amount of knowledge in a domain: an evolutionary and a common core process. Simulations of model click behavior on difficult information search tasks and subsequent matching with actual behavioral data from users (divided into low and high domain knowledge groups based on a domain knowledge test) were conducted. Results showed that the efficacy of modeling for high domain knowledge participants (in terms of the number of matches between the model predictions and the actual user clicks on search result pages) was higher with the expert semantic space compared to the non-expert semantic space while for low domain knowledge participants it was the other way around. When the process of knowledge acquisition was taken into account, the effect of using a semantic space based on high domain knowledge was significant only for high domain knowledge participants, irrespective of the knowledge acquisition process. The implications of these outcomes for support tools that can be built based on these models are discussed.
      PubDate: 2017-05-24
      DOI: 10.1007/s10791-017-9308-8
  • Optimizing search results for human learning goals
    • Authors: Rohail Syed; Kevyn Collins-Thompson
      Abstract: While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts.
      PubDate: 2017-05-12
      DOI: 10.1007/s10791-017-9303-0
  • Personalized Information Seeking Assistant (PiSA): from programming
           information seeking to learning
    • Authors: Yihan Lu; I-Han Hsiao
      Abstract: Online programming discussion forums have grown increasingly and formed sizable repositories of problem-solving solutions. In this paper, we investigate programming learners’ information seeking behaviors in online discussion forums, and provide visual navigational support to facilitate information seeking. We design engines to collect students’ information seeking behaviors, and model these behaviors with sequence pattern mining techniques. The results show that programming learners indeed seek for information from discussion forums by actively search and read progressively according to course schedule topics. Advanced students consistently perform query refinements, examine search results and commit to read, however, novices do not. Finally, according to the lessons learned, we propose, design and evaluate Personalized Information Seeking Assistant system to help query refinement by summarizing the search results and to provide social-based browsing history. Findings suggest that paying attention to the query history may lead to further reading events, which subsequently resulting in potential learning activities.
      PubDate: 2017-05-09
      DOI: 10.1007/s10791-017-9305-y
  • There’s a creepy guy on the other end at Google!: engaging middle school
           students in a drawing activity to elicit their mental models of Google
    • Authors: Christie Kodama; Beth St. Jean; Mega Subramaniam; Natalie Greene Taylor
      Abstract: Although youth are increasingly going online to fulfill their needs for information, many youth struggle with information and digital literacy skills, such as the abilities to conduct a search and assess the credibility of online information. Ideally, these skills encompass an accurate and comprehensive understanding of the ways in which a system, such as a Web search engine, functions. In order to investigate youths’ conceptions of the Google search engine, a drawing activity was conducted with 26 HackHealth after-school program participants to elicit their mental models of Google. The findings revealed that many participants personified Google and emphasized anthropomorphic elements, computing equipment, and/or connections (such as cables, satellites and antennas) in their drawings. Far fewer participants focused their drawings on the actual Google interface or on computer code. Overall, their drawings suggest a limited understanding of Google and the ways in which it actually works. However, an understanding of youths’ conceptions of Google can enable educators to better tailor their digital literacy instruction efforts and can inform search engine developers and search engine interface designers in making the inner workings of the engine more transparent and their output more trustworthy to young users. With a better understanding of how Google works, young users will be better able to construct effective queries, assess search results, and ultimately find relevant and trustworthy information that will be of use to them.
      PubDate: 2017-05-05
      DOI: 10.1007/s10791-017-9306-x
  • Collaborator recommendation in heterogeneous bibliographic networks using
           random walks
    • Authors: Xing Zhou; Lixin Ding; Zhaokui Li; Runze Wan
      Abstract: The increasingly growing popularity of the collaboration among researchers and the increasing information overload in big scholarly data make it imperative to develop a collaborator recommendation system for researchers to find potential partners. Existing works always study this task as a link prediction problem in a homogeneous network with a single object type (i.e., author) and a single link type (i.e., co-authorship). However, a real-world academic social network often involves several object types, e.g., papers, terms, and venues, as well as multiple relationships among different objects. This paper proposes a RWR-CR (standing for random walk with restart-based collaborator recommendation) algorithm in a heterogeneous bibliographic network towards this problem. First, we construct a heterogeneous network with multiple types of nodes and links with a simplified network structure by removing the citing paper nodes. Then, two importance measures are used to weight edges in the network, which will bias a random walker’s behaviors. Finally, we employ a random walk with restart to retrieve relevant authors and output an ordered recommendation list in terms of ranking scores. Experimental results on DBLP and hep-th datasets demonstrate the effectiveness of our methodology and its promising performance in collaborator prediction.
      PubDate: 2017-03-29
      DOI: 10.1007/s10791-017-9300-3
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016