for Journals by Title or ISSN
for Articles by Keywords
Journal Cover Information Retrieval
  [SJR: 0.589]   [H-I: 44]   [185 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1573-7659 - ISSN (Online) 1386-4564
   Published by Springer-Verlag Homepage  [2352 journals]
  • The orchestration of a collaborative information seeking learning task
    • Authors: Simon Knight; Bart Rienties; Karen Littleton; Dirk Tempelaar; Matthew Mitsui; Chirag Shah
      Pages: 480 - 505
      Abstract: Abstract The paper describes our novel perspective on ‘searching to learn’ through collaborative information seeking (CIS). We describe this perspective, which motivated empirical work to ‘orchestrate’ a CIS searching to learn session. The work is described through the lens of orchestration, an approach which brings to the fore the ways in which: background context—including practical classroom constraints, and theoretical perspective; actors—including the educators, researchers, and technologies; and activities that are to be completed, are brought into alignment. The orchestration is exemplified through the description of research work designed to explore a pedagogically salient construct (epistemic cognition), in a particular institutional setting. Evaluation of the session indicated satisfaction with the orchestration from students, with written feedback indicating reflection from them on features of the orchestration. We foreground this approach to demonstrate the potential of orchestration as a design approach for researching and implementing CIS as a ‘searching to learn’ context.
      PubDate: 2017-10-01
      DOI: 10.1007/s10791-017-9304-z
      Issue No: Vol. 20, No. 5 (2017)
  • Sequence-based context-aware music recommendation
    • Authors: Dongjing Wang; Shuiguang Deng; Guandong Xu
      Abstract: Abstract Contextual factors greatly affect users’ preferences for music, so they can benefit music recommendation and music retrieval. However, how to acquire and utilize the contextual information is still facing challenges. This paper proposes a novel approach for context-aware music recommendation, which infers users’ preferences for music, and then recommends music pieces that fit their real-time requirements. Specifically, the proposed approach first learns the low dimensional representations of music pieces from users’ music listening sequences using neural network models. Based on the learned representations, it then infers and models users’ general and contextual preferences for music from users’ historical listening records. Finally, music pieces in accordance with user’s preferences are recommended to the target user. Extensive experiments are conducted on real world datasets to compare the proposed method with other state-of-the-art recommendation methods. The results demonstrate that the proposed method significantly outperforms those baselines, especially on sparse data.
      PubDate: 2017-10-13
      DOI: 10.1007/s10791-017-9317-7
  • Session search modeling by partially observable Markov decision process
    • Authors: Grace Hui Yang; Xuchu Dong; Jiyun Luo; Sicong Zhang
      Abstract: Abstract Session search, the task of document retrieval for a series of queries in a session, has been receiving increasing attention from the information retrieval research community. Session search exhibits the properties of rich user-system interactions and temporal dependency. These properties lead to our proposal of using partially observable Markov decision process to model session search. On the basis of a design choice schema for states, actions and rewards, we evaluate different combinations of these choices over the TREC 2012 and 2013 session track datasets. According to the experimental results, practical design recommendations for using PODMP in session search are discussed.
      PubDate: 2017-10-11
      DOI: 10.1007/s10791-017-9316-8
  • Retrieving and classifying instances of source code plagiarism
    • Authors: Debasis Ganguly; Gareth J. F. Jones; Aarón Ramírez-de-la-Cruz; Gabriela Ramírez-de-la-Rosa; Esaú Villatoro-Tello
      Abstract: Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.
      PubDate: 2017-09-13
      DOI: 10.1007/s10791-017-9313-y
  • Introduction to the special issue on search as learning
    • PubDate: 2017-09-09
      DOI: 10.1007/s10791-017-9315-9
  • Machine learning techniques for XML (co-)clustering by
           structure-constrained phrases
    • Authors: Gianni Costa; Riccardo Ortale
      Abstract: Abstract A new method is proposed for clustering XML documents by structure-constrained phrases. It is implemented by three machine-learning approaches previously unexplored in the XML domain, namely non-negative matrix (tri-)factorization, co-clustering and automatic transactional clustering. A novel class of XML features approximately captures structure-constrained phrases as n-grams contextualized by root-to-leaf paths. Experiments over real-world benchmark XML corpora show that the effectiveness of the three approaches improves with contextualized n-grams of suitable length. This confirms the validity of the devised method from multiple clustering perspectives. Two approaches overcome in effectiveness several state-of-the-art competitors. The scalability of the three approaches is investigated, too.
      PubDate: 2017-08-04
      DOI: 10.1007/s10791-017-9314-x
  • Statistical biases in Information Retrieval metrics for recommender
    • Authors: Alejandro Bellogín; Pablo Castells; Iván Cantador
      Abstract: Abstract There is an increasing consensus in the Recommender Systems community that the dominant error-based evaluation metrics are insufficient, and mostly inadequate, to properly assess the practical effectiveness of recommendations. Seeking to evaluate recommendation rankings—which largely determine the effective accuracy in matching user needs—rather than predicted rating values, Information Retrieval metrics have started to be applied for the evaluation of recommender systems. In this paper we analyse the main issues and potential divergences in the application of Information Retrieval methodologies to recommender system evaluation, and provide a systematic characterisation of experimental design alternatives for this adaptation. We lay out an experimental configuration framework upon which we identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases. These biases considerably distort the empirical measurements, hindering the interpretation and comparison of results across experiments. We develop a formal characterisation and analysis of the biases upon which we analyse their causes and main factors, as well as their impact on evaluation metrics under different experimental configurations, illustrating the theoretical findings with empirical evidence. We propose two experimental design approaches that effectively neutralise such biases to a large extent. We report experiments validating our proposed experimental variants, and comparing them to alternative approaches and metrics that have been defined in the literature with similar or related purposes.
      PubDate: 2017-07-27
      DOI: 10.1007/s10791-017-9312-z
  • Product review summarization through question retrieval and
    • Authors: Mengwen Liu; Yi Fang; Alexander G. Choulos; Dae Hoon Park; Xiaohua Hu
      Abstract: Abstract Product reviews have become an important resource for customers before they make purchase decisions. However, the abundance of reviews makes it difficult for customers to digest them and make informed choices. In our study, we aim to help customers who want to quickly capture the main idea of a lengthy product review before they read the details. In contrast with existing work on review analysis and document summarization, we aim to retrieve a set of real-world user questions to summarize a review. In this way, users would know what questions a given review can address and they may further read the review only if they have similar questions about the product. Specifically, we design a two-stage approach which consists of question selection and question diversification. For question selection phase, we first employ probabilistic retrieval models to locate candidate questions that are relevant to a given review. A Recurrent Neural Network Encoder–Decoder is utilized to measure the “answerability” of questions to a review. We then design a set function to re-rank the questions with the goal of rewarding diversity in the final question set. The set function satisfies submodularity and monotonicity, which results in an efficient greedy algorithm of submodular optimization. Evaluation on product reviews from two categories shows that the proposed approach is effective for discovering meaningful questions that are representative of individual reviews.
      PubDate: 2017-07-24
      DOI: 10.1007/s10791-017-9311-0
  • Online searching and learning: YUM and other search tools for children and
    • Authors: Ion Madrazo Azpiazu; Nevena Dragovic; Maria Soledad Pera; Jerry Alan Fails
      Abstract: Abstract Information discovery tasks using online search tools are performed on a regular basis by school-age children. However, these tools are not necessarily designed to both explicitly facilitate the retrieval of resources these young users can comprehend and aid low-literacy searchers. This is of particular concern for educational environments, as there is an inherent expectation that these tools facilitate effective learning. In this manuscript we present an initial assessment conducted over (1) children-oriented search tools based on queries generated by K-9 students, analyzing features such as readability and adequacy of retrieved results, and (2) tools used by teachers in their classrooms, analyzing their main purpose and target audience’s age range. Among the examined tools, we include YouUnderstood.Me, an enhanced search environment, which is the result of our ongoing efforts on the development of a search environment tailored to 5-15 year-olds that can foster learning through the retrieval of materials that not only satisfy the information needs of these users but also match their reading abilities. The results of these studies highlight the fact that search results presented to children have average reading levels that do not match the target audience. In addition, tools oriented to teachers do not go beyond showing the progress of their students, and seldomly provide a simple way of retrieving class contents that fit current needs of students. These facts further showcase the need for developing a dual environment oriented to both teachers and students.
      PubDate: 2017-07-21
      DOI: 10.1007/s10791-017-9310-1
  • The role of domain knowledge in cognitive modeling of information search
    • Authors: Saraschandra Karanam; Guillermo Jorge-Botana; Ricardo Olmos; Herre van Oostendorp
      Abstract: Abstract Computational cognitive models developed so far do not incorporate individual differences in domain knowledge in predicting user clicks on search result pages. We address this problem using a cognitive model of information search which enables us to use two semantic spaces having a low (non-expert semantic space) and a high (expert semantic space) amount of medical and health related information to represent respectively low and high knowledge of users in this domain. We also investigated two different processes along which one can gain a larger amount of knowledge in a domain: an evolutionary and a common core process. Simulations of model click behavior on difficult information search tasks and subsequent matching with actual behavioral data from users (divided into low and high domain knowledge groups based on a domain knowledge test) were conducted. Results showed that the efficacy of modeling for high domain knowledge participants (in terms of the number of matches between the model predictions and the actual user clicks on search result pages) was higher with the expert semantic space compared to the non-expert semantic space while for low domain knowledge participants it was the other way around. When the process of knowledge acquisition was taken into account, the effect of using a semantic space based on high domain knowledge was significant only for high domain knowledge participants, irrespective of the knowledge acquisition process. The implications of these outcomes for support tools that can be built based on these models are discussed.
      PubDate: 2017-05-24
      DOI: 10.1007/s10791-017-9308-8
  • Efficiency in information retrieval: introduction to special issue
    • Authors: David Hawking; Alistair Moffat; Andrew Trotman
      PubDate: 2017-05-20
      DOI: 10.1007/s10791-017-9309-7
  • Optimizing search results for human learning goals
    • Authors: Rohail Syed; Kevyn Collins-Thompson
      Abstract: Abstract While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts.
      PubDate: 2017-05-12
      DOI: 10.1007/s10791-017-9303-0
  • Personalized Information Seeking Assistant (PiSA): from programming
           information seeking to learning
    • Authors: Yihan Lu; I-Han Hsiao
      Abstract: Abstract Online programming discussion forums have grown increasingly and formed sizable repositories of problem-solving solutions. In this paper, we investigate programming learners’ information seeking behaviors in online discussion forums, and provide visual navigational support to facilitate information seeking. We design engines to collect students’ information seeking behaviors, and model these behaviors with sequence pattern mining techniques. The results show that programming learners indeed seek for information from discussion forums by actively search and read progressively according to course schedule topics. Advanced students consistently perform query refinements, examine search results and commit to read, however, novices do not. Finally, according to the lessons learned, we propose, design and evaluate Personalized Information Seeking Assistant system to help query refinement by summarizing the search results and to provide social-based browsing history. Findings suggest that paying attention to the query history may lead to further reading events, which subsequently resulting in potential learning activities.
      PubDate: 2017-05-09
      DOI: 10.1007/s10791-017-9305-y
  • Validating simulated interaction for retrieval evaluation
    • Authors: Teemu Pääkkönen; Jaana Kekäläinen; Heikki Keskustalo; Leif Azzopardi; David Maxwell; Kalervo Järvelin
      Abstract: Abstract A searcher’s interaction with a retrieval system consists of actions such as query formulation, search result list interaction and document interaction. The simulation of searcher interaction has recently gained momentum in the analysis and evaluation of interactive information retrieval (IIR). However, a key issue that has not yet been adequately addressed is the validity of such IIR simulations and whether they reliably predict the performance obtained by a searcher across the session. The aim of this paper is to determine the validity of the common interaction model (CIM) typically used for simulating multi-query sessions. We focus on search result interactions, i.e., inspecting snippets, examining documents and deciding when to stop examining the results of a single query, or when to stop the whole session. To this end, we run a series of simulations grounded by real world behavioral data to show how accurate and responsive the model is to various experimental conditions under which the data were produced. We then validate on a second real world data set derived under similar experimental conditions. We seek to predict cumulated gain across the session. We find that the interaction model with a query-level stopping strategy based on consecutive non-relevant snippets leads to the highest prediction accuracy, and lowest deviation from ground truth, around 9 to 15% depending on the experimental conditions. To our knowledge, the present study is the first validation effort of the CIM that shows that the model’s acceptance and use is justified within IIR evaluations. We also identify and discuss ways to further improve the CIM and its behavioral parameters for more accurate simulations.
      PubDate: 2017-05-06
      DOI: 10.1007/s10791-017-9301-2
  • There’s a creepy guy on the other end at Google!: engaging middle school
           students in a drawing activity to elicit their mental models of Google
    • Authors: Christie Kodama; Beth St. Jean; Mega Subramaniam; Natalie Greene Taylor
      Abstract: Abstract Although youth are increasingly going online to fulfill their needs for information, many youth struggle with information and digital literacy skills, such as the abilities to conduct a search and assess the credibility of online information. Ideally, these skills encompass an accurate and comprehensive understanding of the ways in which a system, such as a Web search engine, functions. In order to investigate youths’ conceptions of the Google search engine, a drawing activity was conducted with 26 HackHealth after-school program participants to elicit their mental models of Google. The findings revealed that many participants personified Google and emphasized anthropomorphic elements, computing equipment, and/or connections (such as cables, satellites and antennas) in their drawings. Far fewer participants focused their drawings on the actual Google interface or on computer code. Overall, their drawings suggest a limited understanding of Google and the ways in which it actually works. However, an understanding of youths’ conceptions of Google can enable educators to better tailor their digital literacy instruction efforts and can inform search engine developers and search engine interface designers in making the inner workings of the engine more transparent and their output more trustworthy to young users. With a better understanding of how Google works, young users will be better able to construct effective queries, assess search results, and ultimately find relevant and trustworthy information that will be of use to them.
      PubDate: 2017-05-05
      DOI: 10.1007/s10791-017-9306-x
  • Identifying top relevant dates for implicit time sensitive queries
    • Authors: Ricardo Campos; Gaël Dias; Alípio Mário Jorge; Célia Nunes
      Abstract: Abstract Despite a clear improvement of search and retrieval temporal applications, current search engines are still mostly unaware of the temporal dimension. Indeed, in most cases, systems are limited to offering the user the chance to restrict the search to a particular time period or to simply rely on an explicitly specified time span. If the user is not explicit in his/her search intents (e.g., “philip seymour hoffman”) search engines may likely fail to present an overall historic perspective of the topic. In most such cases, they are limited to retrieving the most recent results. One possible solution to this shortcoming is to understand the different time periods of the query. In this context, most state-of-the-art methodologies consider any occurrence of temporal expressions in web documents and other web data as equally relevant to an implicit time sensitive query. To approach this problem in a more adequate manner, we propose in this paper the detection of relevant temporal expressions to the query. Unlike previous metadata and query log-based approaches, we show how to achieve this goal based on information extracted from document content. However, instead of simply focusing on the detection of the most obvious date we are also interested in retrieving the set of dates that are relevant to the query. Towards this goal, we define a general similarity measure that makes use of co-occurrences of words and years based on corpus statistics and a classification methodology that is able to identify the set of top relevant dates for a given implicit time sensitive query, while filtering out the non-relevant ones. Through extensive experimental evaluation, we mean to demonstrate that our approach offers promising results in the field of temporal information retrieval (T-IR), as demonstrated by the experiments conducted over several baselines on web corpora collections.
      PubDate: 2017-05-05
      DOI: 10.1007/s10791-017-9302-1
  • Collaborator recommendation in heterogeneous bibliographic networks using
           random walks
    • Authors: Xing Zhou; Lixin Ding; Zhaokui Li; Runze Wan
      Abstract: Abstract The increasingly growing popularity of the collaboration among researchers and the increasing information overload in big scholarly data make it imperative to develop a collaborator recommendation system for researchers to find potential partners. Existing works always study this task as a link prediction problem in a homogeneous network with a single object type (i.e., author) and a single link type (i.e., co-authorship). However, a real-world academic social network often involves several object types, e.g., papers, terms, and venues, as well as multiple relationships among different objects. This paper proposes a RWR-CR (standing for random walk with restart-based collaborator recommendation) algorithm in a heterogeneous bibliographic network towards this problem. First, we construct a heterogeneous network with multiple types of nodes and links with a simplified network structure by removing the citing paper nodes. Then, two importance measures are used to weight edges in the network, which will bias a random walker’s behaviors. Finally, we employ a random walk with restart to retrieve relevant authors and output an ordered recommendation list in terms of ranking scores. Experimental results on DBLP and hep-th datasets demonstrate the effectiveness of our methodology and its promising performance in collaborator prediction.
      PubDate: 2017-03-29
      DOI: 10.1007/s10791-017-9300-3
  • Waves: a fast multi-tier top- k query processing algorithm
    • Authors: Caio Moura Daoud; Edleno Silva de Moura; David Fernandes; Altigran Soares da Silva; Cristian Rossi; Andre Carvalho
      Abstract: Abstract In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each wave i may insert only those documents that occur in that tier level into the answer. After processing a wave, the algorithm checks whether the answer achieved might be changed by successive waves or not. A new wave is started only if it has a chance of changing the top-k scores. We show through experiments that such lazy query processing strategy results in smaller query processing times when compared to previous approaches proposed in the literature. We present experiments to compare Waves’ performance to the state-of-the-art document-at-a-time query processing methods that preserve top-k results and show scenarios where the method can be a good alternative algorithm for computing top-k results.
      PubDate: 2017-03-13
      DOI: 10.1007/s10791-017-9298-6
  • Performance improvements for search systems using an integrated cache of
           lists + intersections
    • Authors: Gabriel Tolosa; Esteban Feuerstein; Luca Becchetti; Alberto Marchetti-Spaccamela
      Abstract: Abstract Modern information retrieval systems use several levels of caching to speedup computation by exploiting frequent, recent or costly data used in the past. Previous studies show that the use of caching techniques is crucial in search engines, as it helps reducing query response times and processing workloads on search servers. In this work we propose and evaluate a static cache that acts simultaneously as list and intersection cache, offering a more efficient way of handling cache space. We also use a query resolution strategy that takes advantage of the existence of this cache to reorder the query execution sequence. In addition, we propose effective strategies to select the term pairs that should populate the cache. We also represent the data in cache in both raw and compressed forms and evaluate the differences between them using different configurations of cache sizes. The results show that the proposed Integrated Cache outperforms the standard posting lists cache in most of the cases, taking advantage not only of the intersection cache but also the query resolution strategy.
      PubDate: 2017-03-11
      DOI: 10.1007/s10791-017-9299-5
  • The role of index compression in score-at-a-time query evaluation
    • Authors: Jimmy Lin; Andrew Trotman
      Abstract: Abstract This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.
      PubDate: 2017-01-25
      DOI: 10.1007/s10791-016-9291-5
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-2016