Journal Cover Information Retrieval
  [SJR: 0.589]   [H-I: 44]   [202 followers]  Follow
   Hybrid Journal Hybrid journal (It can contain Open Access articles)
   ISSN (Print) 1573-7659 - ISSN (Online) 1386-4564
   Published by Springer-Verlag Homepage  [2355 journals]
  • A language model-based framework for multi-publisher content-based
           recommender systems
    • Authors: Hamed Zamani; Azadeh Shakery
      Abstract: The rapid growth of the Web has increased the difficulty of finding the information that can address the users’ information needs. A number of recommendation approaches have been developed to tackle this problem. The increase in the number of data providers has necessitated the development of multi-publisher recommender systems; systems that include more than one item/data provider. In such environments, preserving the privacy of both publishers and subscribers is a key and challenging point. In this paper, we propose a multi-publisher framework for recommender systems based on a client–server architecture, which preserves the privacy of both data providers and subscribers. We develop our framework as a content-based filtering system using the statistical language modeling framework. We also introduce AUTO, a simple yet effective threshold optimization algorithm, to find a dissemination threshold for making acceptance and rejection decisions for new published documents. We further propose a language model sketching technique to reduce the network traffic between servers and clients in the proposed framework. Extensive experiments using the TREC-9 Filtering Track and the CLEF 2008-09 INFILE Track collections indicate the effectiveness of the proposed models in both single- and multi-publisher settings.
      PubDate: 2018-02-06
      DOI: 10.1007/s10791-018-9327-0
  • An artist ranking system based on social media mining
    • Authors: Amalia F. Foka
      Abstract: Currently users on social media post their opinion and feelings about almost everything. This online behavior has led to numerous applications where social media data are used to measure public opinion in a similar way as a poll or a survey. In this paper, we will present an application of social media mining for the art market. To the best of our knowledge, this will be the first attempt to mine social media to extract quantitative and qualitative data for the art market. Although there are previous works on analyzing and predicting other markets, these methodologies cannot be applied directly to the art market. In our proposed methodology, artists will be treated as brands. That is, we will mine Twitter posts that mention specific artists’ names and attempt to rank artists in a similar manner as brand equity and awareness would be measured. The particularities of the art market are considered mainly in the construction of a topic-specific user network where user expertise and influence is evaluated and later used to rank artists. The proposed ranking system is evaluated against two other available systems to identify the advantages it can offer.
      PubDate: 2018-02-05
      DOI: 10.1007/s10791-018-9328-z
  • Hybrid query expansion model for text and microblog information retrieval
    • Authors: Meriem Amina Zingla; Chiraz Latiri; Philippe Mulhem; Catherine Berrut; Yahya Slimani
      Abstract: Query expansion (QE) is an important process in information retrieval applications that improves the user query and helps in retrieving relevant results. In this paper, we introduce a hybrid query expansion model (HQE) that investigates how external resources can be combined to association rules mining and used to enhance expansion terms generation and selection. The HQE model can be processed in different configurations, starting from methods based on association rules and combining it with external knowledge. The HQE model handles the two main phases of a QE process, namely: the candidate terms generation phase and the selection phase. We propose for the first phase, statistical, semantic and conceptual methods to generate new related terms for a given query. For the second phase, we introduce a similarity measure, ESAC, based on the Explicit Semantic Analysis that computes the relatedness between a query and the set of candidate terms. The performance of the proposed HQE model is evaluated within two experimental validations. The first one addresses the tweet search task proposed by TREC Microblog Track 2011 and an ad-hoc IR task related to the hard topics of the TREC Robust 2004. The second experimental validation concerns the tweet contextualization task organized by INEX 2014. Global results highlighted the effectiveness of our HQE model and of association rules mining for QE combined with external resources.
      PubDate: 2018-02-03
      DOI: 10.1007/s10791-017-9326-6
  • A unified score propagation model for web spam demotion algorithm
    • Authors: Xu Zhuang; Yan Zhu; Chin-Chen Chang; Qiang Peng; Faisal Khurshid
      Pages: 547 - 574
      Abstract: Web spam pages exploit the biases of search engine algorithms to get higher than their deserved rankings in search results by using several types of spamming techniques. Many web spam demotion algorithms have been developed to combat spam via the use of the web link structure, from which the goodness or badness score of each web page is evaluated. Those scores are then used to identify spam pages or punish their rankings in search engine results. However, most of the published spam demotion algorithms differ from their base models by only very limited improvements and still suffer from some common score manipulation methods. The lack of a general framework for this field makes the task of designing high-performance spam demotion algorithms very inefficient. In this paper, we propose a unified score propagation model for web spam demotion algorithms by abstracting the score propagation process of relevant models with a forward score propagation function and a backward score propagation function, each of which can further be expressed as three sub-functions: a splitting function, an accepting function and a combination function. On the basis of the proposed model, we develop two new web spam demotion algorithms named Supervised Forward and Backward score Ranking (SFBR) and Unsupervised Forward and Backward score Ranking (UFBR). Our experiments, conducted on three large-scale public datasets, show that (1) SFBR is very robust and apparently outperforms other algorithms and (2) UFBR can obtain results comparable to some well-known supervised algorithms in the spam demotion task even if the UFBR is unsupervised.
      PubDate: 2017-12-01
      DOI: 10.1007/s10791-017-9307-9
      Issue No: Vol. 20, No. 6 (2017)
  • EveTAR : building a large-scale multi-task test collection over Arabic
    • Authors: Maram Hasanain; Reem Suwaileh; Tamer Elsayed; Mucahid Kutlu; Hind Almerekhi
      Abstract: This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets.
      PubDate: 2017-12-21
      DOI: 10.1007/s10791-017-9325-7
  • Clustering small-sized collections of short texts
    • Authors: Lili Kotlerman; Ido Dagan; Oren Kurland
      Abstract: The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
      PubDate: 2017-11-30
      DOI: 10.1007/s10791-017-9324-8
  • Website replica detection with distant supervision
    • Authors: Cristiano Carvalho; Edleno Silva de Moura; Adriano Veloso; Nivio Ziviani
      Abstract: Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.
      PubDate: 2017-11-29
      DOI: 10.1007/s10791-017-9320-z
  • Neural information retrieval: introduction to the special issue
    • PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9323-9
  • Neural information retrieval: at the end of the early years
    • Abstract: A recent “third wave” of neural network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, work in this area is often referred to as deep learning. Recent years have witnessed an explosive growth of research into NN-based approaches to information retrieval (IR). A significant body of work has now been created. In this paper, we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.
      PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9321-y
  • Using word embeddings in Twitter election classification
    • Abstract: Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models).
      PubDate: 2017-11-09
      DOI: 10.1007/s10791-017-9319-5
  • A study of untrained models for multimodal information retrieval
    • Abstract: Operational multimodal information retrieval systems have to deal with increasingly complex document collections and queries that are composed of a large set of textual and non-textual modalities such as ratings, prices, timestamps, geographical coordinates, etc. The resulting combinatorial explosion of modality combinations makes it intractable to treat each modality individually and to obtain suitable training data. As a consequence, instead of finding and training new models for each individual modality or combination of modalities, it is crucial to establish unified models, and fuse their outputs in a robust way. Since the most popular weighting schemes for textual retrieval have in the past generalized well to many retrieval tasks, we demonstrate how they can be adapted to be used with non-textual modalities, which is a first step towards finding such a unified model. We demonstrate that the popular weighting scheme BM25 is suitable to be used for multimodal IR systems and analyze the underlying assumptions of the BM25 formula with respect to merging modalities under the so-called raw-score merging hypothesis, which requires no training. We establish a multimodal baseline for two multimodal test collections, show how modalities differ with respect to their contribution to relevance and the difficulty of treating modalities with overlapping information. Our experiments demonstrate that our multimodal baseline with no training achieves a significantly higher retrieval effectiveness than using just the textual modality for the social book search 2016 collection and lies in the range of a trained multimodal approach using the optimal linear combination of the modality scores.
      PubDate: 2017-11-03
      DOI: 10.1007/s10791-017-9322-x
  • Picture it in your mind: generating high level visual representations from
           textual descriptions
    • Authors: Fabio Carrara; Andrea Esuli; Tiziano Fagni; Fabrizio Falchi; Alejandro Moreo Fernández
      Abstract: In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.
      PubDate: 2017-10-14
      DOI: 10.1007/s10791-017-9318-6
  • Sequence-based context-aware music recommendation
    • Authors: Dongjing Wang; Shuiguang Deng; Guandong Xu
      Abstract: Contextual factors greatly affect users’ preferences for music, so they can benefit music recommendation and music retrieval. However, how to acquire and utilize the contextual information is still facing challenges. This paper proposes a novel approach for context-aware music recommendation, which infers users’ preferences for music, and then recommends music pieces that fit their real-time requirements. Specifically, the proposed approach first learns the low dimensional representations of music pieces from users’ music listening sequences using neural network models. Based on the learned representations, it then infers and models users’ general and contextual preferences for music from users’ historical listening records. Finally, music pieces in accordance with user’s preferences are recommended to the target user. Extensive experiments are conducted on real world datasets to compare the proposed method with other state-of-the-art recommendation methods. The results demonstrate that the proposed method significantly outperforms those baselines, especially on sparse data.
      PubDate: 2017-10-13
      DOI: 10.1007/s10791-017-9317-7
  • Session search modeling by partially observable Markov decision process
    • Authors: Grace Hui Yang; Xuchu Dong; Jiyun Luo; Sicong Zhang
      Abstract: Session search, the task of document retrieval for a series of queries in a session, has been receiving increasing attention from the information retrieval research community. Session search exhibits the properties of rich user-system interactions and temporal dependency. These properties lead to our proposal of using partially observable Markov decision process to model session search. On the basis of a design choice schema for states, actions and rewards, we evaluate different combinations of these choices over the TREC 2012 and 2013 session track datasets. According to the experimental results, practical design recommendations for using PODMP in session search are discussed.
      PubDate: 2017-10-11
      DOI: 10.1007/s10791-017-9316-8
  • Introduction to the special issue on search as learning
    • PubDate: 2017-09-09
      DOI: 10.1007/s10791-017-9315-9
  • Statistical biases in Information Retrieval metrics for recommender
    • Authors: Alejandro Bellogín; Pablo Castells; Iván Cantador
      Abstract: There is an increasing consensus in the Recommender Systems community that the dominant error-based evaluation metrics are insufficient, and mostly inadequate, to properly assess the practical effectiveness of recommendations. Seeking to evaluate recommendation rankings—which largely determine the effective accuracy in matching user needs—rather than predicted rating values, Information Retrieval metrics have started to be applied for the evaluation of recommender systems. In this paper we analyse the main issues and potential divergences in the application of Information Retrieval methodologies to recommender system evaluation, and provide a systematic characterisation of experimental design alternatives for this adaptation. We lay out an experimental configuration framework upon which we identify and analyse specific statistical biases arising in the adaptation of Information Retrieval metrics to recommendation tasks, namely sparsity and popularity biases. These biases considerably distort the empirical measurements, hindering the interpretation and comparison of results across experiments. We develop a formal characterisation and analysis of the biases upon which we analyse their causes and main factors, as well as their impact on evaluation metrics under different experimental configurations, illustrating the theoretical findings with empirical evidence. We propose two experimental design approaches that effectively neutralise such biases to a large extent. We report experiments validating our proposed experimental variants, and comparing them to alternative approaches and metrics that have been defined in the literature with similar or related purposes.
      PubDate: 2017-07-27
      DOI: 10.1007/s10791-017-9312-z
  • Product review summarization through question retrieval and
    • Authors: Mengwen Liu; Yi Fang; Alexander G. Choulos; Dae Hoon Park; Xiaohua Hu
      Abstract: Product reviews have become an important resource for customers before they make purchase decisions. However, the abundance of reviews makes it difficult for customers to digest them and make informed choices. In our study, we aim to help customers who want to quickly capture the main idea of a lengthy product review before they read the details. In contrast with existing work on review analysis and document summarization, we aim to retrieve a set of real-world user questions to summarize a review. In this way, users would know what questions a given review can address and they may further read the review only if they have similar questions about the product. Specifically, we design a two-stage approach which consists of question selection and question diversification. For question selection phase, we first employ probabilistic retrieval models to locate candidate questions that are relevant to a given review. A Recurrent Neural Network Encoder–Decoder is utilized to measure the “answerability” of questions to a review. We then design a set function to re-rank the questions with the goal of rewarding diversity in the final question set. The set function satisfies submodularity and monotonicity, which results in an efficient greedy algorithm of submodular optimization. Evaluation on product reviews from two categories shows that the proposed approach is effective for discovering meaningful questions that are representative of individual reviews.
      PubDate: 2017-07-24
      DOI: 10.1007/s10791-017-9311-0
  • The role of domain knowledge in cognitive modeling of information search
    • Authors: Saraschandra Karanam; Guillermo Jorge-Botana; Ricardo Olmos; Herre van Oostendorp
      Abstract: Computational cognitive models developed so far do not incorporate individual differences in domain knowledge in predicting user clicks on search result pages. We address this problem using a cognitive model of information search which enables us to use two semantic spaces having a low (non-expert semantic space) and a high (expert semantic space) amount of medical and health related information to represent respectively low and high knowledge of users in this domain. We also investigated two different processes along which one can gain a larger amount of knowledge in a domain: an evolutionary and a common core process. Simulations of model click behavior on difficult information search tasks and subsequent matching with actual behavioral data from users (divided into low and high domain knowledge groups based on a domain knowledge test) were conducted. Results showed that the efficacy of modeling for high domain knowledge participants (in terms of the number of matches between the model predictions and the actual user clicks on search result pages) was higher with the expert semantic space compared to the non-expert semantic space while for low domain knowledge participants it was the other way around. When the process of knowledge acquisition was taken into account, the effect of using a semantic space based on high domain knowledge was significant only for high domain knowledge participants, irrespective of the knowledge acquisition process. The implications of these outcomes for support tools that can be built based on these models are discussed.
      PubDate: 2017-05-24
      DOI: 10.1007/s10791-017-9308-8
  • Optimizing search results for human learning goals
    • Authors: Rohail Syed; Kevyn Collins-Thompson
      Abstract: While past research has shown that learning outcomes can be influenced by the amount of effort students invest during the learning process, there has been little research into this question for scenarios where people use search engines to learn. In fact, learning-related tasks represent a significant fraction of the time users spend using Web search, so methods for evaluating and optimizing search engines to maximize learning are likely to have broad impact. Thus, we introduce and evaluate a retrieval algorithm designed to maximize educational utility for a vocabulary learning task, in which users learn a set of important keywords for a given topic by reading representative documents on diverse aspects of the topic. Using a crowdsourced pilot study, we compare the learning outcomes of users across four conditions corresponding to rankings that optimize for different levels of keyword density. We find that adding keyword density to the retrieval objective gave significant learning gains on some topics, with higher levels of keyword density generally corresponding to more time spent reading per word, and stronger learning gains per word read. We conclude that our approach to optimizing search ranking for educational utility leads to retrieved document sets that ultimately may result in more efficient learning of important concepts.
      PubDate: 2017-05-12
      DOI: 10.1007/s10791-017-9303-0
  • Personalized Information Seeking Assistant (PiSA): from programming
           information seeking to learning
    • Authors: Yihan Lu; I-Han Hsiao
      Abstract: Online programming discussion forums have grown increasingly and formed sizable repositories of problem-solving solutions. In this paper, we investigate programming learners’ information seeking behaviors in online discussion forums, and provide visual navigational support to facilitate information seeking. We design engines to collect students’ information seeking behaviors, and model these behaviors with sequence pattern mining techniques. The results show that programming learners indeed seek for information from discussion forums by actively search and read progressively according to course schedule topics. Advanced students consistently perform query refinements, examine search results and commit to read, however, novices do not. Finally, according to the lessons learned, we propose, design and evaluate Personalized Information Seeking Assistant system to help query refinement by summarizing the search results and to provide social-based browsing history. Findings suggest that paying attention to the query history may lead to further reading events, which subsequently resulting in potential learning activities.
      PubDate: 2017-05-09
      DOI: 10.1007/s10791-017-9305-y
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-