for Journals by Title or ISSN
for Articles by Keywords
Followed Journals
Journal you Follow: 0
Sign Up to follow journals, search in your chosen journals and, optionally, receive Email Alerts when new issues of your Followed Journals are published.
Already have an account? Sign In to see the journals you follow.
Journal Cover
Information Retrieval
Journal Prestige (SJR): 0.352
Citation Impact (citeScore): 2
Number of Followers: 271  
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1573-7659 - ISSN (Online) 1386-4564
Published by Springer-Verlag Homepage  [2350 journals]
  • A language model-based framework for multi-publisher content-based
           recommender systems
    • Authors: Hamed Zamani; Azadeh Shakery
      Pages: 369 - 409
      Abstract: Abstract The rapid growth of the Web has increased the difficulty of finding the information that can address the users’ information needs. A number of recommendation approaches have been developed to tackle this problem. The increase in the number of data providers has necessitated the development of multi-publisher recommender systems; systems that include more than one item/data provider. In such environments, preserving the privacy of both publishers and subscribers is a key and challenging point. In this paper, we propose a multi-publisher framework for recommender systems based on a client–server architecture, which preserves the privacy of both data providers and subscribers. We develop our framework as a content-based filtering system using the statistical language modeling framework. We also introduce AUTO, a simple yet effective threshold optimization algorithm, to find a dissemination threshold for making acceptance and rejection decisions for new published documents. We further propose a language model sketching technique to reduce the network traffic between servers and clients in the proposed framework. Extensive experiments using the TREC-9 Filtering Track and the CLEF 2008-09 INFILE Track collections indicate the effectiveness of the proposed models in both single- and multi-publisher settings.
      PubDate: 2018-10-01
      DOI: 10.1007/s10791-018-9327-0
      Issue No: Vol. 21, No. 5 (2018)
  • An artist ranking system based on social media mining
    • Authors: Amalia F. Foka
      Pages: 410 - 448
      Abstract: Abstract Currently users on social media post their opinion and feelings about almost everything. This online behavior has led to numerous applications where social media data are used to measure public opinion in a similar way as a poll or a survey. In this paper, we will present an application of social media mining for the art market. To the best of our knowledge, this will be the first attempt to mine social media to extract quantitative and qualitative data for the art market. Although there are previous works on analyzing and predicting other markets, these methodologies cannot be applied directly to the art market. In our proposed methodology, artists will be treated as brands. That is, we will mine Twitter posts that mention specific artists’ names and attempt to rank artists in a similar manner as brand equity and awareness would be measured. The particularities of the art market are considered mainly in the construction of a topic-specific user network where user expertise and influence is evaluated and later used to rank artists. The proposed ranking system is evaluated against two other available systems to identify the advantages it can offer.
      PubDate: 2018-10-01
      DOI: 10.1007/s10791-018-9328-z
      Issue No: Vol. 21, No. 5 (2018)
  • A non-parametric topical relevance model
    • Authors: Debasis Ganguly; Gareth J. F. Jones
      Pages: 449 - 479
      Abstract: Abstract An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.
      PubDate: 2018-10-01
      DOI: 10.1007/s10791-018-9329-y
      Issue No: Vol. 21, No. 5 (2018)
  • Website replica detection with distant supervision
    • Authors: Cristiano Carvalho; Edleno Silva de Moura; Adriano Veloso; Nivio Ziviani
      Pages: 253 - 272
      Abstract: Abstract Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.
      PubDate: 2018-08-01
      DOI: 10.1007/s10791-017-9320-z
      Issue No: Vol. 21, No. 4 (2018)
  • Clustering small-sized collections of short texts
    • Authors: Lili Kotlerman; Ido Dagan; Oren Kurland
      Pages: 273 - 306
      Abstract: Abstract The need to cluster small text corpora composed of a few hundreds of short texts rises in various applications; e.g., clustering top-retrieved documents based on their snippets. This clustering task is challenging due to the vocabulary mismatch between short texts and the insufficient corpus-based statistics (e.g., term co-occurrence statistics) due to the corpus size. We address this clustering challenge using a framework that utilizes a set of external knowledge resources that provide information about term relations. Specifically, we use information induced from the resources to estimate similarity between terms and produce term clusters. We also utilize the resources to expand the vocabulary used in the given corpus and thus enhance term clustering. We then project the texts in the corpus onto the term clusters to cluster the texts. We evaluate various instantiations of the proposed framework by varying the term clustering method used, the approach of projecting the texts onto the term clusters, and the way of applying external knowledge resources. Extensive empirical evaluation demonstrates the merits of our approach with respect to applying clustering algorithms directly on the text corpus, and using state-of-the-art co-clustering and topic modeling methods.
      PubDate: 2018-08-01
      DOI: 10.1007/s10791-017-9324-8
      Issue No: Vol. 21, No. 4 (2018)
  • EveTAR : building a large-scale multi-task test collection over Arabic
    • Authors: Maram Hasanain; Reem Suwaileh; Tamer Elsayed; Mucahid Kutlu; Hind Almerekhi
      Pages: 307 - 336
      Abstract: Abstract This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets.
      PubDate: 2018-08-01
      DOI: 10.1007/s10791-017-9325-7
      Issue No: Vol. 21, No. 4 (2018)
  • Search bias quantification: investigating political bias in social media
           and web search
    • Authors: Juhi Kulshrestha; Motahhare Eslami; Johnnatan Messias; Muhammad Bilal Zafar; Saptarshi Ghosh; Krishna P. Gummadi; Karrie Karahalios
      Abstract: Abstract Users frequently use search systems on the Web as well as online social media to learn about ongoing events and public opinion on personalities. Prior studies have shown that the top-ranked results returned by these search engines can shape user opinion about the topic (e.g., event or person) being searched. In case of polarizing topics like politics, where multiple competing perspectives exist, the political bias in the top search results can play a significant role in shaping public opinion towards (or away from) certain perspectives. Given the considerable impact that search bias can have on the user, we propose a generalizable search bias quantification framework that not only measures the political bias in ranked list output by the search system but also decouples the bias introduced by the different sources—input data and ranking system. We apply our framework to study the political bias in searches related to 2016 US Presidential primaries in Twitter social media search and find that both input data and ranking system matter in determining the final search output bias seen by the users. And finally, we use the framework to compare the relative bias for two popular search systems—Twitter social media search and Google web search—for queries related to politicians and political events. We end by discussing some potential solutions to signal the bias in the search results to make the users more aware of them.
      PubDate: 2018-08-21
      DOI: 10.1007/s10791-018-9341-2
  • Those were the days: learning to rank social media posts for reminiscence
    • Authors: Kaweh Djafari Naini; Ricardo Kawase; Nattiya Kanhabua; Claudia Niederée; Ismail Sengor Altingovde
      Abstract: Abstract Social media posts are a great source for life summaries aggregating activities, events, interactions and thoughts of the last months or years. They can be used for personal reminiscence as well as for keeping track with developments in the lives of not-so-close friends. One of the core challenges of automatically creating such summaries is to decide which posts are memorable, i.e., should be considered for retention and which ones to forget. To address this challenge, we design and conduct user evaluation studies and construct a corpus that captures human expectations towards content retention. We analyze this corpus to identify a small set of seed features that are most likely to characterize memorable posts. Next, we compile a broader set of features that are leveraged to build general and personalized machine-learning models to rank posts for retention. By applying feature selection, we identify a compact yet effective subset of these features. The models trained with the presented feature sets outperform the baseline models exploiting an intuitive set of temporal and social features.
      PubDate: 2018-08-11
      DOI: 10.1007/s10791-018-9339-9
  • Beyond word embeddings: learning entity and concept representations from
           large scale knowledge bases
    • Authors: Walid Shalaby; Wlodek Zadrozny; Hongxia Jin
      Abstract: Abstract Text representations using neural word embeddings have proven effective in many NLP applications. Recent researches adapt the traditional word embedding models to learn vectors of multiword expressions (concepts/entities). However, these methods are limited to textual knowledge bases (e.g., Wikipedia). In this paper, we propose a novel and simple technique for integrating the knowledge about concepts from two large scale knowledge bases of different structure (Wikipedia and Probase) in order to learn concept representations. We adapt the efficient skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph. We evaluate our concept embedding models on two tasks: (1) analogical reasoning, where we achieve a state-of-the-art performance of 91% on semantic analogies, (2) concept categorization, where we achieve a state-of-the-art performance on two benchmark datasets achieving categorization accuracy of 100% on one and 98% on the other. Additionally, we present a case study to evaluate our model on unsupervised argument type identification for neural semantic parsing. We demonstrate the competitive accuracy of our unsupervised method and its ability to better generalize to out of vocabulary entity mentions compared to the tedious and error prone methods which depend on gazetteers and regular expressions.
      PubDate: 2018-08-11
      DOI: 10.1007/s10791-018-9340-3
  • User interest prediction over future unobserved topics on social networks
    • Authors: Fattane Zarrinkalam; Mohsen Kahani; Ebrahim Bagheri
      Abstract: Abstract The accurate prediction of users’ future interests on social networks allows one to perform future planning by studying how users will react if certain topics emerge in the future. It can improve areas such as targeted advertising and the efficient delivery of services. Despite the importance of predicting user future interests on social networks, existing works mainly focus on identifying user current interests and little work has been done on the prediction of user potential interests in the future. There have been work that attempt to identify a user future interests, however they cannot predict user interests with regard to new topics since these topics have never received any feedback from users in the past. In this paper, we propose a framework that works on the basis of temporal evolution of user interests and utilizes semantic information from knowledge bases such as Wikipedia to predict user future interests and overcome the cold item problem. Through extensive experiments on a real-world Twitter dataset, we demonstrate the effectiveness of our approach in predicting future interests of users compared to state-of-the-art baselines. Moreover, we further show that the impact of our work is especially meaningful when considered in case of cold items.
      PubDate: 2018-07-10
      DOI: 10.1007/s10791-018-9337-y
  • Predicting trading interactions in an online marketplace through
           location-based and online social networks
    • Authors: Lukas Eberhard; Christoph Trattner; Martin Atzmueller
      Abstract: Abstract Link prediction is a prominent research direction e.g., for inferring upcoming interactions to be used in recommender systems. Although this problem of predicting links between users has been extensively studied in the past, research investigating this issue simultaneously in multiplex networks is rather rare so far. This is the focus of this paper. We investigate the extent to which trading interactions between sellers and buyers within an online marketplace platform can be predicted based on three different but overlapping networks—an online social network, a location-based social network and a trading network. In particular, we conducted the study in the context of the virtual world Second Life. For that, we crawled according data of the online social network, user information of the location-based social network obtained by specialized bots, and we extracted purchases of the trading network. Overall, we generated and used 57 topological and homophilic features in different constellations to predict trading interactions between user pairs. We focused on both unsupervised as well as supervised learning methods. For supervised learning, we achieved accuracy values up to \(92.5\%\) , for unsupervised learning we obtained nDCG values up to over \(97\%\) and MAP values up to \(75\%\) .
      PubDate: 2018-07-09
      DOI: 10.1007/s10791-018-9336-z
  • Influence me! Predicting links to influential users
    • Authors: Ariel Monteserin; Marcelo G. Armentano
      Abstract: Abstract In addition to being in contact with friends, online social networks are commonly used as a source of information, suggestions and recommendations from members of the community. Whenever we accept a suggestion or perform any action because it was recommended by a “friend”, we are being influenced by him/her. For this reason, it is useful for users seeking for interesting information to identify and connect to this kind of influential users. In this context, we propose an approach to predict links to influential users. Compared to approaches that identify general influential users in a network, our approach seeks to identify users who might have some kind of influence to individual (target) users. To carry out this goal, we adapted an influence maximization algorithm to find new influential users from the set of current influential users of the target user. Moreover, we compared the results obtained with different metrics for link prediction and analyzed in which context these metrics obtained better results.
      PubDate: 2018-07-06
      DOI: 10.1007/s10791-018-9335-0
  • Determining the interests of social media users: two approaches
    • Authors: Nacéra Bennacer Seghouani; Coriane Nana Jipmo; Gianluca Quercini
      Abstract: Abstract Although social media platforms serve diverse purposes, from social and professional networking to photo sharing and blogging, people frequently use them to share the thoughts and opinions and most importantly, their interests (e.g., politics, economy, sports). Understanding the interests of social media users is key to many applications that need to characterize them to recommend some services and find other individuals with similar interests. In this paper, we propose two approaches to the automatic determination of the interests of social media users. The first, that we named Frisk, is an unsupervised multilingual approach that determines the interests of a user from the explicit meaning of the words that occur in the user’s posts. The second, that we termed Ascertain, is a supervised approach that resorts to the hidden dimensions of the words that several studies indicated to be capable of revealing some of the psychological processes and personality traits of a person. In our evaluation, that we performed on two datasets obtained from Twitter, we show that Frisk is capable of inferring the interests in a multilingual context with good accuracy and that the psychological dimensions used by Ascertain are also good predictors of a user’s interests.
      PubDate: 2018-07-05
      DOI: 10.1007/s10791-018-9338-x
  • A systematic approach to normalization in probabilistic models
    • Authors: Aldo Lipani; Thomas Roelleke; Mihai Lupu; Allan Hanbury
      Abstract: Abstract Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.
      PubDate: 2018-06-30
      DOI: 10.1007/s10791-018-9334-1
  • A topic recommender for journalists
    • Authors: Alessandro Cucchiarelli; Christian Morbidoni; Giovanni Stilo; Paola Velardi
      Abstract: Abstract The way in which people gather information about events and form their own opinion on them has changed dramatically with the advent of social media. For many readers, the news gathered from online sources has become an opportunity to share points of view and information within micro-blogging platforms such as Twitter, mainly aimed at satisfying their communication needs. Furthermore, the need to deepen the aspects related to news stimulates a demand for additional information which is often met through online encyclopedias, such as Wikipedia. This behaviour has also influenced the way in which journalists write their articles, requiring a careful assessment of what actually interests the readers. The goal of this paper is to present a recommender system, What to Write and Why, capable of suggesting to a journalist, for a given event, the aspects still uncovered in news articles on which the readers focus their interest. The basic idea is to characterize an event according to the echo it receives in online news sources and associate it with the corresponding readers’ communicative and informative patterns, detected through the analysis of Twitter and Wikipedia, respectively. Our methodology temporally aligns the results of this analysis and recommends the concepts that emerge as topics of interest from Twitter and Wikipedia, either not covered or poorly covered in the published news articles.
      PubDate: 2018-06-14
      DOI: 10.1007/s10791-018-9333-2
  • Hybrid query expansion model for text and microblog information retrieval
    • Authors: Meriem Amina Zingla; Chiraz Latiri; Philippe Mulhem; Catherine Berrut; Yahya Slimani
      Abstract: Abstract Query expansion (QE) is an important process in information retrieval applications that improves the user query and helps in retrieving relevant results. In this paper, we introduce a hybrid query expansion model (HQE) that investigates how external resources can be combined to association rules mining and used to enhance expansion terms generation and selection. The HQE model can be processed in different configurations, starting from methods based on association rules and combining it with external knowledge. The HQE model handles the two main phases of a QE process, namely: the candidate terms generation phase and the selection phase. We propose for the first phase, statistical, semantic and conceptual methods to generate new related terms for a given query. For the second phase, we introduce a similarity measure, ESAC, based on the Explicit Semantic Analysis that computes the relatedness between a query and the set of candidate terms. The performance of the proposed HQE model is evaluated within two experimental validations. The first one addresses the tweet search task proposed by TREC Microblog Track 2011 and an ad-hoc IR task related to the hard topics of the TREC Robust 2004. The second experimental validation concerns the tweet contextualization task organized by INEX 2014. Global results highlighted the effectiveness of our HQE model and of association rules mining for QE combined with external resources.
      PubDate: 2018-02-03
      DOI: 10.1007/s10791-017-9326-6
  • Neural information retrieval: introduction to the special issue
    • PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9323-9
  • Neural information retrieval: at the end of the early years
    • Abstract: Abstract A recent “third wave” of neural network (NN) approaches now delivers state-of-the-art performance in many machine learning tasks, spanning speech recognition, computer vision, and natural language processing. Because these modern NNs often comprise multiple interconnected layers, work in this area is often referred to as deep learning. Recent years have witnessed an explosive growth of research into NN-based approaches to information retrieval (IR). A significant body of work has now been created. In this paper, we survey the current landscape of Neural IR research, paying special attention to the use of learned distributed representations of textual units. We highlight the successes of neural IR thus far, catalog obstacles to its wider adoption, and suggest potentially promising directions for future research.
      PubDate: 2017-11-10
      DOI: 10.1007/s10791-017-9321-y
  • Using word embeddings in Twitter election classification
    • Abstract: Abstract Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary (OOV) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models).
      PubDate: 2017-11-09
      DOI: 10.1007/s10791-017-9319-5
  • Picture it in your mind: generating high level visual representations from
           textual descriptions
    • Authors: Fabio Carrara; Andrea Esuli; Tiziano Fagni; Fabrizio Falchi; Alejandro Moreo Fernández
      Abstract: Abstract In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the (typically huge) image collection on which the search is performed. We propose various neural network models of increasing complexity that learn to generate, from a short descriptive text, a high level visual representation in a visual feature space such as the pool5 layer of the ResNet-152 or the fc6–fc7 layers of an AlexNet trained on ILSVRC12 and Places databases. The Text2Vis models we explore include (1) a relatively simple regressor network relying on a bag-of-words representation for the textual descriptors, (2) a deep recurrent network that is sensible to word order, and (3) a wide and deep model that combines a stacked LSTM deep network with a wide regressor network. We compare the models we propose with other search strategies, also including textual search methods that exploit state-of-the-art caption generation models to index the image collection.
      PubDate: 2017-10-14
      DOI: 10.1007/s10791-017-9318-6
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327
Home (Search)
Subjects A-Z
Publishers A-Z
Your IP address:
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-