Similar Journals
 Information RetrievalJournal Prestige (SJR): 0.352 Citation Impact (citeScore): 2Number of Followers: 683      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1573-7659 - ISSN (Online) 1386-4564 Published by Springer-Verlag  [2570 journals]
• Demographic differences in search engine use with implications for cohort
selection
• Abstract: Abstract The correlation between the demographics of users and the text they write has been investigated through literary texts and, more recently, social media. However, differences pertaining to language use in search engines has not been thoroughly analyzed, especially for age and gender differences. Such differences are important especially due to the growing use of search engine data in the study of human health, where queries are used to identify patient populations. Using three datasets comprising of queries from multiple general-purpose Internet search engines we investigate the correlation between demography (age, gender, and income) and the text of queries submitted to search engines. Our results show that females and younger people use longer queries. This difference is such that females make approximately 25% more queries with 10 or more words. In the case of queries which identify users as having specific medical conditions we find that females make 53% more queries than expected, and that this results in patient cohorts which are highly skewed in gender and age, compared to known patient populations. We show that methods for cohort selection which use additional information beyond queries where users indicate their condition are less skewed. Finally, we show that biased training cohorts can lead to differential performance of models designed to detect disease from search engine queries. Our results indicate that studies where demographic representation is important, such as in the study of health aspect of users or when search engines are evaluated for fairness, care should be taken in the selection of search engine data so as to create a representative dataset.
PubDate: 2019-12-01

• Beyond word embeddings: learning entity and concept representations from
large scale knowledge bases
• Abstract: Abstract Text representations using neural word embeddings have proven effective in many NLP applications. Recent researches adapt the traditional word embedding models to learn vectors of multiword expressions (concepts/entities). However, these methods are limited to textual knowledge bases (e.g., Wikipedia). In this paper, we propose a novel and simple technique for integrating the knowledge about concepts from two large scale knowledge bases of different structure (Wikipedia and Probase) in order to learn concept representations. We adapt the efficient skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph. We evaluate our concept embedding models on two tasks: (1) analogical reasoning, where we achieve a state-of-the-art performance of 91% on semantic analogies, (2) concept categorization, where we achieve a state-of-the-art performance on two benchmark datasets achieving categorization accuracy of 100% on one and 98% on the other. Additionally, we present a case study to evaluate our model on unsupervised argument type identification for neural semantic parsing. We demonstrate the competitive accuracy of our unsupervised method and its ability to better generalize to out of vocabulary entity mentions compared to the tedious and error prone methods which depend on gazetteers and regular expressions.
PubDate: 2019-12-01

• A comparison of filtering evaluation metrics based on formal constraints
• Abstract: Abstract Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a better understanding of the similarities and differences between metrics. Our formal study leads to a typology of measures for document filtering which is based on (1) a formal constraint that must be satisfied by any suitable evaluation measure, and (2) a set of three (mutually exclusive) formal properties which help to understand the fundamental differences between measures and determining which ones are more appropriate depending on the application scenario. As far as we know, this is the first in-depth study on how filtering metrics can be categorized according to their appropriateness for different scenarios. Two main findings derive from our study. First, not every measure satisfies the basic constraint; but problematic measures can be adapted using smoothing techniques that and makes them compliant with the basic constraint while preserving their original properties. Our second finding is that all metrics (except one) can be grouped in three families, each satisfying one out of three formal properties which are mutually exclusive. In cases where the application scenario is clearly defined, this classification of metrics should help choosing an adequate evaluation measure. The exception is the Reliability/Sensitivity metric pair, which does not fit into any of the three families, but has two valuable empirical properties: it is strict (i.e. a good result according to reliability/sensitivity ensures a good result according to all other metrics) and has more robustness that all other measures considered in our study.
PubDate: 2019-12-01

• A selective approach to index term weighting for robust information
retrieval based on the frequency distributions of query terms
• Abstract: Abstract A typical information retrieval (IR) system applies a single retrieval strategy to every information need of users. However, the results of the past IR experiments show that a particular retrieval strategy is in general good at fulfilling some type of information needs while failing to fulfil some other type, i.e., high variation in retrieval effectiveness across information needs. On the other hand, the same results also show that an information need that a particular retrieval strategy failed to fulfil could be fulfilled by one of the other existing retrieval strategies. The challenge in here is therefore to determine in advance what retrieval strategy should be applied to which information need. This challenge is related to the robustness of IR systems in retrieval effectiveness. For an IR system, robustness can be defined as fulfilling every information need of users with an acceptable level of satisfaction. Maintaining robustness in retrieval effectiveness is a long-standing challenge and in this article we propose a simple but powerful method as a remedy. The method is a selective approach to index term weighting and for any given query (i.e., information need) it predicts the “best” term weighting model amongst a set of alternatives, on the basis of the frequency distributions of query terms on a target document collection. To predict the best term weighting model, the method uses the Chi-square statistic, the statistic of the Chi-square goodness-of-fit test. The results of the experiments, performed using the official query sets of the TREC Web track and the Million Query track, reveal in general that the frequency distributions of query terms provide relevant information on the retrieval effectiveness of term weighting models. In particular, the results show that the selective approach proposed in this article is, on average, more effective and more robust than the most effective single term weighting model.
PubDate: 2019-12-01

• Informational, transactional, and navigational need of information:
relevance of search intention in search engine advertising
PubDate: 2019-11-26

• Preference-based interactive multi-document summarisation
• Abstract: Abstract Interactive NLP is a promising paradigm to close the gap between automatic NLP systems and the human upper bound. Preference-based interactive learning has been successfully applied, but the existing methods require several thousand interaction rounds even in simulations with perfect user feedback. In this paper, we study preference-based interactive summarisation. To reduce the number of interaction rounds, we propose the Active Preference-based ReInforcement Learning (APRIL) framework. APRIL uses active learning to query the user, preference learning to learn a summary ranking function from the preferences, and neural Reinforcement learning to efficiently search for the (near-)optimal summary. Our results show that users can easily provide reliable preferences over summaries and that APRIL outperforms the state-of-the-art preference-based interactive method in both simulation and real-user experiments.
PubDate: 2019-11-19

• Boosting learning to rank with user dynamics and continuation methods
• Abstract: Abstract Learning to rank (LtR) techniques leverage assessed samples of query-document relevance to learn effective ranking functions able to exploit the noisy signals hidden in the features used to represent queries and documents. In this paper we explore how to enhance the state-of-the-art LambdaMart LtR algorithm by integrating in the training process an explicit knowledge of the underlying user-interaction model and the possibility of targeting different objective functions that can effectively drive the algorithm towards promising areas of the search space. We enrich the iterative process followed by the learning algorithm in two ways: (1) by considering complex query-based user dynamics instead than simply discounting the gain by the rank position; (2) by designing a learning path across different loss functions that can capture different signals in the training data. Our extensive experiments, conducted on publicly available datasets, show that the proposed solution permits to improve various ranking quality measures by statistically significant margins.
PubDate: 2019-11-05

• Optimizing the recency-relevance-diversity trade-offs in non-personalized
news recommendations
• Abstract: Abstract Online news media sites are emerging as the primary source of news for a large number of users. Due to a large number of stories being published in these media sites, users usually rely on news recommendation systems to find important news. In this work, we focus on automatically recommending news stories to all users of such media websites, where the selection is not influenced by a particular user’s news reading habit. When recommending news stories in such non-personalized manner, there are three basic metrics of interest—recency, importance (analogous to relevance in personalized recommendation) and diversity of the recommended news. Ideally, recommender systems should recommend the most important stories soon after they are published. However, the importance of a story only becomes evident as the story ages, thereby creating a tension between recency and importance. A systematic analysis of popular recommendation strategies in use today reveals that they lead to poor trade-offs between recency and importance in practice. So, in this paper, we propose a new recommendation strategy (called Highest Future-Impact) which attempts to optimize on both the axes. To implement our proposed strategy in practice, we propose two approaches to predict the future-impact of news stories, by using crowd-sourced popularity signals and by observing editorial selection in past news data. Finally, we propose approaches to inculcate diversity in recommended news which can maintain a balanced proportion of news from different news sections. Evaluations over real-world news datasets show that our implementations achieve good performance in recommending news stories.
PubDate: 2019-10-01

• On the impact of group size on collaborative search effectiveness
• Abstract: Abstract While today’s web search engines are designed for single-user search, over the years research efforts have shown that complex information needs—which are explorative, open-ended and multi-faceted—can be answered more efficiently and effectively when searching in collaboration. Collaborative search (and sensemaking) research has investigated techniques, algorithms and interface affordances to gain insights and improve the collaborative search process. It is not hard to imagine that the size of the group collaborating on a search task significantly influences the group’s behaviour and search effectiveness. However, a common denominator across almost all existing studies is a fixed group size—usually either pairs, triads or in a few cases four users collaborating. Investigations into larger group sizes and the impact of group size dynamics on users’ behaviour and search metrics have so far rarely been considered—and when, then only in a simulation setup. In this work, we investigate in a large-scale user experiment to what extent those simulation results carry over to the real world. To this end, we extended our collaborative search framework SearchX with algorithmic mediation features and ran a large-scale experiment with more than 300 crowd-workers. We consider the collaboration group size as a dependent variable, and investigate collaborations between groups of up to six people. We find that most prior simulation-based results on the impact of collaboration group size on behaviour and search effectiveness cannot be reproduced in our user experiment.
PubDate: 2019-10-01

• ReBoost: a retrieval-boosted sequence-to-sequence model for neural
response generation
• Abstract: Abstract Human–computer conversation is an active research topic in natural language processing. One of the representative methods to build conversation systems uses the sequence-to-sequence (Seq2seq) model through neural networks. However, with limited input information, the Seq2seq model tends to generate meaningless and trivial responses. It can be greatly enhanced if more supplementary information is provided in the generation process. In this work, we propose to utilize retrieved responses to boost the Seq2seq model for generating more informative replies. Our method, called ReBoost, incorporates retrieved results in the Seq2seq model by a hierarchical structure. The input message and retrieved results can influence the generation process jointly. Experiments on two benchmark datasets demonstrate that our model is able to generate more informative responses in both automatic and human evaluations and outperforms the state-of-the-art response generation models.
PubDate: 2019-09-23

• Evaluation measures for quantification: an axiomatic approach
• Abstract: Abstract Quantification is the task of estimating, given a set $$\sigma$$ of unlabelled items and a set of classes $${\mathcal {C}}=\{c_{1}, \ldots , c_{ {\mathcal {C}} }\}$$ , the prevalence (or “relative frequency”) in $$\sigma$$ of each class $$c_{i}\in {\mathcal {C}}$$ . While quantification may in principle be solved by classifying each item in $$\sigma$$ and counting how many such items have been labelled with $$c_{i}$$ , it has long been shown that this “classify and count” method yields suboptimal quantification accuracy. As a result, quantification is no longer considered a mere byproduct of classification, and has evolved as a task of its own. While the scientific community has devoted a lot of attention to devising more accurate quantification methods, it has not devoted much to discussing what properties an evaluation measure for quantification (EMQ) should enjoy, and which EMQs should be adopted as a result. This paper lays down a number of interesting properties that an EMQ may or may not enjoy, discusses if (and when) each of these properties is desirable, surveys the EMQs that have been used so far, and discusses whether they enjoy or not the above properties. As a result of this investigation, some of the EMQs that have been used in the literature turn out to be severely unfit, while others emerge as closer to what the quantification community actually needs. However, a significant result is that no existing EMQ satisfies all the properties identified as desirable, thus indicating that more research is needed in order to identify (or synthesize) a truly adequate EMQ.
PubDate: 2019-09-21

• Special issue on de-personalisation, diversification, filter bubbles and
search
• PubDate: 2019-09-18

• How do interval scales help us with better understanding IR evaluation
measures'
• Abstract: Abstract Evaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and variance should be computed only when relying on interval scales. In our previous work we defined a theory of IR evaluation measures, based on the representational theory of measurement, which allowed us to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely precision, recall, and F-measure—always are interval scales in the case of binary relevance while this does not happen in the multi-graded relevance case. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRBP is an interval scale when we choose a specific value of the parameter p and define a specific total order among systems while all the other IR measures are not interval scales. In this work, we build on our previous findings and we carry out an extensive evaluation, based on standard TREC collections, to study how our theoretical findings impact on the experimental ones. In particular, we conduct a correlation analysis to study the relationship among the above-mentioned state-of-the-art evaluation measures and their scales. We study how the scales of evaluation measures impact on non parametric and parametric statistical tests for multiple comparisons of IR system performance. Finally, we analyse how incomplete information and pool downsampling affect different scales and evaluation measures.
PubDate: 2019-09-04

• Evaluating sentence-level relevance feedback for high-recall information
retrieval
• Abstract: Abstract This study uses a novel simulation framework to evaluate whether the time and effort necessary to achieve high recall using active learning is reduced by presenting the reviewer with isolated sentences, as opposed to full documents, for relevance feedback. Under the weak assumption that more time and effort is required to review an entire document than a single sentence, simulation results indicate that the use of isolated sentences for relevance feedback can yield comparable accuracy and higher efficiency, relative to the state-of-the-art baseline model implementation (BMI) of the AutoTAR continuous active learning (“CAL”) method employed in the TREC 2015 and 2016 Total Recall Track.
PubDate: 2019-08-13

• Deep cross-platform product matching in e-commerce
• Abstract: Abstract Online shopping has become more and more popular in recent years, which leads to a prosperity on online platforms. Generally, the identical products are provided by many sellers on multiple platforms. Thus the comparison between products on multiple platforms becomes a basic demand for both consumers and sellers. However, identifying identical products on multiple platforms is difficult because the description for a certain product can be various. In this work, we propose a novel neural matching model to solve this problem. Two kinds of descriptions (i.e. product titles and attributes), which are widely provided on online platforms, are considered in our method. We conduct experiments on a real-world data set which contains thousands of products on two online e-commerce platforms. The experimental results show that our method can take use of the product information contained in both titles and attributes and significantly outperform the state-of-the-art matching models.
PubDate: 2019-08-13

• Payoffs and pitfalls in using knowledge-bases for consumer health search
• Abstract: Abstract Consumer health search (CHS) is a challenging domain with vocabulary mismatch and considerable domain expertise hampering peoples’ ability to formulate effective queries. We posit that using knowledge bases for query reformulation may help alleviate this problem. How to exploit knowledge bases for effective CHS is nontrivial, involving a swathe of key choices and design decisions (many of which are not explored in the literature). Here we rigorously empirically evaluate the impact these different choices have on retrieval effectiveness. A state-of-the-art knowledge-base retrieval model—the Entity Query Feature Expansion model—was used to evaluate these choices, which include: which knowledge base to use (specialised vs. general purpose), how to construct the knowledge base, how to extract entities from queries and map them to entities in the knowledge base, what part of the knowledge base to use for query expansion, and if to augment the knowledge base search process with relevance feedback. While knowledge base retrieval has been proposed as a solution for CHS, this paper delves into the finer details of doing this effectively, highlighting both payoffs and pitfalls. It aims to provide some lessons to others in advancing the state-of-the-art in CHS.
PubDate: 2019-08-01

• Neural variational entity set expansion for automatically populated
knowledge graphs
• Abstract: Abstract We propose Neural variational set expansion to extract actionable information from a noisy knowledge graph (KG) and propose a general approach for increasing the interpretability of recommendation systems. We demonstrate the usefulness of applying a variational autoencoder to the Entity set expansion task based on a realistic automatically generated KG.
PubDate: 2019-08-01

• Abstraction of query auto completion logs for anonymity-preserving
analysis
• Abstract: Query auto completion (QAC) is used in search interfaces to interactively offer a list of suggestions to users as they enter queries. The suggested completions are updated each time the user modifies their partial query, as they either add further keystrokes or interact directly with completions that have been offered. In this work we use a state model to capture the possible interactions that can occur in a QAC environment. Using this model, we show how an abstract QAC log can be derived from a sequence of QAC interactions; this log does not contain the actual characters entered, but records only the sequence of types of interaction, thus preserving user anonymity with extremely high confidence. To validate the usefulness of the approach, we use a large scale abstract QAC log collected from a popular commercial search engine to demonstrate how previous and new knowledge about QAC behavior can be inferred without knowledge of the queries being entered. An interaction model is then derived from this log to demonstrate its validity, and we report observations on user behavior with QAC systems based on the interaction model that is proposed.
PubDate: 2019-06-06

• The impact of result diversification on search behaviour and performance
PubDate: 2019-05-16

• Special issue on knowledge graphs and semantics in text analysis and
retrieval
• PubDate: 2019-03-04

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762