Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Typically, a Reinforcement Learning (RL) algorithm focuses in learning a single deployable policy as the end product. Depending on the initialization methods and seed randomization, learning a single policy could possibly leads to convergence to different local optima across different runs, especially when the algorithm is sensitive to hyper-parameter tuning. Motivated by the capability of Generative Adversarial Networks (GANs) in learning complex data manifold, the adversarial training procedure could be utilized to learn a population of good-performing policies instead. We extend the teacher-student methodology observed in the Knowledge Distillation field in typical deep neural network prediction tasks to RL paradigm. Instead of learning a single compressed student network, an adversarially-trained generative model (hypernetwork) is learned to output network weights of a population of good-performing policy networks, representing a school of apprentices. Our proposed framework, named Teacher-Apprentices RL (TARL), is modular and could be used in conjunction with many existing RL algorithms. We illustrate the performance gain and improved robustness by combining TARL with various types of RL algorithms, including direct policy search Cross-Entropy Method, Q-learning, Actor-Critic, and policy gradient-based methods. PubDate: 2023-04-28
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail. PubDate: 2023-04-28
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Support agents that help users in their daily lives need to take into account not only the user’s characteristics, but also the social situation of the user. Existing work on including social context uses some type of situation cue as an input to information processing techniques in order to assess the expected behavior of the user. However, research shows that it is important to also determine the meaning of a situation, a step which we refer to as social situation comprehension. We propose using psychological characteristics of situations, which have been proposed in social science for ascribing meaning to situations, as the basis for social situation comprehension. Using data from user studies, we evaluate this proposal from two perspectives. First, from a technical perspective, we show that psychological characteristics of situations can be used as input to predict the priority of social situations, and that psychological characteristics of situations can be predicted from the features of a social situation. Second, we investigate the role of the comprehension step in human–machine meaning making. We show that psychological characteristics can be successfully used as a basis for explanations given to users about the decisions of an agenda management personal assistant agent. PubDate: 2023-04-28
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns–known in reinforcement learning as the value–cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns. PubDate: 2023-04-28
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: In one of its simplest forms, Team Formation involves deploying the least expensive team of agents while covering a set of skills. While current algorithms are reasonably successful in computing the best teams, the resilience to change of such solutions remains an important concern: Once a team has been formed, some of the agents considered at start may be finally defective and some skills may become uncovered. Two recently introduced solution concepts deal with this issue proactively: 1) form a team which is robust to changes so that after some agent losses, all skills remain covered, and 2) opt for a recoverable team, i.e., it can be "repaired" in the worst case by hiring new agents while keeping the overall deployment cost minimal. In this paper, we introduce the problem of partially robust team formation (PR–TF). Partial robustness is a weaker form of robustness which guarantees a certain degree of skill coverage after some agents are lost. We analyze the computational complexity of PR-TF and provide two complete algorithms for it. We compare the performance of our algorithms with the existing methods for robust and recoverable team formation on several existing and newly introduced benchmarks. Our empirical study demonstrates that partial robustness offers an interesting trade-off between (full) robustness and recoverability in terms of computational efficiency, skill coverage guaranteed after agent losses and repairability. This paper is an extended and revised version of as reported by (Schwind et al., Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’21), pp. 1154–1162, 2021). PubDate: 2023-04-25
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Given a set of agents qualifying or disqualifying each other, group identification is the task of identifying a socially qualified subgroup of agents. Social qualification depends on the specific rule used to aggregate individual qualifications . The classical bribery problem in this context asks how many agents need to change their qualifications in order to change the outcome in a certain way. Complementing previous results showing polynomial-time solvability or NP-hardness of bribery for various social rules in the constructive (aiming at making specific agents socially qualified) or destructive (aiming at making specific agents socially disqualified) setting, we provide a comprehensive picture of the parameterized computational complexity landscape. Conceptually, we also consider a more fine-grained concept of bribery cost, where we ask how many single qualifications need to be changed, nonunit prices for different bribery actions, and a more general bribery goal that combines the constructive and destructive setting. PubDate: 2023-03-24 DOI: 10.1007/s10458-023-09597-7
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Robotic football has long been seen as a grand challenge in artificial intelligence. Despite recent success of learned policies over heuristics and handcrafted rules in general, current teams in the simulated RoboCup football leagues, where autonomous agents compete against each other, still rely on handcrafted strategies with only a few using reinforcement learning directly. This limits a learning agent’s ability to find stronger high-level strategies for the full game. In this paper, we show that it is possible for agents to learn competent football strategies on a full 22 player setting using limited computation resources (one GPU and one CPU), from tabula rasa through self-play. To do this, we build a 2D football simulator with faster simulation times than the RoboCup simulator. We propose various improvements to the standard single-agent PPO training algorithm which help it scale to our multi-agent setting. These improvements include (1) using a policy and critic network with an attention mechanism that scales linearly in the number of agents, (2) sharing networks between agents which allow for faster throughput using batching, and (3) using Polyak averaged opponents, league opponents and freezing the opponent team when necessary. We show through experimental results that stable training in the full 22 player setting is possible. Agents trained in the 22 player setting learn to defeat a variety of handcrafted strategies, and also achieve a higher win rate compared to agents trained in the 4 player setting and evaluated in the full game. PubDate: 2023-03-24 DOI: 10.1007/s10458-023-09603-y
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We consider a natural generalization of the fundamental electoral manipulation problem, where a briber can change the opinion or preference of voters through influence. This is motivated by modern political campaigns where candidates try to convince voters through media such as TV, newspaper, Internet. Compared with the classical bribery problem, we do not assume the briber will directly exchange money for votes from individual voters, but rather assume that the briber has a set of potential campaign strategies. Each campaign strategy represents some way of casting influence on voters. A campaign strategy has some cost and can influence a subset of voters. If a voter belongs to the audience of a campaign strategy, then he/she will be influenced. A voter will be more likely to change his/her opinion/preference if he/she has received influence from a larger number of campaign strategies. We model this through an independent activation model which is widely adopted in social science research and study the computational complexity. In this paper, we give a full characterization by showing NP-hardness results and establishing a near-optimal fixed-parameter tractable algorithm that gives a solution arbitrarily close to the optimal solution. PubDate: 2023-03-08 DOI: 10.1007/s10458-023-09602-z
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Public announcement logic (PAL) extends multi-agent epistemic logic with dynamic operators modelling the effects of public communication. Allowing quantification over public announcements lets us reason about the existence of an announcement that reaches a certain epistemic goal. Two notable examples of logics of quantified announcements are arbitrary public announcement logic (APAL) and group announcement logic (GAL). While the notion of common knowledge plays an important role in PAL, and in particular in characterisations of epistemic states that an agent or a group of agents might make come about by performing public announcements, extensions of APAL and GAL with common knowledge still haven’t been studied in detail. That is what we do in this paper. In particular, we consider both conservative extensions, where the semantics of the quantifiers is not changed, as well as extensions where the scope of quantification also includes common knowledge formulas. We compare the expressivity of these extensions relative to each other and other connected logics, and provide sound and complete axiomatisations. Finally, we show how the completeness results can be used for other logics with quantification over information change. PubDate: 2023-03-08 DOI: 10.1007/s10458-023-09601-0
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Deep reinforcement learning has contributed to dramatic advances in many tasks, such as playing games, controlling robots, and navigating complex environments. However, it requires many interactions with the environment. This is different from the human learning process since humans can use prior knowledge, which can significantly speed up the learning process as it avoids unnecessary exploration. Previous works integrating knowledge in RL did not model uncertainty in human cognition, which reduces the reliability of knowledge. In this paper, we propose a knowledge-guided policy network, a novel framework that combines suboptimal human knowledge with reinforcement learning. Our framework consists of a fuzzy rule controller representing human knowledge and a refined module to fine-tune suboptimal prior knowledge. The proposed framework is end-to-end and can be combined with existing reinforcement learning algorithms such as PPO, AC, and SAC. We conduct experiments on both discrete and continuous control tasks. The empirical results show that our approach, which combines suboptimal human knowledge and RL, significantly improves the learning efficiency of basic RL algorithms, even with very low-performance human prior knowledge. Additional experiments are conducted on the number of fuzzy rules and the interpretability of the policy, which make our proposed framework more complete and reasonable. The code for this research is released under the project page of https://github.com/yuyuanq/reinforcement-learning-using-knowledge-controller. PubDate: 2023-02-18 DOI: 10.1007/s10458-023-09600-1
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Egalitarian considerations play a central role in many areas of social choice theory. Applications of egalitarian principles range from ensuring everyone gets an equal share of a cake when deciding how to divide it, to guaranteeing balance with respect to gender or ethnicity in committee elections. Yet, the egalitarian approach has received little attention in judgment aggregation—a powerful framework for aggregating logically interconnected issues. We make the first steps towards filling that gap. We introduce axioms capturing two classical interpretations of egalitarianism in judgment aggregation and situate these within the context of existing axioms in the pertinent framework of belief merging. We then explore the relationship between these axioms and several notions of strategyproofness from social choice theory at large. Finally, a novel egalitarian judgment aggregation rule stems from our analysis; we present complexity results concerning both outcome determination and strategic manipulation for that rule. PubDate: 2023-02-02 DOI: 10.1007/s10458-023-09598-6
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of \({\mathcal {O}}(\sqrt{T \log (L)}+\tau ^2\sqrt{ T \log (\vert A \vert )})\) where L is the size of adversary’s pure strategy set and \(\vert A \vert\) denotes the size of agent’s action space.Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of \({\mathcal {O}}(\sqrt{T\log (L)}+\tau ^2\sqrt{ T k \log (k)})\) where k depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence to a NE result. To our best knowledge, this is the first work leading to the last iteration result in OMDPs. PubDate: 2023-01-27 DOI: 10.1007/s10458-023-09599-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Reinforcement Learning (RL) has been widely used to solve sequential decision-making problems. However, it often suffers from slow learning speed in complex scenarios. Teacher–student frameworks address this issue by enabling agents to ask for and give advice so that a student agent can leverage the knowledge of a teacher agent to facilitate its learning. In this paper, we consider the effect of reusing previous advice, and propose a novel memory-based teacher–student framework such that student agents can memorize and reuse the previous advice from teacher agents. In particular, we propose two methods to decide whether previous advice should be reused: Q-Change per Step that reuses the advice if it leads to an increase in Q-values, and Decay Reusing Probability that reuses the advice with a decaying probability. The experiments on diverse RL tasks (Mario, Predator–Prey and Half Field Offense) confirm that our proposed framework significantly outperforms the existing frameworks in which previous advice is not reused. PubDate: 2022-12-29 DOI: 10.1007/s10458-022-09595-1
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We study the NP-hard Fair Connected Districting problem recently proposed by Stoica et al. [AAMAS 2020]: Partition a vertex-colored graph into k connected components (subsequently referred to as districts) so that in every district the most frequent color occurs at most a given number of times more often than the second most frequent color. Fair Connected Districting is motivated by various real-world scenarios where agents of different types, which are one-to-one represented by nodes in a network, have to be partitioned into disjoint districts. Herein, one strives for “fair districts” without any type being in a dominating majority in any of the districts. This is to e.g. prevent segregation or political domination of some political party. We conduct a fine-grained analysis of the (parameterized) computational complexity of Fair Connected Districting. In particular, we prove that it is polynomial-time solvable on paths, cycles, stars, and caterpillars, but already becomes NP-hard on trees. Motivated by the latter negative result, we perform a parameterized complexity analysis with respect to various graph parameters including treewidth, and problem-specific parameters, including, the numbers of colors and districts. We obtain a rich and diverse, close to complete picture of the corresponding parameterized complexity landscape (that is, a classification along the complexity classes FPT, XP, W[1]-hard, and para-NP-hard). PubDate: 2022-12-28 DOI: 10.1007/s10458-022-09594-2
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Multi-Agent Path Finding (MAPF) is a well-studied problem that aims to generate collision-free paths for multiple agents while optimizing a single objective. However, many real-world applications require the consideration of multiple objectives. In this paper, we address a novel extension of MAPF, Multi-Objective MAPF (MOMAPF), that aims to optimize multiple given objectives while computing collision-free paths for all agents. In particular, we aim to incorporate the preferences of a decision maker over multi-agent path planning. Thus, we propose to solve a scalarized MOMAPF, whereby the given preferences of a decision maker are reflected by a weight value associated to each given objective and all weighted objectives are combined into one scalar. We introduce a solver for scalarized MOMAPF based on Conflict-Based Search (CBS) that incorporates an adapted path planner based on an evolutionary algorithm, the Genetic Algorithm (GA). We also introduce three practical objectives to consider in path planning: efficiency, safety, and smoothness. We evaluate the performance of our proposed method in function of the input parameters of GA on experimental simulations and we analyze its efficiency in providing conflict-free solutions within a fixed time. PubDate: 2022-12-23 DOI: 10.1007/s10458-022-09593-3
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications of Artificial Intelligenceg 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure. PubDate: 2022-12-21 DOI: 10.1007/s10458-022-09586-2
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We address the problem of strategyproof (SP) facility location mechanisms on discrete trees. Our main result is a full characterization of onto and SP mechanisms. In particular, we prove that when a single agent significantly affects the outcome, the trajectory of the facility is almost contained in the trajectory of the agent, and both move in the same direction along the common edges. We show tight relations of our characterization to previous results on discrete lines and on continuous trees. We then derive further implications of the main result for infinite discrete lines. PubDate: 2022-12-08 DOI: 10.1007/s10458-022-09592-4
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Ranking is a prerequisite for making decisions, and therefore it is a very responsible and frequently applied activity. This study considers fairness issues in a multi-criteria decision-making (MCDM) method called VIKOR (in Serbian language—VIšekriterijumska optimizacija i KOmpromisno Rešenje, which means Multiple Criteria Optimization and Compromise Solution). The method is specific because of its original property to search for the first-ranked compromise solutions based on the parameter v. The VIKOR method was modified in this paper to rank all the alternatives and find compromise solutions for each rank. Then, the obtained ranks were used to satisfy fairness constraints (i.e., the desired level of disparate impact) by criteria weights optimization. We built three types of mathematical models depending on decision makers’ (DMs’) preferences regarding the definition of the compromise parameter v. Metaheuristic optimization algorithms were explored in order to minimize the differences in VIKOR ranking prior to and after optimization. The proposed postprocessing reranking approach ensures fair ranking (i.e., the ranking without discrimination). The conducted experiments involve three real-life datasets of different sizes, well-known in the literature. The comparisons of the results with popular fair ranking algorithms include a comparative examination of several rank-based metrics intended to measure accuracy and fairness that indicate a high-quality competence of the suggested approach. The most significant contributions include developing automated and adaptive optimization procedures with the possibility of further adjustments following DMs’ preferences and matching fairness metrics with traditional MCDM goals in a comprehensive full VIKOR ranking. PubDate: 2022-12-05 DOI: 10.1007/s10458-022-09591-5
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: We study fair allocations of indivisible goods and chores in conjunction with system efficiency, measured by two social welfare functions, namely utilitarian and egalitarian welfare. To model preference, each agent is associated with a cardinal and additive valuation function. The fairness criteria we are concerned with are equitability up to any item (EQX) and equitability up to one item (EQ1). For the trade-off between fairness and efficiency, we investigate efficiency loss under these fairness constraints and establish the price of fairness. From the computational perspective, we provide a complete picture of the computational complexity of (i) deciding the existence of an EQX/EQ1 and welfare-maximizing allocation; (ii) computing a welfare maximizer among all EQX/EQ1 allocations. PubDate: 2022-12-02 DOI: 10.1007/s10458-022-09587-1
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: This work proposes a notion of accountability for multi-agent systems, that supports the development of robust distributed systems. Accountability is grounded on responsibility, and encompasses both a normative dimension, and a structural dimension. For realizing robust distributed systems, conceived as agent systems or organizations, it is necessary to keep a right level of situational awareness, through the introduction of the means for gathering and propagating accounts, upon which actions can be taken. This paper presents a formalization of accountability, including the accountability lifecycle, for the design of robust agent organizations. Particular attention is given to the interplay of accountability and goals, by describing typical patterns in which accountability affects the state of an agent’s goals and vice versa. We illustrate the practical aspects of the proposal by means of JaCaMo (Boissier et al. Sci Comput Program 78(6):747–761, 2013. https://doi.org/10.1016/j.scico.2011.10.004). PubDate: 2022-11-28 DOI: 10.1007/s10458-022-09590-6