Project

Back to overview

Neural models of probabilistic reinforcement learning

English title Neural models of probabilistic reinforcement learning
Applicant Pouget Alexandre
Number 197296
Funding scheme Project funding (Div. I-III)
Research institution Dépt des Neurosciences Fondamentales Faculté de Médecine Université de Genève
Institution of higher education University of Geneva - GE
Main discipline Neurophysiology and Brain Research
Start/End 01.01.2021 - 31.12.2024
Approved amount 632'000.00
Show all

All Disciplines (2)

Discipline
Neurophysiology and Brain Research
Mathematics

Keywords (5)

neuroscience; computational; learning; information; uncertainty

Lay Summary (French)

Lead
Ce projet explore les bases neurales de l’apprentissage par renforcement, une forme d’apprentissage largement utilisée chez les animaux et qui dont les implémentations sur ordinateur ont d’ores et déjà permis aux systèmes artificielles de battre les humains à des jeux tels que les échecs ou le go.
Lay summary

L’apprentissage par renforcement permet d’apprendre une séquence d’actions conduisant à une récompense, comme par exemple aux échecs ou la récompense n’est connu qu’à l’issue d’une longue série de mouvements sur l’échiquier. Ce type d’apprentissage ne nécessite que très peu de supervision puisque l’agent a juste besoin de savoir s’il a gagné ou perdu, sans qu’il soit nécessaire de préciser quelles devraient être les décisions intermédiaires (dans le cas des échecs, les coups successifs), mais c’est aussi ce qui rend ce genre d’apprentissage particulièrement difficile puisque l’agent doit déterminer la contribution de chaque coup à l’issue de la partie.

Dans les approches classiques de l’apprentissage par renforcement, on caractérise l’agent pas son état présent (ex. la configuration de l’échiquier à chaque moment de la partie) et on cherche à prédire la récompense moyenne associée à chaque état. Cela permet ainsi de choisir des actions qui conduise à des états associés avec les plus fortes récompenses. De récents travaux ont montré cependant qu’il est plus efficace de représenter la distribution de probabilité des récompenses plutôt que la récompense moyenne. Ainsi, aux échecs, le but serait de déterminer pour chaque configuration de l’échiquier, la probabilité que la partie se termine par une victoire, un match nul ou une défaite. Cette approche conduit à un apprentissage plus rapide et plus efficace. Notre recherche va explorer les bases neurales de ce type d’apprentissage probabiliste par renforcement.

 

Direct link to Lay Summary Last update: 19.10.2020

Responsible applicant and co-applicants

Associated projects

Number Title Start Funding scheme
165831 Probabilistic approaches to synaptic learning 01.01.2017 Project funding (Div. I-III)

Abstract

Reinforcement learning (RL) is a powerful form of learning which allows animals and artificial systems alike to learn from sparse and impoverished supervision. Spectacular progress has been made over the last few decades on the theoretical foundations of RL as well as its neural implementation. Nonetheless, these theories are still largely limited to situations in which the state of the agent is known with certainty, rewards are deterministic and coding capacities are virtually infinite. Yet, in real-world situations, these assumptions are rarely valid. This proposal explores extensions of existing algorithms to incorporate reward and state uncertainty as well as coding limitations. Specifically, we intend to explore the following three main aims:Aim 1: Probabilistic RL and its neural implementation. Recent work suggests the ventral tegmental area encodes probability over rewards, or rather, over reward prediction errors (Dabney et al., 2020). This form of RL is known as distributional RL and has been shown to vastly improve performance on a variety of benchmarks. Aim 1 of this proposal will extend this theory to the case in which the states of the agent are themselves uncertain, which we refer to as probabilistic RL. There is indeed strong experimental evidence that the brain encodes state uncertainty. It is unclear however whether this uncertainty influences the response of VTA neurons and how neural circuits could implement the computation required to go from a probabilistic representation of states to a probabilistic representation of rewards and reward prediction errors. Once we understand the nature of the computation involved, we will derive an extension of RL that can learn the proper transformation. We will initially work with expectile codes (Dabney et al., 2020) for the probabilistic representation of reward in the VTA before exploring another type of code, sigmoidal code, which is more consistent with existing codes in the brain, and which has the potential of providing an even richer representation of cumulative rewards and sequences of rewards. These new algorithms will be applied to a perceptual decision-making task in which the uncertainty of the sensory stimulus can be tightly controlled. These simulations will be used to generate experimental predictions to be tested in the Uchida lab at Harvard University. Aim 2: Successor representation for probabilistic RL. In the so-called model-based RL, the agent builds a model of world, which can be used for offline learning, planning, and fast generalization to new reward contingencies. However, building a model of a complex environment can sometimes be computationally costly and time consuming. In standard RL, the successor representation provides a compromise between model-free and model-based RL and affords interesting generalization properties while being reasonably simple. This part of the proposal will focus on generalizing the concept of the successor representation to probabilistic RL and on generating experimental predictions, again, to be tested in the Uchida lab. Aim 3: Planning in probabilistic RL. RL models generally assume that key variables such as state or state-action values (i,e, Q-values) can be encoded with nearly infinite precision. The brain, however, can only encode these variables with limited precision because of energetic costs that come, for instance, from recruiting more neurons for representation and computation or creating the relevant synaptic connections between existing neurons for learning. In order to plan efficiently in this framework of bounded rationality (Gershman et al., 2015), it is essential to allocate these resources in a way that can maximize expected reward. We intend to investigate this question by designing new cost functions that can adjudicate between maximizing reward and limited coding resources.
-