Paper Content:
Page 1:
Object-centric proto-symbolic behavioural reasoning from pixels
Ruben van Bergena, Justus H ¨ubottera, Pablo Lanillos*b,a
aDonders Institute, Radboud University, Nijmegen, The Netherlands
bCajal International Neuroscience Center, Spanish National Research Council, Madrid, Spain
Abstract
Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the
low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A
key question in designing such agents is how best to instantiate the representational space that will interface between
these two levels—ideally without requiring supervision in the form of expensive data annotations. These objectives
can be e fficiently achieved by representing the world in terms of objects (grounded in perception and action). In this
work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and
reason about its environment, using object-centric representations. We show the utility of our approach through tasks in
synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control.
Results show that the agent can learn emergent conditional behavioural reasoning, such as ( A→B)∧(¬A→C), as
well as logical composition ( A→B)∧(A→C)⊢A→(B∧C) and XOR operations, and successfully controls its
environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in
its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation.
While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of
real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as
a key inductive bias for unsupervised learning, to enable behavioral reasoning.
Keywords: Object-centric reasoning, Brain-inspired perception and control, Deep learning architectures.
1. Introduction
A one-year old infant, before language is expressed [ 1], learns the e fferent-a fferent patterns of sensory stimulation
and motor commands to a neural representation of the environment. The pathway from the sensorium, to abstract
thought, and back to the minutiae of the sensorimotor domain defines the feats of higher-order cognition and structures
the di fferent levels of abstraction at which intelligent agents must operate, when they interact with the environment [ 2].
How sub-symbolic computations transform into higher-level cognitive representations of structures is far from being
understood [ 3]. However, there is common accepted idea that suggest that there are intermediate representations that
bridge cognitive reasoning and behaviour [ 4]. Hence, in the design of artificial agents, a key challenge is first to find a
representational space that provides an e ffective interface between these disparate demands, as well as a mechanism
that makes proper use of this representation to interact with the environment and produce behaviour. In this sense, we
approach reasoning from a behavioral perspective where the output of a reasoning process is always an action that
physically interacts with the environment.
To address these challenges in a single, comprehensive system, here, we introduce a novel, brain-inspired neural
network architecture that spans the domains of cognitive reasoning, perceptual inference, planning and continuous
control. We follow both the emergentist and probabilistic approach to cognition, where the representational format, and
the machinery to transform sensory observations into this space, is learned unsupervised but allows reasoning as an
inference process. This also avoids the dependency on ground-truth labels furnished by human annotators, as these are
highly labor-intensive to produce, especially when labels must cover all the relevant variables in a scene. The proposed
architecture leverages the core inductive bias that the environment can be partitioned into discrete entities, or objects ,
that obey many useful symmetries and invariances [ 5]. Object-based representations and reasoning are a key tenet of
perception and cognition in humans [ 6]. Objects constitute the environment’s separable, movable parts, as well as the
Preprint submitted to under review February 12, 2025arXiv:2411.17438v2 [cs.AI] 11 Feb 2025
Page 2:
logical units of reasoning and planning. They naturally take on properties of symbols, as di fferent objects obey the same
laws of physics and possess similar or analogous properties [ 7]. At the same time, each object representation is tethered
to the sensorimotor domain via explicit attention maps. Objects, thus, are an ideal level of abstraction to interface
between the disparate levels of computation that we require. We show the promise of our approach by evaluating it in
tasks that require both high-level reasoning and continuous control in synthetically generated environments.
2. Related work, challenges and contribution
Structured representation learning appeared as a powerful way to introduce inductive biases and scale artificial
intelligence to high-level cognition [ 8] and exploit symmetries and invariances that can be leveraged by decomposing
scenes into their natural constituents [ 9]. Particularly, object-centric representations provide a natural interface
between bottom-up (e.g., emergentist, connectionist) and top-down (e.g., graph based, probabilistic) approaches [ 7].
We summarize the closely related works that influenced the proposed architecture classified in four topics1: scene
understanding (e.g., visual segmentation), physics-informed and simulation-based methods, robotics and reasoning
(e.g., object relations). We analyse these works from the prism of behavioural reasoning.
Visual segmentation and tracking. Visual segmentation methods have reached maturity with high accuracy on segmen-
tation in both static images [ 10] and videos [ 11]. From IODINE [ 10] that used iterative amortized inference, recent
methods evolved to slot-attention architectures [ 12] and di ffusion approaches [ 13]. These methods, which can track
2D and 3D objects in cluttered scenes [ 14], capture object collisions and discover new objects [ 15], are designed for
perception and not for reasoning about, or controlling the segmented objects.
Physics-informed and simulation-based methods. These approaches have also shown a great potential in handling
complex non-linear interactions. These are strongly influenced by the simulation approach to cognition [ 4]. For
instance, interaction [ 16] and propagation [ 17] networks are able to understand complex scene dynamics with multiple
objects and even control them to reach a desired known state. Unfortunately, these methods require full observation of
the objects states. Approaches that can work directly on pixels make use of ground-truth segmentation masks [ 18]
or Visual interaction networks [ 19]. While these approaches have some kind of physical “reasoning” a full-fledged
architecture that performs 1st order logic reasoning from pixels and transforms to behaviour (control) is still not fully
accomplished.
Robotics. These works focus on moving the objects meaningfully and usually have full access to the objects states (e.g.
[20]). OP3 [ 21] is an outstanding exception. However, it has two key limitations. First, it is restricted to a discrete
action space of picking up and placing objects, with no continuous control. Second, it has no ability to learn tasks—it
can only plan actions towards objectives specified ad hoc by means of a goal image that shows exactly what the scene
in question ought to look like. Novel approaches use ReFs to describe the 3D representation of the object and control is
solved by RRTs in the latent prediction dynamics and Model Predictive Control (MPC) to execute the actions [ 22]. Still,
ground truth objects mask are used for training and inference. Some methods have overcome this constraint through
entity-based segmentation—as a simplification of object-centric representation that uses the center of the object as
the location of the entity—and RL to overcome with the generation of meaningful behaviours [ 23]. Unfortunately,
this restricts the shape of objects to “point mass” like. Besides, there are object-centric approaches that, instead of
focusing on the interactive behaviour between the agent and the objects, focus on view-point matching [ 24] and agent
navigation [25].
Reasoning. Leaving out the literature on visual scene understanding with language, reasoning with object-centric
representations as neurosymbols is, while the most promising, the least investigated area [ 26]. There are just a few
object-centric studies on visual reasoning in static images [ 27]. Furthermore, there are relevant works from Large
Language Models research (e.g., [ 28]), but they do not adhere to the grounding paradigm proposed, where reasoning
should appear as an emergent property of learning a world physical model [ 29]. For instance in [ 30] the reasoning
capabilities are much more expressive than our proposed approach but they use pre-trained behaviours, which are
1Some of the works may have overlapping features in other topics.
2
Page 3:
conditioned on the language generated by the LLM. Conversely, our architecture learns the world dynamics and
interaction through unsupervised learning, harnessing the construction of the proto-symbols while interacting. This
does not prevent the possibility of connecting proposed architecture to a LLMs similarly to [ 31], but through the
preference network, which is already grounded.
2.1. Current challenges and architecture decisions
There are two major challenges at the unsupervised learning with object-centric representations: working with
complex naturalistic scene images [ 13] and the generation of meaningful physical actions (i.e., interacting through
continuous control) derived from the reasoning process when the input is high-dimensional (i.e., image) and the
environment is partially-observable. This work focus on the latter. We show that simple conditional reasoning problems,
such as, “if there is a heart move squares to the left, otherwise move squares to the right” (( A→B)∧(¬A→C))
is already unsolvable for non-object-centric representations SOTA algorithms—see Sec. 4. An underlying problem
for both challenges is to achieve high performance in world state dynamic estimation (world model learning), but
allowing adaptation and generalization. We designed the object-centric backbone architecture to incorporate adaptation,
following the free energy principle theory of neural information processing [ 32,33], where the brain estimates the
world by continuously approximating its internal model, learned by experience, to the real world through approximate
Bayesian computation. To implement it, we opted for an iterative variational inference approach inspired by the
IODINE [ 10] architecture. Whilst iterative methods have shown less performance than amortized approaches (e.g.,
slot-attention architecture [ 12]), at least at perceptual tasks, it provides Bayesian filtering and smoothing, thus allowing
better counterfactual reasoning. Finally, another relevant challenge, which is not addressed in this work, is to deal with
complex non-linearities of physical contact, such as collisions. Physics-based simulation based approaches [ 18] and
Deep Reinforcement Learning (RL) [ 23] are showing the best results but they still provide the ground truth object
masks or entail oversimplified segmentation in the visual input respectively.
2.2. Contribution
This work provides a neural network architecture (Fig. 1) that can learn 1st order conditional behavioural
reasoning where proto-symbols2take the form of object-centic representations and perception, control and preferences
(desired internal state of the agent) are dynamical processes computed through approximate Bayesian inference, in
an unsupervised learning scheme from pixels. Our approach draws inspiration from natural intelligence, where the
agent’s function is to generate behaviours through reasoning conditioned on perceptual cues by optimizing a single
quantity: the variational free energy (or evidence lower bound) [ 32]. To this end, we combine bottom-up and top-down
object-centric approaches [ 35,36] following the emergentist approach (with unsupervised learning from pixels) but
allowing probabilistic inference on the structured representation. This permits the agent to learn: i) what is an object
(scene understanding), ii) how to mentally manipulate it through object-centric rules (reasoning) and iii) generate
physical behavior (i.e., continuous control). The agent can only perceive through visual input (2D image projection) of
a synthetic world composed of 2D or 3D objects (see Fig. 2a for schematic). It can interact with the environment by
applying forces to locations in the image, which corresponds to forces in the 2D /3D world. This architecture allows
the agent to learn complex conditional rules, such as ( A→B∧C)∧(¬A→D∧E), perform logical composition
(A→B)∧(A→C)⊢A→(B∧C) by learning two di fferent rules separately but then they emerge combined during
execution, and XOR operations, such as ( A∨B)→C)∧(A∧B→D). Once the agent learns the rules, reasoning
transforms into a behavioural response that changes the environment towards its internal preference. Furthermore, the
agent, using the proposed architecture, shows online adaptation to unexpected changes in its environment, is robust to
mild violations of its world model, and invariant to the number of objects.
3. Proposed neural network model
The Object-centric Behavioural Reasoner (OBR) proposed, depicted in Fig. 1, is an object-centric deep learning
architecture consisting of two interconnected modules: i) perceptual inference, in charge of learning object represen-
tations (properties and dynamics) and producing top-down attention and ii) action inference, with the reasoning and
2We use the term proto-symbols as in visual segmentation we use for proto-objects [34], thus describing candidates for symbols and objects.
3
Page 4:
control, in charge of learning proto-symbolic rules at the level of object representations and transforming them into
meaningful behaviors (i.e., control commands).
Figure 1: OBR model architecture . Illustration of an example instance of OBR perceptual and reasoning interconnected modules with K=3 slots,
and an inference window of two time points. The perceptual module exploits iterative amortized inference (through a refinement network). The action
module reasons what is the internal state preference (through online goal imagination) and generates the continuous control actions accordingly
to obtain the desired state in the real world (through the minimization of the variational free energy). OBR uses object-centric representations,
dynamics, reasoning, and control. For clarity, the computation of object action beliefs is not included here (see Appendix 6).
3.1. Perceptual inference and generative model
The perception module infers the object state and action beliefs from incoming sensory data, e.g., one RGB image
(frame) per timepoint, and the agent’s own motor e fferents (see Appendix 6.1).
Object-centric representation. We define the k-th object at time tas a state vector s†(k)
t. This vector is expressed
in second-order generalized coordinates - that is, it is a concatenation of the current state and its derivative: s†(k)
t=h
s(k)
tT,s′(k)
tTiT. Objects are influenced by actions from the agent. The action (or action-e ffect) on object kat time tis
denoted by a(k)
t. OBR represents its knowledge about object states and actions through variational beliefs {q(s†(k)
t)}and
{q(a(k)
t)}, which are mean-field Gaussian distributions parameterized by a mean and variance for each latent dimension.
To learn object-centric representations and perform inference from pixels, we use a set of Kweight-sharing iterative
variational autoencoders (itV AEs) [ 37,10], where Kis the number of object representations to be inferred. These
itV AEs together invert a generative model in which image pixels are drawn from a Gaussian mixture distribution:
p(oi|{s(k)}k∈1:K)=X
kˆmikN
gi(s(k)),σ2
o
(1)
ˆmik=p(mi|{s(k)}k∈1:K)=Cat
Softmax
{πi(s(k))}k∈1:K
(2)
4
Page 5:
where oiis the value of the i-th image pixel, gi(•) is a decoder function that translates an object state to a predicted
mean value at pixel i,σ2
ois the variability of pixels around their mean values, and πi(•) maps an object state to a log-
probability at pixel i, which defines the predicted probability ˆmikthat the pixel belongs to that object (its segmentation
mask). We implement gi(•) andπi(•) jointly in the decoder of the itV AE, which thus outputs 4 channels per pixel (3
RGB color values +1 mask logit). Each itV AE thereby predicts its object’s subimage and segmentation mask, and a
full image prediction can be obtained as a segmentation-weighted superposition of subimages.
Object dynamics. To complete the world model, this per-frame observation model is combined with linearized
object-centric dynamics p(s†(k)
t|s†(k)
t−1,a(k)
t−1) defined by the following equations:
s′(k)
t=s′(k)
t−1+D(a(k)
t−1)+σsϵ1t, s(k)
t=s(k)
t−1+s′(k)
t+σsϵ2t (3)
whereϵ1tandϵ2tare noise realizations drawn from a Normal distribution. Note that we are assuming that the latent
representation dynamics can be captured with a 2nd-order generalized coordinates with additive noise, but this is not
imposed to the environment3The action a(k)
ton object kat time tis a (2-D or 3-D) vector that specifies the control
command (e.g., acceleration) on the object in environment coordinates. Finally, Dis a learned function that transforms
the control command to its e ffect in the model’s latent space.
Learning and inference. Inference for each object is performed within a sliding window spanning the present and a
number of past time points. Each itV AE infers beliefs for a single object and time point, taking a total of 8 inference
iterations to refine these beliefs to their optimal (minimum-ELBO) values. Inference is coupled between itV AEs through
information flow between represented objects and between time points in the inference window (see Appendix 6). OBR
optimizes a composite ELBO loss Lcomp=PNiter
n=1n
NiterL(n)for both learning and inference, where nindexes inference
iterations andLis:
L=−TX
t=0"
H
q
{s†(k)
t,a(k)
t}
| {z }
Complexity+βEq({s(k)
t})[logp(ot|{s(k)
t})]
| {z }
Reconstruction accuracy
+X
kEq(a(k)
t)[logp(a(k)
t|Ψt)]
| {z }
Action inference accuracy+X
kEq
s†(k)
t,s†(k)
t−1,a(k)
t−1[logp(s†(k)
t|s†(k)
t−1,a(k)
t−1)]
| {z }
Temporal consistency#
(4)
3.2. Reasoning and control
OBR’s generative model allows it to infer the current state of the world, as well as predict future states that would
arise as the result of the agent’s actions. In a (model-based) RL framework, we could now define rewards and learn
a value function on which to base a behavioral policy. Here, we take a slightly di fferent approach, which is seminal
in the proposed architecture, based on the influential computational neuroscience framework of Active Inference
[39,33]. It posits that agents, through their actions, aim to minimize a variational free energy with respect to a
probability distribution over states that they prefer to find themselves in [ 40,41,42]. This preference distribution
entails a biased (conditional) prior on environmental states, which acts as an attractor for planning future behavior.
Once this preference (or soft goal) is learnt the rest of the architecture machinery will drive the system (agent-world) to
generate object-centric actions that minimize the discrepancy of the current and the desired state.
Preference network. We assume that the preference distribution is given by ˜p({s†(k)
T>t}k∈1:K|{λ(k)
t}k∈1:K)=Q
kN(˜µ(k),˜σ(k)2),
where Tis some time horizon and tdenotes the present. Note that this distribution is conditioned on the agent’s
internal beliefs, via the variational parameters. We implement this mapping through a set-structured MLP network ϕ
3Robotic experiments have shown that 2nd order generalized coordinates are enough to track a dynamical system [ 38] depending on the nature of
the noise.
5
Page 6:
(architectural details in Appendix 6.1), which preserves the order of objects from input to output, and is invariant to the
number of objects:
ν(k)=ϕenc(λ(k)),c=1
KX
kWctxν(k)+bctx (5)
˜µ(k),˜σ(k)=ϕdec
ν(k),c
(6)
In words, each λ(k)is projected into an embedding space by an encoder network. Object-wise embedding vectors
{ν(k)}are then linearly transformed and aggregated into a global context vector c, which is appended to each object
embedding. Finally, the concatenated object +context embeddings are passed through a decoder network, to obtain
object-wise preference statistics. Conceptually, since OBR determines its current goals by repeatedly applying the
same local operation to each object representation (rather than a global operation on the full state of the environment),
this may be interpreted as a form of (proto-)symbolic reasoning.
Note that the preference network depends on the world model in an unsupervised learning scheme. The latent space
geometry that the world model ends up learning is unknown a priori. Therefore, the preference network cannot learn a
mapping within this latent space (from current to desired states), before this latent space has been defined. Interestingly,
multiple preference models can be trained ”on top of” the same world model, allowing fast acquisition of novel tasks
within the same environment. Practically, the one main di fference to a common value network or critic, is that our
preference network does not assign scores to states, but rather furnishes the agent with a desired state conditioned on its
current context. This obviates the need to unroll (many) possible futures to evaluate which of these would be more
desirable.
Control. Given a preference distribution ˜p(s†)4, the linearized dynamics model enables the agent to plan actions
efficiently in closed form (without the need of rollouts). Specifically, the action plan π=h
at+1, ..., at+TiTis computed
by minimizing the path integral of the variational free energy [ 41] over some future time horizon (similarly to model
predictive control)5:
π∗=argmin
πTX
τ=1DKL
q(s†
t+τ|π)||˜p(s†)
=(UTLU+λaI)−1UTLe (7)
Importantly, this optimization can be performed in closed form, by computing the discrepancy between the preferred
state and the sequence of states that will unfold if no action is taken, and projecting this discrepancy ( e) onto the
pseudo-inverse of a matrix Uthat maps actions within the planning horizon to their (cumulative) e ffects over that
time window. L=diag(˜σ(k)−2) is a diagonal matrix containing the precisions (inverse variances) of the preference
distributions, and λacontrols the strength with which actions are regularized (shrunk) towards to a zero-mean prior.
Further details are provided in Appendix 7.1.
3.3. Training
The perception module is trained using pre-generated 4-frame videos with sparse actions, to minimize a bespoke
ELBO loss that drives the model to reconstruct video frames accurately, while employing representations that are
consistent with the dynamics model. Full details on the training procedure may be found in Appendix 6.3. The
preference network is trained separately, in a self-supervised fashion, to learn the state-preference mapping—OBR does
not copy any behavior from a teacher agent. The model is shown example task episodes where objects start in a random
configuration with random velocities, and are simulated for a few frames without any goal-directed interference. In the
final two frames, the objects are moved (by an oracle agent) to their target locations (with a final velocity of 0). The
perception module encodes each video frame in the latent space, and the preference network is trained to minimize the
discrepancy (KL divergence) between its preference prediction for the “seed frames”, and the representation of the
target frame.
4Note that in this notation, we do not condition the preference on the current state belief as above, as the planning procedure described here can
be applied to any preference distribution.
5Note that the notation in this section omits the object index kfor legibility, as actions can be planned independently between objects (a strength
of our approach).
6
Page 7:
4. Results
We evaluated OBR qualitatively and quantitatively focusing on analyzing its ability to perform conditional
behavioural reasoning using visual cues (Figure 2a). This is, to evaluate the capacity of the agent to learn proto-
symbolic rules that follow 1st order logic. An exemplary conditional rule is “If there is a Half-torus move Boxes to the
Top-Left and cones to the Bottom-Left and Half-torus to Middle-Right otherwise move Boxes and Cones to the Right”6.
Furthermore, we analyzed its generalization for the number of objects and the adaptation to changes in the environment.
As extra result, although it is not the focus of this work, an analysis of world model learning (i.e., segmentation and
video prediction) can be found in the Appendix 6.1.1.
Figure 2: Conditional behavioural reasoning experiments . (a) 3D Active dSprites. The agent’s visual input is the RGB image of the projected
objects. Objects have 1st order dynamics with friction and the agent can apply forces. The behavioural reasoning is conditional on the objects present
in the environment and the learnt rules. This is, the agent should move the objects di fferently depending on the proto-symbolic rules and the presence
of objects. (b) Type of conditional rules that the agent should be able to learn: Simple and complex conditional rules, composition of two learnt rules
and advanced logic XOR rules.
Figure 2b details the type of rules that the agent can learn. Conditional rules, such as ( A→B)∧(¬A→C):
“If there is a Half-torus move Boxes to the Top-left and cones to the Center-left, otherwise move Boxes to the Top-
right”. It can also perform logical composition ( A→B)∧(A→C)⊢A→(B∧C) by learning two di fferent rules
separately but then they emerge combined during execution. Furthermore, it can also learn XOR operations, such as
(A∨B)→C)∧(A∧B→D). Once the agent learns the rules, reasoning transforms into a behavioural response that
changes the environment towards its internal preference (Fig. 2a). Figure 2c shows an OBR agent solving several
conditional reasoning instances with di fferent number of objects in the scene.
4.1. Environment
We developed, Active dSprites, which is an “activated” version of the various multi-dSprites datasets that have been
used in previous work on object-based visual inference (e.g. [ 10,12]). Not only does it include (continuous) dynamics,
but these dynamics can can be acted on by an agent (through continuous control actions that accelerate the objects).
Thus, active dSprites is an interactive environment, rather than a dataset. We implemented two di fferent scenarios in
the active dSprites environment. The first, objects are 2.5-D shapes (e.g., squares, ellipses and hearts)—they have
no depth dimension of their own, but can occlude each other within the depth dimension of the image. The second
objects are 3D shapes (e.g., boxes, half-torus, cones) that can move in a 10 ×10×10 m cubic space. These objects
are much more complex to learn than the 2D version, as they have lighting and shadows, producing color gradients
that need to be encoded. To randomize the environment, when an active dSprites instance is intialized, object shapes,
positions, sizes and colors are all sampled uniformly at random. Initial velocities are drawn from a Normal distribution.
Shape colors are sampled at discrete intervals spanning the full range of RGB-colors, while a background color is
drawn from a set of evenly spaced grayscale values between black and white. Shapes are presented in random depth
order. In the 3D environment the lighting sources are fixed. Code for the active dSprites environment can be found at
github.com /neuro-ai-robotics /OBR.
6While we use words to describe the rule the agent learns the rule through unsupervised learning and there is no textual description of them
7
Page 8:
4.2. Problem complexity analysis through baselines comparison
We evaluated the problem solving complexity of a basic conditional rule ”If Heart then move Boxes to the Right,
otherwise move Ellipses to the Left” comparing di fferent oracle and baseline algorithms with our approach 3. We
evaluated di fferent algorithms with full observability (i.e., with access to object locations): Linear-Quadratic Regulator
control [ 43,44], (implemented in the Python 3 control systems library), Soft-Actor-Critic (SAC) [ 45] and Proximal
Policy Optimization (PPO) [ 46] (both used from the stable baselines 3 library [ 47]), and our OBR with perfect access to
the true environmental state). And with partial observability (i.e., the agent only has access to the RGB image (pixels):
SAC, PPO and OBR. For RL algorithms we gave dense rewards using the object distance between the current state
and the desired state. For the OBR we only allow actions after frame 5. Here objects are 2.5 dimensional and during
training the shape, color and positions are randomized and velocities are sampled from a Normal distribution with
mean 0.
Figure 3: Complexity analysis through baseline comparison . Oracle algorithms have access to the objects location and dynamics, and define
the upper-bound performance. RL algorithms (PPO and SAC) are tested with full observability and partial observability (RGB image pixels) (a)
Algorithms performance at each frame of execution for a learnt single conditional rule ( A→B)∧(¬A→C). Colored lines are the Mean Squared
Error (MSE) between the objects location and the idealized goal location for 512 tested instances after training. Error bars are the standard deviation.
(b) Average performance of the algorithms at the final frame. The lower the better.
Results are shown in Fig. 3. The left panel shows the task accuracy (Mean Squarred Error, MSE) of the current
object locations and the desired state in projected image)7. All methods with full observability can solve the task,
having all of them similar performance. When restricting the input to RGB images (pixels) standard SAC or PPO
cannot learn the conditional behavioural rule. We can see that the MSE increases along the time. OBR, while it is not
as accurate as the oracle, can learn the rule. The right panel shows the final performance of the compared algorithms.
Oracle algorithms show the upperbound performance as it is constrained by the dynamics of the environment.
These results first show, in line with previous works [ 21], that manipulating objects in the environment cannot
be solved through non object-centric approaches, and highlights an important weakness of current pixel-based RL
baselines. Second, it shows that a basic reasoning rule that implies interaction with the environment is already
sufficiently complex. Thus, learning multiple rules and composition in this active dSprite synthetic environment is
relevant.
4.2.1. Abstract behavioural reasoning with conditional rules
We evaluated the capacity of the architecture to learn proto-symbolic behavioural rules, and its generalization to
different number of objects. We evaluated both 2D and 3D scenarios. Figure 4 shows an OBR agent execution for
an“IfHeart” conditional ( A→B)∧(¬A→C) and the XOR rules, and two randomized scenarios for each rule. In
Fig. 4a, the first scenario, there is no heart, and boxes should be moved to the upper-left and ellipses to the middle-left.
7Note that this error is over the ground truth goal as we can use the environment information to obtain the true image projection of the goal.
8
Page 9:
/gid00015/gid00017/gid00023
environment
imagined
goal
environment
imagined
goaltrue
goal active passivea c
/gid00004/gid00001/gid00078/gid00065/gid00073/gid00068/gid00066/gid00083/gid00082
/gid00015/gid00017/gid00022
/gid00015/gid00017/gid00021
/gid00015/gid00017/gid00020
/gid00015/gid00017/gid00019
/gid00015/gid00017/gid00018
/gid00017/gid00017/gid00019/gid00021/gid00023/gid00025 /gid00018/gid00019/gid00018/gid00017/gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00046/gid00052/gid00038/gid00010/gid00019
/gid00020
/gid00021
/gid00022
/gid00015/gid00017/gid00023
environment
imagined
goal
environment
imagined
goaltrue
goal active passiveb d
/gid00015/gid00017/gid00022
/gid00015/gid00017/gid00021
/gid00015/gid00017/gid00020
/gid00015/gid00017/gid00019
/gid00015/gid00017/gid00018
/gid00017/gid00017/gid00019/gid00021/gid00023
/gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010/gid00025 /gid00018/gid00019/gid00018/gid00017/gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00046/gid00052/gid00038/gid00010Figure 4: Behavioural reasoning with 2D objects . (a) Two instances of the active 2D dSprites solved by the OBR agent which has learnt the
“IfHeart” conditional rule: (Heart →Squares to top-right ∧Ellipses to middle-right) ∧(¬Heart→Squares to top-left ∧Ellipses to middle-left ∧
Hearts to down-left). Environment row shows the real scenario with the objects and Imagined goal row show the desired preference of the agent.
The red arrows are the control actions applied to the objects in every frame. The agent is passive until the 5th frame and the preference network
waits three frames to provide an output (this allows perception to be stable). The true goal is the idealized desired location of the objects. (b) Two
instances of the active 2D dSprites solved by the OBR agent which has learnt the Heart XOR Square task. (c and d) Statistical evaluation of both
tasks accuracy (MSE with respect to the idealized true goal) for di fferent number of objects in the scene. The training was performed with 3 objects.
Colored lines describe the the MSE between ground-truth goals and actual object positions and velocities. Error bars reflect +/- 1 SEM.
9
Page 10:
In the second one, there is a heart present, and boxes, ellipses and hearts should be moved to the top-left, center-left
and bottom-left, respectively. Thus, an object’s own shape determines its goal position along the vertical axis, while the
presence of a heart anywhere in the scene determines all objects’ goal positions along the horizontal axis. The agent is
passive until the 5th frame. The environment row shows the visual input and the red arrows describe the forces applied.
The imagined goal row, describes the output from the preference network at each frame. Note that the preference is in
encoded only in the latent representation and here, for visualization purposes, we show its reconstruction using the
decoder network. We let the agent to perform three frames of perceptual inference before activating the preference
network to infer the current goal. The true goal is the idealized solution of this “IfHeart” task. Analogously, in Fig. 4b,
the OBR agent, which has learnt the Heart XOR Square task, solves two 2d active dSprites scenarios.
Figure 4c,d shows the statistical performance of the OBR for di fferent number of objects for the two conditional
rules. The network is trained with three objects and then evaluated on two, three, four and five objects in the scene.
Again here, the agent is passive until the fifth frame. The error curves shows that OBR interpret and manipulate the
environment using the learnt conditional rule.
Figure 5 shows an OBR similar analysis but in a 3D environment. The 3D scenario is considerably much more
complex than the 2D one as objects have an extra dimension to move and colors have lighting. Here, the conditional
rule learnt depends on the appearance of a HalfTorus object. The OBR preference network is also decoded to show
its visual interpretation. Relevantly, the second instance shows that OBR can deal with the absence of multiples
appearances of objects in the scene. There are two HalfTorus and no Cones in Fig. 5a bottom instance. Figure 5b
shows the statistical performance of the OBR for di fferent number of objects. The network is again trained with three
objects and then evaluated on two, three, four and five objects in the scene. The agent is passive until the fifth frame.
The error curves (towards zero) shows that OBR interpret and manipulate the environment using the learnt conditional
rule. Figure 2c describes the 16 randomized OBR agent executions in 3D for 2,3,4 and 5 objects in the scence. For
completeness, several randomized executions in 3D with di fferent rules can be inspected at the github repository as
animations.
4.2.2. Adaptation to changes in the environment
We evaluated the capacity of the OBR agent to recover from unexpected changes in the environment. For that
purpose, we generate instances of the active dSprites that perform an object substitution in the middle of the execution.
For instance, a Heart is changed by a Square during a conditional rule execution. Figure 6a describes two instances
of the adaptation environment and Fig. 6b the statistical performance over 512 randomized instances. The object
substitution experiment test the perception module, which needs to adapt fast to estimate the new situation, the
preference network, which has to adapt the imagined goal depending on the objects that appear in the scene, and
the control module that should provide the correct continuous actions to correct for the previously generated object
movements. The true goal describes the idealized desired objects location after object substitution. The task error
shows a decrease of performance until the system is able to recover and generate the proper behaviour to fulfill the
conditional rule.
4.2.3. Logical composition emergence
We further evaluated the capacity of the agent to do logical composition ( A→B)∧(A→C)⊢A→(B∧C),
as described in Fig.7. We trained the preference reasoning module on two rules separately and then we evaluated in
the testing phase the capacity of the agent to combine both. For instance, Rule 1, If Heart move Squares to Top-left,
otherwise, to Middle-right and Rule 2, If Heart move Ellipses to Middle-left, otherwise, to Middle-right. In the testing
phase, when there are both squares and ellipses the agent should compose both rules into: If Heart move Squares to
Top-left and Ellipses to Middle-left, otherwise to Middle-right. Fig 7 shows the task performance, computed as the
MSE to idealized desired location of the objects, as if the two rules where being computed. The OBR agent is passive
until the 5th frame. The plot shows that the performed actions on the objects actually solve the task, confirming the that
agent reasoning combines both rules.
5. Discussion
We described a brain-inspired deep learning architecture that allows grounded conditional abstract reasoning and
behaviour by leveraging object-centric learned representations as the units of “mental” manipulation. Both perceptual
10
Page 11:
Figure 5: Behavioural reasoning with 3D objects . (a) Two instances of the active 3D dSprites solved by the OBR agent which has learnt the
“IfHalfTorus” conditional rule: (HalfTorus →Boxes to top-right-middle ∧Cones to middle-right-middle) ∧(¬HalfTorus→Boxes to top-left-front ∧
Cones to middle-left-front ∧HalfTorus to down-left-front). Note that in the second instance there are only Boxes and HalfTorus and noCones but
the OBR solves the conditional task without troubles. Environment row shows the real scenario with the 3D objects and Imagined goal row show
the desired preference of the agent. The red arrows are the control actions applied to the objects in every frame. The agent is passive until the 5th
frame and the preference network waits three frames to provide an output (this allows perception to be stable). The true goal is the idealized desired
location of the objects. (b) Statistical evaluation of the task accuracy (MSE with respect to the idealized true goal) for di fferent number of objects in
the scene. Blue lines describe the mean and the error bars denote the standard deviation. (c) Instances of the OBR agent sequential behaviour with an
IfHalfTorus rule with di fferent number of objects in the scene. The goal represents the ideal location of the objects given the rule. The desired goals
of the agent (preferent internal state) are not shown and are learnt through unsupervised learning using the preference network.
environment
imagined
goal
environment
imagined
goaltrue
goala/gid00015/gid00018/gid00017
/gid00015/gid00017/gid00025
/gid00015/gid00017/gid00023
/gid00015/gid00017/gid00021
/gid00015/gid00017/gid00019
/gid00015/gid00017/gid00017b
/gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00068/gid00081/gid00081/gid00078/gid00081/gid00010
true
goalsubstitution
/gid00017/gid00019/gid00021/gid00023/gid00025 /gid00018/gid00019/gid00018/gid00021/gid00018/gid00023/gid00018/gid00025 /gid00018/gid00017
/gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010
Figure 6: Adaptation . (a) Two instances of the active dSprites solved by the OBR agent which has learnt the “IfHeart” conditional rule: (Heart →
Squares to top-right ∧Ellipses to middle-right) ∧(¬Heart→Squares to top-left ∧Ellipses to middle-left ∧Hearts to down-left). At 4th frame one
of the objects (in this case the heart is substituted by a square, forcing not only to change the object-centric perception but the conditional behaviour,
as the resulting reasoning depends on the appearance of the Heart. (b) Statistical performance evaluation of 512 instances of the object substitution
experiment. The blue line is the median and error bars define the inter-quartile range to deal with the skewed data due to failures.
11
Page 12:
Figure 7: Logical composition . The OBR agent preference reasoning module is trained with two distinct rules and then tested in execution if the
composed rule emerges. Task accuracy over 512 instances shows that OBR is combining both rules in the test phase.
and proto-symbolic goals were learnt in an unsupervised fashion from visual information (pixels). Results show
that the system can learn and resolve conditional behaviours similar to 1st order logic: simple A→B, complex
(A→B∧C)∧(¬A→D∧E), XOR operation ( A∨B→C),A∧B→D) and logical composition ( A→B,A→
C⊢A→(B∧C). We analysed the model generalization to di fferent number of objects (both in 2D and 3D), objects
properties (e.g., color, shapes) and, critically, adaptation to changes in the environment (e.g., object substitution) thanks
to the iterative inference and the preference network. OBR shows a promising direction in objects-based behavioral
reasoning from pixels, mainly placing the foundations of grounded object-centric representations as a key inductive
bias during learning, perception, reasoning and control.
5.1. Object-centric representations as symbols
We showed that connectionist representation learning with inductive-bias, in particular object-centric representations,
can work as a bridge between symbols and behaviour. Because objects are entities that live in both the real world and
the internal representation of the agent, we can learn trajectories in the belief space that produce first-order-logic-like
behaviours, where variables are substituted by object representations. Logical operators are encoded by learning the
latent desired states (or preferences) of the agent and how to traverse the latent manifold to go from the current state to
the desired state using the generative model.
5.2. Grounded protosymbols and preferences in the latent space
A key feature of OBR is that it learns grounded representations and manipulates them in the latent manifold, as
well as generating the agent preferences (goals) in the object-centric latent representation. Agents’ internal goals are
tied to the world and the actions that it can perform in it. Hence, bringing benefits but also restrictions as it forces
the system to perform embodied reasoning. Learned protosymbolic rules depend on the potential learnable object
features and how the agent can act on them—See latent space traversals in Appendix 8. For instance, the agent cannot
learn to change the color of an object if all objects are in grey-scale or there is not a policy that can cause the desired
effect. As a consequence, conditional behaviours that cannot be interpreted in the objects features space cannot be
properly resolved in execution. On the good side, goal-conditioned behaviours are always grounded and therefore the
gap between reasoning and control vanishes.
5.3. Shortcomings and future work
Following the brain inspiration, we employed an iterative inference procedure in order to benefit from postdictive
inference (“smoothing”) in a sliding window approach. The resulting architecture is limited to the accuracy of
the perception module. Thus, while this work focused on the proof-of-principle of the advantage of object-centric
representations in behavioural reasoning, more powerful and sophisticated architectures (e.g. vision transformers) can
be used to improve perceptual inference and scale to naturalistic images. Second, in the experiments, the e ffect of the
real actions in the latent representation is assumed to be linear. While this could fit a hierarchical interpretation of
cognition, it prevents to properly manipulate complex non-linear dynamics. While OBR can recover from collisions,
it cannot plan ahead with them. Further work should investigate complex collisions, for instance, by incorporating
12
Page 13:
other inductive biases, such as interaction networks [ 16] or sparsely changing latent representations to encode relevant
events [ 48]. Third, the achieved agent behaviours are much less complex than by current SOTA LLMs /VLM approaches
such as RT architectures [ 49]. OBR provides, however, goal-conditioned behavioural reasoning, which is a major
challenge in the field [ 50]. Increased expressivity of the reasoning should be investigated, for instance, by using OBR
as the interface between symbolic (e.g., graph-based, language) and subsymbolic representations. Finally, and very
relevant to operationalize this approach for robotics applications the agent needs to include the body restrictions of
performing an action in the world [ 51,29]. This can be done by changing the action field mapping by a robotic arm
controller.
Acknowledgments
This work has been funded by the SPIKEFERENCE project, co-funded by the Human Brain Project (HBP) Specific
Grant Agreement 3 (ID: 945539).
6. Appendix
6.1. Network architecture
OBR’s perceptual inference network consists of two separate Iterative Amortized Inference (IAI) modules, each of
which in turn contains a refinement and a decoder module. The first IAI module concerns the inference of the state
beliefs q({s†(k)}) – we term this the perceptual inference module . The second IAI module infers the object action beliefs
q({a(k)}), and we refer to this as the action inference module . In addition, OBR comprises a preference network that
maps the current latent state beliefs to a predicted preference distribution over future object states. Code for the OBR
architecture can be found at github.com /neuro-ai-robotics /OBR.
6.1.1. Perceptual inference module
This module used a latent dimension of 16. Note that, in the output of the refinement network, this number is
doubled once as each latent belief is encoded by a mean and variance, and then doubled again as we represent (and
infer) both the states and their first-order derivatives. In the decoder, the latent dimension is doubled only once, as the
state derivatives do not enter into the reconstruction of a video frame. As in [ 10], we use a spatial broadcast decoder,
meaning that the latent beliefs are copied along a spatial grid with the same dimensions as a video frame, and each
latent vector is concatenated with the ( x,y) coordinate of its grid location, before passing through a stack of transposed
convolution layers. Decoder and refinement network architectures are summarized in the tables below. The refinement
network takes in 16 image-sized inputs, which are identical to those used in [ 10], except that we omit the leave-one-out
likelihoods. Vector-sized inputs join the network after the convolutional stage (which processes only the image-sized
inputs), and consist of the variational parameters and (stochastic estimates of) their gradients.
Decoder (states)
Type Size /#Chan. Act. func. Comment
Input (λ) 32
Broadcast 34 Appends coordinate channels
ConvT5×5 32 ELU
ConvT5×5 32 ELU
ConvT5×5 32 ELU
ConvT5×5 32 ELU
ConvT5×5 4 Sigmoid or Softmax Outputs RGB +mask
Refinement network (states)
13
Page 14:
Type Size /#Chan. Act. func. Comment
Linear 64
LSTM 128 tanh
Concat [...,λ,∇λL] 256 Appends vector-sized inputs
Linear 128 ELU
Flatten 800
Conv 5×5 32 ELU
Conv 5×5 32 ELU
Conv 5×5 32 ELU
Inputs 16
Segmentation and prediction. Figure 8 describes the internal mechanism of the perceptual module with the attentive
masks segmentation and the objects prediction in the environment.
Figure 8: Object-centric segmentation and prediction. (a) Observation model (b) Three examples of attentive masks output by the OBR perceptual
module with an IAI backbone. (c) Dynamics model. (d) Two examples of the perception predicting ahead in time the dynamics of the environment.
Prediction robustness. To enable action planning across episodes of non-trivial length, it is important that OBR’s
perceptual inference and predictive abilities remain stable across longer time windows. In addition, a great benefit of
OBR’s slot-based architecture is that object slots can be added or removed at will, without having to learn additional
connection weights [ 36]. However, it is not a given that performance will be robust to such variations. Here, we test
the robustness of OBR’s world model, as measured by segmentation and reconstruction accuracy, to both of these
dimensions (Fig. 9). Having been trained on 4-frame videos with 3 objects, OBR generalizes well to longer videos
with fewer or more objects present in the scene. A slight drop-o ffin performance towards later frames in the 12-frame
testing videos is easily remedied by a very short bout (3 epochs) of additional training with an additional set of longer
(also 12-frame) training videos. Generalization to di fferent numbers of objects is characterized by a slight drop in
performance as the number increases, but this might also be (partly) attributed to increased complexity of the resulting
images.
14
Page 15:
/gid00017/gid00015/gid00022
/gid00018/gid00017
/gid00025
/gid00023
/gid00021
/gid00019
/gid00017/gid00017/gid00015/gid00023/gid00017/gid00015/gid00024/gid00017/gid00015/gid00025/gid00017/gid00015/gid00026/gid00018/gid00015/gid00017
/gid00019/gid00019 /gid00017 /gid00021 /gid00023 /gid00025 /gid00018/gid00017 /gid00018/gid00019/gid00004/gid00001/gid00078/gid00065/gid00073/gid00068/gid00066/gid00083/gid00082
/gid00019
/gid00020
/gid00021
/gid00022
/gid00019/gid00019 /gid00017 /gid00021 /gid00023 /gid00025 /gid00018/gid00017 /gid00018/gid00019
/gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010/gid00051/gid00068/gid00066/gid00078/gid00077/gid00082/gid00083/gid00081/gid00084/gid00066/gid00083/gid00072/gid00078/gid00077/gid00001/gid00068/gid00081/gid00081/gid00078/gid00081/gid00001/gid00009/gid00046/gid00052/gid00038/gid00028/gid00001/gid00132/gid00001/gid00018/gid00017/gid00014/gid00020/gid00010 /gid00052/gid00068/gid00070/gid00076/gid00068/gid00077/gid00083/gid00064/gid00083/gid00072/gid00078/gid00077/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00039/gid00014/gid00034/gid00051/gid00042/gid00010No extra training Minimal stability trainingFigure 9: Robustness to variations in video length and number of objects . Segmentation accuracy (F-ARI score) and reconstruction error (MSE)
across video frames that extend beyond the duration of the training videos (4 frames), for scenes with various numbers of objects (training videos
contained 3 objects). Left panels show the performance of the network after training on 4-frame videos only. Right panels show the performance
after just 3 epochs of additional training with 12-frame videos. Error bars depict the mean +/- 1 SEM.
15
Page 16:
6.1.2. Action inference module
A key challenge in multi-object environments such as the evaluated environment, is that an agent’s internal
representation of objects has an unknown (and possibly imperfect) correspondence to the objects in the environment.
Even if the object features are inferred with perfect accuracy, the order of the objects in the representation is arbitrary.
To solve this correspondence problem, we define an action space in which accelerations can be placed at pixel locations
within the environment’s image grid, and objects receive the sum of all accelerations that coincide with their visible
pixels. Specifically, we introduce the notion of an action field Ψ=[ψ1,...,ψM]T: an [ M×2] matrix (with Mthe
number of pixels in an image or video frame), such that the i-th row in this matrix ( ψi) specifies the (x,y,z)-acceleration
applied at pixel i. The action on the k-th object is then given by:
a(k)
t=X
i[mi=k]ψi (8)
where miis a categorical variable that indicates which object pixel ibelongs to8. This definition of actions in pixel
space provides an unambiguous interface for the agent to interact with its environment.
The action inference module does not incorporate a decoder network, as the quality of the action beliefs is computed
by evaluating equation 8 and plugging this into the ELBO loss from equation 4. While this requires some additional
sampling operations (see Appendix 7), no neural network is required for this. This module does include a (shallow)
refinement network, which is summarized in the table below. This network takes as input the current variational
parametersλa(k)(2 means and 2 variances), their gradients, and the ‘expected object action’,P
iˆmikψi.
Refinement network (actions)
Type Size /#Chan. Act. func. Comment
Linear 4
LSTM 32 tanh
Inputs 10
6.1.3. Preference network
The preference network consists of encoder and decoder networks that operate on single object representations,
as well as a context module that aggregates object embeddings produced by the encoder into a single context vector,
which is then broadcast and appended to the object embeddings that are fed into the decoder. Details for the two-layer
encoder and decoder networks are listed below. The context module consists of a single Linear layer with input and
output sizes both equal to 64, and no activation function.
Encoder
Type Size /#Chan. Act. func. Comment
Linear 64 ELU
Linear 64 ELU
Inputs 64
Decoder (preferences)
Type Size /#Chan. Act. func. Comment
Linear 64 ELU
Linear 64 ELU
Inputs 128
8Note the use of Iverson-bracket notation; the bracket term is binary and evaluates to 1 i ffthe expression inside the brackets is true.
16
Page 17:
6.2. Inference procedure
Inference is performed in a recurrent algorithm that loops through the decoder and refinement subnetworks. For a
single video frame, the inference starts by initializing the object-wise latent beliefs. For the first frame in an episode,
beliefs are initialized to a learned default vector of variational parameters λ0. For subsequent frames, beliefs are
initialized by extrapolating the inferred dynamics from the previous frame.
From each belief q(s†(k)), we then (as in [ 10]) sample a state vector (using the reparameterization trick) and run
this through the decoder to obtain, for each object, a predicted sub-image o(k)and segmentation logits ˆm(k). These
decoder outputs are then fed, separately for each object, into the refinement subnetwork, along with additional inputs.
Image-sized inputs are fed into the bottom convolutional layer, while vector-sized inputs are appended to the output of
the final convolutional layer and processed by the final LSTM and Linear layers. We use the same selection of auxiliary
inputs as [10], except that we omit the leave-one-out likelihoods.
Inference across multiple video frames is performed in a sliding window. This window slides ”into” and ”out
of” the full video or episode. That is, the first window includes only the first video frame, which is processed for L
iterations, before moving the inference window ahead by one frame. Thus, it takes L×Finference steps before the
window reaches its full extent, comprising Fframes. The reverse happens when we reach the end of an episode or
video. The lagging end of the window keeps advancing until there is only a single frame left within the window, which
gets processed for a final Literations. Thus, each video frame gets processed for exactly L×Finference steps.
6.3. Training procedure
The above network architecture was trained on pre-generated experience with the active dSprites environment. The
main training set, used to train the world model, comprised 50,000 videos of 4 frames each. An additional validation set
of 10,000 videos was sampled identically and independently to the training set. Each video was generated as follows.
First, an instance of the active dSprites environment was randomly initialized, with three objects. Object shapes (square,
ellipse or heart) were drawn uniformly. Colors were uniformly sampled from 5 evenly spaced values, independently
for the R, G and B channels (a uniform grayscale background color was sampled from the same 5 intensity values).
Orientations were sampled uniformly from the interval [0 ,2π]. Object sizes were sampled uniformly from [1
6],1
3],
where 1 is the size of the image frame. Positions were sampled uniformly from the interval [0 .2,0.8]2(where 0 and 1
are the edges of the frame), while velocities were drawn from a Normal distribution with mean 0 and standard deviation
0.0625. Video frames were then generated by simulating 4 frames of these objects’ dynamics. Between the 2nd and 3rd
frames, a single action (acceleration) was randomly sampled for each object, from a Normal distribution with mean 0
and s.d. 0.0625, and placed at a random pixel location within the object’s segmentation mask. If an object happened to
be completely invisible (e.g. because it was fully occluded by another), then it received no action. An additional action
was sampled for a random background pixel, to encourage the model to learn to accurately segment the background
into a separate object slot (rather than grouping the background with one of the objects). Training with these 4-frame
videos was followed by a brief period of training with longer videos (12 frames, with actions after every second frame)
that were otherwise identically sampled.
Additional sets of 50,000 training videos (one for each task; as well as accompanying sets of 10,000 identically and
independently sampled validation videos) were used to train OBR’s preference network. Each of these videos was 8
frames long and generated as follows. First, an instance of the active dSPrites environment was randomly initialized
as before (but with a di fferent procedure for sampling shapes – see below). Then, 6 frames of this environment were
simulated with random actions inserted before the 3rd and 5th frames. To prevent objects from leaving the frame
(which is likely in longer videos and becomes problematic in this setting), we additionally placed actions as needed
to prevent this from happening (specifically, if an object was on course to move to a coordinate outside the range
[0.05,0.95]2in the next frame, an action was generated that resulted in the object appearing to ”bounce” against an
invisible wall instead). Before the 7th and 8th frames, actions were generated that brought the objects to their target
positions and velocities within the context of a task (see below). The first of these actions brought the objects to their
target positions, while the second decelerated them to 0 velocity. Thus, in the final frame (and only then), the objects
reached their target configuration. Each frame was encoded (using the normal inference procedure) as a set of latent
beliefs in OBR’s latent space, and the preference network was then tasked to map the representation of the first 7 ”seed
17
Page 18:
frames” to that of the final target frame, by minimizing the following KL-divergence loss:
Lpref=T−1X
t=1X
kDKL
˜pt
s†(k)
||q
s†(k)
T
(9)
˜pt
s†(k)
=N
s†(k);˜µ(k)(λt),˜σ(k)(λt)2
(10)
whereλtare the parameters of the variational beliefs for frame t, and we make explicit the fact that the parameters
(means and variances) of the predicted preference distribution are a function (via the preference network) of these
beliefs about prior frames.
Object target positions depended on the task. Specifically, the vertical target coordinates of any square, ellipse and
heart shapes in the scene were fixed to 0.2, 0.5 and 0.8, respectively. If, under the task, objects were to go to the left of
the frame, their target horizontal coordinate was 0.2 – if they were to go to the right, it was 0.8. Object shapes were
pseudo-randomized to evenly sample task conditions. In the IfHeart task, we ensured that training episodes had a
50% probability of containing at least one heart. In the HeartXORSquare task, we sampled four conditions with equal
probability: (1) there being neither a heart or a square in the scene; (2) at least one heart but no squares present; (3) at
least one square but no hearts present; and (4) at least one heart and one square present in the scene. Outside of these
restrictions, shapes were sampled randomly.
Training was performed using the ADAM optimizer [ 52] with default parameters and an initial learning rate of
3×10−4. This learning rate was reduced automatically by a factor 3 whenever the validation loss had not decreased in
the last 10 training epochs, down to a minimum learning rate of 3 ×10−5. Training was performed with a batch size of
64 (16×4 GPUs). OBR’s world model was deemed to have converged after 241 epochs, which required approximately
24 hours to train on 4 Nvidia A100 GPUs. Preference networks for the IfHeart andHeartXORSquare tasks were trained
for 90 and 120 epochs, respectively (18 and 24 hours).
6.3.1. ELBO loss
OBR optimizes the following (weighted) ELBO loss for both learning and inference:
L=−TX
t=0"
H
q
{s†(k)
t,a(k)
t}
| {z }
Complexity+βEq({s(k)
t})[logp(ot|{s(k)
t})]
| {z }
Reconstruction accuracy
+X
kEq(a(k)
t)[logp(a(k)
t|Ψt)]
| {z }
Action inference accuracy+X
kEq
s†(k)
t,s†(k)
t−1,a(k)
t−1[logp(s†(k)
t|s†(k)
t−1,a(k)
t−1)]
| {z }
Temporal consistency#
(11)
whereH(•) denotes entropy. Similar to previous work (e.g. [ 10]) we up-weight the reconstruction accuracy term in
this loss by a factor β. We train the network to minimize not just the loss at the end of the inference iterations through
the network, but a composite loss that also includes the loss after earlier iterations. Let L(n)
βbe the loss after ninference
iterations, then the composite loss is given by:
Lcomp=NiterX
n=1n
NiterL(n)(12)
6.3.2. Hyperparameters
OBR includes a total of 4 hyperparameters: (1) the loss-reweighting coe fficientβ(see above); (2) the variance of
the pixels around their predicted values, σ2
o; (3) the variance of the noise in the latent space dynamics, σ2
s; and (4)
the variance of the noise in the object actions, σ2
ψ. The results described in the current work were achieved with the
following settings:
18
Page 19:
Param. Value
β 5.0
σo 0.3
σs 0.1
σψ 0.3
7. Computing Eq(a(k))[logp(a(k)|Ψ)]
The expectation under q(a(k)) of logp(a(k)|Ψ), which appears in the ELBO loss (eq. 4), cannot be computed
in closed form, because the latter log probability requires us to marginalize over all possible configurations of the
pixel-to-object assignments, and to do so inside of the logarithm. That is:
logp(a(k)|Ψ)=logX
mp(a(k)|Ψ,m)p(m|{s(k)}) (13)
=log
Ep(m|{s(k)})[p(a(k)|Ψ,m)]
(14)
However, note that within the ELBO loss, we want to maximize the expected value of this quantity (as its negative
appears in the ELBO, which we want to minimize). From Jensen’s inequality, we have:
Ep(m|{s(k)})[logp(a(k)|Ψ,m)]≤log
Ep(m|{s(k)})[p(a(k)|Ψ,m)]
(15)
Therefore, the l.h.s. of this equation provides a lower bound on the quantity we want to maximize. Thus, we can
approximate our goal by maximizing this lower bound instead. This is convenient, because this lower bound, and its
expectation under q(a(k)) can be approximated through sampling:
Eq(a(k))h
Ep(m|{s(k)})[logp(a(k)|Ψ,m)]i
≈1
NsamplesX
jlogp(a(k)∗
j|Ψ,m∗
j) (16)
=1
NsamplesX
jlogNa(k)∗
j;X
iˆm∗(i)
jkψi,σ2
ψI (17)
ˆm∗(i)
j∼p(mi|{s(k)}),a(k)∗
j∼q(a(k)),s(k)∗
j∼q(s(k)) (18)
where we slightly abuse notation in the sampling of the pixel assignments, as a vector is sampled from a distribution
over a categorical variable. The reason this results in a vector is because this sampling step uses the Gumbel-Softmax
trick [ 53], which is a di fferentiable method for sampling categorical variables as ”approximately one-hot” vectors.
Thus, for every pixel i, we sample a vector ˆm∗(i)
j, such that the k-th entry of this vector, ˆm∗(i)
jk, denotes the ”soft-binary”
condition of whether pixel ibelongs to object k. In practice, we use Nsamples =1, based on the intuition that this will
still yield a good approximation over many training instances, and that we rely on the refinement network to learn to
infer good beliefs. The Gumbel-Softmax sampling method depends on a temperature τ, which we gradually reduce
across training epochs, so that the samples gradually better approximate the ideal one-hot vectors.
It is worth noting that, as the entropy of p(m|{s(k)}) decreases (i.e. as object slots ”become more certain” about
which pixels are theirs), the bound in equation 15 becomes tighter. In the limit as the entropy goes to 0, the network is
perfectly certain about the pixel assignments, and so the distribution collapses to a point mass. The expectation then
becomes trivial, and so the two sides of eq. 15 become equal. Sampling the pixel assignments is equally trivial in this
case, as the distribution has collapsed to permit only a single value for each assignment. In short, at this extreme point,
the procedure becomes entirely deterministic. In our data, we typically observe very low entropy for p(m|{s(k)}), and so
we likely operate in a regime close to the deterministic one, where the approximation is very accurate.
19
Page 20:
7.1. Planning
OBR’s linearized dynamics model allows us to find an optimal action plan π∗(minimizing equation 7) in closed
form, as follows:
d(T)=1
2
...
T,Ω(T)
d=1 0 ... 0 0
1 0 ... 0 0
2 1 ... 0 0
1 1 ... 0 0
...............
T−1T−2... 1 0
1 1 ... 1 0
T T−1... 2 1
1 1 ... 1 1(19)
U=(Ω(T)
d⊗D),L=diag( 1⊗˜σ)−2(20)
π∗=(UTLU+λaI)−1UTL
1⊗(˜µ−µs†
t)−d(T)⊗"µs′
t
0#!
(21)
where 1denotes a column vector of 1s of length T,⊗denotes the Kronecker product, and µs†
tandµs′
tdenote, respectively,
the means of the variational beliefs about the full object state in generalized coordinates ( s†), and the means of the
beliefs about the derivatives ( s′) only. In words, Uis a matrix that maps actions within the planning horizon to their
(cumulative) e ffects over that time window. These e ffects are translated to the network’s latent space, through the
multiplication by Din equation 20. The optimal actions are obtained by projecting the current (precision-weighted)
error (i.e., the discrepancy between the desired state and the sequence of states that will unfold if no action is taken)
onto the (precision-weighted) pseudoinverse of U. This projection is optionally shrunk towards a zero-mean Gaussian
prior over actions, with precision λa, to regularize the scale of the actions.
In the experiments reported in this work, we used a planning horizon of T=3 and an action regularization strength
λa=0.1. We also set L=Ias we empirically find this leads to better and more stable performance.
8. Latent space traversals
OBR is trained without supervision, and so the structure of its latent representational space is a priori unknown.
Here, we explore this space by systematically traversing it one dimension at a time. We first let OBR encode an image
into its latent space. Subsequently, for each object representation, we increment the target dimension’s encoded (mean)
value with a range of 20 evenly spaced values between [ −1,1]. We then feed the resulting, perturbed latent vectors
through OBR’s decoder to obtain 20 images that vary systematically (and for all objects) along the target dimension.
The images in Fig. 10 show the results of this procedure for three di fferent scenes, for each of the 16 latent dimensions
of the trained OBR model. Of particular note are dimensions 3 and 9, which have learned to encode the objects’
positions (along an approximately orthogonal set of axes at an arbitrary angle to the pixel grid). Other recognizable
variations can be seen in color (dimensions 4, 6 and 11) and size (14). Shape and orientation appear to be entangled
along multiple dimensions. Other dimensions do not obviously encode anything – they may be truly non-coding, or
their role might not be visible in combination with the other latent values that we happened to sample here. In particular,
the depth value of an object is almost certainly encoded in one or more latent dimensions, but this is not apparent from
these traversals, as they target all objects simultaneously, and thus should not a ffect their occlusion patterns (which
object is in front of which other object(s)).
20
Page 21:
References
[1] S. J. Cowley, How human infants deal with symbol grounding, Interaction Studies 8 (1) (2007) 83–104.
[2] E. Gibney, This ai learnt language by seeing the world through a baby’s eyes, Nature 626 (7999) (2024) 465–466.
[3] S. T. Piantadosi, The computational origin of representation, Minds and machines 31 (2021) 1–58.
[4]P. W. Battaglia, J. B. Hamrick, J. B. Tenenbaum, Simulation as an engine of physical scene understanding, Proceedings of the National
Academy of Sciences 110 (45) (2013) 18327–18332.
[5] K. Gre ff, S. Van Steenkiste, J. Schmidhuber, On the binding problem in artificial neural networks, arXiv preprint arXiv:2012.05208 (2020).
[6] B. Peters, N. Kriegeskorte, Capturing the objects of vision with neural networks, Nature Human Behaviour 5 (2021) 1127–1144.
[7]T. L. Gri ffiths, N. Chater, C. Kemp, A. Perfors, J. B. Tenenbaum, Probabilistic models of cognition: Exploring representations and inductive
biases, Trends in cognitive sciences 14 (8) (2010) 357–364.
[8]A. Goyal, Y . Bengio, Inductive biases for deep learning of higher-level cognition, Proceedings of the Royal Society A 478 (2266) (2022)
20210068.
[9]S. Eslami, N. Heess, T. Weber, Y . Tassa, D. Szepesvari, G. E. Hinton, et al., Attend, infer, repeat: Fast scene understanding with generative
models, Advances in neural information processing systems 29 (2016).
[10] K. Gre ff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, A. Lerchner, Multi-object representation
learning with iterative variational inference, in: International conference on machine learning, PMLR, 2019, pp. 2424–2433.
[11] G. Elsayed, A. Mahendran, S. van Steenkiste, K. Gre ff, M. C. Mozer, T. Kipf, Savi ++: Towards end-to-end object-centric learning from
real-world videos, Advances in Neural Information Processing Systems 35 (2022) 28940–28954.
[12] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, T. Kipf, Object-centric learning with
slot attention, Advances in Neural Information Processing Systems 33 (2020) 11525–11538.
[13] J. Jiang, F. Deng, G. Singh, S. Ahn, Object-centric slot di ffusion, arXiv preprint arXiv:2303.10834 (2023).
[14] M. Traub, S. Otte, T. Menge, M. Karlbauer, J. Thuemmel, M. V . Butz, Learning what and where: Disentangling location and identity tracking
without supervision, arXiv preprint arXiv:2205.13349 (2022).
[15] H.-X. Yu, L. J. Guibas, J. Wu, Unsupervised discovery of object radiance fields, in: International Conference on Learning Representations,
2022.
[16] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al., Interaction networks for learning about objects, relations and physics, Advances in
neural information processing systems 29 (2016).
[17] Y . Li, J. Wu, J.-Y . Zhu, J. B. Tenenbaum, A. Torralba, R. Tedrake, Propagation networks for model-based control under partial observation, in:
2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 1205–1211.
[18] L. S. Piloto, A. Weinstein, P. Battaglia, M. Botvinick, Intuitive physics learning in a deep-learning model inspired by developmental psychology,
Nature human behaviour 6 (9) (2022) 1257–1267.
[19] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, A. Tacchetti, Visual interaction networks: Learning a physics simulator from video,
Advances in neural information processing systems 30 (2017).
[20] C. Sancaktar, S. Blaes, G. Martius, Curious exploration via structured world models yields zero-shot object manipulation, Advances in Neural
Information Processing Systems 35 (2022) 24170–24183.
[21] R. Veerapaneni, J. D. Co-Reyes, M. Chang, M. Janner, C. Finn, J. Wu, J. Tenenbaum, S. Levine, Entity abstraction in visual model-based
reinforcement learning, in: Conference on Robot Learning, PMLR, 2020, pp. 1439–1456.
[22] D. Driess, Z. Huang, Y . Li, R. Tedrake, M. Toussaint, Learning multi-object dynamics with compositional neural radiance fields, in: Conference
on robot learning, PMLR, 2023, pp. 1755–1768.
[23] D. Haramati, T. Daniel, A. Tamar, Entity-centric reinforcement learning for object manipulation from pixels, arXiv preprint arXiv:2404.01220
(2024).
[24] T. Van de Maele, T. Verbelen, P. Mazzaglia, S. Ferraro, B. Dhoedt, Object-centric scene representations using active inference, Neural
Computation 36 (4) (2024) 677–704.
[25] J. Huo, Q. Sun, B. Jiang, H. Lin, Y . Fu, Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language
navigation, in: Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23212–23221.
[26] R. Assouel, P. Rodriguez, P. Taslakian, D. Vazquez, Y . Bengio, Object-centric compositional imagination for visual abstract reasoning, in:
ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022.
[27] T. Webb, S. S. Mondal, J. D. Cohen, Systematic visual reasoning through object-centric relational abstraction, Advances in Neural Information
Processing Systems 36 (2024).
[28] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied
multimodal language model, arXiv preprint arXiv:2303.03378 (2023).
[29] T. Taniguchi, S. Murata, M. Suzuki, D. Ognibene, P. Lanillos, E. Ugur, L. Jamone, T. Nakamura, A. Ciria, B. Lara, et al., World models and
predictive coding for cognitive and developmental robotics: Frontiers and challenges, Advanced Robotics 37 (13) (2023) 780–806.
[30] M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al., Do as i can, not as i
say: Grounding language in robotic a ffordances, arXiv preprint arXiv:2204.01691 (2022).
[31] Y . Xu, W. Li, P. Vaezipoor, S. Sanner, E. B. Khalil, Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of
object-based representations, arXiv preprint arXiv:2305.18354 (2023).
[32] K. Friston, The free-energy principle: a unified brain theory?, Nature reviews neuroscience 11 (2) (2010) 127–138.
[33] P. Lanillos, C. Meo, C. Pezzato, A. A. Meera, M. Baioumy, W. Ohata, A. Tschantz, B. Millidge, M. Wisse, C. L. Buckley, et al., Active
inference in robotics and artificial agents: Survey and challenges, arXiv preprint arXiv:2112.01871 (2021).
[34] D. Walther, C. Koch, Modeling attention to salient proto-objects, Neural networks 19 (9) (2006) 1395–1407.
[35] A. Rasouli, P. Lanillos, G. Cheng, J. K. Tsotsos, Attention-based active visual search for mobile robots, Autonomous Robots 44 (2) (2020)
131–146.
21
Page 22:
(a)Latent dimension 0
(b)Latent dimension 1
(c)Latent dimension 2
(d)Latent dimension 3
(e)Latent dimension 4
(f)Latent dimension 5
(g)Latent dimension 6
(h)Latent dimension 7
Figure 1022
Page 23:
(a)Latent dimension 8
(b)Latent dimension 9
(c)Latent dimension 10
(d)Latent dimension 11
(e)Latent dimension 12
(f)Latent dimension 13
(g)Latent dimension 14
(h)Latent dimension 1523
Page 24:
[36] R. S. van Bergen, P. Lanillos, Object-based active inference, in: Active Inference: Third International Workshop, IWAI 2022, Grenoble, France,
September 19, 2022, Revised Selected Papers, Springer, 2023, pp. 50–64.
[37] J. Marino, Y . Yue, S. Mandt, Iterative amortized inference, 35th International Conference on Machine Learning, ICML 2018 8 (2018)
5444–5462.
[38] F. Bos, A. A. Meera, D. Benders, M. Wisse, Free energy principle for state and input estimation of a quadcopter flying in wind, in: 2022
International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 5389–5395.
[39] T. Parr, G. Pezzulo, K. J. Friston, Active inference: the free energy principle in mind, brain, and behavior, MIT Press, 2022.
[40] O. van der Himst, P. Lanillos, Deep active inference for partially observable mdps, in: Active Inference: First International Workshop, IWAI
2020, Co-located with ECML /PKDD 2020, Ghent, Belgium, September 14, 2020, Proceedings 1, Springer, 2020, pp. 61–71.
[41] B. Millidge, A. Tschantz, C. L. Buckley, Whence the expected free energy?, Neural Computation 33 (2) (2021) 447–482.
[42] Z. Fountas, N. Sajid, P. Mediano, K. Friston, Deep active inference agents using monte-carlo methods, Advances in neural information
processing systems 33 (2020) 11662–11675.
[43] R. E. Kalman, et al., Contributions to the theory of optimal control, Bol. soc. mat. mexicana 5 (2) (1960) 102–119.
[44] D. E. Kirk, Optimal control theory: an introduction, Courier Corporation, 2004.
[45] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: O ff-policy maximum entropy deep reinforcement learning with a stochastic actor,
in: International conference on machine learning, PMLR, 2018, pp. 1861–1870.
[46] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017).
[47] A. Ra ffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations,
Journal of Machine Learning Research 22 (268) (2021) 1–8.
[48] C. Gumbsch, M. V . Butz, G. Martius, Sparsely changing latent states for prediction and planning in partially observable domains, Advances in
Neural Information Processing Systems 34 (2021) 17518–17531.
[49] J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al., Rt-trajectory: Robotic task
generalization via hindsight trajectory sketches, arXiv preprint arXiv:2311.01977 (2023).
[50] P. Sundaresan, Q. Vuong, J. Gu, P. Xu, T. Xiao, S. Kirmani, T. Yu, M. Stark, A. Jain, K. Hausman, et al., Rt-sketch: Goal-conditioned imitation
learning from hand-drawn sketches, arXiv preprint arXiv:2403.02709 (2024).
[51] P. Lanillos, M. van Gerven, Neuroscience-inspired perception-action in robotics: applying active inference for state estimation, control and
self-perception, arXiv preprint arXiv:2105.04261 (2021).
[52] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
[53] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, arXiv preprint arXiv:1611.01144 (2016).
24