loader
Generating audio...

arxiv

Paper 2411.17438

Object-centric proto-symbolic behavioural reasoning from pixels

Authors: Ruben van Bergen, Justus Hübotter, Pablo Lanillos

Published: 2024-11-26

Abstract:

Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels -- ideally without requiring supervision in the form of expensive data annotations. These objectives can be efficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as $(A \to B) \land (\neg A \to C)$, as well as logical composition $(A \to B) \land (A \to C) \vdash A \to (B \land C)$ and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning.

Paper Content:
Page 1: Object-centric proto-symbolic behavioural reasoning from pixels Ruben van Bergena, Justus H ¨ubottera, Pablo Lanillos*b,a aDonders Institute, Radboud University, Nijmegen, The Netherlands bCajal International Neuroscience Center, Spanish National Research Council, Madrid, Spain Abstract Autonomous intelligent agents must bridge computational challenges at disparate levels of abstraction, from the low-level spaces of sensory input and motor commands to the high-level domain of abstract reasoning and planning. A key question in designing such agents is how best to instantiate the representational space that will interface between these two levels—ideally without requiring supervision in the form of expensive data annotations. These objectives can be e fficiently achieved by representing the world in terms of objects (grounded in perception and action). In this work, we present a novel, brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment, using object-centric representations. We show the utility of our approach through tasks in synthetic environments that require a combination of (high-level) logical reasoning and (low-level) continuous control. Results show that the agent can learn emergent conditional behavioural reasoning, such as ( A→B)∧(¬A→C), as well as logical composition ( A→B)∧(A→C)⊢A→(B∧C) and XOR operations, and successfully controls its environment to satisfy objectives deduced from these logical rules. The agent can adapt online to unexpected changes in its environment and is robust to mild violations of its world model, thanks to dynamic internal desired goal generation. While the present results are limited to synthetic settings (2D and 3D activated versions of dSprites), which fall short of real-world levels of complexity, the proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning, to enable behavioral reasoning. Keywords: Object-centric reasoning, Brain-inspired perception and control, Deep learning architectures. 1. Introduction A one-year old infant, before language is expressed [ 1], learns the e fferent-a fferent patterns of sensory stimulation and motor commands to a neural representation of the environment. The pathway from the sensorium, to abstract thought, and back to the minutiae of the sensorimotor domain defines the feats of higher-order cognition and structures the di fferent levels of abstraction at which intelligent agents must operate, when they interact with the environment [ 2]. How sub-symbolic computations transform into higher-level cognitive representations of structures is far from being understood [ 3]. However, there is common accepted idea that suggest that there are intermediate representations that bridge cognitive reasoning and behaviour [ 4]. Hence, in the design of artificial agents, a key challenge is first to find a representational space that provides an e ffective interface between these disparate demands, as well as a mechanism that makes proper use of this representation to interact with the environment and produce behaviour. In this sense, we approach reasoning from a behavioral perspective where the output of a reasoning process is always an action that physically interacts with the environment. To address these challenges in a single, comprehensive system, here, we introduce a novel, brain-inspired neural network architecture that spans the domains of cognitive reasoning, perceptual inference, planning and continuous control. We follow both the emergentist and probabilistic approach to cognition, where the representational format, and the machinery to transform sensory observations into this space, is learned unsupervised but allows reasoning as an inference process. This also avoids the dependency on ground-truth labels furnished by human annotators, as these are highly labor-intensive to produce, especially when labels must cover all the relevant variables in a scene. The proposed architecture leverages the core inductive bias that the environment can be partitioned into discrete entities, or objects , that obey many useful symmetries and invariances [ 5]. Object-based representations and reasoning are a key tenet of perception and cognition in humans [ 6]. Objects constitute the environment’s separable, movable parts, as well as the Preprint submitted to under review February 12, 2025arXiv:2411.17438v2 [cs.AI] 11 Feb 2025 Page 2: logical units of reasoning and planning. They naturally take on properties of symbols, as di fferent objects obey the same laws of physics and possess similar or analogous properties [ 7]. At the same time, each object representation is tethered to the sensorimotor domain via explicit attention maps. Objects, thus, are an ideal level of abstraction to interface between the disparate levels of computation that we require. We show the promise of our approach by evaluating it in tasks that require both high-level reasoning and continuous control in synthetically generated environments. 2. Related work, challenges and contribution Structured representation learning appeared as a powerful way to introduce inductive biases and scale artificial intelligence to high-level cognition [ 8] and exploit symmetries and invariances that can be leveraged by decomposing scenes into their natural constituents [ 9]. Particularly, object-centric representations provide a natural interface between bottom-up (e.g., emergentist, connectionist) and top-down (e.g., graph based, probabilistic) approaches [ 7]. We summarize the closely related works that influenced the proposed architecture classified in four topics1: scene understanding (e.g., visual segmentation), physics-informed and simulation-based methods, robotics and reasoning (e.g., object relations). We analyse these works from the prism of behavioural reasoning. Visual segmentation and tracking. Visual segmentation methods have reached maturity with high accuracy on segmen- tation in both static images [ 10] and videos [ 11]. From IODINE [ 10] that used iterative amortized inference, recent methods evolved to slot-attention architectures [ 12] and di ffusion approaches [ 13]. These methods, which can track 2D and 3D objects in cluttered scenes [ 14], capture object collisions and discover new objects [ 15], are designed for perception and not for reasoning about, or controlling the segmented objects. Physics-informed and simulation-based methods. These approaches have also shown a great potential in handling complex non-linear interactions. These are strongly influenced by the simulation approach to cognition [ 4]. For instance, interaction [ 16] and propagation [ 17] networks are able to understand complex scene dynamics with multiple objects and even control them to reach a desired known state. Unfortunately, these methods require full observation of the objects states. Approaches that can work directly on pixels make use of ground-truth segmentation masks [ 18] or Visual interaction networks [ 19]. While these approaches have some kind of physical “reasoning” a full-fledged architecture that performs 1st order logic reasoning from pixels and transforms to behaviour (control) is still not fully accomplished. Robotics. These works focus on moving the objects meaningfully and usually have full access to the objects states (e.g. [20]). OP3 [ 21] is an outstanding exception. However, it has two key limitations. First, it is restricted to a discrete action space of picking up and placing objects, with no continuous control. Second, it has no ability to learn tasks—it can only plan actions towards objectives specified ad hoc by means of a goal image that shows exactly what the scene in question ought to look like. Novel approaches use ReFs to describe the 3D representation of the object and control is solved by RRTs in the latent prediction dynamics and Model Predictive Control (MPC) to execute the actions [ 22]. Still, ground truth objects mask are used for training and inference. Some methods have overcome this constraint through entity-based segmentation—as a simplification of object-centric representation that uses the center of the object as the location of the entity—and RL to overcome with the generation of meaningful behaviours [ 23]. Unfortunately, this restricts the shape of objects to “point mass” like. Besides, there are object-centric approaches that, instead of focusing on the interactive behaviour between the agent and the objects, focus on view-point matching [ 24] and agent navigation [25]. Reasoning. Leaving out the literature on visual scene understanding with language, reasoning with object-centric representations as neurosymbols is, while the most promising, the least investigated area [ 26]. There are just a few object-centric studies on visual reasoning in static images [ 27]. Furthermore, there are relevant works from Large Language Models research (e.g., [ 28]), but they do not adhere to the grounding paradigm proposed, where reasoning should appear as an emergent property of learning a world physical model [ 29]. For instance in [ 30] the reasoning capabilities are much more expressive than our proposed approach but they use pre-trained behaviours, which are 1Some of the works may have overlapping features in other topics. 2 Page 3: conditioned on the language generated by the LLM. Conversely, our architecture learns the world dynamics and interaction through unsupervised learning, harnessing the construction of the proto-symbols while interacting. This does not prevent the possibility of connecting proposed architecture to a LLMs similarly to [ 31], but through the preference network, which is already grounded. 2.1. Current challenges and architecture decisions There are two major challenges at the unsupervised learning with object-centric representations: working with complex naturalistic scene images [ 13] and the generation of meaningful physical actions (i.e., interacting through continuous control) derived from the reasoning process when the input is high-dimensional (i.e., image) and the environment is partially-observable. This work focus on the latter. We show that simple conditional reasoning problems, such as, “if there is a heart move squares to the left, otherwise move squares to the right” (( A→B)∧(¬A→C)) is already unsolvable for non-object-centric representations SOTA algorithms—see Sec. 4. An underlying problem for both challenges is to achieve high performance in world state dynamic estimation (world model learning), but allowing adaptation and generalization. We designed the object-centric backbone architecture to incorporate adaptation, following the free energy principle theory of neural information processing [ 32,33], where the brain estimates the world by continuously approximating its internal model, learned by experience, to the real world through approximate Bayesian computation. To implement it, we opted for an iterative variational inference approach inspired by the IODINE [ 10] architecture. Whilst iterative methods have shown less performance than amortized approaches (e.g., slot-attention architecture [ 12]), at least at perceptual tasks, it provides Bayesian filtering and smoothing, thus allowing better counterfactual reasoning. Finally, another relevant challenge, which is not addressed in this work, is to deal with complex non-linearities of physical contact, such as collisions. Physics-based simulation based approaches [ 18] and Deep Reinforcement Learning (RL) [ 23] are showing the best results but they still provide the ground truth object masks or entail oversimplified segmentation in the visual input respectively. 2.2. Contribution This work provides a neural network architecture (Fig. 1) that can learn 1st order conditional behavioural reasoning where proto-symbols2take the form of object-centic representations and perception, control and preferences (desired internal state of the agent) are dynamical processes computed through approximate Bayesian inference, in an unsupervised learning scheme from pixels. Our approach draws inspiration from natural intelligence, where the agent’s function is to generate behaviours through reasoning conditioned on perceptual cues by optimizing a single quantity: the variational free energy (or evidence lower bound) [ 32]. To this end, we combine bottom-up and top-down object-centric approaches [ 35,36] following the emergentist approach (with unsupervised learning from pixels) but allowing probabilistic inference on the structured representation. This permits the agent to learn: i) what is an object (scene understanding), ii) how to mentally manipulate it through object-centric rules (reasoning) and iii) generate physical behavior (i.e., continuous control). The agent can only perceive through visual input (2D image projection) of a synthetic world composed of 2D or 3D objects (see Fig. 2a for schematic). It can interact with the environment by applying forces to locations in the image, which corresponds to forces in the 2D /3D world. This architecture allows the agent to learn complex conditional rules, such as ( A→B∧C)∧(¬A→D∧E), perform logical composition (A→B)∧(A→C)⊢A→(B∧C) by learning two di fferent rules separately but then they emerge combined during execution, and XOR operations, such as ( A∨B)→C)∧(A∧B→D). Once the agent learns the rules, reasoning transforms into a behavioural response that changes the environment towards its internal preference. Furthermore, the agent, using the proposed architecture, shows online adaptation to unexpected changes in its environment, is robust to mild violations of its world model, and invariant to the number of objects. 3. Proposed neural network model The Object-centric Behavioural Reasoner (OBR) proposed, depicted in Fig. 1, is an object-centric deep learning architecture consisting of two interconnected modules: i) perceptual inference, in charge of learning object represen- tations (properties and dynamics) and producing top-down attention and ii) action inference, with the reasoning and 2We use the term proto-symbols as in visual segmentation we use for proto-objects [34], thus describing candidates for symbols and objects. 3 Page 4: control, in charge of learning proto-symbolic rules at the level of object representations and transforming them into meaningful behaviors (i.e., control commands). Figure 1: OBR model architecture . Illustration of an example instance of OBR perceptual and reasoning interconnected modules with K=3 slots, and an inference window of two time points. The perceptual module exploits iterative amortized inference (through a refinement network). The action module reasons what is the internal state preference (through online goal imagination) and generates the continuous control actions accordingly to obtain the desired state in the real world (through the minimization of the variational free energy). OBR uses object-centric representations, dynamics, reasoning, and control. For clarity, the computation of object action beliefs is not included here (see Appendix 6). 3.1. Perceptual inference and generative model The perception module infers the object state and action beliefs from incoming sensory data, e.g., one RGB image (frame) per timepoint, and the agent’s own motor e fferents (see Appendix 6.1). Object-centric representation. We define the k-th object at time tas a state vector s†(k) t. This vector is expressed in second-order generalized coordinates - that is, it is a concatenation of the current state and its derivative: s†(k) t=h s(k) tT,s′(k) tTiT. Objects are influenced by actions from the agent. The action (or action-e ffect) on object kat time tis denoted by a(k) t. OBR represents its knowledge about object states and actions through variational beliefs {q(s†(k) t)}and {q(a(k) t)}, which are mean-field Gaussian distributions parameterized by a mean and variance for each latent dimension. To learn object-centric representations and perform inference from pixels, we use a set of Kweight-sharing iterative variational autoencoders (itV AEs) [ 37,10], where Kis the number of object representations to be inferred. These itV AEs together invert a generative model in which image pixels are drawn from a Gaussian mixture distribution: p(oi|{s(k)}k∈1:K)=X kˆmikN gi(s(k)),σ2 o (1) ˆmik=p(mi|{s(k)}k∈1:K)=Cat Softmax {πi(s(k))}k∈1:K (2) 4 Page 5: where oiis the value of the i-th image pixel, gi(•) is a decoder function that translates an object state to a predicted mean value at pixel i,σ2 ois the variability of pixels around their mean values, and πi(•) maps an object state to a log- probability at pixel i, which defines the predicted probability ˆmikthat the pixel belongs to that object (its segmentation mask). We implement gi(•) andπi(•) jointly in the decoder of the itV AE, which thus outputs 4 channels per pixel (3 RGB color values +1 mask logit). Each itV AE thereby predicts its object’s subimage and segmentation mask, and a full image prediction can be obtained as a segmentation-weighted superposition of subimages. Object dynamics. To complete the world model, this per-frame observation model is combined with linearized object-centric dynamics p(s†(k) t|s†(k) t−1,a(k) t−1) defined by the following equations: s′(k) t=s′(k) t−1+D(a(k) t−1)+σsϵ1t, s(k) t=s(k) t−1+s′(k) t+σsϵ2t (3) whereϵ1tandϵ2tare noise realizations drawn from a Normal distribution. Note that we are assuming that the latent representation dynamics can be captured with a 2nd-order generalized coordinates with additive noise, but this is not imposed to the environment3The action a(k) ton object kat time tis a (2-D or 3-D) vector that specifies the control command (e.g., acceleration) on the object in environment coordinates. Finally, Dis a learned function that transforms the control command to its e ffect in the model’s latent space. Learning and inference. Inference for each object is performed within a sliding window spanning the present and a number of past time points. Each itV AE infers beliefs for a single object and time point, taking a total of 8 inference iterations to refine these beliefs to their optimal (minimum-ELBO) values. Inference is coupled between itV AEs through information flow between represented objects and between time points in the inference window (see Appendix 6). OBR optimizes a composite ELBO loss Lcomp=PNiter n=1n NiterL(n)for both learning and inference, where nindexes inference iterations andLis: L=−TX t=0" H q {s†(k) t,a(k) t} | {z } Complexity+βEq({s(k) t})[logp(ot|{s(k) t})] | {z } Reconstruction accuracy +X kEq(a(k) t)[logp(a(k) t|Ψt)] | {z } Action inference accuracy+X kEq s†(k) t,s†(k) t−1,a(k) t−1[logp(s†(k) t|s†(k) t−1,a(k) t−1)] | {z } Temporal consistency# (4) 3.2. Reasoning and control OBR’s generative model allows it to infer the current state of the world, as well as predict future states that would arise as the result of the agent’s actions. In a (model-based) RL framework, we could now define rewards and learn a value function on which to base a behavioral policy. Here, we take a slightly di fferent approach, which is seminal in the proposed architecture, based on the influential computational neuroscience framework of Active Inference [39,33]. It posits that agents, through their actions, aim to minimize a variational free energy with respect to a probability distribution over states that they prefer to find themselves in [ 40,41,42]. This preference distribution entails a biased (conditional) prior on environmental states, which acts as an attractor for planning future behavior. Once this preference (or soft goal) is learnt the rest of the architecture machinery will drive the system (agent-world) to generate object-centric actions that minimize the discrepancy of the current and the desired state. Preference network. We assume that the preference distribution is given by ˜p({s†(k) T>t}k∈1:K|{λ(k) t}k∈1:K)=Q kN(˜µ(k),˜σ(k)2), where Tis some time horizon and tdenotes the present. Note that this distribution is conditioned on the agent’s internal beliefs, via the variational parameters. We implement this mapping through a set-structured MLP network ϕ 3Robotic experiments have shown that 2nd order generalized coordinates are enough to track a dynamical system [ 38] depending on the nature of the noise. 5 Page 6: (architectural details in Appendix 6.1), which preserves the order of objects from input to output, and is invariant to the number of objects: ν(k)=ϕenc(λ(k)),c=1 KX kWctxν(k)+bctx (5) ˜µ(k),˜σ(k)=ϕdec ν(k),c (6) In words, each λ(k)is projected into an embedding space by an encoder network. Object-wise embedding vectors {ν(k)}are then linearly transformed and aggregated into a global context vector c, which is appended to each object embedding. Finally, the concatenated object +context embeddings are passed through a decoder network, to obtain object-wise preference statistics. Conceptually, since OBR determines its current goals by repeatedly applying the same local operation to each object representation (rather than a global operation on the full state of the environment), this may be interpreted as a form of (proto-)symbolic reasoning. Note that the preference network depends on the world model in an unsupervised learning scheme. The latent space geometry that the world model ends up learning is unknown a priori. Therefore, the preference network cannot learn a mapping within this latent space (from current to desired states), before this latent space has been defined. Interestingly, multiple preference models can be trained ”on top of” the same world model, allowing fast acquisition of novel tasks within the same environment. Practically, the one main di fference to a common value network or critic, is that our preference network does not assign scores to states, but rather furnishes the agent with a desired state conditioned on its current context. This obviates the need to unroll (many) possible futures to evaluate which of these would be more desirable. Control. Given a preference distribution ˜p(s†)4, the linearized dynamics model enables the agent to plan actions efficiently in closed form (without the need of rollouts). Specifically, the action plan π=h at+1, ..., at+TiTis computed by minimizing the path integral of the variational free energy [ 41] over some future time horizon (similarly to model predictive control)5: π∗=argmin πTX τ=1DKL q(s† t+τ|π)||˜p(s†) =(UTLU+λaI)−1UTLe (7) Importantly, this optimization can be performed in closed form, by computing the discrepancy between the preferred state and the sequence of states that will unfold if no action is taken, and projecting this discrepancy ( e) onto the pseudo-inverse of a matrix Uthat maps actions within the planning horizon to their (cumulative) e ffects over that time window. L=diag(˜σ(k)−2) is a diagonal matrix containing the precisions (inverse variances) of the preference distributions, and λacontrols the strength with which actions are regularized (shrunk) towards to a zero-mean prior. Further details are provided in Appendix 7.1. 3.3. Training The perception module is trained using pre-generated 4-frame videos with sparse actions, to minimize a bespoke ELBO loss that drives the model to reconstruct video frames accurately, while employing representations that are consistent with the dynamics model. Full details on the training procedure may be found in Appendix 6.3. The preference network is trained separately, in a self-supervised fashion, to learn the state-preference mapping—OBR does not copy any behavior from a teacher agent. The model is shown example task episodes where objects start in a random configuration with random velocities, and are simulated for a few frames without any goal-directed interference. In the final two frames, the objects are moved (by an oracle agent) to their target locations (with a final velocity of 0). The perception module encodes each video frame in the latent space, and the preference network is trained to minimize the discrepancy (KL divergence) between its preference prediction for the “seed frames”, and the representation of the target frame. 4Note that in this notation, we do not condition the preference on the current state belief as above, as the planning procedure described here can be applied to any preference distribution. 5Note that the notation in this section omits the object index kfor legibility, as actions can be planned independently between objects (a strength of our approach). 6 Page 7: 4. Results We evaluated OBR qualitatively and quantitatively focusing on analyzing its ability to perform conditional behavioural reasoning using visual cues (Figure 2a). This is, to evaluate the capacity of the agent to learn proto- symbolic rules that follow 1st order logic. An exemplary conditional rule is “If there is a Half-torus move Boxes to the Top-Left and cones to the Bottom-Left and Half-torus to Middle-Right otherwise move Boxes and Cones to the Right”6. Furthermore, we analyzed its generalization for the number of objects and the adaptation to changes in the environment. As extra result, although it is not the focus of this work, an analysis of world model learning (i.e., segmentation and video prediction) can be found in the Appendix 6.1.1. Figure 2: Conditional behavioural reasoning experiments . (a) 3D Active dSprites. The agent’s visual input is the RGB image of the projected objects. Objects have 1st order dynamics with friction and the agent can apply forces. The behavioural reasoning is conditional on the objects present in the environment and the learnt rules. This is, the agent should move the objects di fferently depending on the proto-symbolic rules and the presence of objects. (b) Type of conditional rules that the agent should be able to learn: Simple and complex conditional rules, composition of two learnt rules and advanced logic XOR rules. Figure 2b details the type of rules that the agent can learn. Conditional rules, such as ( A→B)∧(¬A→C): “If there is a Half-torus move Boxes to the Top-left and cones to the Center-left, otherwise move Boxes to the Top- right”. It can also perform logical composition ( A→B)∧(A→C)⊢A→(B∧C) by learning two di fferent rules separately but then they emerge combined during execution. Furthermore, it can also learn XOR operations, such as (A∨B)→C)∧(A∧B→D). Once the agent learns the rules, reasoning transforms into a behavioural response that changes the environment towards its internal preference (Fig. 2a). Figure 2c shows an OBR agent solving several conditional reasoning instances with di fferent number of objects in the scene. 4.1. Environment We developed, Active dSprites, which is an “activated” version of the various multi-dSprites datasets that have been used in previous work on object-based visual inference (e.g. [ 10,12]). Not only does it include (continuous) dynamics, but these dynamics can can be acted on by an agent (through continuous control actions that accelerate the objects). Thus, active dSprites is an interactive environment, rather than a dataset. We implemented two di fferent scenarios in the active dSprites environment. The first, objects are 2.5-D shapes (e.g., squares, ellipses and hearts)—they have no depth dimension of their own, but can occlude each other within the depth dimension of the image. The second objects are 3D shapes (e.g., boxes, half-torus, cones) that can move in a 10 ×10×10 m cubic space. These objects are much more complex to learn than the 2D version, as they have lighting and shadows, producing color gradients that need to be encoded. To randomize the environment, when an active dSprites instance is intialized, object shapes, positions, sizes and colors are all sampled uniformly at random. Initial velocities are drawn from a Normal distribution. Shape colors are sampled at discrete intervals spanning the full range of RGB-colors, while a background color is drawn from a set of evenly spaced grayscale values between black and white. Shapes are presented in random depth order. In the 3D environment the lighting sources are fixed. Code for the active dSprites environment can be found at github.com /neuro-ai-robotics /OBR. 6While we use words to describe the rule the agent learns the rule through unsupervised learning and there is no textual description of them 7 Page 8: 4.2. Problem complexity analysis through baselines comparison We evaluated the problem solving complexity of a basic conditional rule ”If Heart then move Boxes to the Right, otherwise move Ellipses to the Left” comparing di fferent oracle and baseline algorithms with our approach 3. We evaluated di fferent algorithms with full observability (i.e., with access to object locations): Linear-Quadratic Regulator control [ 43,44], (implemented in the Python 3 control systems library), Soft-Actor-Critic (SAC) [ 45] and Proximal Policy Optimization (PPO) [ 46] (both used from the stable baselines 3 library [ 47]), and our OBR with perfect access to the true environmental state). And with partial observability (i.e., the agent only has access to the RGB image (pixels): SAC, PPO and OBR. For RL algorithms we gave dense rewards using the object distance between the current state and the desired state. For the OBR we only allow actions after frame 5. Here objects are 2.5 dimensional and during training the shape, color and positions are randomized and velocities are sampled from a Normal distribution with mean 0. Figure 3: Complexity analysis through baseline comparison . Oracle algorithms have access to the objects location and dynamics, and define the upper-bound performance. RL algorithms (PPO and SAC) are tested with full observability and partial observability (RGB image pixels) (a) Algorithms performance at each frame of execution for a learnt single conditional rule ( A→B)∧(¬A→C). Colored lines are the Mean Squared Error (MSE) between the objects location and the idealized goal location for 512 tested instances after training. Error bars are the standard deviation. (b) Average performance of the algorithms at the final frame. The lower the better. Results are shown in Fig. 3. The left panel shows the task accuracy (Mean Squarred Error, MSE) of the current object locations and the desired state in projected image)7. All methods with full observability can solve the task, having all of them similar performance. When restricting the input to RGB images (pixels) standard SAC or PPO cannot learn the conditional behavioural rule. We can see that the MSE increases along the time. OBR, while it is not as accurate as the oracle, can learn the rule. The right panel shows the final performance of the compared algorithms. Oracle algorithms show the upperbound performance as it is constrained by the dynamics of the environment. These results first show, in line with previous works [ 21], that manipulating objects in the environment cannot be solved through non object-centric approaches, and highlights an important weakness of current pixel-based RL baselines. Second, it shows that a basic reasoning rule that implies interaction with the environment is already sufficiently complex. Thus, learning multiple rules and composition in this active dSprite synthetic environment is relevant. 4.2.1. Abstract behavioural reasoning with conditional rules We evaluated the capacity of the architecture to learn proto-symbolic behavioural rules, and its generalization to different number of objects. We evaluated both 2D and 3D scenarios. Figure 4 shows an OBR agent execution for an“IfHeart” conditional ( A→B)∧(¬A→C) and the XOR rules, and two randomized scenarios for each rule. In Fig. 4a, the first scenario, there is no heart, and boxes should be moved to the upper-left and ellipses to the middle-left. 7Note that this error is over the ground truth goal as we can use the environment information to obtain the true image projection of the goal. 8 Page 9: /gid00015/gid00017/gid00023 environment imagined goal environment imagined goaltrue goal active passivea c /gid00004/gid00001/gid00078/gid00065/gid00073/gid00068/gid00066/gid00083/gid00082 /gid00015/gid00017/gid00022 /gid00015/gid00017/gid00021 /gid00015/gid00017/gid00020 /gid00015/gid00017/gid00019 /gid00015/gid00017/gid00018 /gid00017/gid00017/gid00019/gid00021/gid00023/gid00025 /gid00018/gid00019/gid00018/gid00017/gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00046/gid00052/gid00038/gid00010/gid00019 /gid00020 /gid00021 /gid00022 /gid00015/gid00017/gid00023 environment imagined goal environment imagined goaltrue goal active passiveb d /gid00015/gid00017/gid00022 /gid00015/gid00017/gid00021 /gid00015/gid00017/gid00020 /gid00015/gid00017/gid00019 /gid00015/gid00017/gid00018 /gid00017/gid00017/gid00019/gid00021/gid00023 /gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010/gid00025 /gid00018/gid00019/gid00018/gid00017/gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00046/gid00052/gid00038/gid00010Figure 4: Behavioural reasoning with 2D objects . (a) Two instances of the active 2D dSprites solved by the OBR agent which has learnt the “IfHeart” conditional rule: (Heart →Squares to top-right ∧Ellipses to middle-right) ∧(¬Heart→Squares to top-left ∧Ellipses to middle-left ∧ Hearts to down-left). Environment row shows the real scenario with the objects and Imagined goal row show the desired preference of the agent. The red arrows are the control actions applied to the objects in every frame. The agent is passive until the 5th frame and the preference network waits three frames to provide an output (this allows perception to be stable). The true goal is the idealized desired location of the objects. (b) Two instances of the active 2D dSprites solved by the OBR agent which has learnt the Heart XOR Square task. (c and d) Statistical evaluation of both tasks accuracy (MSE with respect to the idealized true goal) for di fferent number of objects in the scene. The training was performed with 3 objects. Colored lines describe the the MSE between ground-truth goals and actual object positions and velocities. Error bars reflect +/- 1 SEM. 9 Page 10: In the second one, there is a heart present, and boxes, ellipses and hearts should be moved to the top-left, center-left and bottom-left, respectively. Thus, an object’s own shape determines its goal position along the vertical axis, while the presence of a heart anywhere in the scene determines all objects’ goal positions along the horizontal axis. The agent is passive until the 5th frame. The environment row shows the visual input and the red arrows describe the forces applied. The imagined goal row, describes the output from the preference network at each frame. Note that the preference is in encoded only in the latent representation and here, for visualization purposes, we show its reconstruction using the decoder network. We let the agent to perform three frames of perceptual inference before activating the preference network to infer the current goal. The true goal is the idealized solution of this “IfHeart” task. Analogously, in Fig. 4b, the OBR agent, which has learnt the Heart XOR Square task, solves two 2d active dSprites scenarios. Figure 4c,d shows the statistical performance of the OBR for di fferent number of objects for the two conditional rules. The network is trained with three objects and then evaluated on two, three, four and five objects in the scene. Again here, the agent is passive until the fifth frame. The error curves shows that OBR interpret and manipulate the environment using the learnt conditional rule. Figure 5 shows an OBR similar analysis but in a 3D environment. The 3D scenario is considerably much more complex than the 2D one as objects have an extra dimension to move and colors have lighting. Here, the conditional rule learnt depends on the appearance of a HalfTorus object. The OBR preference network is also decoded to show its visual interpretation. Relevantly, the second instance shows that OBR can deal with the absence of multiples appearances of objects in the scene. There are two HalfTorus and no Cones in Fig. 5a bottom instance. Figure 5b shows the statistical performance of the OBR for di fferent number of objects. The network is again trained with three objects and then evaluated on two, three, four and five objects in the scene. The agent is passive until the fifth frame. The error curves (towards zero) shows that OBR interpret and manipulate the environment using the learnt conditional rule. Figure 2c describes the 16 randomized OBR agent executions in 3D for 2,3,4 and 5 objects in the scence. For completeness, several randomized executions in 3D with di fferent rules can be inspected at the github repository as animations. 4.2.2. Adaptation to changes in the environment We evaluated the capacity of the OBR agent to recover from unexpected changes in the environment. For that purpose, we generate instances of the active dSprites that perform an object substitution in the middle of the execution. For instance, a Heart is changed by a Square during a conditional rule execution. Figure 6a describes two instances of the adaptation environment and Fig. 6b the statistical performance over 512 randomized instances. The object substitution experiment test the perception module, which needs to adapt fast to estimate the new situation, the preference network, which has to adapt the imagined goal depending on the objects that appear in the scene, and the control module that should provide the correct continuous actions to correct for the previously generated object movements. The true goal describes the idealized desired objects location after object substitution. The task error shows a decrease of performance until the system is able to recover and generate the proper behaviour to fulfill the conditional rule. 4.2.3. Logical composition emergence We further evaluated the capacity of the agent to do logical composition ( A→B)∧(A→C)⊢A→(B∧C), as described in Fig.7. We trained the preference reasoning module on two rules separately and then we evaluated in the testing phase the capacity of the agent to combine both. For instance, Rule 1, If Heart move Squares to Top-left, otherwise, to Middle-right and Rule 2, If Heart move Ellipses to Middle-left, otherwise, to Middle-right. In the testing phase, when there are both squares and ellipses the agent should compose both rules into: If Heart move Squares to Top-left and Ellipses to Middle-left, otherwise to Middle-right. Fig 7 shows the task performance, computed as the MSE to idealized desired location of the objects, as if the two rules where being computed. The OBR agent is passive until the 5th frame. The plot shows that the performed actions on the objects actually solve the task, confirming the that agent reasoning combines both rules. 5. Discussion We described a brain-inspired deep learning architecture that allows grounded conditional abstract reasoning and behaviour by leveraging object-centric learned representations as the units of “mental” manipulation. Both perceptual 10 Page 11: Figure 5: Behavioural reasoning with 3D objects . (a) Two instances of the active 3D dSprites solved by the OBR agent which has learnt the “IfHalfTorus” conditional rule: (HalfTorus →Boxes to top-right-middle ∧Cones to middle-right-middle) ∧(¬HalfTorus→Boxes to top-left-front ∧ Cones to middle-left-front ∧HalfTorus to down-left-front). Note that in the second instance there are only Boxes and HalfTorus and noCones but the OBR solves the conditional task without troubles. Environment row shows the real scenario with the 3D objects and Imagined goal row show the desired preference of the agent. The red arrows are the control actions applied to the objects in every frame. The agent is passive until the 5th frame and the preference network waits three frames to provide an output (this allows perception to be stable). The true goal is the idealized desired location of the objects. (b) Statistical evaluation of the task accuracy (MSE with respect to the idealized true goal) for di fferent number of objects in the scene. Blue lines describe the mean and the error bars denote the standard deviation. (c) Instances of the OBR agent sequential behaviour with an IfHalfTorus rule with di fferent number of objects in the scene. The goal represents the ideal location of the objects given the rule. The desired goals of the agent (preferent internal state) are not shown and are learnt through unsupervised learning using the preference network. environment imagined goal environment imagined goaltrue goala/gid00015/gid00018/gid00017 /gid00015/gid00017/gid00025 /gid00015/gid00017/gid00023 /gid00015/gid00017/gid00021 /gid00015/gid00017/gid00019 /gid00015/gid00017/gid00017b /gid00053/gid00064/gid00082/gid00074/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00068/gid00081/gid00081/gid00078/gid00081/gid00010 true goalsubstitution /gid00017/gid00019/gid00021/gid00023/gid00025 /gid00018/gid00019/gid00018/gid00021/gid00018/gid00023/gid00018/gid00025 /gid00018/gid00017 /gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010 Figure 6: Adaptation . (a) Two instances of the active dSprites solved by the OBR agent which has learnt the “IfHeart” conditional rule: (Heart → Squares to top-right ∧Ellipses to middle-right) ∧(¬Heart→Squares to top-left ∧Ellipses to middle-left ∧Hearts to down-left). At 4th frame one of the objects (in this case the heart is substituted by a square, forcing not only to change the object-centric perception but the conditional behaviour, as the resulting reasoning depends on the appearance of the Heart. (b) Statistical performance evaluation of 512 instances of the object substitution experiment. The blue line is the median and error bars define the inter-quartile range to deal with the skewed data due to failures. 11 Page 12: Figure 7: Logical composition . The OBR agent preference reasoning module is trained with two distinct rules and then tested in execution if the composed rule emerges. Task accuracy over 512 instances shows that OBR is combining both rules in the test phase. and proto-symbolic goals were learnt in an unsupervised fashion from visual information (pixels). Results show that the system can learn and resolve conditional behaviours similar to 1st order logic: simple A→B, complex (A→B∧C)∧(¬A→D∧E), XOR operation ( A∨B→C),A∧B→D) and logical composition ( A→B,A→ C⊢A→(B∧C). We analysed the model generalization to di fferent number of objects (both in 2D and 3D), objects properties (e.g., color, shapes) and, critically, adaptation to changes in the environment (e.g., object substitution) thanks to the iterative inference and the preference network. OBR shows a promising direction in objects-based behavioral reasoning from pixels, mainly placing the foundations of grounded object-centric representations as a key inductive bias during learning, perception, reasoning and control. 5.1. Object-centric representations as symbols We showed that connectionist representation learning with inductive-bias, in particular object-centric representations, can work as a bridge between symbols and behaviour. Because objects are entities that live in both the real world and the internal representation of the agent, we can learn trajectories in the belief space that produce first-order-logic-like behaviours, where variables are substituted by object representations. Logical operators are encoded by learning the latent desired states (or preferences) of the agent and how to traverse the latent manifold to go from the current state to the desired state using the generative model. 5.2. Grounded protosymbols and preferences in the latent space A key feature of OBR is that it learns grounded representations and manipulates them in the latent manifold, as well as generating the agent preferences (goals) in the object-centric latent representation. Agents’ internal goals are tied to the world and the actions that it can perform in it. Hence, bringing benefits but also restrictions as it forces the system to perform embodied reasoning. Learned protosymbolic rules depend on the potential learnable object features and how the agent can act on them—See latent space traversals in Appendix 8. For instance, the agent cannot learn to change the color of an object if all objects are in grey-scale or there is not a policy that can cause the desired effect. As a consequence, conditional behaviours that cannot be interpreted in the objects features space cannot be properly resolved in execution. On the good side, goal-conditioned behaviours are always grounded and therefore the gap between reasoning and control vanishes. 5.3. Shortcomings and future work Following the brain inspiration, we employed an iterative inference procedure in order to benefit from postdictive inference (“smoothing”) in a sliding window approach. The resulting architecture is limited to the accuracy of the perception module. Thus, while this work focused on the proof-of-principle of the advantage of object-centric representations in behavioural reasoning, more powerful and sophisticated architectures (e.g. vision transformers) can be used to improve perceptual inference and scale to naturalistic images. Second, in the experiments, the e ffect of the real actions in the latent representation is assumed to be linear. While this could fit a hierarchical interpretation of cognition, it prevents to properly manipulate complex non-linear dynamics. While OBR can recover from collisions, it cannot plan ahead with them. Further work should investigate complex collisions, for instance, by incorporating 12 Page 13: other inductive biases, such as interaction networks [ 16] or sparsely changing latent representations to encode relevant events [ 48]. Third, the achieved agent behaviours are much less complex than by current SOTA LLMs /VLM approaches such as RT architectures [ 49]. OBR provides, however, goal-conditioned behavioural reasoning, which is a major challenge in the field [ 50]. Increased expressivity of the reasoning should be investigated, for instance, by using OBR as the interface between symbolic (e.g., graph-based, language) and subsymbolic representations. Finally, and very relevant to operationalize this approach for robotics applications the agent needs to include the body restrictions of performing an action in the world [ 51,29]. This can be done by changing the action field mapping by a robotic arm controller. Acknowledgments This work has been funded by the SPIKEFERENCE project, co-funded by the Human Brain Project (HBP) Specific Grant Agreement 3 (ID: 945539). 6. Appendix 6.1. Network architecture OBR’s perceptual inference network consists of two separate Iterative Amortized Inference (IAI) modules, each of which in turn contains a refinement and a decoder module. The first IAI module concerns the inference of the state beliefs q({s†(k)}) – we term this the perceptual inference module . The second IAI module infers the object action beliefs q({a(k)}), and we refer to this as the action inference module . In addition, OBR comprises a preference network that maps the current latent state beliefs to a predicted preference distribution over future object states. Code for the OBR architecture can be found at github.com /neuro-ai-robotics /OBR. 6.1.1. Perceptual inference module This module used a latent dimension of 16. Note that, in the output of the refinement network, this number is doubled once as each latent belief is encoded by a mean and variance, and then doubled again as we represent (and infer) both the states and their first-order derivatives. In the decoder, the latent dimension is doubled only once, as the state derivatives do not enter into the reconstruction of a video frame. As in [ 10], we use a spatial broadcast decoder, meaning that the latent beliefs are copied along a spatial grid with the same dimensions as a video frame, and each latent vector is concatenated with the ( x,y) coordinate of its grid location, before passing through a stack of transposed convolution layers. Decoder and refinement network architectures are summarized in the tables below. The refinement network takes in 16 image-sized inputs, which are identical to those used in [ 10], except that we omit the leave-one-out likelihoods. Vector-sized inputs join the network after the convolutional stage (which processes only the image-sized inputs), and consist of the variational parameters and (stochastic estimates of) their gradients. Decoder (states) Type Size /#Chan. Act. func. Comment Input (λ) 32 Broadcast 34 Appends coordinate channels ConvT5×5 32 ELU ConvT5×5 32 ELU ConvT5×5 32 ELU ConvT5×5 32 ELU ConvT5×5 4 Sigmoid or Softmax Outputs RGB +mask Refinement network (states) 13 Page 14: Type Size /#Chan. Act. func. Comment Linear 64 LSTM 128 tanh Concat [...,λ,∇λL] 256 Appends vector-sized inputs Linear 128 ELU Flatten 800 Conv 5×5 32 ELU Conv 5×5 32 ELU Conv 5×5 32 ELU Inputs 16 Segmentation and prediction. Figure 8 describes the internal mechanism of the perceptual module with the attentive masks segmentation and the objects prediction in the environment. Figure 8: Object-centric segmentation and prediction. (a) Observation model (b) Three examples of attentive masks output by the OBR perceptual module with an IAI backbone. (c) Dynamics model. (d) Two examples of the perception predicting ahead in time the dynamics of the environment. Prediction robustness. To enable action planning across episodes of non-trivial length, it is important that OBR’s perceptual inference and predictive abilities remain stable across longer time windows. In addition, a great benefit of OBR’s slot-based architecture is that object slots can be added or removed at will, without having to learn additional connection weights [ 36]. However, it is not a given that performance will be robust to such variations. Here, we test the robustness of OBR’s world model, as measured by segmentation and reconstruction accuracy, to both of these dimensions (Fig. 9). Having been trained on 4-frame videos with 3 objects, OBR generalizes well to longer videos with fewer or more objects present in the scene. A slight drop-o ffin performance towards later frames in the 12-frame testing videos is easily remedied by a very short bout (3 epochs) of additional training with an additional set of longer (also 12-frame) training videos. Generalization to di fferent numbers of objects is characterized by a slight drop in performance as the number increases, but this might also be (partly) attributed to increased complexity of the resulting images. 14 Page 15: /gid00017/gid00015/gid00022 /gid00018/gid00017 /gid00025 /gid00023 /gid00021 /gid00019 /gid00017/gid00017/gid00015/gid00023/gid00017/gid00015/gid00024/gid00017/gid00015/gid00025/gid00017/gid00015/gid00026/gid00018/gid00015/gid00017 /gid00019/gid00019 /gid00017 /gid00021 /gid00023 /gid00025 /gid00018/gid00017 /gid00018/gid00019/gid00004/gid00001/gid00078/gid00065/gid00073/gid00068/gid00066/gid00083/gid00082 /gid00019 /gid00020 /gid00021 /gid00022 /gid00019/gid00019 /gid00017 /gid00021 /gid00023 /gid00025 /gid00018/gid00017 /gid00018/gid00019 /gid00053/gid00072/gid00076/gid00068/gid00001/gid00009/gid00069/gid00081/gid00064/gid00076/gid00068/gid00082/gid00010/gid00051/gid00068/gid00066/gid00078/gid00077/gid00082/gid00083/gid00081/gid00084/gid00066/gid00083/gid00072/gid00078/gid00077/gid00001/gid00068/gid00081/gid00081/gid00078/gid00081/gid00001/gid00009/gid00046/gid00052/gid00038/gid00028/gid00001/gid00132/gid00001/gid00018/gid00017/gid00014/gid00020/gid00010 /gid00052/gid00068/gid00070/gid00076/gid00068/gid00077/gid00083/gid00064/gid00083/gid00072/gid00078/gid00077/gid00001/gid00064/gid00066/gid00066/gid00084/gid00081/gid00064/gid00066/gid00088/gid00001/gid00009/gid00039/gid00014/gid00034/gid00051/gid00042/gid00010No extra training Minimal stability trainingFigure 9: Robustness to variations in video length and number of objects . Segmentation accuracy (F-ARI score) and reconstruction error (MSE) across video frames that extend beyond the duration of the training videos (4 frames), for scenes with various numbers of objects (training videos contained 3 objects). Left panels show the performance of the network after training on 4-frame videos only. Right panels show the performance after just 3 epochs of additional training with 12-frame videos. Error bars depict the mean +/- 1 SEM. 15 Page 16: 6.1.2. Action inference module A key challenge in multi-object environments such as the evaluated environment, is that an agent’s internal representation of objects has an unknown (and possibly imperfect) correspondence to the objects in the environment. Even if the object features are inferred with perfect accuracy, the order of the objects in the representation is arbitrary. To solve this correspondence problem, we define an action space in which accelerations can be placed at pixel locations within the environment’s image grid, and objects receive the sum of all accelerations that coincide with their visible pixels. Specifically, we introduce the notion of an action field Ψ=[ψ1,...,ψM]T: an [ M×2] matrix (with Mthe number of pixels in an image or video frame), such that the i-th row in this matrix ( ψi) specifies the (x,y,z)-acceleration applied at pixel i. The action on the k-th object is then given by: a(k) t=X i[mi=k]ψi (8) where miis a categorical variable that indicates which object pixel ibelongs to8. This definition of actions in pixel space provides an unambiguous interface for the agent to interact with its environment. The action inference module does not incorporate a decoder network, as the quality of the action beliefs is computed by evaluating equation 8 and plugging this into the ELBO loss from equation 4. While this requires some additional sampling operations (see Appendix 7), no neural network is required for this. This module does include a (shallow) refinement network, which is summarized in the table below. This network takes as input the current variational parametersλa(k)(2 means and 2 variances), their gradients, and the ‘expected object action’,P iˆmikψi. Refinement network (actions) Type Size /#Chan. Act. func. Comment Linear 4 LSTM 32 tanh Inputs 10 6.1.3. Preference network The preference network consists of encoder and decoder networks that operate on single object representations, as well as a context module that aggregates object embeddings produced by the encoder into a single context vector, which is then broadcast and appended to the object embeddings that are fed into the decoder. Details for the two-layer encoder and decoder networks are listed below. The context module consists of a single Linear layer with input and output sizes both equal to 64, and no activation function. Encoder Type Size /#Chan. Act. func. Comment Linear 64 ELU Linear 64 ELU Inputs 64 Decoder (preferences) Type Size /#Chan. Act. func. Comment Linear 64 ELU Linear 64 ELU Inputs 128 8Note the use of Iverson-bracket notation; the bracket term is binary and evaluates to 1 i ffthe expression inside the brackets is true. 16 Page 17: 6.2. Inference procedure Inference is performed in a recurrent algorithm that loops through the decoder and refinement subnetworks. For a single video frame, the inference starts by initializing the object-wise latent beliefs. For the first frame in an episode, beliefs are initialized to a learned default vector of variational parameters λ0. For subsequent frames, beliefs are initialized by extrapolating the inferred dynamics from the previous frame. From each belief q(s†(k)), we then (as in [ 10]) sample a state vector (using the reparameterization trick) and run this through the decoder to obtain, for each object, a predicted sub-image o(k)and segmentation logits ˆm(k). These decoder outputs are then fed, separately for each object, into the refinement subnetwork, along with additional inputs. Image-sized inputs are fed into the bottom convolutional layer, while vector-sized inputs are appended to the output of the final convolutional layer and processed by the final LSTM and Linear layers. We use the same selection of auxiliary inputs as [10], except that we omit the leave-one-out likelihoods. Inference across multiple video frames is performed in a sliding window. This window slides ”into” and ”out of” the full video or episode. That is, the first window includes only the first video frame, which is processed for L iterations, before moving the inference window ahead by one frame. Thus, it takes L×Finference steps before the window reaches its full extent, comprising Fframes. The reverse happens when we reach the end of an episode or video. The lagging end of the window keeps advancing until there is only a single frame left within the window, which gets processed for a final Literations. Thus, each video frame gets processed for exactly L×Finference steps. 6.3. Training procedure The above network architecture was trained on pre-generated experience with the active dSprites environment. The main training set, used to train the world model, comprised 50,000 videos of 4 frames each. An additional validation set of 10,000 videos was sampled identically and independently to the training set. Each video was generated as follows. First, an instance of the active dSprites environment was randomly initialized, with three objects. Object shapes (square, ellipse or heart) were drawn uniformly. Colors were uniformly sampled from 5 evenly spaced values, independently for the R, G and B channels (a uniform grayscale background color was sampled from the same 5 intensity values). Orientations were sampled uniformly from the interval [0 ,2π]. Object sizes were sampled uniformly from [1 6],1 3], where 1 is the size of the image frame. Positions were sampled uniformly from the interval [0 .2,0.8]2(where 0 and 1 are the edges of the frame), while velocities were drawn from a Normal distribution with mean 0 and standard deviation 0.0625. Video frames were then generated by simulating 4 frames of these objects’ dynamics. Between the 2nd and 3rd frames, a single action (acceleration) was randomly sampled for each object, from a Normal distribution with mean 0 and s.d. 0.0625, and placed at a random pixel location within the object’s segmentation mask. If an object happened to be completely invisible (e.g. because it was fully occluded by another), then it received no action. An additional action was sampled for a random background pixel, to encourage the model to learn to accurately segment the background into a separate object slot (rather than grouping the background with one of the objects). Training with these 4-frame videos was followed by a brief period of training with longer videos (12 frames, with actions after every second frame) that were otherwise identically sampled. Additional sets of 50,000 training videos (one for each task; as well as accompanying sets of 10,000 identically and independently sampled validation videos) were used to train OBR’s preference network. Each of these videos was 8 frames long and generated as follows. First, an instance of the active dSPrites environment was randomly initialized as before (but with a di fferent procedure for sampling shapes – see below). Then, 6 frames of this environment were simulated with random actions inserted before the 3rd and 5th frames. To prevent objects from leaving the frame (which is likely in longer videos and becomes problematic in this setting), we additionally placed actions as needed to prevent this from happening (specifically, if an object was on course to move to a coordinate outside the range [0.05,0.95]2in the next frame, an action was generated that resulted in the object appearing to ”bounce” against an invisible wall instead). Before the 7th and 8th frames, actions were generated that brought the objects to their target positions and velocities within the context of a task (see below). The first of these actions brought the objects to their target positions, while the second decelerated them to 0 velocity. Thus, in the final frame (and only then), the objects reached their target configuration. Each frame was encoded (using the normal inference procedure) as a set of latent beliefs in OBR’s latent space, and the preference network was then tasked to map the representation of the first 7 ”seed 17 Page 18: frames” to that of the final target frame, by minimizing the following KL-divergence loss: Lpref=T−1X t=1X kDKL ˜pt s†(k) ||q s†(k) T (9) ˜pt s†(k) =N s†(k);˜µ(k)(λt),˜σ(k)(λt)2 (10) whereλtare the parameters of the variational beliefs for frame t, and we make explicit the fact that the parameters (means and variances) of the predicted preference distribution are a function (via the preference network) of these beliefs about prior frames. Object target positions depended on the task. Specifically, the vertical target coordinates of any square, ellipse and heart shapes in the scene were fixed to 0.2, 0.5 and 0.8, respectively. If, under the task, objects were to go to the left of the frame, their target horizontal coordinate was 0.2 – if they were to go to the right, it was 0.8. Object shapes were pseudo-randomized to evenly sample task conditions. In the IfHeart task, we ensured that training episodes had a 50% probability of containing at least one heart. In the HeartXORSquare task, we sampled four conditions with equal probability: (1) there being neither a heart or a square in the scene; (2) at least one heart but no squares present; (3) at least one square but no hearts present; and (4) at least one heart and one square present in the scene. Outside of these restrictions, shapes were sampled randomly. Training was performed using the ADAM optimizer [ 52] with default parameters and an initial learning rate of 3×10−4. This learning rate was reduced automatically by a factor 3 whenever the validation loss had not decreased in the last 10 training epochs, down to a minimum learning rate of 3 ×10−5. Training was performed with a batch size of 64 (16×4 GPUs). OBR’s world model was deemed to have converged after 241 epochs, which required approximately 24 hours to train on 4 Nvidia A100 GPUs. Preference networks for the IfHeart andHeartXORSquare tasks were trained for 90 and 120 epochs, respectively (18 and 24 hours). 6.3.1. ELBO loss OBR optimizes the following (weighted) ELBO loss for both learning and inference: L=−TX t=0" H q {s†(k) t,a(k) t} | {z } Complexity+βEq({s(k) t})[logp(ot|{s(k) t})] | {z } Reconstruction accuracy +X kEq(a(k) t)[logp(a(k) t|Ψt)] | {z } Action inference accuracy+X kEq s†(k) t,s†(k) t−1,a(k) t−1[logp(s†(k) t|s†(k) t−1,a(k) t−1)] | {z } Temporal consistency# (11) whereH(•) denotes entropy. Similar to previous work (e.g. [ 10]) we up-weight the reconstruction accuracy term in this loss by a factor β. We train the network to minimize not just the loss at the end of the inference iterations through the network, but a composite loss that also includes the loss after earlier iterations. Let L(n) βbe the loss after ninference iterations, then the composite loss is given by: Lcomp=NiterX n=1n NiterL(n)(12) 6.3.2. Hyperparameters OBR includes a total of 4 hyperparameters: (1) the loss-reweighting coe fficientβ(see above); (2) the variance of the pixels around their predicted values, σ2 o; (3) the variance of the noise in the latent space dynamics, σ2 s; and (4) the variance of the noise in the object actions, σ2 ψ. The results described in the current work were achieved with the following settings: 18 Page 19: Param. Value β 5.0 σo 0.3 σs 0.1 σψ 0.3 7. Computing Eq(a(k))[logp(a(k)|Ψ)] The expectation under q(a(k)) of logp(a(k)|Ψ), which appears in the ELBO loss (eq. 4), cannot be computed in closed form, because the latter log probability requires us to marginalize over all possible configurations of the pixel-to-object assignments, and to do so inside of the logarithm. That is: logp(a(k)|Ψ)=logX mp(a(k)|Ψ,m)p(m|{s(k)}) (13) =log Ep(m|{s(k)})[p(a(k)|Ψ,m)] (14) However, note that within the ELBO loss, we want to maximize the expected value of this quantity (as its negative appears in the ELBO, which we want to minimize). From Jensen’s inequality, we have: Ep(m|{s(k)})[logp(a(k)|Ψ,m)]≤log Ep(m|{s(k)})[p(a(k)|Ψ,m)] (15) Therefore, the l.h.s. of this equation provides a lower bound on the quantity we want to maximize. Thus, we can approximate our goal by maximizing this lower bound instead. This is convenient, because this lower bound, and its expectation under q(a(k)) can be approximated through sampling: Eq(a(k))h Ep(m|{s(k)})[logp(a(k)|Ψ,m)]i ≈1 NsamplesX jlogp(a(k)∗ j|Ψ,m∗ j) (16) =1 NsamplesX jlogNa(k)∗ j;X iˆm∗(i) jkψi,σ2 ψI (17) ˆm∗(i) j∼p(mi|{s(k)}),a(k)∗ j∼q(a(k)),s(k)∗ j∼q(s(k)) (18) where we slightly abuse notation in the sampling of the pixel assignments, as a vector is sampled from a distribution over a categorical variable. The reason this results in a vector is because this sampling step uses the Gumbel-Softmax trick [ 53], which is a di fferentiable method for sampling categorical variables as ”approximately one-hot” vectors. Thus, for every pixel i, we sample a vector ˆm∗(i) j, such that the k-th entry of this vector, ˆm∗(i) jk, denotes the ”soft-binary” condition of whether pixel ibelongs to object k. In practice, we use Nsamples =1, based on the intuition that this will still yield a good approximation over many training instances, and that we rely on the refinement network to learn to infer good beliefs. The Gumbel-Softmax sampling method depends on a temperature τ, which we gradually reduce across training epochs, so that the samples gradually better approximate the ideal one-hot vectors. It is worth noting that, as the entropy of p(m|{s(k)}) decreases (i.e. as object slots ”become more certain” about which pixels are theirs), the bound in equation 15 becomes tighter. In the limit as the entropy goes to 0, the network is perfectly certain about the pixel assignments, and so the distribution collapses to a point mass. The expectation then becomes trivial, and so the two sides of eq. 15 become equal. Sampling the pixel assignments is equally trivial in this case, as the distribution has collapsed to permit only a single value for each assignment. In short, at this extreme point, the procedure becomes entirely deterministic. In our data, we typically observe very low entropy for p(m|{s(k)}), and so we likely operate in a regime close to the deterministic one, where the approximation is very accurate. 19 Page 20: 7.1. Planning OBR’s linearized dynamics model allows us to find an optimal action plan π∗(minimizing equation 7) in closed form, as follows: d(T)=1 2 ... T,Ω(T) d=1 0 ... 0 0 1 0 ... 0 0 2 1 ... 0 0 1 1 ... 0 0 ............... T−1T−2... 1 0 1 1 ... 1 0 T T−1... 2 1 1 1 ... 1 1(19) U=(Ω(T) d⊗D),L=diag( 1⊗˜σ)−2(20) π∗=(UTLU+λaI)−1UTL 1⊗(˜µ−µs† t)−d(T)⊗"µs′ t 0#! (21) where 1denotes a column vector of 1s of length T,⊗denotes the Kronecker product, and µs† tandµs′ tdenote, respectively, the means of the variational beliefs about the full object state in generalized coordinates ( s†), and the means of the beliefs about the derivatives ( s′) only. In words, Uis a matrix that maps actions within the planning horizon to their (cumulative) e ffects over that time window. These e ffects are translated to the network’s latent space, through the multiplication by Din equation 20. The optimal actions are obtained by projecting the current (precision-weighted) error (i.e., the discrepancy between the desired state and the sequence of states that will unfold if no action is taken) onto the (precision-weighted) pseudoinverse of U. This projection is optionally shrunk towards a zero-mean Gaussian prior over actions, with precision λa, to regularize the scale of the actions. In the experiments reported in this work, we used a planning horizon of T=3 and an action regularization strength λa=0.1. We also set L=Ias we empirically find this leads to better and more stable performance. 8. Latent space traversals OBR is trained without supervision, and so the structure of its latent representational space is a priori unknown. Here, we explore this space by systematically traversing it one dimension at a time. We first let OBR encode an image into its latent space. Subsequently, for each object representation, we increment the target dimension’s encoded (mean) value with a range of 20 evenly spaced values between [ −1,1]. We then feed the resulting, perturbed latent vectors through OBR’s decoder to obtain 20 images that vary systematically (and for all objects) along the target dimension. The images in Fig. 10 show the results of this procedure for three di fferent scenes, for each of the 16 latent dimensions of the trained OBR model. Of particular note are dimensions 3 and 9, which have learned to encode the objects’ positions (along an approximately orthogonal set of axes at an arbitrary angle to the pixel grid). Other recognizable variations can be seen in color (dimensions 4, 6 and 11) and size (14). Shape and orientation appear to be entangled along multiple dimensions. Other dimensions do not obviously encode anything – they may be truly non-coding, or their role might not be visible in combination with the other latent values that we happened to sample here. In particular, the depth value of an object is almost certainly encoded in one or more latent dimensions, but this is not apparent from these traversals, as they target all objects simultaneously, and thus should not a ffect their occlusion patterns (which object is in front of which other object(s)). 20 Page 21: References [1] S. J. Cowley, How human infants deal with symbol grounding, Interaction Studies 8 (1) (2007) 83–104. [2] E. Gibney, This ai learnt language by seeing the world through a baby’s eyes, Nature 626 (7999) (2024) 465–466. [3] S. T. Piantadosi, The computational origin of representation, Minds and machines 31 (2021) 1–58. [4]P. W. Battaglia, J. B. Hamrick, J. B. Tenenbaum, Simulation as an engine of physical scene understanding, Proceedings of the National Academy of Sciences 110 (45) (2013) 18327–18332. [5] K. Gre ff, S. Van Steenkiste, J. Schmidhuber, On the binding problem in artificial neural networks, arXiv preprint arXiv:2012.05208 (2020). [6] B. Peters, N. Kriegeskorte, Capturing the objects of vision with neural networks, Nature Human Behaviour 5 (2021) 1127–1144. [7]T. L. Gri ffiths, N. Chater, C. Kemp, A. Perfors, J. B. Tenenbaum, Probabilistic models of cognition: Exploring representations and inductive biases, Trends in cognitive sciences 14 (8) (2010) 357–364. [8]A. Goyal, Y . Bengio, Inductive biases for deep learning of higher-level cognition, Proceedings of the Royal Society A 478 (2266) (2022) 20210068. [9]S. Eslami, N. Heess, T. Weber, Y . Tassa, D. Szepesvari, G. E. Hinton, et al., Attend, infer, repeat: Fast scene understanding with generative models, Advances in neural information processing systems 29 (2016). [10] K. Gre ff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, A. Lerchner, Multi-object representation learning with iterative variational inference, in: International conference on machine learning, PMLR, 2019, pp. 2424–2433. [11] G. Elsayed, A. Mahendran, S. van Steenkiste, K. Gre ff, M. C. Mozer, T. Kipf, Savi ++: Towards end-to-end object-centric learning from real-world videos, Advances in Neural Information Processing Systems 35 (2022) 28940–28954. [12] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, T. Kipf, Object-centric learning with slot attention, Advances in Neural Information Processing Systems 33 (2020) 11525–11538. [13] J. Jiang, F. Deng, G. Singh, S. Ahn, Object-centric slot di ffusion, arXiv preprint arXiv:2303.10834 (2023). [14] M. Traub, S. Otte, T. Menge, M. Karlbauer, J. Thuemmel, M. V . Butz, Learning what and where: Disentangling location and identity tracking without supervision, arXiv preprint arXiv:2205.13349 (2022). [15] H.-X. Yu, L. J. Guibas, J. Wu, Unsupervised discovery of object radiance fields, in: International Conference on Learning Representations, 2022. [16] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al., Interaction networks for learning about objects, relations and physics, Advances in neural information processing systems 29 (2016). [17] Y . Li, J. Wu, J.-Y . Zhu, J. B. Tenenbaum, A. Torralba, R. Tedrake, Propagation networks for model-based control under partial observation, in: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 1205–1211. [18] L. S. Piloto, A. Weinstein, P. Battaglia, M. Botvinick, Intuitive physics learning in a deep-learning model inspired by developmental psychology, Nature human behaviour 6 (9) (2022) 1257–1267. [19] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, A. Tacchetti, Visual interaction networks: Learning a physics simulator from video, Advances in neural information processing systems 30 (2017). [20] C. Sancaktar, S. Blaes, G. Martius, Curious exploration via structured world models yields zero-shot object manipulation, Advances in Neural Information Processing Systems 35 (2022) 24170–24183. [21] R. Veerapaneni, J. D. Co-Reyes, M. Chang, M. Janner, C. Finn, J. Wu, J. Tenenbaum, S. Levine, Entity abstraction in visual model-based reinforcement learning, in: Conference on Robot Learning, PMLR, 2020, pp. 1439–1456. [22] D. Driess, Z. Huang, Y . Li, R. Tedrake, M. Toussaint, Learning multi-object dynamics with compositional neural radiance fields, in: Conference on robot learning, PMLR, 2023, pp. 1755–1768. [23] D. Haramati, T. Daniel, A. Tamar, Entity-centric reinforcement learning for object manipulation from pixels, arXiv preprint arXiv:2404.01220 (2024). [24] T. Van de Maele, T. Verbelen, P. Mazzaglia, S. Ferraro, B. Dhoedt, Object-centric scene representations using active inference, Neural Computation 36 (4) (2024) 677–704. [25] J. Huo, Q. Sun, B. Jiang, H. Lin, Y . Fu, Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation, in: Proceedings of the IEEE /CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23212–23221. [26] R. Assouel, P. Rodriguez, P. Taslakian, D. Vazquez, Y . Bengio, Object-centric compositional imagination for visual abstract reasoning, in: ICLR2022 Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. [27] T. Webb, S. S. Mondal, J. D. Cohen, Systematic visual reasoning through object-centric relational abstraction, Advances in Neural Information Processing Systems 36 (2024). [28] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied multimodal language model, arXiv preprint arXiv:2303.03378 (2023). [29] T. Taniguchi, S. Murata, M. Suzuki, D. Ognibene, P. Lanillos, E. Ugur, L. Jamone, T. Nakamura, A. Ciria, B. Lara, et al., World models and predictive coding for cognitive and developmental robotics: Frontiers and challenges, Advanced Robotics 37 (13) (2023) 780–806. [30] M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al., Do as i can, not as i say: Grounding language in robotic a ffordances, arXiv preprint arXiv:2204.01691 (2022). [31] Y . Xu, W. Li, P. Vaezipoor, S. Sanner, E. B. Khalil, Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations, arXiv preprint arXiv:2305.18354 (2023). [32] K. Friston, The free-energy principle: a unified brain theory?, Nature reviews neuroscience 11 (2) (2010) 127–138. [33] P. Lanillos, C. Meo, C. Pezzato, A. A. Meera, M. Baioumy, W. Ohata, A. Tschantz, B. Millidge, M. Wisse, C. L. Buckley, et al., Active inference in robotics and artificial agents: Survey and challenges, arXiv preprint arXiv:2112.01871 (2021). [34] D. Walther, C. Koch, Modeling attention to salient proto-objects, Neural networks 19 (9) (2006) 1395–1407. [35] A. Rasouli, P. Lanillos, G. Cheng, J. K. Tsotsos, Attention-based active visual search for mobile robots, Autonomous Robots 44 (2) (2020) 131–146. 21 Page 22: (a)Latent dimension 0 (b)Latent dimension 1 (c)Latent dimension 2 (d)Latent dimension 3 (e)Latent dimension 4 (f)Latent dimension 5 (g)Latent dimension 6 (h)Latent dimension 7 Figure 1022 Page 23: (a)Latent dimension 8 (b)Latent dimension 9 (c)Latent dimension 10 (d)Latent dimension 11 (e)Latent dimension 12 (f)Latent dimension 13 (g)Latent dimension 14 (h)Latent dimension 1523 Page 24: [36] R. S. van Bergen, P. Lanillos, Object-based active inference, in: Active Inference: Third International Workshop, IWAI 2022, Grenoble, France, September 19, 2022, Revised Selected Papers, Springer, 2023, pp. 50–64. [37] J. Marino, Y . Yue, S. Mandt, Iterative amortized inference, 35th International Conference on Machine Learning, ICML 2018 8 (2018) 5444–5462. [38] F. Bos, A. A. Meera, D. Benders, M. Wisse, Free energy principle for state and input estimation of a quadcopter flying in wind, in: 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 5389–5395. [39] T. Parr, G. Pezzulo, K. J. Friston, Active inference: the free energy principle in mind, brain, and behavior, MIT Press, 2022. [40] O. van der Himst, P. Lanillos, Deep active inference for partially observable mdps, in: Active Inference: First International Workshop, IWAI 2020, Co-located with ECML /PKDD 2020, Ghent, Belgium, September 14, 2020, Proceedings 1, Springer, 2020, pp. 61–71. [41] B. Millidge, A. Tschantz, C. L. Buckley, Whence the expected free energy?, Neural Computation 33 (2) (2021) 447–482. [42] Z. Fountas, N. Sajid, P. Mediano, K. Friston, Deep active inference agents using monte-carlo methods, Advances in neural information processing systems 33 (2020) 11662–11675. [43] R. E. Kalman, et al., Contributions to the theory of optimal control, Bol. soc. mat. mexicana 5 (2) (1960) 102–119. [44] D. E. Kirk, Optimal control theory: an introduction, Courier Corporation, 2004. [45] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: O ff-policy maximum entropy deep reinforcement learning with a stochastic actor, in: International conference on machine learning, PMLR, 2018, pp. 1861–1870. [46] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [47] A. Ra ffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (268) (2021) 1–8. [48] C. Gumbsch, M. V . Butz, G. Martius, Sparsely changing latent states for prediction and planning in partially observable domains, Advances in Neural Information Processing Systems 34 (2021) 17518–17531. [49] J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al., Rt-trajectory: Robotic task generalization via hindsight trajectory sketches, arXiv preprint arXiv:2311.01977 (2023). [50] P. Sundaresan, Q. Vuong, J. Gu, P. Xu, T. Xiao, S. Kirmani, T. Yu, M. Stark, A. Jain, K. Hausman, et al., Rt-sketch: Goal-conditioned imitation learning from hand-drawn sketches, arXiv preprint arXiv:2403.02709 (2024). [51] P. Lanillos, M. van Gerven, Neuroscience-inspired perception-action in robotics: applying active inference for state estimation, control and self-perception, arXiv preprint arXiv:2105.04261 (2021). [52] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [53] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, arXiv preprint arXiv:1611.01144 (2016). 24

---