Shoes on the Table

Look at this picture and find what’s out of place. Think about signaling to someone that an item is not where it should be. Through sight, we immediately notice that shoes don’t belong on the table. Entering a friend’s apartment and seeing this, we’d point and question their placement. Instinctively, we’d likely move them to the floor. It’s natural for us to grasp a situation’s context, convey our intentions, and take action.

Shoes

As we traverse a world where digital and physical boundaries increasingly blur, machines are evolving to see and it’s set to reshape our interactive and operational dynamics. Although integral to our human experience, vision is no longer exclusively ours – machines are rapidly catching up and acting with greater autonomy. As they weave into our interactions, we find ourselves in a realm where our visual capabilities are paralleled, and sometimes outmatched, by robotic perception and decision-making.

Pointing is one of the first signs that infants have developed social and communication skills, requiring a visual queue and physical response. This act of asking for something – imperative pointing – necessitates some visual trigger that prompts a baby to make a request. As the Hanen Centre notes, it also allows adults to label objects for the baby, fostering vocabulary and language development. Pointing serves as one of the primal forms of user input and action, evolving through childhood into adulthood.

The advent of touchscreen smartphones transitioned pointing from a physical to a digital gesture, replacing speech as our primary interaction mode. Now, emerging technology shifts the paradigm once more, positioning vision as the main user input channel. Unlike a touchscreen or mouse, which demands a sequence of intent, realization, seeing, pointing, and clicking, vision-based systems eliminate the need for the last two steps. The unveiling of Apple Vision Pro, Meta Quest 3, and Ray-Ban Meta smart glasses has reignited consumer interest in mixed reality and the metaverse, heralding a transformation in our interaction with technology.

Baby

Consumer virtual reality will likely concentrate among established players who leverage existing distribution channels and familiar operating systems. The unchanged nature of the iPhone over the years is precisely why we keep buying it – its familiarity; it’s become the default input method. The familiarity with its interface – knowing how to navigate the home screen, find contacts, and adjust settings – outweighs the effort to adapt to something new. Upcoming iterations of consumer vision systems will sift through what works and what doesn’t, eventually delivering a default product we all come to embrace.

The act of seeing is core to our existence. From our first gaze, we begin amassing and cataloging visual data, constructing a repository of experiences and contexts to navigate the world. Over time, humans have developed intricate mechanisms to interpret their surroundings. The next frontier is augmenting this reality within virtual realms. While new software will enhance our efficiency and consumption, we don’t need a visual cue to tell us that shoes don’t belong on the table. We already know that.

Exciting new applications and platforms will emerge with vision as the default input. Yet they overshadow the true value of vision going mainstream: machines using sight to interact with their physical environment.

Historically, disorganized tasks have been a human domain, as they pose too much complexity for robots trained on finite datasets. Humans have a structural advantage in vision due to a lifetime of contextualization. Traits like object permanence, depth perception, and scene understanding are acquired over developmental years. Akin to how a baby points at objects to learn from caregivers, AI models fed with vast data can analyze, discern, and act. We’re headed towards a future where cross-learning between systems will enable critical thinking long elusive to machines.

Unlike human vision, which is inherently multimodal and leverages a full sensory range for complete understanding, machines are only now exploring this domain. The rise of multimodal deep learning models, fusing data from diverse modalities like images and text, aims to emulate human-like comprehension of varied inputs. Text-based LLMs trained on extensive datasets across the web, pave the way for multimodal models like GPT-4, which are trained on a mix of media types, offering more generalized interpretations.

Eye

Things get interesting when a task becomes more open-ended and pushes the boundaries of current vision systems’ capabilities. Imagine a physical representation of the apartment with shoes on the table. If you asked a friend to “put everything where it belongs,” they would quickly notice that the shoes don’t belong, move them to the floor, and then scan the environment for what else needs to be fixed. A robotic system using visual inputs may recognize the shoes and table, but not know that the shoes shouldn’t be there, let alone how to take action to remedy the situation. However, a robot trained to repeatedly remove shoes from the table could perform this task flawlessly, albeit nothing else.

Even if the robot could visually ingest its environment and the command “put everything where it belongs,” the nuance of the situation makes the task far more complex. Does “everything” include things like the ceiling and walls? Do objects “belong” in more than one place? How would it even execute the action to accomplish the task? These and more are questions that are part of an unstructured world – one that makes sense to humans, but not machines. While humans are generalists, robots are specialists – veering outside a specific task’s scope is a challenge for a robot.

Enter Google’s vision language action (VLA) models, like Google DeepMind’s Robotic Transformer 2 (RT-2), which learn from both web and robotics data, translating that knowledge into generalized robotic control instructions while retaining web-scale capabilities. RT-2 extends what was previously possible from its limited robotic data through a vision language model. This enables it to interpret new commands and perform rudimentary reasoning about the environment and context of a user’s text-based input. The model can use chain-of-thought reasoning to perform more involved commands that require making intermediate decisions.

A recent iteration of this is Open X-Embodiment and RT-X, an open-source dataset of robot demonstrations and advanced robotics transformer model, respectively. Open X-Embodiment includes over one million robot trajectories, from single-armed robots to quadrupeds. Combining data from different types of robots into RT-X’s training teaches emergent skills not found in original datasets. This creates a flywheel, as new robotic actions are added to Open X-Embodiment, providing more data for RT-X to train on and pull from, giving more robots access to a new set of actions, and so on.

Using RT-X, robots that have learned from one another through cross-training have begun to exhibit spatial reasoning – a biological trait. Suddenly, “putting everything where it belongs” becomes a feasible prompt. A robot with access to this ecosystem can reach back into its data set to check if similar questions have been asked in the past or if the current scene has been interacted with before. It can determine if other robots have operated to move in the way required, even if that’s not within its current understanding. By sharing, learning, seeing, and asking, robots are mimicking human behavior.

Robot

The next domain of generative models will be in physical space. With this, real-world agents become not just possible, but probable, as robots are able to work alongside and learn from one another. Like training AI, the bottleneck of real-world agents is hardware. While RT-X and Open X-Embodiment transform what’s possible, they hinge on physical actions to enrich the dataset, colliding with the resource limitations inherent in actual robotic systems and lab environments. Until these resources can scale into the commercial sector and produce data at scale, most of the training will come from just a handful of robots working on analogous problems.

A team from UC Berkeley has engineered a universal simulator, called UniSim, enabling agents to train exclusively in simulation, with the capability to transfer this learning to an actual robot. UniSim navigates from a starting image to a destination image, finding an optimal decision path to the intended outcome. This simulator isn’t constrained to low-level reinforcement learning, but can work through high-level tasks as well. These work together to manipulate the physical world to resemble the destination image, all without necessitating any one action’s existence in the robot’s dataset. Consequently, UniSim becomes a conduit through which the interaction of humans and robotic agents in a physical space is emulated, visualizing the outcome of these interactions.

UniSim does all of its training virtually, removing the limitations of RT-X. Only when an action is needed does the transfer to a physical robotic system occur. Now we can imagine our scene with shoes on a table, as well as a previously captured scene of the tidy apartment. By asking “put everything where it belongs,” the robot doesn’t have to ever have seen the images nor performed any task. Instead, it will understand what steps it needs to take to manipulate the environment and achieve the outcome we’ve been trying for the whole time: no shoes on the table. UniSim enables real-world action at software scale.

Robotic vision agents will differ in their distribution from consumer vision products. Big tech will use their channels to sell as many consumer hardware devices as possible, creating a network of users that third-party developers will build software on top of. Commoditized hardware is the vector for innovative software in this case. Physical robots are commodities themselves, but to develop valuable software for them does not require existing distribution. There’s no buy-in required for an end customer – rather the software developer will sell a full-stack robotic system as one. The ideal use cases will not be to sell commoditized hardware, nor to build physical robots for specific tasks, but to sell a packaged service into market verticals. In this case, hardware becomes a variable cost.

Take a company I’ve spent some time with called Emancro. They develop general-purpose robotic systems to address labor shortages in hospitals’ pharmacies. It can assemble medication trays, restock cabinets, and eventually will horizontally scale across the hospital to automate other disorganized tasks. The robot itself is largely off-the-shelf with innovation coming from a custom manipulation AI and in-house end effector. Emancro isn’t selling a piece of hardware per se – it’s using hardware as a way to distribute its software.

I believe that we are going to see a lot more Emancros developing models for general purpose robotics across disorganized settings. To me, the chaos of a packed bar on a Friday night is the ultimate example of disorganized order. Each agent (bartender) is programmed for a specific task (serve drinks). But sometimes a bottle runs empty, a keg kicks, or one side of the bar becomes busier than another. The agents are required to make changes on the fly – reach into the liquor cabinet, tap a new keg, or help a colleague take orders. Agents can run countless virtual simulations of a late night shift, and then apply those actions into the real-world. The optionality of this type of system is endless, and the best companies will be the ones that target the highest ROI verticals.

Technology will increasingly intertwine with our visual faculties as we navigate the interplay between physical and digital worlds. The advent of robots that not only see, but act, propels us into a domain where our inherent human vision is brought to the brink of outperformance. Our mastery in perception, understanding, and action reigns supreme in this co-evolving dynamic, but overlooking the technological innovations from tech giants and labs is to dismiss an inevitable future. As we transition into a realm where our visual and physical domains meld with computers and their expanding capabilities, our role evolves into one of guidance and control. It’s merely a matter of time before a robot beats us to removing the shoes from the table.