June 17, 2026

The Third Wave of Physical Intelligence

Type

Deep Dives

Vikram Venkat

Efforts have been made to digitize the physical world for several decades now – physical AI is only the latest, and third wave, in this sequence. While early solutions captured and analyzed data, it is only recently that technical advances have enabled physical AI to emerge as a new operational intelligence layer that closes the loop by delivering outcomes and actions.

First wave – Digital Systems of Record

Starting in the 1990s, various enterprise systems of record (primarily ERPs, MES, SCADA) turned information from physical processes into structured digital data streams. With further technological advances, telemetry solutions enabled a shift from manual, batched inputs to automated, continuous inputs. These include the first wave of IoT sensors, telematics, and early warehouse automation. Increased adoption of these inputs enabled remote monitoring and alerting, as well as continuous monitoring. However, these continued to feed measured values and identified anomalies into the same systems of record without providing any understanding of the causes and reasons behind these values or anomalies. Early visual data from cameras, especially used in security, traffic, and industrial contexts, helped add some of the context from the physical world to understand the measurements; however, these cameras were again purely fixed pipelines of data that could be correlated manually with data from the systems of record.

Second wave – Early Computer Vision

The first wave of digital systems of record had proved the importance of understanding the physical world. Starting in the late 2000s, deep learning algorithms helped automate analysis and understanding of visual data – Convolutional Neural Networks (CNNs), Simultaneous Localization and Mapping (SLAM), and early multimodal models laid the foundation for broader application of computer vision algorithms. This also allowed the shift from manual analysis and hand-engineered features to automated learning. These solutions laid the groundwork for new solutions in medical imaging, warehouse robotics, and even early autonomous vehicles and drones.

However, these approaches were massively data-hungry, and every new use case needed additional data collection and annotation prior to retraining a model. These systems were also more effective in retrospective analyses than in proactive decision-making. This allowed the move from visualization (as enabled by the first wave) to perception, but not yet to true reasoning and action.

Third wave – Physical AI

Physical AI solutions began emerging in the late 2010s and are still undergoing development. Three main shifts were needed to enable this new wave of solutions:

From perception to world models: this enabled building a persistent understanding of the world state, which can be queried and analyzed in real-time – an upgrade on the specifically trained models from the previous wave.
From modular datasets to integrated pipelines: true understanding of the world state requires correlating observations across various datasets, including IoT sensors, process data from ERPs, and visual data. These need to be captured and fused in real time, enabling continuous data loops instead of fixed training data sets.
From open-loop to closed-loop control systems: Systems operating in the real world need to perform real-time inference, state estimation, and risk-adjusted modeling to ensure strict adherence to safety constraints, especially in dynamically changing environments.

The advent of physical AI solutions enables a true understanding of the context around any data captured from the physical world and makes decisions or takes actions based on this understanding. Visual data, captured by cameras of various kinds, is crucial for enabling true reasoning and action.

The criticality and ubiquity of vision

Visual data provides a rich depth that cannot be obtained from other modalities or sources.

First, visual data enables a semantic understanding of the physical world. Physical sensors from the first wave can measure vibrations in a gearbox and identify that there is an abnormality; however, vision data can specifically identify that the vibration is caused by a specific misaligned bearing in the gearbox. This is especially powerful when there are complex ecosystems (typical in industrial, retail, security, and healthcare applications) with large volumes of data across different subparts of the ecosystem that cannot be manually observed by humans to make sense of sensor measurements.

Second, visual data can capture behavioral intent. Traditional sensors can only detect measurable quantities but cannot augment or contextualize the cause and intention; visual data can capture any deviations from expected workflows or processes and even analyze human posture, gestures, and movements to provide a deeper understanding.

Third, visual data can generalize much more easily across time, use cases, and observed variables. Unlike typical sensors that measure a single metric (for example, thermometers that measure temperature), the same camera feed in a factory can be used to monitor safety, optimize process efficiency, detect quality issues, and provide operational analytics. This can be used across time periods and changes measured across the temporal dimension as well.

Visual data can provide rich intelligence, which is essential across different use cases, especially in the real world. These use cases cut across various sectors, including:

Manufacturing and industrial: Physical AI solutions can drive truly autonomous industrial operations, including:

Robotics in industrial and warehouse automation: Pick-and-place, sorting, welding
Process monitoring: Measuring cycle times, checking assembly correctness, and identifying deviations from set processes or expected outputs
Predictive maintenance: Identifying the need for maintenance based on visual wear, damage, corrosion, leaks, or other similar flaws
Automating quality inspection: Identifying surface defects, missed or misaligned components, and other similar factors
Safety and compliance: Detecting PPE, identifying unsafe or non-compliant behavior, and enforcing movement restrictions

Healthcare: Third-wave vision solutions provide non-invasive, continuous observation at scale, augmenting the capabilities of healthcare providers and clinicians through:

Patient monitoring: Mobility tracking, fall detection, and other visible conditions
Surgical intelligence: Guiding surgical staff through clearer visualization of anatomical features and flows, and supporting robotic surgeries
Clinical workflow optimization: Optimizing room utilization, controlling access, ensuring effective disinfection, and hand-hygiene compliance
Medical imaging: Radiology and pathology across therapeutic areas

Security and defense: Third-wave vision solutions enable active threat identification and classification through:

Perimeter and access control: Facial recognition, motion tracking, tailgating detection
Crowd analytics: Measuring density and flow
Anomaly detection: Unusual movement patterns, abandoned objects, gun detection

Mobility: Vision is a critical part of the autonomous navigation stack, proactively identifying relevant environmental context and enabling real-time decision-making while in motion:

Advanced Driver Assistance Systems (ADAS)
Intelligent traffic systems
Fleet monitoring

Multiple other use cases exist across retail, agriculture, energy, and more. Physical AI solutions not only help improve efficiency in existing processes; they also unlock new use cases and processes that could not have been performed before. This is especially relevant given recent labor shortages across manufacturing (where ~400,000 jobs in the US remain unfilled, and the gap is expected to widen to ~2.1 million by 2030, per the National Association of Manufacturers), healthcare (McKinsey estimates put the global healthcare worker shortage at over 10 million by 2030), and most other major sectors. Physical AI solutions that can improve the productivity and efficiency of human workers and support them in delivering outcomes are critical to filling some of these labor gaps.

A net-new world

This has only been possible recently due to a confluence of some major factors. First, the number of cameras installed for commercial use cases across industry and government has grown rapidly over the past decade – conservative estimates put this number in the low billions. Further, there have been massive advancements in camera technology that allow capturing different types of scenes with increased accuracy and resolution – we will touch upon some of these in a subsequent article.

Second, this has been augmented and amplified by developments in software infrastructure and algorithms, including sensor fusion models, causal inference architectures, and Vision Language Models (VLMs). The feed from the growing number of cameras has been a critical source of real-world data that can act as a base to train and fine-tune advanced vision-based AI solutions.

Finally, advancements in broader technical infrastructure, such as improvements in data compression, storage, retrieval, and transfer, as well as edge processing, have enabled physical AI on various edge devices, including autonomous vehicles, robots, and cameras – making these solutions both accessible and ubiquitous.

Given the complexity of capturing data around the real world, interpreting it accurately in real-time, and making decisions or taking actions based on this interpretation, a whole ecosystem of different forms of hardware and software is essential. In our next article, we will explore the ideal physical AI hardware and software stacks, and the way forward for physical AI.

Knowledge Base