Unified multimodal foundation (early fusion)
Vision and language are optimized in a shared representation space, reducing template mismatch and information loss compared to stitched two-stage pipelines; this helps connect visual understanding directly into reasoning and action flows.
