Gap 45 — gitgap

Gap Declaration

Scene descriptions in our study did not tap into precise compositional processes, such as symmetric relationships (Hafri et al.,), thematic role assignment in agentive verbs (Koring, Mak & Reuland,), or ergative verbs, which emphasize the object role in an action (see Khatin‐Zadeh, Hu, Eskandari, Banaruee, Yanjiao, Farsani & He for an example of gestures). Furthermore, since production and comprehension are not distinct and likely share predictive forward models (e.g., Kempen, Olsthoorn & Sprenger; Pickering & Gambi), a visual world version of our study may be able to better elucidate the moment‐by‐moment online integration of overt attention and sentence understanding (Snedeker & Trueswell,), and thus examine whether the evidence of semantic dominance observed in our data still holds. Future research could, for example, develop experimental designs that elicit descriptions involving actions, agents, and patients across languages, yielding a more comprehensive understanding of the LoV underpinning event representation and its reflection in overt attention. Additionally, by systematically manipulating syntactic constructions in their complexity, it is plausible to map their cartographic distances as a function of gaze alignment. While we expect semantic similarity to remain dominant, this controlled setting could isolate structural distinctions linked to predictable gaze shifts, effectively testing the comprehension–production interface (Pickering & Gambi,). [...] Computationally, the current study did not directly investigate real‐time alignment between visual and linguistic processing; instead, we focused on the association between the two information streams, using sentence similarity as a predictor of scan pattern similarity. However, our reliance on global similarity metrics may overlook fine‐grained individual differences in attentional strategies. Future research could benefit from network‐based scan‐path analyses, where eye movements are modeled as transition networks to reveal structural metrics (e.g., centrality, density) capable of distinguishing processing styles (Ma, Liu, Clariana, Gu & Li,). Additionally, recent advances in deep semantic gaze embeddings (Castner, Kuebler, Scheiter, Richter, Eder, Hüttig, Keutel & Kasneci,) and personalized scan‐path prediction models using observer encoders (Chen, Jiang & Zhao,; Xue, Xu, Mondal, Le, Zelinsky, Hoai & Samaras,) may offer higher granularity in distinguishing between participants who rely on top‐down linguistic guidance versus those driven by bottom‐up visual saliency. In parallel, the past decade has seen rapid advances in representing visual information using formalisms previously applied to linguistic information (e.g., dependency grammars, Elliott & Keller), leading to new developments in multimodal modeling (Koh, Fried & Salakhutdinov,), with applications to tasks such as image captioning (Elliott & de Vries,), text‐to‐image generation (Rombach, Blattmann, Lorenz, Esser & Ommer,), or sign language recognition (Li, Duan, Fang, Gong & Jiang,) (see Wu, Gan, Chen, Wan & Philip for a review of state of the art).

Gateway future research

Type scope

Section conclusions

Phase 1

Confidence 1.0

Abstract

Abstract A central question in cognition is how representations are integrated across different modalities, such as language and vision. One prominent hypothesis posits the existence of an abstract, prelinguistic “language of vision” as a representational system that organizes meaning compositionally, enabling cross‐modal integration. This hypothesis predicts that the language of vision operates universally, independent of linguistic surface features such as word order. We conducted eye‐tracking experiments where participants described visual scenes in English, Portuguese, and Japanese. By analyzing spoken descriptions alongside eye‐movement sequences divided into planning and articulation phases, we demonstrate that semantic similarity between sentences strongly predicts the similarity of…

Conclusions / Discussion

Discussion This study provided empirical evidence that the similarity of participants' scan patterns is predicted by the semantic similarity of the associated scene descriptions across typologically diverse languages (English, Portuguese, and Japanese). Our results extend previous work (Coco & Keller) to languages beyond English. As discussed in the Introduction, our findings enable us to adjudicate between the LoV hypothesis (Cavanagh,), a strictly syntactic hypothesis, and interactionist accounts regarding the representational systems that coordinate overt attention and speech production. The idea of an LoV builds on the LoT hypothesis (Fodor,) in assuming the semantic compositionality of prelinguistic representations. Our results support this hypothesis and suggest that the relevant semantic representations are shared across languages and grounded in the perceptual system (Wilson et al.,; Quilty‐Dunn et al.,). This is also consistent with the idea that visual information shapes the language used to describe it (Talmy,; Landau & Jackendoff,; Hafri et al.,; Ünal et al.,; Coventry et al.,) and with recent evidence showing that speakers consistently prioritize affordance‐based objec…

Keeper Review

The Appreciated Gateway must be evaluated by a human keeper.
Does this declaration represent a genuine open research gap?

Structural Hole 65% bridge

Origin computer science

Crossings

psychology criminal justice epidemiology genomics bioinformatics

Technique originates in computer science; functional analogues in psychology, criminal justice literature are absent.

○ NAUGHT — Open Opportunity

No paper has claimed this gap. Appreciate the opportunity.

Provenance

Gap ID45

Paper ID57

PMCIDPMC12930141

AI Check Interrogated — no signals

DOI 10.1111/cogs.70185

Detected2026-04-11

Verdict pending

Gap Type scope