CHEF-VL: Detecting Cognitive Sequencing Errors in Cooking with Vision-language Models
Ruiqi Wang, Peiqi Gao, Patrick Lynch, and 5 more authors
Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., Dec 2025
Minimally obtrusive support for individuals with subjective cognitive decline (SCD) is important for fostering independence in completing daily tasks. In overseeing these tasks, occupational therapists may choose to help as errors arise and provide corrective courses of action. To accomplish this, therapists must be able to recognize task-specific actions, as well as the appropriate sequence for them to occur. However, manual monitoring by therapists is not always feasible in real-world environments, motivating the need for automated systems capable of recognizing actions and detecting sequencing errors. To address this, we present CHEF-VL, an online Cognitive Human Error Detection Framework with Vision-Language Models in smart kitchen environments. CHEF-VL combines two novel vision-language models, with one fine-tuned for online human action recognition and the other specially engineered to track key environmental states. An Action-State Merger integrates these two streams of information to reduce prediction noise and correct misrecognized actions. A two-year occupational therapy project of over 100 participants with and without SCD was organized to collect video data for task evaluation. Empirical results demonstrate that CHEF-VL improves both action recognition and sequencing error detection performance, offering a promising solution for real-world assistive technologies in smart home settings.