CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning
Improving multimodal table understanding with code-driven reasoning.
Improving multimodal table understanding with code-driven reasoning.
A comprehensive benchmark and a training-free method for 360° image perception using MLLMs.
Introduces TB-Bench to train and evaluate multimodal agents for understanding complex traffic behaviors captured by dashcams.
Presents GRIT, a dual-feature transformer that improves both speed and accuracy for image captioning.
Enhances interactive instruction following agents with wide-context perception and iterative reasoning.
Introduces an efficient attention design capturing full interactions in visual dialog systems.
Revisits single-stage detectors and boosts their effectiveness for face detection benchmarks.
Applies capsule networks to the challenging task of recognizing subtle micro-expressions.
Proposes a semi-supervised multi-label learning framework that explicitly models label-feature relationships.
Introduces a lifelong topic modeling pipeline tailored for Vietnamese multi-label text classification.