Thesis

Josh worked on a code generation and computer vision project as part of his M.S. Thesis at the University of Illinois Urbana-Champaign as well as a subsequent paper.

Abstract

End-to-end vision-language models often fail to handle compositional tasks, necessitating alternative approaches for more complex problem-solving. Leveraging the visual programming paradigm, we propose a novel method for composing foundational vision models through program generation to tackle compositional tasks effectively. We investigate prompting and execution strategies that enable the synthesis of fine-tunable code by trainable large language models aimed at improving the effectiveness of the programs in solving vision-language tasks.

Capitalizing on the robust compositional reasoning capabilities of large language models (LLMs), we employ pre-trained LLMs to architect programs constructed using a catalog of pre-defined atomic functions. These atomic functions, implemented with pre-trained vision models, serve as the building blocks for the visual programs generated by our system. Our methodology supports programs in various formats, always offering the flexibility to fine-tune the constituent vision models and the LLM code generator.

This study concentrates on image-based question-answering. This focus underscores the critical need for advanced compositional reasoning in interpreting and responding to complex visual queries. Our evaluation encompasses the executability and correctness of the produced programs, providing a comprehensive assessment of our approach's effectiveness.

This paper lays the groundwork for a subsequent investigation into the joint training of the LLMs and atomic functions, setting the stage for significant advancements in program generation and compositional reasoning in computer vision.