Ming Jiang | Image Caption Evaluation

I started doing this research when I interned at Micorosoft Research Lab. With an eye on vision-lanage research, my interest in proposing this project was aroused from a question about popular evaluation metrics that have been popularly used to measure image captioning system performance: Is it faithful enough to quantify the quality of an image description purely based on text matching, especailly on n-gram overlapping?

The reason of having this question was because of two aspects of consideration. First, as an old adage goes “a picture is worth a thousand words”, even human-written references may not fully cover the image content. The possible information loss caused by references may bring biases to the evaluation process, and, moreover, such text-level comparisons have to face up to the challenge of language ambiguity. In addition to the concern with existing text-based evalution strategies, I’m also interested in exploring other possible dimensions that should be considered to more comprehensively judge a machine’s ability in perceiving visual information toward its generated text.