Abstracts: NeurIPS 2024 with Pranjal Chitale - podcast episode cover

Abstracts: NeurIPS 2024 with Pranjal Chitale

Dec 06, 202411 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Pranjal Chitale discusses the 2024 NeurIPS work CVQA. Spanning 31 languages and the cultures of 30 countries, this VQA benchmark was created with native speakers and cultural experts to evaluate model performance across diverse linguistic and cultural contexts.

Read the paper

Get the dataset

Transcript

GRETCHEN HUIZINGA

Welcome to Abstracts,  a Microsoft Research Podcast that puts the   spotlight on world-class research in brief.  I’m Gretchen Huizinga. In this series,   members of the research community at  Microsoft give us a quick snapshot—or   a podcast abstract— of their  new and noteworthy papers. [MUSIC FADES] Today I'm talking to Pranjal Chitale,  a research fellow at Microsoft  

Research India. Pranjal is coauthor of a paper  called “CVQA: Culturally-diverse Multilingual   Visual Question Answering Benchmark,”  and this paper is an oral presentation   at this week's 38th annual Conference on  Neural Information Processing Systems,   or NeurIPS, in Vancouver, BC. Pranjal,  thanks for joining us today on Abstracts!

PRANJAL CHITALE

Hi, Gretchen.  Thanks for having me.

HUIZINGA

So, Pranjal, give us an overview  of this paper. In a couple sentences,   what problem are you trying to solve,  and why should people care about it?

CHITALE

So we are witnessing some exciting  times as LLMs are rapidly evolving as tools for   countless use cases. While most of these LLMs  were initially leveraged for natural language   processing tasks, they are now expanded across  languages and modalities. However, a major gap  

lies in the availability of multimodal data for  non-English languages. Therefore, most multimodal   models might not have coverage for non-English  languages altogether or might just heavily   rely on translations of the associated text in  English-centric datasets so as to support multiple   languages. The drawback of this approach is that  it often misses the cultural nuances of local   languages. And another reason why this is not  optimal is the images are mostly Western-centric  

[and] therefore would not be well reflective  of the local culture of a lot of regions. So   this kind of bias can skew these models towards  a Western perspective, raising concerns about   inclusivity and safety of the content which they  generate when serving a global population, which   involves multicultural and multilingual users.  Therefore, for a truly inclusive AI ecosystem,   models must demonstrate cultural understanding  to ensure that the generated content is safe,  

respectful for diverse communities. Evaluating  cultural awareness, though, is extremely   challenging because how to define culture itself  is an unsolved problem. However, in this work, we   are trying to take a step towards having a proxy  which could measure cultural understanding.

HUIZINGA

Well, talk about how you did this.  What methodology did you use for this paper,   and what were your major findings?

CHITALE

Now that we have defined our broader  problem, it is important to decide the scope of   our solution because, as we discussed, culture is  an umbrella term. So we need to define a smaller   scope for this problem. We chose visual question  answering, which is a multimodal task, and it is   one of the most critical multimodal tasks for the  scope of this work. So recognizing the limitations   of existing VQA benchmarks, which often rely on  translations and lack cultural representation,  

we developed CVQA, which is Culturally-diverse  multilingual VQA benchmark. CVQA spans 30   countries, 31 languages, and has over 10,000  culturally nuanced questions, which were crafted   by native speakers and cultural experts. So our  focus was on creating questions which required   what we term as cultural common sense to answer.  For instance, with just the image, it is not   possible to answer the question. You need some  cultural awareness about the local culture to  

be able to answer the question. So these questions  draw inspiration from knowledge of local culture.   So one important aspect of this dataset is that  we include both local language as well as English  

variants of the same question to allow robust  testing of models across linguistic concepts. I   would say the crux of this effort is that while  most of the prior efforts may be small in terms   of language—it could be language-group specific or  country specific for most—but we wanted this to be   a much larger global-scale collaborative effort.  So this covers 31 languages across 30 countries.   So to build CVQA, we worked with qualified  volunteers from diverse age group and genders,  

ensuring that the questions authentically  represented their cultures. So images which   were collected, those were ensured to be copyright  free, grounded in culture, and safe for work with   strict guidelines to ensure that we avoid  images which reflect some stereotypes or   privacy violations. And we also had 10 categories,  which involved topics ranging from daily life,   sports, cuisine to history of the region, so  a holistic view of the culture of the region.  

So each question was crafted as a  multiple-choice task with challenging   answer options which required both the image  as well as cultural knowledge to solve. We   also employed a maker-checker approach  to ensure quality and consistency.

HUIZINGA

So you've created the benchmark. You've  tested it. What were your major findings?

CHITALE

Now that we have created a  benchmark, the next step is to evaluate   how these multimodal models are performing  on this benchmark. So we benchmark several   state-of-the-art multimodal models, which  include both open-source offerings like CLIP,   BLIP, LLaVA-1.5, and proprietary offerings  like GPT-4o or Gemini 1.5 Flash. So what   we observed is there is a huge gap when it  comes … in performance when we compare these  

proprietary offerings versus the open-source  models. So GPT-4o was the highest-performing   model with 75.4% accuracy on English prompts  and 74.3% accuracy on local prompts. However,   the story is completely different when we go  to open-source models. These open-source models   significantly lag behind the proprietary models.  And one key finding over these open-source models   is that these models perform even worse when  prompted in the native language when we compare  

it to prompting in English. This potentially  highlights that these models lack multilingual   understanding capabilities, which may be because  multilingual training data is pretty scarce.

HUIZINGA

Yeah.

CHITALE

So LLaVA-1.5 turned out to be the  best open-source model. So one thing to notice,   LLaVA-1.5 performs well across a large set of  English VQA benchmarks, but when it comes to   cultural understanding, it is a pretty weak model.  Further, we also did some ablations to understand   if adding location-specific information to  the textual prompts has some impact or not,   but we identified that it does not result  in any significant performance improvements.  

Further, we also conducted a category-wise  analysis. So, as we had mentioned, there   are 10 categories to which these images belong.  So what we observed is that certain categories,   like people and everyday life, consistently saw  higher accuracy across a large set of models.   This may be likely due to abundance of human  activity data in training datasets. However,   when it comes to niche categories  like cooking and food, pop culture,  

which are much more challenging, especially in  local languages, these models struggle. Therefore,   these are the kind of highly diverse  cultural contexts which need improvement.

HUIZINGA

How’s this work going to make an  impact outside the lab and in the real world?

CHITALE

CVQA is significant because it  addresses a fundamental gap in how we evaluate   vision-language and multimodal models today. While  proprietary models are making impressive strides,   open-source models, which are more accessible  and easier to deploy, significantly lag behind   in terms of cultural awareness and safety. So  CVQA fills this gap and provides a much-needed  

benchmark to help us identify these gaps  in the first place. So as to fix them,   we first need to identify the gaps, and whether  we are progressing or not can be captured by this   benchmark. So for the real world, this benchmark  does have some far-reaching implications. Models   which understand culture are not just technically  better, but they would create interactions which   are far more engaging, natural, and safe for users  from diverse backgrounds. So this benchmark offers  

entirely new axis for improvement, cultural  awareness, and linguistic diversity. Therefore,   by improving a model's ability to  handle culturally nuanced questions,   CVQA ensures researchers and developers  think beyond accuracy and also focus on   cultural awareness and inclusivity before  shipping these models into production.

HUIZINGA

Pranjal, what are the unanswered  questions or unsolved problems in this field,   and what do you plan to do about it?

CHITALE

So while CVQA makes some strides in  addressing cultural and linguistic diversity,   there is still much more to explore in this space.  So this dataset only covers 31 languages and   cultures, but this is just, like, a subset of the  incredible diversity that exists globally. Many   languages and cultures remain underrepresented,  especially some of them are endangered or have   limited digital resources. So expanding CVQA  to include more of these languages would be a  

natural next step. Secondly, CVQA just focuses on  single-turn question-answer pairs. But in reality,   human interaction is often multi-turn and  conversational in nature. So a multi-turn version   of CVQA could better simulate real-world use  cases and challenge models to maintain cultural   and contextual awareness over extended dialogues.  Another interesting area is personalization. So   it would be very interesting if we could teach  models to adapt to a user's cultural background,  

preferences, or even regional nuances in real  time. This remains a significant challenge,   although this benchmark could help us  move a step towards our broader goal. [MUSIC]

HUIZINGA

Well, Pranjal Chitale, this is super  important research and thank you for joining   us today. To our listeners, thanks for tuning  in. If you're interested in learning more about   this paper, you can find it at aka.ms/abstracts.  You can also find it on arXiv and on the NeurIPS   website. And if you're at NeurIPS, you can also go  hear about it. See you next time on Abstracts!

[MUSIC FADES]

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android