Answer relevancy metric is worse in languages other than English #1829

icejean · 2025-01-10T02:59:37Z

I just try ragas to evaluate my GraphRAG app in Chinese, and find that the metric answer relevancy is worse for every question. And I find out that the cause is the question generated is in English， so the embedding of the original question and the generated question is quite different. To address this issue, I modify the function in ~/ragas/prompt/pydantic_prompt.py to demand LLM to output the generated question in Chinese, and it does work.

    def _generate_output_signature(self, indent: int = 4) -> str:
        return (
            f"Please return the output in a JSON format that complies with the "
            f"following schema as specified in JSON Schema and the generated question in Chinese:\n"
            f"{self.output_model.model_json_schema()}"
        )

But I know the function is called not only for this metric, and a solution is needed to support all languages, so I write down the issue here.
Best regards
Jean from China

The text was updated successfully, but these errors were encountered:

jjmachan · 2025-01-10T07:32:43Z

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that?
https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

icejean · 2025-01-11T01:25:15Z

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that? https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

Not yet, I'm new to ragas and don't know about the feature, I'll have a try soon, thanks for your reply. 8-)

chungyuan · 2025-01-15T06:09:08Z

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that? https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

I would like to follow up on the questions raised by this author.

When evaluating documents that contain multiple languages (e.g., English, Chinese, Japanese, etc.), it seems that "adapting metrics to target languages" cannot handle multiple target languages simultaneously. Is that correct?
Given that large language models are being used, why are they still so sensitive to language differences?

icejean · 2025-01-15T08:05:35Z

ragas 0.2.11, Class not found:

from ragas.metrics import SimpleCriteriaScoreWithReference

scorer = SimpleCriteriaScoreWithReference(
    name="course_grained_score", definition="Score 0 to 5 by similarity"
)

scorer.get_prompts()

Throws exception:

>>> from ragas.metrics import SimpleCriteriaScoreWithReference
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'SimpleCriteriaScoreWithReference' from 'ragas.metrics' (/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/metrics/__init__.py)

It's not mentioned in metrics.

icejean · 2025-01-15T09:16:03Z

As to the answer relevancy metric, I write this code sinppet:

# llm2 and embeddings are initialized correctly allready
from ragas.metrics import AnswerRelevancy

answer_relevancy = AnswerRelevancy(
    name="answer_relevancy", strictness=3, embeddings=embeddings
)

prompts = answer_relevancy.get_prompts()
print(prompts)

import asyncio

async def adapt_prompt():
    adapted_prompts = await answer_relevancy.adapt_prompts(language="Chinese", llm=llm2)
    print(adapted_prompts)
    return adapted_prompts

# run adapt_prompt() with  asyncio.run()
adapted_prompts = asyncio.run(adapt_prompt())

It's O.K. to get the prompt in English:

>>> print(prompts)
>>> 
{'response_relevance_prompt': ResponseRelevancePrompt(instruction=Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers, examples=[(ResponseRelevanceInput(response='Albert Einstein was born in Germany.'), ResponseRelevanceOutput(question='Where was Albert Einstein born?', noncommittal=0)), (ResponseRelevanceInput(response="I don't know about the  groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. "), ResponseRelevanceOutput(question='What was the groundbreaking feature of the smartphone invented in 2023?', noncommittal=1))], language=english)}

But failed to call LLM to transform it into Chinese:

>>> import asyncio
>>> 
>>> async def adapt_prompt():
...     adapted_prompts = await answer_relevancy.adapt_prompts(language="Chinese", llm=llm2)
...     print(adapted_prompts)
...     return adapted_prompts
...     
... # run adapt_prompt() with  asyncio.run()
... adapted_prompts = asyncio.run(adapt_prompt())
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "<string>", line 2, in adapt_prompt
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/mixin.py", line 77, in adapt_prompts
    adapted_prompt = await prompt.adapt(language, llm, adapt_instruction)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 233, in adapt
    translated_strings = await translate_statements_prompt.generate(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 128, in generate
    output_single = await self.generate_multiple(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 189, in generate_multiple
    resp = await llm.generate(
                 ^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 627, in generate
    batch_size=len(messages),
               ^^^^^^^^^^^^^
TypeError: object of type 'StringPromptValue' has no len()

Is there anything worng in my code snippet?

icejean · 2025-01-16T08:13:02Z

Well, I make it run out at last:
Two key points:

Use llm_factory() to make a LLM instance to translate the prompt as mentioned in document Adapting metrics to target language, or use LangchainLLMWrapper to wrap a LangChain base model.
The target language is "chinese" in lower case but not "Chinese" in upper case.

# This LLM instance is used to run assessments,  loadLLM2 is my own function.
# llm2 = loadLLM2("OpenAI") # OpenAI, Ali

from ragas.metrics import faithfulness,context_entity_recall
from ragas import evaluate
from ragas.cost import get_token_usage_for_openai

from ragas.llms import LangchainLLMWrapper

# This LLM instance is used to translate the prompt into target language.
llm3 = LangchainLLMWrapper(llm2)  # gpt-4o-mini

from ragas.metrics import AnswerRelevancy

answer_relevancy = AnswerRelevancy(
    name="answer_relevancy", strictness=3, embeddings=embeddings
)

prompts = answer_relevancy.get_prompts()

import asyncio

async def adapt_prompt():
    adapted_prompts = await answer_relevancy.adapt_prompts(language="chinese", llm=llm3)
    print(adapted_prompts)
    return adapted_prompts

# run adapt_prompt() with  asyncio.run()
adapted_prompts = asyncio.run(adapt_prompt())

answer_relevancy.set_prompts(**adapted_prompts)

# embeddings is initialized somewhere elsle
score = evaluate(
    dataset=dataset,
    # metrics=[faithfulness, answer_relevancy,context_entity_recall],
    metrics=[answer_relevancy],
    llm=llm2,
    embeddings=embeddings,
    token_usage_parser=get_token_usage_for_openai, # 只有用OpenAI时才估算token流量。
)

# df = score.to_pandas()[['faithfulness', 'answer_relevancy','context_entity_recall']]
df = score.to_pandas()[['answer_relevancy']]
df

The prompt adapted:

... adapted_prompts = asyncio.run(adapt_prompt())
{'response_relevance_prompt': ResponseRelevancePrompt(instruction=Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers, examples=[(ResponseRelevanceInput(response='阿尔伯特·爱因斯坦出生在德国。'), ResponseRelevanceOutput(question='阿尔伯特·爱因斯坦出生在哪里？', noncommittal=0)), (ResponseRelevanceInput(response='我不知道2023年发明的智能手机的突破性功能，因为我对2022年以后的信息一无所知。'), ResponseRelevanceOutput(question='2023年发明的智能手机的突破性功能是什么？', noncommittal=1))], language=chinese)}

The result:

>>> df = score.to_pandas()[['answer_relevancy']]
>>> df
   answer_relevancy
0          0.952645

jjmachan · 2025-01-22T04:49:28Z

hey @icejean just wanted to check if you found the solution for the issue you were facing?

icejean · 2025-01-22T07:01:52Z

hey @icejean just wanted to check if you found the solution for the issue you were facing?

Yes, it's addressed with the code snippet above, as mentioned in the document: Adapting metrics to target language.

jjmachan · 2025-01-23T18:29:20Z

awsome - great to hear that 🙂

icejean added the bug Something isn't working label Jan 10, 2025

icejean mentioned this issue Jan 10, 2025

Answer relevancy metric is worse in languages other than English neo4j-labs/llm-graph-builder#998

Closed

icejean closed this as completed Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Answer relevancy metric is worse in languages other than English #1829

Answer relevancy metric is worse in languages other than English #1829

icejean commented Jan 10, 2025

jjmachan commented Jan 10, 2025

icejean commented Jan 11, 2025

chungyuan commented Jan 15, 2025

icejean commented Jan 15, 2025 •

edited

Loading

icejean commented Jan 15, 2025

icejean commented Jan 16, 2025 •

edited

Loading

jjmachan commented Jan 22, 2025

icejean commented Jan 22, 2025 •

edited

Loading

jjmachan commented Jan 23, 2025

Answer relevancy metric is worse in languages other than English #1829

Answer relevancy metric is worse in languages other than English #1829

Comments

icejean commented Jan 10, 2025

jjmachan commented Jan 10, 2025

icejean commented Jan 11, 2025

chungyuan commented Jan 15, 2025

icejean commented Jan 15, 2025 • edited Loading

icejean commented Jan 15, 2025

icejean commented Jan 16, 2025 • edited Loading

jjmachan commented Jan 22, 2025

icejean commented Jan 22, 2025 • edited Loading

jjmachan commented Jan 23, 2025

icejean commented Jan 15, 2025 •

edited

Loading

icejean commented Jan 16, 2025 •

edited

Loading

icejean commented Jan 22, 2025 •

edited

Loading