Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answer relevancy metric is worse in languages other than English #1829

Closed
icejean opened this issue Jan 10, 2025 · 9 comments
Closed

Answer relevancy metric is worse in languages other than English #1829

icejean opened this issue Jan 10, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@icejean
Copy link

icejean commented Jan 10, 2025

I just try ragas to evaluate my GraphRAG app in Chinese, and find that the metric answer relevancy is worse for every question. And I find out that the cause is the question generated is in English, so the embedding of the original question and the generated question is quite different. To address this issue, I modify the function in ~/ragas/prompt/pydantic_prompt.py to demand LLM to output the generated question in Chinese, and it does work.

    def _generate_output_signature(self, indent: int = 4) -> str:
        return (
            f"Please return the output in a JSON format that complies with the "
            f"following schema as specified in JSON Schema and the generated question in Chinese:\n"
            f"{self.output_model.model_json_schema()}"
        )

But I know the function is called not only for this metric, and a solution is needed to support all languages, so I write down the issue here.
Best regards
Jean from China

@icejean icejean added the bug Something isn't working label Jan 10, 2025
@jjmachan
Copy link
Member

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that?
https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

@icejean
Copy link
Author

icejean commented Jan 11, 2025

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that? https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

Not yet, I'm new to ragas and don't know about the feature, I'll have a try soon, thanks for your reply. 8-)

@chungyuan
Copy link

hey there Jean 👋🏽

Have you tried adapting the metrics into chineese? is it still doing down for that? https://docs.ragas.io/en/stable/howtos/customizations/metrics/_metrics_language_adaptation/

I would like to follow up on the questions raised by this author.

  1. When evaluating documents that contain multiple languages (e.g., English, Chinese, Japanese, etc.), it seems that "adapting metrics to target languages" cannot handle multiple target languages simultaneously. Is that correct?
  2. Given that large language models are being used, why are they still so sensitive to language differences?

@icejean
Copy link
Author

icejean commented Jan 15, 2025

ragas 0.2.11, Class not found:

from ragas.metrics import SimpleCriteriaScoreWithReference

scorer = SimpleCriteriaScoreWithReference(
    name="course_grained_score", definition="Score 0 to 5 by similarity"
)

scorer.get_prompts()

Throws exception:

>>> from ragas.metrics import SimpleCriteriaScoreWithReference
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'SimpleCriteriaScoreWithReference' from 'ragas.metrics' (/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/metrics/__init__.py)

It's not mentioned in metrics.

@icejean
Copy link
Author

icejean commented Jan 15, 2025

As to the answer relevancy metric, I write this code sinppet:

# llm2 and embeddings are initialized correctly allready
from ragas.metrics import AnswerRelevancy

answer_relevancy = AnswerRelevancy(
    name="answer_relevancy", strictness=3, embeddings=embeddings
)

prompts = answer_relevancy.get_prompts()
print(prompts)

import asyncio

async def adapt_prompt():
    adapted_prompts = await answer_relevancy.adapt_prompts(language="Chinese", llm=llm2)
    print(adapted_prompts)
    return adapted_prompts

# run adapt_prompt() with  asyncio.run()
adapted_prompts = asyncio.run(adapt_prompt())

It's O.K. to get the prompt in English:

>>> print(prompts)
>>> 
{'response_relevance_prompt': ResponseRelevancePrompt(instruction=Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers, examples=[(ResponseRelevanceInput(response='Albert Einstein was born in Germany.'), ResponseRelevanceOutput(question='Where was Albert Einstein born?', noncommittal=0)), (ResponseRelevanceInput(response="I don't know about the  groundbreaking feature of the smartphone invented in 2023 as am unaware of information beyond 2022. "), ResponseRelevanceOutput(question='What was the groundbreaking feature of the smartphone invented in 2023?', noncommittal=1))], language=english)}

But failed to call LLM to transform it into Chinese:

>>> import asyncio
>>> 
>>> async def adapt_prompt():
...     adapted_prompts = await answer_relevancy.adapt_prompts(language="Chinese", llm=llm2)
...     print(adapted_prompts)
...     return adapted_prompts
...     
... # run adapt_prompt() with  asyncio.run()
... adapted_prompts = asyncio.run(adapt_prompt())
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/asyncio/tasks.py", line 277, in __step
    result = coro.send(None)
             ^^^^^^^^^^^^^^^
  File "<string>", line 2, in adapt_prompt
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/mixin.py", line 77, in adapt_prompts
    adapted_prompt = await prompt.adapt(language, llm, adapt_instruction)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 233, in adapt
    translated_strings = await translate_statements_prompt.generate(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 128, in generate
    output_single = await self.generate_multiple(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/ragas/prompt/pydantic_prompt.py", line 189, in generate_multiple
    resp = await llm.generate(
                 ^^^^^^^^^^^^^
  File "/usr/lib64/anaconda3/envs/pytorch/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 627, in generate
    batch_size=len(messages),
               ^^^^^^^^^^^^^
TypeError: object of type 'StringPromptValue' has no len()

Is there anything worng in my code snippet?

@icejean
Copy link
Author

icejean commented Jan 16, 2025

Well, I make it run out at last:
Two key points:

  1. Use llm_factory() to make a LLM instance to translate the prompt as mentioned in document Adapting metrics to target language, or use LangchainLLMWrapper to wrap a LangChain base model.
  2. The target language is "chinese" in lower case but not "Chinese" in upper case.
# This LLM instance is used to run assessments,  loadLLM2 is my own function.
# llm2 = loadLLM2("OpenAI") # OpenAI, Ali

from ragas.metrics import faithfulness,context_entity_recall
from ragas import evaluate
from ragas.cost import get_token_usage_for_openai

from ragas.llms import LangchainLLMWrapper

# This LLM instance is used to translate the prompt into target language.
llm3 = LangchainLLMWrapper(llm2)  # gpt-4o-mini

from ragas.metrics import AnswerRelevancy

answer_relevancy = AnswerRelevancy(
    name="answer_relevancy", strictness=3, embeddings=embeddings
)

prompts = answer_relevancy.get_prompts()

import asyncio

async def adapt_prompt():
    adapted_prompts = await answer_relevancy.adapt_prompts(language="chinese", llm=llm3)
    print(adapted_prompts)
    return adapted_prompts

# run adapt_prompt() with  asyncio.run()
adapted_prompts = asyncio.run(adapt_prompt())

answer_relevancy.set_prompts(**adapted_prompts)

# embeddings is initialized somewhere elsle
score = evaluate(
    dataset=dataset,
    # metrics=[faithfulness, answer_relevancy,context_entity_recall],
    metrics=[answer_relevancy],
    llm=llm2,
    embeddings=embeddings,
    token_usage_parser=get_token_usage_for_openai, # 只有用OpenAI时才估算token流量。
)

# df = score.to_pandas()[['faithfulness', 'answer_relevancy','context_entity_recall']]
df = score.to_pandas()[['answer_relevancy']]
df

The prompt adapted:

... adapted_prompts = asyncio.run(adapt_prompt())
{'response_relevance_prompt': ResponseRelevancePrompt(instruction=Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers, examples=[(ResponseRelevanceInput(response='阿尔伯特·爱因斯坦出生在德国。'), ResponseRelevanceOutput(question='阿尔伯特·爱因斯坦出生在哪里?', noncommittal=0)), (ResponseRelevanceInput(response='我不知道2023年发明的智能手机的突破性功能,因为我对2022年以后的信息一无所知。'), ResponseRelevanceOutput(question='2023年发明的智能手机的突破性功能是什么?', noncommittal=1))], language=chinese)}

The result:

>>> df = score.to_pandas()[['answer_relevancy']]
>>> df
   answer_relevancy
0          0.952645

@icejean icejean closed this as completed Jan 19, 2025
@jjmachan
Copy link
Member

hey @icejean just wanted to check if you found the solution for the issue you were facing?

@icejean
Copy link
Author

icejean commented Jan 22, 2025

hey @icejean just wanted to check if you found the solution for the issue you were facing?

Yes, it's addressed with the code snippet above, as mentioned in the document: Adapting metrics to target language.

@jjmachan
Copy link
Member

awsome - great to hear that 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants