Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

搜索结果不全? #16

Open
gitiray opened this issue Apr 13, 2022 · 1 comment
Open

搜索结果不全? #16

gitiray opened this issue Apr 13, 2022 · 1 comment

Comments

@gitiray
Copy link

gitiray commented Apr 13, 2022

Bug 描述
搜索结果不全,有些关键词搜不到或者只能搜到几条。

复现方式
复现 bug 的步骤:

  1. 使用 tg_searcher 的版本
Docker 最新镜像 53bbaf311b3b
  1. 使用了配置文件
common:
  name: name1
  runtime_dir: /app/config/tg_searcher_data
  proxy:
  api_id: xxx
  api_hash: xxx
 
sessions:
  - name: session1
    phone: 'xxx'
 
backends:
  - id: backend1
    use_session: session1
    config:
      monitor_all: false
 
frontends:
  - type: bot
    id: private
    use_backend: backend1
    config:
      admin_id: xxx
      bot_token: xxx
      page_len: 10
      redis: redis:6379
      private_mode: true
  1. 为避免日志里出现大量无关内容,关闭了监听全部,故手动监听测试对话,向机器人发送 /monitor_chat 测试对话ID

  2. 向测试对话群组发送4条测试消息:

  • 我知道
  • 我会知道
  • 我知道了
  • 我想知了
  1. 向机器人搜索以下关键词:
  • 知道

期望行为

  • 关键词“知”:应当匹配第1至4条消息,实际只返回第4条 ❌
  • 关键词“道”:应当匹配第1至3条消息,实际返回无结果 ❌
  • 关键词“知道”:应当匹配第1至3条消息,实际返回1至3条消息 ✔️

日志

INFO:bot-backend:Init backend bot
INFO:bot-frontend:private:Start init frontend bot
INFO:bot-frontend:private:Start login to bot
INFO:telethon.network.mtprotosender:Connecting to :443/TcpFull...
INFO:telethon.network.mtprotosender:Connection to :443/TcpFull complete!
INFO:bot-frontend:private:Bot account login ok
INFO:bot-frontend:private:Register bot commands ok
INFO:root:Initialization ok
INFO:bot-frontend:private:Admin xxx sends "/monitor_chat xxx"
INFO:bot-frontend:private:add xxx to monitored_chat
/usr/local/lib/python3.9/site-packages/tg_searcher/frontend_bot.py:311: RuntimeWarning: coroutine 'BackendBot.format_dialog_html' was never awaited
  await self._admin_msg_handler(event)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO:bot-backend:New msg https://t.me/c/xxxxx from "xxx": "我知道"
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
DEBUG:jieba:Loading model from cache /tmp/jieba.cache
Loading model cost 1.490 seconds.
DEBUG:jieba:Loading model cost 1.490 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
INFO:bot-backend:New msg https://t.me/c/xxxxx from "xxx": "我会知道"
INFO:bot-backend:New msg https://t.me/c/xxxxx from "xxx": "我知道了"
INFO:bot-backend:New msg https://t.me/c/xxxxx from "xxx": "我想知了"
INFO:bot-frontend:private:Admin xxx sends "知"
INFO:bot-frontend:private:User xxx (in xxx) sends "知"
start search
INFO:bot-frontend:private:Search "知" in chats None
INFO:bot-frontend:private:Admin xxx sends "道"
INFO:bot-frontend:private:User xxx (in xxx) sends "道"
start search
INFO:bot-frontend:private:Search "道" in chats None
INFO:bot-frontend:private:Admin xxx sends "知道"
INFO:bot-frontend:private:User xxx (in xxx) sends "知道"
start search
INFO:bot-frontend:private:Search "知道" in chats None
...
@SharzyL
Copy link
Owner

SharzyL commented Apr 13, 2022

这是一直存在的一个问题,原因是后端使用的库 whoosh 存储消息的时候是搜索的时候是根据倒排索引来搜索的。「我会知道」被分词为 ["我", "会", "知道"],从而索引里面会存储 , 知道 这三个词对应的消息 id。只有当使用「知道」这个关键词的时候才能通过索引找到对应的消息。解决这个问题需要修改存储的逻辑,这是一个比较麻烦的问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants