GitHub

New Word Extraction

This repo will consist of four parts:

Codes for crawling papers' abstract, keyword, authors, titles from Baidu Academic and cnki using a user defined query, forked from this website with some modifications
Codes to extract new words from all the information we got from last step
Crawl the explainations from Wikipedia and BaiduBaike
Dialog bot to explain the new words

There should be a data/ folder where the crawled files will be stored in .pkl format.

Requirements:

selenium == 3.141.0
urllib3 == 1.24.3
openpyxl == 3.0.5
smoothNLP ==0.4.0
ltp == 4.0.10
torch == 1.6.0
elasticsearch == 7.10.0
transformers == 3.3.1

You can install it through pip install -r requirement.txt, and also a chrome web driver will be needed

You can construct your own queries.xlsx to crawl the specific websites you would like to, just follow the format below, of course you can construct this with openpyxl automatically by the code snippet, and replace the terms as you wish

from openpyxl import load_workbook,Workbook
wb = Workbook()
sheet = wb.active
sheet.title = "Sheet1"
terms = ["computer architecture", "Software Engineering"]
for each in terms:
	url = "https://xueshu.baidu.com/s?wd=" + each
	sheet.append([each, url])

wb.save(r'queries.xlsx')

Please note that there should not be any blank rows. The URLs could be set with a filter by copying the URLs from your browser.

Usage:

Paper Crawling

To crawl papers from Baidu Academic, run python3 baidySpider.py

To crawl papers from cnki, run python3 word.py

Extract New Words from Crawled Paper

To extract new words from the data, run python3 extract_new_words.py. You will need to prepare a word list to filter out some of the words are not newly invented, and certainly you are free to delete the filtering procedure.

Crawl Explainations from Wikipedia and BaiduBaike

To crawl Wikipedia using the keywords from above procedures, runpython3 crawlwikipedia.py

To crawl BaiduBaike using the keywords from above procedures, runpython3 crawlbaidupedia.py

Run Dialogue Bot

You should install and run elasticsearch before run the python script.

For MaxOS users, we recommend them to use brew install elasticsearch, and then type elasticsearch in your command line, then this service will be deployed in your local machine.

First time run: You want to run python dialog_bot.py --json_file [yourfile] to create the model file, which takes a while. The model is the BERT representation of all the keywords. The format of json file could be found in the file crawlbaidupedia.py at line 51-55

Then: You will just run by python dialog_bot.py --model_file [your model] to interactively chat with the chatbot.

Enjoy chatting!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
baidu_academy		baidu_academy
chatbot		chatbot
cnki		cnki
extract_new_words		extract_new_words
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

New Word Extraction

About

Releases

Packages

Contributors 2

Languages

BearCleverProud/new_word_extraction

Folders and files

Latest commit

History

Repository files navigation

New Word Extraction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages