We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
示例:http://blog.csdn.net/levy_cui/article/details/51481306 去除后Content仍包含<div id="article_content" class="article_content">以及<pre code_snippet_id="1693397" snippet_file_name="blog_20160523_1_4170383" name="code" class="python">:
<div id="article_content" class="article_content">
<pre code_snippet_id="1693397" snippet_file_name="blog_20160523_1_4170383" name="code" class="python">
<div id="article_content" class="article_content"><span> </span><p>架构<br></p><span> </span><p><img src="http://img.blog.csdn.net/20160523141938618" alt=""><br></p><span> </span><p>基于行块分布函数的通用网页正文抽取</p><span> http://wenku.baidu.com/link?url=TOBoIHWT_k68h5z8k_Pmqr-wJMPfCy2q64yzS8hxsgTg4lMNH84YVfOCWUfvfORTlccMWe5Bd1BNVf9dqIgh75t4VQ728fY2Rte3x3CQhaS</span><br><span> </span><br><span> 网页正文及内容图片提取算法</span><br><span> </span><p>http://www.jianshu.com/p/d43422081e4b</p><span> </span><span> </span><p>这一算法的主要原理基于两点:<br>正文区密度:在去除HTML中所有tag之后,正文区字符密度更高,较少出现多行空白;<br>行块长度:非正文区域的内容一般单独标签(行块)中较短。<br></p><span> 测试源码:</span><br><span> https://github.com/rainyear/cix-extractor-py/blob/master/extractor.py#L9</span><br><span> </span><span> </span><pre code_snippet_id="1693397" snippet_file_name="blog_20160523_1_4170383" name="code" class="python">#! /usr/bin/env python3 # -*- coding: utf-8 -*- import requests as req import re DBUG = 0 reBODY = r'<body.*?>([\s\S]*?)<\/body>' reCOMM = r'<!--.*?-->' reTRIM = r'<{0}.*?>([\s\S]*?)<\/{0}>' reTAG = r'<[\s\S]*?>|[ \t\r\f\v]' reIMG = re.compile(r'<img[\s\S]*?src=[\'|"]([\s\S]*?)[\'|"][\s\S]*?>') class Extractor(): def __init__(self, url = "", blockSize=3, timeout=5, image=False): self.url = url self.blockSize = blockSize self.timeout = timeout self.saveImage = image self.rawPage = "" self.ctexts = [] self.cblocks = [] def getRawPage(self): try: resp = req.get(self.url, timeout=self.timeout) except Exception as e: raise e if DBUG: print(resp.encoding) resp.encoding = "UTF-8" return resp.status_code, resp.text #去除所有tag,包括样式、Js脚本内容等,但保留原有的换行符\n: def processTags(self): self.body = re.sub(reCOMM, "", self.body) self.body = re.sub(reTRIM.format("script"), "" ,re.sub(reTRIM.format("style"), "", self.body)) # self.body = re.sub(r"[\n]+","\n", re.sub(reTAG, "", self.body)) self.body = re.sub(reTAG, "", self.body) #将网页内容按行分割,定义行块 blocki 为第 [i,i+blockSize] 行文本之和并给出行块长度基于行号的分布函数: def processBlocks(self): self.ctexts = self.body.split("\n") self.textLens = [len(text) for text in self.ctexts] self.cblocks = [0]*(len(self.ctexts) - self.blockSize - 1) lines = len(self.ctexts) for i in range(self.blockSize): self.cblocks = list(map(lambda x,y: x+y, self.textLens[i : lines-1-self.blockSize+i], self.cblocks)) maxTextLen = max(self.cblocks) if DBUG: print(maxTextLen) self.start = self.end = self.cblocks.index(maxTextLen) while self.start > 0 and self.cblocks[self.start] > min(self.textLens): self.start -= 1 while self.end < lines - self.blockSize and self.cblocks[self.end] > min(self.textLens): self.end += 1 return "".join(self.ctexts[self.start:self.end]) #如果需要提取正文区域出现的图片,只需要在第一步去除tag时保留<img>标签的内容: def processImages(self): self.body = reIMG.sub(r'{{\1}}', self.body) #正文出现在最长的行块,截取两边至行块长度为 0 的范围: def getContext(self): code, self.rawPage = self.getRawPage() self.body = re.findall(reBODY, self.rawPage)[0] if DBUG: print(code, self.rawPage) if self.saveImage: self.processImages() self.processTags() return self.processBlocks() # print(len(self.body.strip("\n"))) if __name__ == '__main__': ext = Extractor(url="http://blog.rainy.im/2015/09/02/web-content-and-main-image-extractor/",blockSize=5, image=False) print(ext.getContext())</pre><br><span> </span><p>总结<br>以上算法基本可以应对大部分(中文)网页正文的提取,针对有些网站正文图片多于文字的情况,可以采用保留<img> 标签中图片链接的方法,增加正文密度。目前少量测试发现的问题有:1)文章分页或动态加载的网页;2)评论长度过长喧 宾夺主的网页。<br></p><span> </span><span> </span><p>web正文提取接口使用说明<br></p><span> </span><p>http://www.weixinxi.wang/open/extract.html</p><span> </span><span> </span><p>Newspaper/python-readability库也可以实现<br></p><span> </span><span> </span><pre code_snippet_id="1693397" snippet_file_name="blog_20160620_2_3856204" name="code" class="python">#!/usr/bin/python # -*- coding:UTF-8 -*- from newspaper import Article url = 'http://www.cankaoxiaoxi.com/roll10/20160619/1197379.shtml' article = Article(url, language='zh') article.download() article.parse() print(article.text) </pre><span>https://github.com/codelucas/newspaper</span><br><span> </span><span> </span></div>
The text was updated successfully, but these errors were encountered:
No branches or pull requests
示例:http://blog.csdn.net/levy_cui/article/details/51481306
去除后Content仍包含
<div id="article_content" class="article_content">
以及<pre code_snippet_id="1693397" snippet_file_name="blog_20160523_1_4170383" name="code" class="python">
:The text was updated successfully, but these errors were encountered: