论文部分内容阅读
采用通用搜索引擎与垂直搜索引擎相结合的互联网主题信息采集策略,提出多种防屏蔽技术相结合的网络采集防屏蔽解决方案,改进一种基于文本密度的网页正文抽取方法,利用基于分词的向量空间模型和余弦夹角公式实现基于内容的标题去重,并设计一个面向侨情的互联网主题信息采集系统。
This paper proposes a web content collection and anti-screening solution based on the combination of general search engine and vertical search engine and proposes a web collection anti-screening solution based on a combination of anti-screening technologies. A web text extraction method based on text density is improved. Space model and cosine angle formula to achieve the content-based heading to weight, and design a Chinese theme for the Qiaqia Internet information collection system.