Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在博客和文字、图片、脚本代码过多的情况下匹配不理想的问题 #1

Open
GoogleCodeExporter opened this issue Mar 15, 2015 · 6 comments

Comments

@GoogleCodeExporter
Copy link

cx-extractor 
算法不错,提供了一种新的思路,以前我做过的是分析提取��
�页面中所有的TABLE和DIV区块,按区块字段的大小多少来判断��
�

我按cx-extractor算法做了一下,碰到以下几个问题:我是用C#来
做的

1、preProcess不能过滤标签中有脚本的情况,如其中的IMG

   http://developer.51cto.com/art/201012/236066.htm

2、是否考虑以下2个方面的进一步改进;即在第一次匹配失败
后进行下面2中再次过滤
   1、正文一般是DIV或者TABLE(TR/TD) 进行包围的,将这些标签换成特殊标签;在行和块合并时把这些特殊标签作为一种参考界定
   2、类似下文中,正文中<p>应用较多,P中间的标签可以替换掉,计算连续的P标签

   http://hi.baidu.com/jrckkyy/blog/item/a0c70a995e3579196f068c4e.html

3、博客方面还不是很理想

   http://www.cnblogs.com/zhoujg/archive/2010/12/04/1895887.html 
   http://sarin.javaeye.com/blog/830831
   http://blog.sina.com.cn/s/blog_4c4fd3070100nbvt.html?tj=1

4、这篇新闻好像也出了点问题

   http://news.sina.com.cn/c/2010-12-04/100718432475s.shtml  

Original issue reported on code.google.com by [email protected] on 4 Dec 2010 at 3:07

@GoogleCodeExporter
Copy link
Author

补充一下,匹配不理想部分原因是编码没有指定;

目前主要问题还是去噪问题,合并及边界还是有些问题

Original comment by [email protected] on 4 Dec 2010 at 3:33

@GoogleCodeExporter
Copy link
Author

修正了你说的问题1,其他你给的链接都可以提取到一个可接�
��的情况。试试最新版本的demo/Java/TextExtract.java

Original comment by [email protected] on 14 Mar 2011 at 7:58

@GoogleCodeExporter
Copy link
Author

java 的jsoup和htmlParse都提供了很多功能了。。。

Original comment by [email protected] on 30 Jan 2012 at 6:58

1 similar comment
@GoogleCodeExporter
Copy link
Author

java 的jsoup和htmlParse都提供了很多功能了。。。

Original comment by [email protected] on 30 Jan 2012 at 6:58

@GoogleCodeExporter
Copy link
Author

英文的空格全部都被替换掉了

Original comment by ygmpkk on 7 Feb 2012 at 3:10

@GoogleCodeExporter
Copy link
Author

请问有没有能够处理英文的版本?空格都消失了...

Original comment by [email protected] on 25 Mar 2012 at 2:20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant