Volume 1 Number 1 (May 2012)
Home > Archive > 2012 > Volume 1 Number 1 (May 2012) >
IJCCE 2012 Vol.1(1): 13-15 ISSN: 2010-3743
DOI: 10.7763/IJCCE.2012.V1.4

A Method Based on Statistics for Extracting Text from Web Pages

Wang Nan and Xiao Chun
Abstract—This paper designed a method which is based on statistics for extracting text from news web pages. First it translated a web document into a DOM tree by HTML tag. Then it structure leaf node’s feature vector according to features of punctuations statistics. Last calculate the similarity between two leaf nodes and weights by feature vector. And it extracted text from those leaf nodes whose weights greater than the threshold value. The results show that the method has property accuracy in text extraction and better universal.

Index Terms—Information extraction, DOM tree, statistical characteristics, similarity.

Wang Nan (e-mail: 582952853@qq.com, tel.: +8615200321822).


Cite: Wang Nan and Xiao Chun, "A Method Based on Statistics for Extracting Text from Web Pages," International Journal of Computer and Communication Engineering vol. 1, no. 1, pp. 13-15, 2012.

General Information

ISSN: 2010-3743
Frequency: Quarterly
Editor-in-Chief: Dr. Maode Ma
Abstracting/ Indexing: EI (INSPEC, IET), Google Scholar, Crossref, ProQuest, and Electronic Journals Library
E-mail: ijcce@iap.org
  • Nov 07, 2017 News!

    IJCCE Vol. 5, No. 5 has been indexed by EI (Inspec) Inspec, created by the Institution of Engineering and Tech.!   [Click]

  • Mar 31, 2016 News!

    IJCCE Vol. 4, No. 5 has been indexed by EI (Inspec) Inspec, created by the Institution of Engineering and Tech.!   [Click]

  • May 30, 2018 News!

    IJCCE Vol.7, No.2 is published with online version!   [Click]

  • Jun 28, 2017 News!

    IJCCE Vol. 5, No. 4 has been indexed by EI (Inspec) Inspec, created by the Institution of Engineering and Tech.!   [Click]

  • Jun 28, 2017 News!

    IJCCE Vol. 5, No. 3 has been indexed by EI (Inspec) Inspec, created by the Institution of Engineering and Tech.!   [Click]

  • Read more>>