Volume 1 Number 1 (May 2012)
Home > Archive > 2012 > Volume 1 Number 1 (May 2012) >
IJCCE 2012 Vol.1(1): 13-15 ISSN: 2010-3743
DOI: 10.7763/IJCCE.2012.V1.4

A Method Based on Statistics for Extracting Text from Web Pages

Wang Nan and Xiao Chun
Abstract—This paper designed a method which is based on statistics for extracting text from news web pages. First it translated a web document into a DOM tree by HTML tag. Then it structure leaf node’s feature vector according to features of punctuations statistics. Last calculate the similarity between two leaf nodes and weights by feature vector. And it extracted text from those leaf nodes whose weights greater than the threshold value. The results show that the method has property accuracy in text extraction and better universal.

Index Terms—Information extraction, DOM tree, statistical characteristics, similarity.

Wang Nan (e-mail: 582952853@qq.com, tel.: +8615200321822).


Cite: Wang Nan and Xiao Chun, "A Method Based on Statistics for Extracting Text from Web Pages," International Journal of Computer and Communication Engineering vol. 1, no. 1, pp. 13-15, 2012.

General Information

ISSN: 2010-3743 (Online)
Abbreviated Title: Int. J. Comput. Commun. Eng.
Frequency: Quarterly
Editor-in-Chief: Dr. Maode Ma
Abstracting/ Indexing: INSPEC, CNKI, Google Scholar, Crossref, EBSCO, ProQuest, and Electronic Journals Library
E-mail: ijcce@iap.org
  • Dec 29, 2021 News!

    IJCCE Vol. 10, No. 1 - Vol. 10, No. 2 have been indexed by Inspec, created by the Institution of Engineering and Tech.!   [Click]

  • Mar 17, 2022 News!

    IJCCE Vol.11, No.2 is published with online version!   [Click]

  • Dec 29, 2021 News!

    The dois of published papers in Vol. 9, No. 3 - Vol. 10, No. 4 have been validated by Crossref.

  • Dec 29, 2021 News!

    IJCCE Vol.11, No.1 is published with online version!   [Click]

  • Sep 16, 2021 News!

    IJCCE Vol.10, No.4 is published with online version!   [Click]

  • Read more>>