New approaches to automatic headline generation for Arabic documents

Fahad Alotaiby; Salah Foda; Ibrahim Alkharashi

doi:10.5897/JECI11.053

Journal of
Engineering and Computer Innovations

Abbreviation: J. Eng. Comput. Innov.
Language: English
ISSN: 2141-6508
DOI: 10.5897/JECI
Start Year: 2010
Published Articles: 32

Full Length Research Paper

New approaches to automatic headline generation for Arabic documents

Fahad Alotaiby1*, Salah Foda1 and Ibrahim Alkharashi2

1Department of Electrical Engineering, College of Engineering, King Saud University, Riyadh, Saudi Arabia. 2Computer Research Institute, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.
Email: [email protected]

Article Number - 3D60A928637
Vol.3(1), pp. 11-25 , February 2012
https://doi.org/10.5897/JECI11.053

Accepted: 23 September 2011
Published: 28 February 2012

Copyright © 2024 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0.

Abstract

A headline is considered a condensed summary of a document. The necessity for automatic headline generation has been on the rise due to the need to handle a huge number of documents, which is a tedious and time-consuming process. Instead of reading every document, the headline can be used to decide which ones contain important and relevant information. There are two major approaches to automatic headline generation. The first is linguistic, in which the knowledge about the structure of the language itself is considered. The second approach is statistical and it comprises all quantitative approaches to automated language processing. However, the Arabic language has a different statistical structure than the English language, and requires special treatment to generate Arabic headlines, especially when there is no dedicated technique for the Arabic language. Therefore, two new statistical methods in automatic headline generation have been developed to create representative headlines for textual documents in the Arabic language. The first is an extractive method based on character cross-correlation, and the second one is an abstractive method based on the hidden Markov model (HMM). The extractive method achieved ROUGE-L of (0.1938) and the HMM method achieved ROUGE-L of (0.2332). In addition, both techniques were assessed via human examiners who evaluated the resulting headlines.

Key words: Summarization, automatic headline generation, hidden Markov model, language model

Abbreviation

CCC, Character cross correlation; DUC, document understanding conference; EWM, exact word matching; HMM, hidden markov model; HTK, hidden markov model toolkit; LDC, linguistic data consortium; LM, language model; MSA, modern standard Arabic; NIST, national institute of standards and technology; NLP, natural language processing; ROUGE, recall-oriented understudy for gisting evaluation

This article is published under the terms of the Creative Commons Attribution License 4.0

Back to Vol. 3 No. 1

Back to articles

Views: 0
Downloads: 0

Related Articles:
On Google
On Google Scholar

Articles on Google by:

Journal of Engineering and Computer Innovations

New approaches to automatic headline generation for Arabic documents

Fahad Alotaiby1*, Salah Foda1 and Ibrahim Alkharashi2

Journal of
Engineering and Computer Innovations