Journal of Engineering and Computer Innovations
Subscribe to JECI
Full Name*
Email Address*

Article Number - 3D60A928637


Vol.3(1), pp. 11-25 , February 2012
DOI: 10.5897/JECI11.053
ISSN: 2141-6508



Full Length Research Paper

New approaches to automatic headline generation for Arabic documents


Fahad Alotaiby1*, Salah Foda1 and Ibrahim Alkharashi2




 

1Department of Electrical Engineering, College of Engineering, King Saud University, Riyadh, Saudi Arabia.

2Computer Research Institute, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.


Email: falotaiby@hotmail.com






 Accepted: 23 September 2011  Published: 28 February 2012

Copyright © 2012 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0


 

A headline is considered a condensed summary of a document. The necessity for automatic headline generation has been on the rise due to the need to handle a huge number of documents, which is a tedious and time-consuming process. Instead of reading every document, the headline can be used to decide which ones contain important and relevant information. There are two major approaches to automatic headline generation. The first is linguistic, in which the knowledge about the structure of the language itself is considered. The second approach is statistical and it comprises all quantitative approaches to automated language processing. However, the Arabic language has a different statistical structure than the English language, and requires special treatment to generate Arabic headlines, especially when there is no dedicated technique for the Arabic language. Therefore, two new statistical methods in automatic headline generation have been developed to create representative headlines for textual documents in the Arabic language. The first is an extractive method based on character cross-correlation, and the second one is an abstractive method based on the hidden Markov model (HMM). The extractive method achieved ROUGE-L of (0.1938) and the HMM method achieved ROUGE-L of (0.2332). In addition, both techniques were assessed via human examiners who evaluated the resulting headlines.

 

Key words: Summarization, automatic headline generation, hidden Markov model, language model

Abbreviation:

CCC, Character cross correlation; DUC, document understanding conference; EWM, exact word matching; HMM, hidden markov model; HTK, hidden markov model toolkit; LDC, linguistic data consortium; LM, language model; MSA, modern standard Arabic; NIST, national institute of standards and technology; NLP, natural language processing; ROUGE, recall-oriented understudy for gisting evaluation



APA (2012). New approaches to automatic headline generation for Arabic documents. Journal of Engineering and Computer Innovations, 3(1), 11-25.
Chicago Fahad Alotaiby, Salah Foda and Ibrahim Alkharashi. "New approaches to automatic headline generation for Arabic documents." Journal of Engineering and Computer Innovations 3, no. 1 (2012): 11-25.
MLA Fahad Alotaiby, Salah Foda and Ibrahim Alkharashi. "New approaches to automatic headline generation for Arabic documents." Journal of Engineering and Computer Innovations 3.1 (2012): 11-25.
   
DOI 10.5897/JECI11.053
URL http://academicjournals.org/journal/JECI/article-abstract/3D60A928637

Subscription Form