Automatically extracting news articles from the Internet

Arnaud Jasselette
Mathieu Vanderwhale

Student thesis: Master types › Master in Computer science

Abstract

As information of interest is scattered around the World Wide Web, the need for fully automatic extraction processes to fetch relevant data cannot be ignored . Nowadays, five billion pages are available on the Internet and almost two million new pages are being added daily. This thesis aims at defining a comprehensive issue to extract news articles specially, from the early classification of significant pages to the article retrieval properly speaking. We developed News Ripper, a "wrapper" that achieves this Web mining task by clustering similar news pages before comparing their layouts to bring the articles to light

Date of Award	2005
Original language	English
Supervisor	Monique Fraiture (Supervisor)

Cite this

Documents

2005_JasseletteA_memoire
File: application/pdf, 51 MB
Type: Thesis