Automatically extracting news articles from the Internet

Arnaud Jasselette
Mathieu Vanderwhale

Student thesis: Master types › Master en sciences informatiques

Résumé

As information of interest is scattered around the World Wide Web, the need for fully automatic extraction processes to fetch relevant data cannot be ignored . Nowadays, five billion pages are available on the Internet and almost two million new pages are being added daily. This thesis aims at defining a comprehensive issue to extract news articles specially, from the early classification of significant pages to the article retrieval properly speaking. We developed News Ripper, a "wrapper" that achieves this Web mining task by clustering similar news pages before comparing their layouts to bring the articles to light

la date de réponse	2005
langue originale	Anglais
Superviseur	Monique Fraiture (Promoteur)

Contient cette citation

Les documents

2005_JasseletteA_memoire
Fichier: application/pdf, 51 MB
Type: Thèse