Automatically extracting news articles from the Internet

  • Arnaud Jasselette
  • Mathieu Vanderwhale

    Student thesis: Master typesMaster en sciences informatiques

    Résumé

    As information of interest is scattered around the World Wide Web, the need for fully automatic extraction processes to fetch relevant data cannot be ignored . Nowadays, five billion pages are available on the Internet and almost two million new pages are being added daily. This thesis aims at defining a comprehensive issue to extract news articles specially, from the early classification of significant pages to the article retrieval properly speaking. We developed News Ripper, a "wrapper" that achieves this Web mining task by clustering similar news pages before comparing their layouts to bring the articles to light
    la date de réponse2005
    langue originaleAnglais
    SuperviseurMonique Fraiture (Promoteur)

    Contient cette citation

    '