Automatically extracting news articles from the Internet

  • Arnaud Jasselette
  • Mathieu Vanderwhale

    Student thesis: Master typesMaster in Computer science


    As information of interest is scattered around the World Wide Web, the need for fully automatic extraction processes to fetch relevant data cannot be ignored . Nowadays, five billion pages are available on the Internet and almost two million new pages are being added daily. This thesis aims at defining a comprehensive issue to extract news articles specially, from the early classification of significant pages to the article retrieval properly speaking. We developed News Ripper, a "wrapper" that achieves this Web mining task by clustering similar news pages before comparing their layouts to bring the articles to light
    Date of Award2005
    Original languageEnglish
    SupervisorMonique Fraiture (Supervisor)

    Cite this
