Extraction de données de sites web : méthodologie, outils et étude de cas

  • Jean-Roch Meurisse

Student thesis: Master typesMaster in Computer science

Abstract

This document presents a method to extract data and their semantic structures from web sites. The site pages are classified into page types according to their informational content. Each page type is described in an XML document produced in accordance with a formalism called Meta. In the Meta document, on the one hand, each concept identified in a page type is listed, interpreted and organised into a hierarchy, and on the other hand, data are localised in the HTML tree. The Meta document is then used to extract an XML Schema describing the data structure and an XML Schema-valid XML document containing data gathered from HTML pages. All XML Schemas are then integrated into a unique conceptual schema that represents the whole application domain. From this conceptual schema, a database is designed in order to record extracted data. The method is illustrated in a case study using existing and specifically developped tools.
Date of Award2004
Original languageFrench
SupervisorJean-Luc Hainaut (Supervisor)

Cite this

'