This document presents a method to extract data and their semantic structures from web sites. The site pages are classified into page types according to their informational content. Each page type is described in an XML document produced in accordance with a formalism called Meta. In the Meta document, on the one hand, each concept identified in a page type is listed, interpreted and organised into a hierarchy, and on the other hand, data are localised in the HTML tree. The Meta document is then used to extract an XML Schema describing the data structure and an XML Schema-valid XML document containing data gathered from HTML pages. All XML Schemas are then integrated into a unique conceptual schema that represents the whole application domain. From this conceptual schema, a database is designed in order to record extracted data.
The method is illustrated in a case study using existing and specifically developped tools.
Extraction de données de sites web : méthodologie, outils et étude de cas
Meurisse, J. (Author). 2004
Student thesis: Master types › Master in Computer science