Semi-Automated extraction of targeted data fromweb pages

Fabrice Estíevenart, Jean Roch Meurisse, Jean Luc Hainaut, Philippe Thiran

Résultats de recherche: Contribution dans un livre/un catalogue/un rapport/dans les actes d'une conférenceArticle dans les actes d'une conférence/un colloque


TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

langue originaleAnglais
titreICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops
EditeurInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronique)0769525717, 9780769525716
Les DOIs
Etat de la publicationPublié - 1 janv. 2006
Evénement22nd International Conference on Data Engineering Workshops, ICDEW 2006 - Atlanta, États-Unis
Durée: 3 avr. 20067 avr. 2006

Une conférence

Une conférence22nd International Conference on Data Engineering Workshops, ICDEW 2006
La villeAtlanta

Empreinte digitale

Examiner les sujets de recherche de « Semi-Automated extraction of targeted data fromweb pages ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation