Semi-Automated Extraction of Targeted Data from Web Pages

Fabrice Estiévenart, Jean-Roch Meurisse, Jean-Luc Hainaut, Philippe Thiran

Research output: Contribution in Book/Catalog/Report/Conference proceedingConference contribution

50 Downloads (Pure)

Abstract

The WorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree tructures. This approach is supported by a user-friendly tool called Retrozilla.
Original languageEnglish
Title of host publicationIEEE ICDE Workshop Proceedings
Subtitle of host publicationWorkshop on Challenges in Web Information Retrieval and Integration
EditorsR. S Barga, X Zhou
Place of PublicationLos Alamitos, California
PublisherIEEE Computer Science Press
Publication statusPublished - 2006

Fingerprint

Dive into the research topics of 'Semi-Automated Extraction of Targeted Data from Web Pages'. Together they form a unique fingerprint.

Cite this