Semi-Automated extraction of targeted data fromweb pages

Fabrice Estíevenart; Jean Roch Meurisse; Jean Luc Hainaut; Philippe Thiran

doi:10.1109/ICDEW.2006.135

Semi-Automated extraction of targeted data fromweb pages

Fabrice Estíevenart, Jean Roch Meurisse, Jean Luc Hainaut, Philippe Thiran

Universite de Namur

Résultats de recherche: Contribution dans un livre/un catalogue/un rapport/dans les actes d'une conférence › Article dans les actes d'une conférence/un colloque

Résumé

TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

langue originale	Anglais
titre	ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops
Editeur	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronique)	0769525717, 9780769525716
Les DOIs	https://doi.org/10.1109/ICDEW.2006.135
Etat de la publication	Publié - 1 janv. 2006
Evénement	22nd International Conference on Data Engineering Workshops, ICDEW 2006 - Atlanta, États-Unis Durée: 3 avr. 2006 → 7 avr. 2006

Une conférence

Une conférence	22nd International Conference on Data Engineering Workshops, ICDEW 2006
Pays/Territoire	États-Unis
La ville	Atlanta
période	3/04/06 → 7/04/06

Accès au document

10.1109/ICDEW.2006.135

Autres fichiers et liens

Link to publication in Scopus

Contient cette citation

@inproceedings{0797b84f6d5d4c778cc4d843ce79eaab,

title = "Semi-Automated extraction of targeted data fromweb pages",

abstract = "TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.",

author = "Fabrice Est{\'i}evenart and Meurisse, {Jean Roch} and Hainaut, {Jean Luc} and Philippe Thiran",

year = "2006",

month = jan,

day = "1",

doi = "10.1109/ICDEW.2006.135",

language = "English",

booktitle = "ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

address = "United States",

note = "22nd International Conference on Data Engineering Workshops, ICDEW 2006 ; Conference date: 03-04-2006 Through 07-04-2006",

}

Estíevenart, F, Meurisse, JR, Hainaut, JL & Thiran, P 2006, Semi-Automated extraction of targeted data fromweb pages. Dans ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops., 1623843, Institute of Electrical and Electronics Engineers Inc., 22nd International Conference on Data Engineering Workshops, ICDEW 2006, Atlanta, États-Unis, 3/04/06. https://doi.org/10.1109/ICDEW.2006.135

Semi-Automated extraction of targeted data fromweb pages. / Estíevenart, Fabrice; Meurisse, Jean Roch; Hainaut, Jean Luc et al.
ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops. Institute of Electrical and Electronics Engineers Inc., 2006. 1623843.

Résultats de recherche: Contribution dans un livre/un catalogue/un rapport/dans les actes d'une conférence › Article dans les actes d'une conférence/un colloque

TY - GEN

T1 - Semi-Automated extraction of targeted data fromweb pages

AU - Estíevenart, Fabrice

AU - Meurisse, Jean Roch

AU - Hainaut, Jean Luc

AU - Thiran, Philippe

PY - 2006/1/1

Y1 - 2006/1/1

N2 - TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

AB - TheWorldWideWeb can be considered an infinite source of information for both individuals and organizations. Yet, if the main standard of publication on the Web (HTML) is quite suited to human reading, its poor semantics makes it difficult for computers to process and use embedded data in a smart and automated way. In this paper, we propose to build a bridge between HTML documents and external applications by means of socalled mapping rules. Such rules mainly record a semantic interpretation of recurring types of information in a cluster of similar Web documents and their location in those documents. Relying on these rules, HTML-embedded data can be extracted towards a more computable format. The definition of mapping rules is based on direct user input mainly for the interpretation part, and on automatic computing for the location of data in HTML tree structures. This approach is supported by a user-friendly tool called Retrozilla.

UR - http://www.scopus.com/inward/record.url?scp=84990954681&partnerID=8YFLogxK

U2 - 10.1109/ICDEW.2006.135

DO - 10.1109/ICDEW.2006.135

M3 - Conference contribution

AN - SCOPUS:84990954681

BT - ICDEW 2006 - Proceedings of the 22nd International Conference on Data Engineering Workshops

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 22nd International Conference on Data Engineering Workshops, ICDEW 2006

Y2 - 3 April 2006 through 7 April 2006

ER -

Semi-Automated extraction of targeted data fromweb pages

Résumé

Une conférence

Accès au document

Autres fichiers et liens

Empreinte digitale

Contient cette citation