A Tool-Supported Method to Extract Data and Schema from Web Sites

Fabrice Estiévenart, Aurore François, Jean Henrard, Jean-Luc Hainaut

Research output: Contribution in Book/Catalog/Report/Conference proceedingConference contribution

62 Downloads (Pure)

Abstract

Modern technologies allow web sites to be dynamically managed by building pages on-the-fly through scripts that get data from a database. Dissociation of data from layout directives provides easy data update and homogeneous presentation. However, many web sites still are made of static HTML pages in which data and layout information are interleaved. This leads to out-of-date information, inconsistent style and tricky and expensive maintenance. This paper presents a tool supported methodology to reengineer web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualised into a unique schema describing the domain covered by the whole web site. Finally, the data are converted according to this new schema so that they can be used to produce the renovated web site. These principles will be illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.
Original languageEnglish
Title of host publicationProc. of the 5th International Workshop on Web Site Evolution
Place of PublicationAmsterdam
PublisherIEEE CS Press
Pages3-11
Number of pages9
Publication statusPublished - 2003

Fingerprint

XML
Websites
HTML
Data structures

Keywords

  • XML
  • data extraction
  • reengineering
  • web site

Cite this

Estiévenart, F., François, A., Henrard, J., & Hainaut, J-L. (2003). A Tool-Supported Method to Extract Data and Schema from Web Sites. In Proc. of the 5th International Workshop on Web Site Evolution (pp. 3-11). Amsterdam: IEEE CS Press.
Estiévenart, Fabrice ; François, Aurore ; Henrard, Jean ; Hainaut, Jean-Luc. / A Tool-Supported Method to Extract Data and Schema from Web Sites. Proc. of the 5th International Workshop on Web Site Evolution. Amsterdam : IEEE CS Press, 2003. pp. 3-11
@inproceedings{67986255f4384c0995053ef6ae3764d1,
title = "A Tool-Supported Method to Extract Data and Schema from Web Sites",
abstract = "Modern technologies allow web sites to be dynamically managed by building pages on-the-fly through scripts that get data from a database. Dissociation of data from layout directives provides easy data update and homogeneous presentation. However, many web sites still are made of static HTML pages in which data and layout information are interleaved. This leads to out-of-date information, inconsistent style and tricky and expensive maintenance. This paper presents a tool supported methodology to reengineer web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualised into a unique schema describing the domain covered by the whole web site. Finally, the data are converted according to this new schema so that they can be used to produce the renovated web site. These principles will be illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.",
keywords = "XML, data extraction, reengineering, web site",
author = "Fabrice Esti{\'e}venart and Aurore Fran{\cc}ois and Jean Henrard and Jean-Luc Hainaut",
year = "2003",
language = "English",
pages = "3--11",
booktitle = "Proc. of the 5th International Workshop on Web Site Evolution",
publisher = "IEEE CS Press",

}

Estiévenart, F, François, A, Henrard, J & Hainaut, J-L 2003, A Tool-Supported Method to Extract Data and Schema from Web Sites. in Proc. of the 5th International Workshop on Web Site Evolution. IEEE CS Press, Amsterdam, pp. 3-11.

A Tool-Supported Method to Extract Data and Schema from Web Sites. / Estiévenart, Fabrice; François, Aurore; Henrard, Jean; Hainaut, Jean-Luc.

Proc. of the 5th International Workshop on Web Site Evolution. Amsterdam : IEEE CS Press, 2003. p. 3-11.

Research output: Contribution in Book/Catalog/Report/Conference proceedingConference contribution

TY - GEN

T1 - A Tool-Supported Method to Extract Data and Schema from Web Sites

AU - Estiévenart, Fabrice

AU - François, Aurore

AU - Henrard, Jean

AU - Hainaut, Jean-Luc

PY - 2003

Y1 - 2003

N2 - Modern technologies allow web sites to be dynamically managed by building pages on-the-fly through scripts that get data from a database. Dissociation of data from layout directives provides easy data update and homogeneous presentation. However, many web sites still are made of static HTML pages in which data and layout information are interleaved. This leads to out-of-date information, inconsistent style and tricky and expensive maintenance. This paper presents a tool supported methodology to reengineer web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualised into a unique schema describing the domain covered by the whole web site. Finally, the data are converted according to this new schema so that they can be used to produce the renovated web site. These principles will be illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.

AB - Modern technologies allow web sites to be dynamically managed by building pages on-the-fly through scripts that get data from a database. Dissociation of data from layout directives provides easy data update and homogeneous presentation. However, many web sites still are made of static HTML pages in which data and layout information are interleaved. This leads to out-of-date information, inconsistent style and tricky and expensive maintenance. This paper presents a tool supported methodology to reengineer web sites, that is, to extract the page contents as XML documents structured by expressive DTDs or XML Schemas. All the pages that are recognized to express the same application (sub)domain are analyzed in order to derive their common structure. This structure is formalized by an XML document, called META, which is then used to extract an XML document that contains the data of the pages and a XML Schema validating these data. The META document can describe various structures such as alternative layout and data structure for the same concept, structure multiplicity and separation between layout and informational content. XML Schemas extracted from different page types are integrated and conceptualised into a unique schema describing the domain covered by the whole web site. Finally, the data are converted according to this new schema so that they can be used to produce the renovated web site. These principles will be illustrated through a case study using the tools that create the META document, extract the data and the XML Schema.

KW - XML

KW - data extraction

KW - reengineering

KW - web site

M3 - Conference contribution

SP - 3

EP - 11

BT - Proc. of the 5th International Workshop on Web Site Evolution

PB - IEEE CS Press

CY - Amsterdam

ER -

Estiévenart F, François A, Henrard J, Hainaut J-L. A Tool-Supported Method to Extract Data and Schema from Web Sites. In Proc. of the 5th International Workshop on Web Site Evolution. Amsterdam: IEEE CS Press. 2003. p. 3-11