Mining Structured Data in Natural Language Artifacts with Island Parsing

Alberto Bacchelli; Andrea Mocci; Anthony Cleve; Michele Lanza

doi:10.1016/j.scico.2017.06.009

Mining Structured Data in Natural Language Artifacts with Island Parsing

Alberto Bacchelli, Andrea Mocci, Anthony Cleve, Michele Lanza

Research output: Contribution to journal › Article › peer-review

Abstract

Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.

Original language	English
Pages (from-to)	31-55
Number of pages	25
Journal	Science of Computer Programming
Volume	150
DOIs	https://doi.org/10.1016/j.scico.2017.06.009
Publication status	Published - 15 Dec 2017

Keywords

Island parsing
Mining software repositories
Unstructured data

Access to Document

10.1016/j.scico.2017.06.009

Cite this

@article{cc1423360136466f920359b13cc05db6,

title = "Mining Structured Data in Natural Language Artifacts with Island Parsing",

abstract = "Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.",

keywords = "Island parsing, Mining software repositories, Unstructured data",

author = "Alberto Bacchelli and Andrea Mocci and Anthony Cleve and Michele Lanza",

note = "Publisher Copyright: {\textcopyright} 2017",

year = "2017",

month = dec,

day = "15",

doi = "10.1016/j.scico.2017.06.009",

language = "English",

volume = "150",

pages = "31--55",

journal = "Science of Computer Programming",

issn = "0167-6423",

publisher = "Elsevier",

}

TY - JOUR

T1 - Mining Structured Data in Natural Language Artifacts with Island Parsing

AU - Bacchelli, Alberto

AU - Mocci, Andrea

AU - Cleve, Anthony

AU - Lanza, Michele

PY - 2017/12/15

Y1 - 2017/12/15

N2 - Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.

AB - Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.

KW - Island parsing

KW - Mining software repositories

KW - Unstructured data

UR - http://www.scopus.com/inward/record.url?scp=85027453302&partnerID=8YFLogxK

U2 - 10.1016/j.scico.2017.06.009

DO - 10.1016/j.scico.2017.06.009

M3 - Article

SN - 0167-6423

VL - 150

SP - 31

EP - 55

JO - Science of Computer Programming

JF - Science of Computer Programming

ER -

Mining Structured Data in Natural Language Artifacts with Island Parsing

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this