TY - JOUR
T1 - Mining Structured Data in Natural Language Artifacts with Island Parsing
AU - Bacchelli, Alberto
AU - Mocci, Andrea
AU - Cleve, Anthony
AU - Lanza, Michele
N1 - Publisher Copyright:
© 2017
PY - 2017/12/15
Y1 - 2017/12/15
N2 - Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.
AB - Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with JAVA code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.
KW - Island parsing
KW - Mining software repositories
KW - Unstructured data
UR - http://www.scopus.com/inward/record.url?scp=85027453302&partnerID=8YFLogxK
U2 - 10.1016/j.scico.2017.06.009
DO - 10.1016/j.scico.2017.06.009
M3 - Article
SN - 0167-6423
VL - 150
SP - 31
EP - 55
JO - Science of Computer Programming
JF - Science of Computer Programming
ER -