Learning Restricted Regular Expressions with Interleaving from XML Data

0
74

Authors: Haiming Chen, Han Xu, Xiaolan Zhang, Xiaoying Mou, Yeting Li

Tags: 2018, conceptual modeling

The presence of a schema for XML documents has numerous advantages. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to devise algorithms to learn a schema from XML documents. The fundamental task in XML schema learning is inferring restricted subclasses of regular expressions. Previous work in this direction lacks support of interleaving. In this paper, based on the analysis of real data, we first propose a new subclass of regular expressions with interleaving, named as Extended Subclass of Regular Expressions with Interleaving (ESIRE). Then, based on single occurrence automaton and maximum independent set, we propose an algorithm GenESIRE to infer ESIREs from a set of given samples. Finally, we conduct a series of experiments to analyze the inference effectness of GenESIRE. Experimental results show that regular expressions inferred by GenESIRE are more precise compared with other methods, measured by different indicators.

Read the full paper here: https://link-springer-com.proxy2.hec.ca/chapter/10.1007/978-3-030-00847-5_43