Extraction of Partial XML Documents Using IR-Based Structure and Contents Analysis

0
70

Authors: Hiroko Kinutani, Kenji Hatano, Masatoshi Yoshikawa, Shunsuke Uemura

Tags: 2001, conceptual modeling

As Internet technologies develop, XML is becoming widely used as a standard data/document format. Although the use of XML documents has attracted public attention, the application of IR technologies in XML document retrieval is still in its premature stage. We foresee that typical XML queries for end-users will be very terse, like those used with current Web search engines. Therefore, an XML search engine should be able to search appropriate retrieval results using only a few keywords. In this paper, we introduce a notion of context nodes. Context nodes are used to automatically extract coherent partial documents without the knowledge of XML document structures. This method is useful because it does not require domain analysts to analyze DTDs and specify candidate partial documents beforehand. We use the term “context search” to represent search methods which employ the notion of context node. As an instantiation of context search methods, we have developed algorithms to identify result partial documents in the vector space model. We made a performance evaluation to verify the effectiveness of our method.

Read the full paper here: https://link.springer.com/chapter/10.1007/3-540-46140-X_26