Data & Knowledge Engineering

Conceptual-model-based data extraction from multiple-recordWeb pages

October 22, 2020

180

Authors: D. M. Campbell, D. W. Embley, D.W. Lonsdale, R.D. Smith, S. W. Liddle, Y. -K. Ng, Y. S. Jiang

Tags: 1999, conceptual modeling

Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured,which makes searching hard and traditional database querying impossible. Many Web documents, however, contain anabundance of recognizable constants that together describe the essence of a document’s content. For these kinds ofdata-rich, multiple-record documents (e.g., advertisements, movie reviews, weather reports, travel information, sportssummaries, Ænancial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extractand structure data automatically. The approach is based on an ontology ± a conceptual model instance ± that describesthe data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we canautomatically produce a database scheme and recognizers for constants and keywords, and then invoke routines torecognize and extract data from unstructured documents and structure it according to the generated database scheme.Experiments show that it is possible to achieve good recall and precision ratios for documents that are rich in rec-ognizable constants and narrow in ontological breadth. Our approach is less labor-intensive than other approaches thatmanually or semiautomatically generate wrappers, and it is generally insensitive to changes in Web-page for-mat.

Read the full paper here: https://reader.elsevier.com/reader/sd/pii/S0169023X99000270?token=3E89BD5E913083633E23D65265F47FB69A022521C0436F019DA278144FE29B2AAB46F19C506767F78042CCAB704E6872

Conceptual-model-based data extraction from multiple-recordWeb pages

EDITOR PICKS

Roger H.L. Chiang – 2023 ASOCA Winner

Join us in the magical Miami for the 2023 AIS SIGSAND!

Participate in SAND sessions at AMCIS 2023 – August 10 –...

POPULAR POSTS

Participate in SAND sessions at AMCIS 2023 – August 10 –...

Conceptual Modelling in the “Digital First” Era — A Joint AIS...

TheoryOn: A Design Framework and System for Unlocking Behavioral Knowledge through...

POPULAR CATEGORY

Share this:

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY