Automatic Image & Data Extraction

System automatically crops and labels figures for downstream analytics.

Problem Statement

Extracting meaningful images and captions from dense documents is tedious but essential for evidence-based reporting. We developed deep learning workflows that localize figures, read surrounding text, and store the resulting assets inside searchable datasets.

Approach

  • Layout-aware CNN detectors identify figures, tables, and highlight regions
  • OCR and NLP modules parse captions, labels, and narratives
  • A data pipeline organizes the extracted assets into structured repositories for downstream analytics

Outcomes

The system dramatically reduced the manual effort required to build training corpora for our healthcare models, enabling rapid experimentation on newly published literature and internal reports.

Description

Recognizing and extracting interested images and captions in a document is difficult but ideal work. We are studying on extracting, analyzing data using deep learning algorithms and then storing it automatically.

Cheolsoo Park
Cheolsoo Park
Professor

His research interests include machine learning, adaptive signal processing, computational neuroscience, and wearable technology.