Automatic Image & Data Extraction

Aug 1, 2018 Machine Learning / Deep Learning

System automatically crops and labels figures for downstream analytics.

Problem Statement

Extracting meaningful images and captions from dense documents is tedious but essential for evidence-based reporting. We developed deep learning workflows that localize figures, read surrounding text, and store the resulting assets inside searchable datasets.

Approach

Layout-aware CNN detectors identify figures, tables, and highlight regions
OCR and NLP modules parse captions, labels, and narratives
A data pipeline organizes the extracted assets into structured repositories for downstream analytics

Outcomes

The system dramatically reduced the manual effort required to build training corpora for our healthcare models, enabling rapid experimentation on newly published literature and internal reports.

Description

Recognizing and extracting interested images and captions in a document is difficult but ideal work. We are studying on extracting, analyzing data using deep learning algorithms and then storing it automatically.

Computer Vision Document AI Deep Learning Data Mining

Automatic Image & Data Extraction

Problem Statement

Approach

Outcomes

Description

Cheolsoo Park

Professor