Data Extraction Basics for Docs and Images with OCR and NER
Become a Data Extraction Expert with Python, Pandas, OCR, NER, and Spacy : Learn to Train and Build Real-World Solutions
Description
Master Smart Data Extraction from PDF and Images with Python, Pandas, OCR, Tesseract, PyTesseract, OpenCV, Spacy, and NER
Gain a competitive edge in the world of computer vision by learning how to extract data from PDFs and images intelligently. In this comprehensive course, you'll learn how to use a variety of powerful tools and techniques, including:
Python: A versatile and widely used programming language for data science and machine learning
Pandas: A powerful library for data manipulation and analysis
OCR: Optical character recognition, used to convert images of text into machine-readable text
Tesseract: A popular open-source OCR engine
PyTesseract: A Python wrapper for Tesseract
OpenCV: A computer vision library
Spacy: A natural language processing (NLP) library
NER: Named entity recognition, used to identify and classify named entities in text
You'll also learn how to build a common pipeline for data extraction from different types of input documents, including structured PDF documents, scanned PDF documents, and Word documents. By the end of the course, you'll be able to develop robust data extraction solutions for a variety of real-world applications.
Unique Offerings:
Code walkthrough of working pipeline which performs various operations on documents such as conversion, extraction, and labeling
Line-by-line code walkthrough of various operations performed at different steps
End product that you will build with us towards the end of course is in working condition and support is provided within 24 hours for any issues faced
Detailed explanation of steps required to train Spacy for NER
Key Topics:
Understanding Data Conversion
Conversion and Extraction from structured PDF document
Conversion of Scanned PDF document
Conversion and Extraction of data from word document
Common Format for Pipeline
Image Reading using PIL and OpenCV
Tesseract for Extraction
Tesseract Page Segmentation Mode (PSM) and OCR Engine Mode (OEM)
Extraction of Data from Image
PyTesseract Operations
Named Entity Recognition (NER)
Spacy Entity Types
IOB Format
Labelling with Spacy for NER
Training Spacy model on custom data using NER
Predicting using Trained Spacy Model
Pandas
Convert Data to CSV Output
What You Will Learn!
- Learn how to extract data from PDFs, Word docs, scanned images, and more with ease.
- Use Tesseract and PyTesseract to perform optical character recognition (OCR) on images with accuracy.
- Develop a common pipeline for data extraction from different types of input documents.
- Learn how to develop a robust data extraction workflow
- Get started on how to use Spacy efficiently for labelling
- Learn how to train Spacy for your own data set
- Use Pandas to convert extracted data to a CSV format
- Design a customizable technical OCR solution for data extraction
Who Should Attend!
- Python Developers who need to extract data from various sources for their work.
- Students who are interested in learning about data extraction and how it can be used to solve real-world problems
- Anyone who is curious about data extraction and wants to learn more about it.