BunkerHill-CardScanToData

This project is a Python-based application that processes and analyzes images, specifically census cards. It includes functionalities for image preprocessing, text detection, and user interface for card selection.

Key Features

Image Preprocessing: The project includes Jupyter notebooks (Header Text Detection.ipynb, Image Deskew and Trim.ipynb, Train Checkbox Model.ipynb) that perform various image preprocessing tasks such as deskewing, trimming, and text detection.
Text Detection: The project uses Google Cloud Vision API for document text detection. Refer to the function detect_document in Header Text Detection.ipynb.
Card Selection User Interface: The project provides a user interface for selecting sub-data fields on each census card. The controls for this interface are detailed in src/help_menu.txt.

Codebase Structure

The codebase is organized into several Python scripts, Jupyter notebooks, and configuration files. The main scripts include:

src/main.py: The main entry point of the application.
src/analyze_cards.py: Contains code for analyzing the cards.
src/card_selection_ui.py: Contains the user interface for card selection.

The Jupyter notebooks contain code for image preprocessing, text detection, and checkbox classification.

Running the Project

To run the project, execute the main.py script with the appropriate command-line arguments. For example:

pythom -m src.main --images_dir path/to/images --save_dir path/to/save

Note

Please ensure you have the necessary permissions and environment variables set up for using Google Cloud Vision API.

Running Procedure

First, the source images need to be cleaned. Run source images through notebooks/01_Image_Deskew_and_Trim.ipynb to produce folders of processed images.
Then, use the card box selection UI to select key areas of the box. It uses image similarity to pinpoint the bounding box anchor points to ensure the cropped/sliced cards are perfectly aligned. The runtime may vary, but the expected time to run would be 30 mins per box per 1000 images. Considering there are around 30 boxes and 3000 images, it should take 45 hrs to completely run.
Then, run the UI with --no_find_vertex on the subfolders that contain the cropped boxes, outlining the text areas and checkboxes, and then naming them.
If there has not been a trained checkbox identification model, one must be trained. Use the data created in the last step, and manually copy around 20-30 boxes with checks and boxes without checks to two folders. Then adjust the notebooks/02_Train_Checkbox_Model.ipynb and train the model.
Then, initiate the CSV by starting the notebooks/03_Header_Text_Detection.ipynb notebook, this will put the image index and relevant info into a table.
Then, modify and run the notebooks/04_Body_Detect_Text_And_Checkbox.ipynb to classify the checkbox and detect the text, then save them to the CSV.
Finally, run notebooks/05_Analyze_Data.ipynb to generate insight and visualizations for the retrieved card data.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
src		src
static		static
view		view
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
pylintrc		pylintrc
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BunkerHill-CardScanToData

Key Features

Codebase Structure

Running the Project

Note

Running Procedure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

visualizela/BunkerHill-CardScanToData

Folders and files

Latest commit

History

Repository files navigation

BunkerHill-CardScanToData

Key Features

Codebase Structure

Running the Project

Note

Running Procedure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages