This project is a Python-based application that processes and analyzes images, specifically census cards. It includes functionalities for image preprocessing, text detection, and user interface for card selection.
-
Image Preprocessing: The project includes Jupyter notebooks (
Header Text Detection.ipynb,Image Deskew and Trim.ipynb,Train Checkbox Model.ipynb) that perform various image preprocessing tasks such as deskewing, trimming, and text detection. -
Text Detection: The project uses Google Cloud Vision API for document text detection. Refer to the function detect_document in
Header Text Detection.ipynb. -
Card Selection User Interface: The project provides a user interface for selecting sub-data fields on each census card. The controls for this interface are detailed in
src/help_menu.txt.
The codebase is organized into several Python scripts, Jupyter notebooks, and configuration files. The main scripts include:
src/main.py: The main entry point of the application.src/analyze_cards.py: Contains code for analyzing the cards.src/card_selection_ui.py: Contains the user interface for card selection.
The Jupyter notebooks contain code for image preprocessing, text detection, and checkbox classification.
To run the project, execute the main.py script with the appropriate command-line arguments. For example:
pythom -m src.main --images_dir path/to/images --save_dir path/to/save
Please ensure you have the necessary permissions and environment variables set up for using Google Cloud Vision API.
- First, the source images need to be cleaned. Run source images through
notebooks/01_Image_Deskew_and_Trim.ipynbto produce folders of processed images. - Then, use the card box selection UI to select key areas of the box. It uses image similarity to pinpoint the bounding box anchor points to ensure the cropped/sliced cards are perfectly aligned. The runtime may vary, but the expected time to run would be 30 mins per box per 1000 images. Considering there are around 30 boxes and 3000 images, it should take 45 hrs to completely run.
- Then, run the UI with
--no_find_vertexon the subfolders that contain the cropped boxes, outlining the text areas and checkboxes, and then naming them. - If there has not been a trained checkbox identification model, one must be trained. Use the data created in the last step, and manually copy around 20-30 boxes with checks and boxes without checks to two folders. Then adjust the
notebooks/02_Train_Checkbox_Model.ipynband train the model. - Then, initiate the CSV by starting the
notebooks/03_Header_Text_Detection.ipynbnotebook, this will put the image index and relevant info into a table. - Then, modify and run the
notebooks/04_Body_Detect_Text_And_Checkbox.ipynbto classify the checkbox and detect the text, then save them to the CSV. - Finally, run
notebooks/05_Analyze_Data.ipynbto generate insight and visualizations for the retrieved card data.