Converting .docx files to structured .html using LibreOffice inside a Docker container. Running Python 3.12.0.
This project is designed to preserve key document structures needed for downstream JSON transformation, including:
- Text hierarchy (e.g. headings, bold/italic formatting)
- Structured tables
- Question-label conventions
- Style-based cues (e.g. color indicating codes vs. types)
The working test file is:
svb_qnr_-_main_-_english__february_2024_for_ingestion.docx
See docker-setup.md for full Docker installation instructions on CentOS Stream 10.
See conventions.md for encoding rules used in .docx formatting.
Folders needed:
data/docx_input- Stores the original Word questionnaires
data/html_output- Stores the converted .html result using LibreOffice
data/html_questionsdata/html_sections- Holds the extracted sections and questions from the converted .html file in docx-to-html/data/html_output
data/json_chunks_from_html- Holds the converted .json chunks using Claude-3.5 Haiku (should be repurposed to hold the converted HTML questions and HTML sections separately)
data/json_final- Will contain the final .json output.
Libraries needed:
- Install all libraries in requirements-dev.txt
- Install Docker on root to use LibreOffice
Files needed:
.envfile storing the Anthropic API key as the environment variableANTHROPIC_API_KEY.gitignorefile excludingdata/and.env
html_splitterandchunk_validator(depreciated): The old method of chunking the .html form of svb and validating that the splitting was done correctly.config.py: Contains old environment variable names for the folders storing data. Can be depreciated or overhauled as folders will no longer point to my home folder.conversions.py(depreciated): The old methods of attempting to convert .docx to .html before LibreOffice was used.libre_office_convert.sh: Converts .docx files into .html using LibreOffice.clean_html.py: Takes all .html files within a specified folder and dumps cleaned versions into another specified folder. Uncleaned files with the name{filename}.htmlwill correspond to their cleaned version,{filename}_cleaned.html.extract_html_questions.pyandextract_html_sections.py(in use): The current method for extracting questions and sections from the .html file. Requires an uncleaned .html file that still contains conventions (i.e., HTML lines to mark questions and sections).full_pipeline.py: Calls bothlibre_office_convert.shandclean_html.pyon an input and output folder. In order to obtain the uncleaned .html file used in the current extraction process, this should be called withoutclean_html.py.
anthropic_diagnostic.py: Simple test script that calls Claude-3.5 Haiku (or any other Anthropic model) and gets a response. Can be used to make sure your key is valid, servers aren't down, etc.file_debugger.py,token_estimator.py,test.py,tokenizer_prompt.txt(depreciated): Remnants of debugging that are not useful as corresponding fixes have been made.html_prompt.txt: The prompt containing instructions to Claude for converting each .html chunk into JSON.questions.json,sections.json(in use): The current modified schemas for converting questions specifically and section metadata specifically.schema.json(depreciated): The old plain Draft7 JSON schema used in .html to .json conversion.html_prompt_builder.py: Definesbuild_messages, the function that inserts the chosen JSON schema and HTML chunk intohtml_prompt.txtto be sent to Claude.send_to_anthropic.py: Callsbuild_messagesand defines the required paths for building the message. Requires an input folder of all .html chunks and an output folder to hold returned .json chunks. File names are preserved.validate.py: Used to validate returned JSON against the Draft7 schema (Might be depreciated based on new schemas used).json_combiner.py: Combines all .json chunks in a specified folder and returns the final output in another specified folder.
- For now, markdown conversion will not be used (.html yielded better results). In any case,
markdown_to_json/is effectively a simplified copy ofhtml_to_json/, just additionally using Turndown to perform the .html to .md conversion.