PDF Summarizer for COXIT
A powerful tool for automated PDF document processing and summarization using LLM models. The tool processes PDF documents, extracts their content, and generates structured summaries using advanced language models.
Note: Tool is specifically designed for COXIT documents and won't be of use for generic PDF documents.
Note: Paid GEMINI Plan is recommended as it has higher rate limits
- 📄 Automated PDF document processing
- 🔍 Intelligent section detection and organization
- 🤖 LLM-powered content summarization
- 📊 Structured output in CSV format
- 👀 Real-time document monitoring
- 🚀 Multi-threaded & asynchronous processing pipeline
- Document Monitoring: Watches the target directory for new PDF files
- PDF Processing: Extracts and processes text from PDF documents
- Step 1: Initial content analysis and section detection
- Step 2: Section-based summarization
- Output Formatting: Generates structured CSV output
- Install uv
about: uv - modern pip, pipx, poetry, venv replacement
wget -qO- https://astral.sh/uv/install.sh | sh- Clone the repository
git clone https://github.com/valaises/pdf-summ-coxit.git- Install pdf-summ-coxit
uv sync && pip install -e .- Set ENV variables
export GEMINI_API_KEY =
export OPENAI_API_KEY = - Run the summarizer
python -m src.core.main -d /path/to/your/pdfs-d--target-dir: Directory to monitor for PDF files (required)--debug: Enable debug logging (optional)
(assuming docker is installed)
- Build an image
cd ~/code/pdf-summ-coxitdocker buid -t pdf-summ .- Specify variables in .env
cp .env.example .env
vim .env
- Start container (detached mode)
docker run -d --env-file .env -v /path/to/your/pdfs:/app/target_dir pdf-summ(assuming docker & docker compose are installed)
- Set ENV variables in .env & .env.compose
cp .env.example .env
cp .env.compose.example .env.composevim .envvim .env.compose- Start container (detached mode)
docker compose --env-file .env.compose upAfter each document is processed, output.csv, output_parts.csv, and usage.csv are automatically re-generated in artifacts inside a directory specified by the -d --output_dir argument.
- N-requests needed to summarize a document in most cases is: page_count + sections_count
- Model is generally
gemini-2, unless it fails to generate JSON, then it'sgpt-4o - time needed to summarize all documents != sum(t for t in doc.usage), as documents are processed asynchronously
- cost is calculated using data provided in
assets/model_list.json
It compares target results from tests/expected.json with results generated by the summarizer
Eval only works if there's an ongoing summarization. e.g. if you started eval.py after summarizer finished it's work, eval.py will show nothing.
- export ENV variables
export GEMINI_API_KEY =
export OPENAI_API_KEY = - Copy PDFs from dataset into dataset dir
mkdir ~/code/pdf-summ-coxit/dataset && cp /path/to/your/pdfs/* ~/code/pdf-summ-coxit/dataset - Run eval.py
cd ~/code/pdf-summ-coxit/datasetpython tests/eval.py -d dataset- Run summarizer
python -m src.core.main -d datasetAs PDFs getting processed, watch STDOUT of eval.py for results and output.csv, output_parts.csv in ~/code/pdf-summ-coxit/dataset
- Start container (see docker / docker-compose usage)
- Inside a container, run eval.py
docker exec -it <container_name> bashpython tests/eval.py -d target_dir