Skip to content

Advanced PDF Summarizer for specific COXIT documents using latest LLMs and semi-agentic workflows

Notifications You must be signed in to change notification settings

v4ler11/pdf-summ-coxit

Repository files navigation

PDF Summarizer for COXIT

A powerful tool for automated PDF document processing and summarization using LLM models. The tool processes PDF documents, extracts their content, and generates structured summaries using advanced language models.

Note: Tool is specifically designed for COXIT documents and won't be of use for generic PDF documents.

Note: Paid GEMINI Plan is recommended as it has higher rate limits

Features

  • 📄 Automated PDF document processing
  • 🔍 Intelligent section detection and organization
  • 🤖 LLM-powered content summarization
  • 📊 Structured output in CSV format
  • 👀 Real-time document monitoring
  • 🚀 Multi-threaded & asynchronous processing pipeline

How It Works

The tool operates in a pipeline:

  1. Document Monitoring: Watches the target directory for new PDF files
  2. PDF Processing: Extracts and processes text from PDF documents
  3. Step 1: Initial content analysis and section detection
  4. Step 2: Section-based summarization
  5. Output Formatting: Generates structured CSV output

On-Device Installation & Usage

  1. Install uv

about: uv - modern pip, pipx, poetry, venv replacement

wget -qO- https://astral.sh/uv/install.sh | sh
  1. Clone the repository
git clone https://github.com/valaises/pdf-summ-coxit.git
  1. Install pdf-summ-coxit
uv sync && pip install -e .
  1. Set ENV variables
export GEMINI_API_KEY =
export OPENAI_API_KEY = 
  1. Run the summarizer
python -m src.core.main -d /path/to/your/pdfs

Command Line Arguments

  • -d --target-dir: Directory to monitor for PDF files (required)
  • --debug: Enable debug logging (optional)

Docker Usage

(assuming docker is installed)

  1. Build an image
cd ~/code/pdf-summ-coxit
docker buid -t pdf-summ .
  1. Specify variables in .env
cp .env.example .env
vim .env
  1. Start container (detached mode)
docker run -d --env-file .env -v /path/to/your/pdfs:/app/target_dir pdf-summ

Docker Compose Usage

(assuming docker & docker compose are installed)

  1. Set ENV variables in .env & .env.compose
cp .env.example .env
cp .env.compose.example .env.compose
vim .env
vim .env.compose
  1. Start container (detached mode)
docker compose --env-file .env.compose up

Getting results

After each document is processed, output.csv, output_parts.csv, and usage.csv are automatically re-generated in artifacts inside a directory specified by the -d --output_dir argument.

Notes about usage:

  • N-requests needed to summarize a document in most cases is: page_count + sections_count
  • Model is generally gemini-2, unless it fails to generate JSON, then it's gpt-4o
  • time needed to summarize all documents != sum(t for t in doc.usage), as documents are processed asynchronously
  • cost is calculated using data provided in assets/model_list.json

Evaluation, locally

What does evaluation?

It compares target results from tests/expected.json with results generated by the summarizer

Attention

Eval only works if there's an ongoing summarization. e.g. if you started eval.py after summarizer finished it's work, eval.py will show nothing.

  1. export ENV variables
export GEMINI_API_KEY =
export OPENAI_API_KEY = 
  1. Copy PDFs from dataset into dataset dir
mkdir ~/code/pdf-summ-coxit/dataset && cp /path/to/your/pdfs/* ~/code/pdf-summ-coxit/dataset 
  1. Run eval.py
cd ~/code/pdf-summ-coxit/dataset
python tests/eval.py -d dataset
  1. Run summarizer
python -m src.core.main -d dataset

As PDFs getting processed, watch STDOUT of eval.py for results and output.csv, output_parts.csv in ~/code/pdf-summ-coxit/dataset

Evaluation, docker / compose

  1. Start container (see docker / docker-compose usage)
  2. Inside a container, run eval.py
docker exec -it <container_name> bash
python tests/eval.py -d target_dir

Video: how to run eval

Screen.Recording.2025-02-17.at.12.1.mp4

About

Advanced PDF Summarizer for specific COXIT documents using latest LLMs and semi-agentic workflows

Topics

Resources

Stars

Watchers

Forks