Fast PDF2Text converter to be used for ML. Processes whole directories with PDF's at once.
Go to file
2024-11-22 20:37:42 +01:00
.gitignore Last version 2024-11-22 20:37:42 +01:00
pdf2text Last version 2024-11-22 20:37:42 +01:00
README.md Last version 2024-11-22 20:37:42 +01:00
requirements.txt Last version 2024-11-22 20:37:42 +01:00

PDF2Text

I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.

Convert all PDF's to text

This is an script for converting a batch of PDF's to text for machine learning. It only has two dependencies:

  • python3
  • pdf.miner (python requirement, specified in requirements.txt file)

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage:

Activate your virtual environment.

source .venv/bin/activate
./pdf2text [source/destination dir]

You read that correctly, the source directory is also the destination directory.