Compare commits

...

2 Commits

Author SHA1 Message Date
746f6da5d5 Added workflow
Some checks failed
pdf2text test / Compile (push) Has been cancelled
2024-11-22 20:45:58 +01:00
994d5495b2 Provided links 2024-11-22 20:41:31 +01:00
2 changed files with 26 additions and 3 deletions

View File

@ -0,0 +1,20 @@
name: pdf2text test
run-name: syntax check
on: [push]
jobs:
Compile:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: List files in the repository
run: |
ls ${{ gitea.workspace }}
- run: echo "Install dependencies."
- run: apt update
- run: apt install python3
- run: python3 -m pip install -r requirements.txt
- run: "Check if starts correcly. Syntax check."
- run: ./pdf2text .
- run: echo "This job's status is ${{ job.status }}."

View File

@ -3,10 +3,10 @@
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge. I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
## Convert all PDF's to text ## Convert all PDF's to text
This is an script for converting a batch of PDF's to text for machine learning. This is an [script](/pdf2text) for converting a batch of PDF's to text for machine learning.
It only has two dependencies: It only has two dependencies:
- python3 - `python3`
- pdf.miner (python requirement, specified in requirements.txt file) - `pdf.miner` (python requirement, specified in [requirements.txt](/requirements.txt) file)
## Installation ## Installation
```bash ```bash
@ -22,3 +22,6 @@ source .venv/bin/activate
./pdf2text [source/destination dir] ./pdf2text [source/destination dir]
``` ```
You read that correctly, the source directory is also the destination directory. You read that correctly, the source directory is also the destination directory.
## Todo:
Make decent python package so it's installable on system without having to load environment first. Not sure if worth it, it's not something you daily use.