Compare commits
2 Commits
711c3b4802
...
746f6da5d5
Author | SHA1 | Date | |
---|---|---|---|
746f6da5d5 | |||
994d5495b2 |
20
.gitea/workflows/test.yaml
Normal file
20
.gitea/workflows/test.yaml
Normal file
@ -0,0 +1,20 @@
|
|||||||
|
name: pdf2text test
|
||||||
|
run-name: syntax check
|
||||||
|
on: [push]
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
Compile:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- name: Check out repository code
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
- name: List files in the repository
|
||||||
|
run: |
|
||||||
|
ls ${{ gitea.workspace }}
|
||||||
|
- run: echo "Install dependencies."
|
||||||
|
- run: apt update
|
||||||
|
- run: apt install python3
|
||||||
|
- run: python3 -m pip install -r requirements.txt
|
||||||
|
- run: "Check if starts correcly. Syntax check."
|
||||||
|
- run: ./pdf2text .
|
||||||
|
- run: echo "This job's status is ${{ job.status }}."
|
@ -3,10 +3,10 @@
|
|||||||
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
|
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
|
||||||
|
|
||||||
## Convert all PDF's to text
|
## Convert all PDF's to text
|
||||||
This is an script for converting a batch of PDF's to text for machine learning.
|
This is an [script](/pdf2text) for converting a batch of PDF's to text for machine learning.
|
||||||
It only has two dependencies:
|
It only has two dependencies:
|
||||||
- python3
|
- `python3`
|
||||||
- pdf.miner (python requirement, specified in requirements.txt file)
|
- `pdf.miner` (python requirement, specified in [requirements.txt](/requirements.txt) file)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
```bash
|
```bash
|
||||||
@ -22,3 +22,6 @@ source .venv/bin/activate
|
|||||||
./pdf2text [source/destination dir]
|
./pdf2text [source/destination dir]
|
||||||
```
|
```
|
||||||
You read that correctly, the source directory is also the destination directory.
|
You read that correctly, the source directory is also the destination directory.
|
||||||
|
|
||||||
|
## Todo:
|
||||||
|
Make decent python package so it's installable on system without having to load environment first. Not sure if worth it, it's not something you daily use.
|
||||||
|
Loading…
Reference in New Issue
Block a user