How to split large merged PDF files with Gemini Ai and pdftk

Sometimes the small things are the ones helping us in our daily lives.
Here is a small “how-to” that helps me in my daily life handling a paperless office.
Imagine the following:
You have a document scanner and many documents, letters, invoices, and other stuff- a nice mix of everything.
Working with a document scanner can be handled in two main ways:
- scan each document, save it, and give it a meaningful name
- scan all documents at once, save a large merged PDF file, and split it afterward
One may argue that it doesn´t matter which method you prefer; there is no time saved here. If you have only 3 or 4 documents, I agree. But what if you have been lazy and now have a stack of 10 or more documents on your table?
In that case, I prefer option 2.
In the past here is what it did:
- open the large PDF
- extract (drag and drop) each page into a new file
- save it with a meaningful name + the date (mentioned in the document) at the end, for example invoice_2024–10–01.pdf
- (find the correct destination folder and save)
but man, that takes time.
Here´s how this can be done in a minute:
Because Google Gemini handles PDF files extremely well, you can upload your large PDF file (containing all documents) and give it this prompt:
I have a PDF that is a merge of multiple different PDFs. Documents with varying numbers of pages were combined into one PDF. Please analyze the entire PDF and determine which pages belong together.
Please help me with:
1. Identifying which pages belong to the same document
2. Creating appropriate filenames based on these rules:
- only lowercase letters
- no spaces
- short descriptive names
- date at the end reflecting when the document/invoice/letter was created (format: YYYY-MM-DD)
3. Providing a bash script that handles the splitting and correct naming
Here's an example of the script structure I'm looking for:
#!/bin/bash
# Input PDF file
input_pdf="sample_document.pdf"
# Output files with dates
insurance_policy="car_insurance_2024-11-05.pdf"
bank_statement="bank_statement_2024-11.pdf"
medical_invoice_oct="medical_invoice_2024-10-21.pdf"
medical_invoice_nov="medical_invoice_2024-11-03.pdf"
utility_bill="electric_bill_2024-11-10.pdf"
investment_report="investment_report_2024-11-15.pdf"
# Splitting PDF pages with pdftk
pdftk "$input_pdf" cat 1-2 output "$insurance_policy"
pdftk "$input_pdf" cat 3-4 output "$bank_statement"
pdftk "$input_pdf" cat 5-6 output "$medical_invoice_oct"
pdftk "$input_pdf" cat 7-8 output "$medical_invoice_nov"
pdftk "$input_pdf" cat 9-12 output "$utility_bill"
pdftk "$input_pdf" cat 13-15 output "$investment_report"
echo "PDFs split successfully."
Note: I'm using Ubuntu Linux, so please only use tools available for this environment
Adjust the prompt to your needs!
Copy that script into the same folder where your merged PDF is and run it: DONE.
This may look complicated in the first place, but believe me, it saves so much time 😉
If you have a detailed folder structure (e.g., invoices, private, business, etc.), you can extend this script by adding rules with a destination folder so the documents are moved to their final destination after splitting. However, as this setup is highly individual, I am focusing solely on the splitting process here.
As always, I hope you learned something or at least I could help you with this small tip to save a lot of time in the future.
Feel free to clap and/or follow me on X Mastodon BlueSky LinkedIn
Cu! 👋