How to split large merged PDF files with Gemini Ai and pdftk

3 min readNov 2, 2024

Sometimes the small things are the ones helping us in our daily lives.
Here is a small “how-to” that helps me in my daily life handling a paperless office.

Imagine the following:

You have a document scanner and many documents, letters, invoices, and other stuff- a nice mix of everything.

Working with a document scanner can be handled in two main ways:

scan each document, save it, and give it a meaningful name
scan all documents at once, save a large merged PDF file, and split it afterward

One may argue that it doesn´t matter which method you prefer; there is no time saved here. If you have only 3 or 4 documents, I agree. But what if you have been lazy and now have a stack of 10 or more documents on your table?

In that case, I prefer option 2.

In the past here is what it did:

open the large PDF
extract (drag and drop) each page into a new file
save it with a meaningful name + the date (mentioned in the document) at the end, for example invoice_2024–10–01.pdf
(find the correct destination folder and save)

but man, that takes time.

Here´s how this can be done in a minute:

Because Google Gemini handles PDF files extremely well, you can upload your large PDF file (containing all documents) and give it this prompt:

I have a PDF that is a merge of multiple different PDFs. Documents with varying numbers of pages were combined into one PDF. Please analyze the entire PDF and determine which pages belong together.

Please help me with:
1. Identifying which pages belong to the same document
2. Creating appropriate filenames based on these rules:
   - only lowercase letters
   - no spaces
   - short descriptive names
   - date at the end reflecting when the document/invoice/letter was created (format: YYYY-MM-DD)
3. Providing a bash script that handles the splitting and correct naming

Here's an example of the script structure I'm looking for:

#!/bin/bash

# Input PDF file
input_pdf="sample_document.pdf"

# Output files with dates
insurance_policy="car_insurance_2024-11-05.pdf"
bank_statement="bank_statement_2024-11.pdf"
medical_invoice_oct="medical_invoice_2024-10-21.pdf"
medical_invoice_nov="medical_invoice_2024-11-03.pdf"
utility_bill="electric_bill_2024-11-10.pdf"
investment_report="investment_report_2024-11-15.pdf"

# Splitting PDF pages with pdftk
pdftk "$input_pdf" cat 1-2 output "$insurance_policy"
pdftk "$input_pdf" cat 3-4 output "$bank_statement"
pdftk "$input_pdf" cat 5-6 output "$medical_invoice_oct"
pdftk "$input_pdf" cat 7-8 output "$medical_invoice_nov"
pdftk "$input_pdf" cat 9-12 output "$utility_bill"
pdftk "$input_pdf" cat 13-15 output "$investment_report"

echo "PDFs split successfully."

Note: I'm using Ubuntu Linux, so please only use tools available for this environment

Adjust the prompt to your needs!

Copy that script into the same folder where your merged PDF is and run it: DONE.

This may look complicated in the first place, but believe me, it saves so much time 😉

If you have a detailed folder structure (e.g., invoices, private, business, etc.), you can extend this script by adding rules with a destination folder so the documents are moved to their final destination after splitting. However, as this setup is highly individual, I am focusing solely on the splitting process here.

As always, I hope you learned something or at least I could help you with this small tip to save a lot of time in the future.

Feel free to clap and/or follow me on X Mastodon BlueSky LinkedIn

Cu! 👋

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Pdf

Google Gemini

Pdftk

Written by Micha(el) Bladowski

137 Followers

28 Following

devop, developer, ebay pro, problem solver, api expert

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Tamanna

Understanding LayoutLM

LayoutLM is a pre-trained model developed by Microsoft that can generate layout features from text and image inputs. It’s designed for…

Jan 12

Converting Unstructured Data into LLM-Ready Formats: A Guide with Gitingest and Docling

TechThync

Ahmed Ibrahim, PhD

Converting Unstructured Data into LLM-Ready Formats: A Guide with Gitingest and Docling

Large Language Models (LLMs) like GPT-4, Claude, and others have revolutionized how we interact with and process data. However, one of the…

Feb 25

Lists

Generative AI Recommended Reading

52 stories1691 saves

What is ChatGPT?

9 stories521 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories563 saves

Natural Language Processing

1977 stories1620 saves

Building a Multimodal LLM Application with PyMuPDF4LLM

Benito Martin

Building a Multimodal LLM Application with PyMuPDF4LLM

Author: Benito Martin

Sep 30, 2024

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Avinash Maheshwari

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Processing and extracting information from diverse document formats is essential for numerous applications. Python’s unstructured library…

Dec 1, 2024

MarkItDown: A Powerful and must-have Toolkit for LLMs projects, tested with OpenAI and Gemini 2.0

Manuele Caddeo

MarkItDown: A Powerful and must-have Toolkit for LLMs projects, tested with OpenAI and Gemini 2.0

MarkItDown is a versatile utility developed by Microsoft that transforms various file formats into Markdown, making it an essential tool…

Dec 19, 2024

Convert PDF to text (markdown) with olmOCR on Windows Mini PC with Intel Core Ultra i5

Wei Lu

Convert PDF to text (markdown) with olmOCR on Windows Mini PC with Intel Core Ultra i5

olmOCR is a Qwen2-VL 7B model fine-tuned with academic papers, technical documentation, and other reference content, as well as a toolkit…

Mar 4

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams