pdf-tocgen

pdf.tocgen is a set of command-line tools for extracting and generating the table of contents from a PDF file.

It can be installed with pip: pip install pdf.tocgen, which installs a set of tools to extract and generate a toc for a pdf that doesn’t have one.

First, create a recipe file, which tracks similar elements to make chapters out of:

This creates a rule out of a word in the pdf _intro-prob.pdf, which reads page 2 in the document, looking for the word Introduction, and writes down the characteristics of the word.

pdfxmeta --auto 1 --page 2 _intro-prob.pdf "Introduction" >> recipe.toml
[[heading]]
# Introduction
level = 1
greedy = true
font.name = "CMBX12"
font.size = 14.344016075134277
# font.size_tolerance = 1e-5
# font.color = 0x000000
# font.superscript = false
# font.italic = false
# font.serif = true
# font.monospace = false
# font.bold = true
# bbox.left = 96.19835662841797
# bbox.top = 109.06538391113281
# bbox.right = 185.04519653320312
# bbox.bottom = 123.42374420166016
# bbox.tolerance = 1e-5

You can then add some more rules, but assuming this is good enough:

You can then generate a table of contents with pdftocgen for the document.

pdftocgen _intro-prob.pdf < recipe.toml
"1 Introduction" 2
"2 Combinatorics" 4
"3 Axioms of Probability" 12
"4 Conditional Probability and Independence" 25
"Interlude: Practice Midterm 1" 40
"5 Discrete Random Variables" 46
"6 Continuous Random Variables" 59
"7 Joint Distributions and Independence" 71
"Interlude: Practice Midterm 2" 82
"8 More on Expectation and Limit Theorems" 88
"Interlude: Practice Final" 105
"9 Convergence in probability" 113
"10 Moment generating functions" 121
"11 Computing probabilities and expectations by conditioning" 128
"Interlude: Practice Midterm 1" 140
"12 Markov Chains: Introduction" 146
"13 Markov Chains: Classification of States" 154
"14 Branching processes" 164
"15 Markov Chains: Limiting Probabilities" 170
"Interlude: Practice Midterm 2" 179
"16 Markov Chains: Reversibility" 185
"17 Three Applications" 191
"18 Poisson Process" 199
"Interlude: Practice Final" 211

Verify that it looks correct, then write it to a file called toc:

pdftocgen _intro-prob.pdf < recipe.toml > toc

Finally, you can import it to the original pdf file using pdftocio, creating a copy called output.pdf.

pdftocio -o output.pdf _intro-prob.pdf < toc

And now the pdf should have chapters added to it.