pdf parsing ---- pdf 文本解析

参考

工具列表：

PDF信息提取技术的汇总（干货满满！） - 知乎

中文 PDF 抽取工具

pdfact
paddleocr
grobid
- 中文支持不好
pdf2docx
- 排版信息不好，乱序
chinese_science_paper_to_text
- https://github.com/flyingwaters/chinese_science_paper_to_text/blob/main/extract.py
pdfplumber
- 测试，中文专利可以

工具对比

参考：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
=pdfminer.six=
provides the foundation for =pdfplumber=. It primarily
focuses on parsing PDFs, analyzing PDF layouts and object positioning,
and extracting text. It does not provide tools for table extraction or
visual debugging.

=PyPDF2= is a pure-Python library "capable of splitting, merging,
cropping, and transforming the pages of PDF files. It can also add
custom data, viewing options, and passwords to PDF files." It can
extract page text, but does not provide easy access to shape objects
(rectangles, lines, etc.), table-extraction, or visually debugging
tools.

快速
=pymupdf= is ~substantially faster than pdfminer.six~ (and thus also
pdfplumber) and can generate and modify PDFs, but the library requires
installation of non-Python software (MuPDF). It also does not enable
easy access to shape objects (rectangles, lines, etc.), and does not
provide table-extraction or visual debugging tools.

表格抽取
=camelot=, =tabula-py=, =tabula-py=, and =pdftables= all focus primarily on extracting tables. In
some cases, they may be better suited to the particular tables you are
trying to extract.

工具列表

13 Best Open Source Free PDF OCR Text Extractors

pdf ocr 层添加工具

OCRmyPDF
pdfocr
- GitHub - gkovacs/pdfocr: Adds text to PDF files using the cuneiform OCR software

什么是 pdf unit

参考
- https://support.activepdf.com/hc/en-us/articles/360002401633-What-are-PDF-Units-and-Coordinates
1 pdf unit == 1 inch
使用案例
- grobid
  - https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/
    contrary to usage, the origin of a document is at the upper left corner. The x-axis extends to the right and the y-axis extends downward,
    all locations and sizes are stored in an abstract value called a PDF unit,
    PDF documents do not have a resolution: to convert a PDF unit to a physical value such as pixels, an external value must be provided for the resolution.

DPI 问题

grobid 坐标换算 DPI 72
打印机一般 DPI 300

pdf 转图片

通过 ghostscript

参考：

How To Convert PDFs to Images for ML Projects Using Ghostscript and Multiproc…

表格图片抽取

GitHub - allenai/pdffigures2: Given a scholarly PDF, extract figures, tables,…
- 标题
- 区域坐标
- 页码
GitHub - titipata/scipdf_parser: Python PDF parser for scientific publication…
- 文本 + 图片
- 底层使用 grobid + pdffigures2
表格识别：microsoft/table-transformer-detection · Hugging Face
- 图片表格区域识别

pdfact

解析 PDF 成 json

效果：

支持非论文的 pdf
支持中文
- 但是测试，中文的论文效果有时会很差

参考：

pdfact docker 使用

镜像: docker run -it --rm -p 80:80 dnlbauer/pdfact-service

使用： curl -H "Accept: application/json" -F file=@testfile.pdf localhost:80/analyze

MinerU + pdf extract kit 笔记

MinerU 代码分析

layout 结果分析处理
- magic_pdf/model/magic_model.py::MagicModel__init__()
处理 layout 信息的如卡代码：
- magic_pdf/pdf_parse_union_core.py::pdf_parse_union()

中文 PDF 识别抽取

版面分析模型

参考：

GitHub - RapidAI/RapidLayout: Analysis of Chinese and English layouts 中英文版面分析
- 收集了多个场景的中文模型

百度
- picodet_lcnet_x1_0_fgd_layout_cdla: 9.7M, CDLA 数据集训练的中文版面分析模型，可以划分为表格、图片、图片标题、表格、表格标题、页眉、脚本、引用、公式 10 类区域
360
- rapid layout
  - 中英文论文、通用场景分别提供了不同的模型
上海 AI 研究所
- LayoutLMv3-SFT:
  - GitHub - opendatalab/PDF-Extract-Kit: A Comprehensive Toolkit for High-Qualit…

标注方法和工具

数据集和标注方法总结
- Datasets and annotations for layout analysis of scientific articles | Interna…
M2Doc 版面分析模型
- GitHub - johnning2333/M2Doc

数据集

{2305.08719} M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout…
- M6Doc
- 国人制作，多种文档类型

文章目录

参考

中文 PDF 抽取工具

pdf 转图片

工具

表格抽取

工具对比

工具列表

pdf ocr 层添加工具

什么是 pdf unit

DPI 问题

pdf 转图片

通过 ghostscript

表格图片抽取

pdfact

pdfact docker 使用

MinerU + pdf extract kit 笔记

MinerU 代码分析

中文 PDF 识别抽取

版面分析模型

标注方法和工具

数据集