参考

工具列表:

中文 PDF 抽取工具

pdf 转图片

工具

表格抽取

工具对比

参考:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
=pdfminer.six=
provides the foundation for =pdfplumber=. It primarily
focuses on parsing PDFs, analyzing PDF layouts and object positioning,
and extracting text. It does not provide tools for table extraction or
visual debugging.

=PyPDF2= is a pure-Python library "capable of splitting, merging,
cropping, and transforming the pages of PDF files. It can also add
custom data, viewing options, and passwords to PDF files." It can
extract page text, but does not provide easy access to shape objects
(rectangles, lines, etc.), table-extraction, or visually debugging
tools.

快速
=pymupdf= is ~substantially faster than pdfminer.six~ (and thus also
pdfplumber) and can generate and modify PDFs, but the library requires
installation of non-Python software (MuPDF). It also does not enable
easy access to shape objects (rectangles, lines, etc.), and does not
provide table-extraction or visual debugging tools.

表格抽取
=camelot=, =tabula-py=, =tabula-py=, and =pdftables= all focus primarily on extracting tables. In
some cases, they may be better suited to the particular tables you are
trying to extract.

什么是 pdf unit

DPI 问题

  1. grobid 坐标换算 DPI 72
  2. 打印机一般 DPI 300

pdfact

解析 PDF 成 json

效果:

  • 支持非论文的 pdf
  • 支持中文

    • 但是测试,中文的论文效果有时会很差

参考:

pdfact docker 使用

镜像: docker run -it --rm -p 80:80 dnlbauer/pdfact-service

使用: curl -H "Accept: application/json" -F file=@testfile.pdf localhost:80/analyze

MinerU + pdf extract kit 笔记

MinerU 代码分析

  1. layout 结果分析处理

    • magic_pdf/model/magic_model.py::MagicModel__init__()
  2. 处理 layout 信息的如卡代码:

    • magic_pdf/pdf_parse_union_core.py::pdf_parse_union()

中文 PDF 识别抽取

版面分析模型

参考:

  1. 百度

    • picodet_lcnet_x1_0_fgd_layout_cdla: 9.7M, CDLA 数据集训练的中文版面分析模型,可以划分为表格、图片、图片标题、表格、表格标题、页眉、脚本、引用、公式 10 类区域
  2. 360

    • rapid layout

      • 中英文论文、通用场景分别提供了不同的模型
  3. 上海 AI 研究所

标注方法和工具