Table

表格抽取工具收集

列表网站
- How to Extract Tables from PDF - PDF to Table Extractor
  - 有不同工具的对比说明
tabula-py
- GitHub - chezou/tabula-py: Simple wrapper of tabula-java: extract table from …
- 底层调用 tabula-java
PDFPatcher
- GitHub - wmjordan/PDFPatcher: PDF补丁丁——PDF工具箱，可以编辑书签、剪裁旋转页面、解除限制、提取或合并文档，探查文档结…
- 支持 OCR 图片表格
camelot
- GitHub - atlanhq/camelot: Camelot: PDF Table Extraction for Humans
pdftables
- GitHub - drj11/pdftables: A library for extracting tables from PDF files
- 停止更新，转为商业网站
- 底层使用 pdfminer
docparser
- GitHub - DS3Lab/DocParser
- 支持 OCR 识别图片表格
ocr-table
- GitHub - cseas/ocr-table: Extract tables from scanned image PDFs using Optica…
- 支持 ocr, tesseract-ocr 识别
table transformer
- Transformers-Tutorials/Using_Table_Transformer_for_table_detection_and_table_…
- 深度学习表格解析库
TIES-2.0
- GitHub - shahrukhqasim/TIES-2.0: Code for: S.R. Qasim, H. Mahmood, and F. Sha…
- 论文： Rethinking Table Recognition using Graph Neural Networks https://arxiv.org/pdf/1905.13391.pdf

pdfplumber

GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each c…

比较

1
2
3
4
5
6
7
pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.

PyPDF2 is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools.

pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools.

camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.

Table

文章目录

表格抽取工具收集

相关博客