Table
文章目录
表格抽取工具收集
列表网站
tabula-py
PDFPatcher
camelot
pdftables
- GitHub - drj11/pdftables: A library for extracting tables from PDF files
- 停止更新,转为商业网站
- 底层使用 pdfminer
docparser
- GitHub - DS3Lab/DocParser
- 支持 OCR 识别图片表格
ocr-table
- GitHub - cseas/ocr-table: Extract tables from scanned image PDFs using Optica…
- 支持 ocr, tesseract-ocr 识别
table transformer
TIES-2.0
pdfplumber
- GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each c…
比较
1 2 3 4 5 6 7pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging. PyPDF2 is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools. pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools. camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.
相关博客
- pdfplumber, camelot 等的相关系列介绍
文章作者
上次更新 2023-02-01 (9aed3e4)