Layout Parser ---- 通用文档图片分析工具

参考

例子
- layout-parser/examples at main · Layout-Parser/layout-parser · GitHub
- 布局识别 + 文本抽取： layout-parser/Deep Layout Parsing.ipynb at main · Layout-Parser/layout-parser…
- 表格 OCR 识别和解析
repo: GitHub - Layout-Parser/layout-parser: A Unified Toolkit for Deep Learning Bas…
模型 model zoo: Model Zoo — Layout Parser 0.3.2 documentation
- HJDataset: Historical Japenese 日语文献
- PubLayNet: PubMed 文献
- PrimaLayout: Patten Recognition & Image Analysis Research
  - 一家研究所
  - 各种类型文献，Particular emphasis is placed on magazines and technical/scientific publications
    - 各种杂志，和科技出版物
- NewspaperNavigator: 历年的美国新闻报纸
- TableBank: 表格数据集，网络上的 word 和 latex 文档
  - 注意： 只识别表格

只接收图片对象
只是 layout 识别工具
- 结果是布局数据
- 具体的文本抽取和前后顺序等需要自己写规则
  - 截取局部图像
  - 通过它提供的 ocr 识别辅助工具自行识别
模型
- 内建模型
- 可以自定义
- 可以使用社区共享的模型
内部支持的 OCR 识别工具
- Tesseract
- Google Cloud Vision