网淘吧来吧,欢迎您!

Pdf Ocr技能使用说明

2026-03-29 新闻来源:网淘吧 围观:6
电脑广告
手机广告

PDF扫描件 OCR 转换技能 📄

配置

  • 百度 OCR API Key: vOBOM7tO0lL8cKMJdZy453Ai
  • 百度 OCR Secret Key: bib8MvDPTfXXdPz4JyzIyDCvCeKxtpyu
  • 免费额度:1000次/月(1次=1页),592页以内一次免费跑完
  • 接口: 通用文字识别(高精度版)accurate_basic

依赖安装

pip install pymupdf python-docx pillow

使用方法

python3 {baseDir}/scripts/pdf_to_docx.py <PDF路径> [输出目录]

输出文件在[输出目录]/xxx_全文_ocr.docx,文件较大时用脚本压缩图片:

python3 {baseDir}/scripts/compress_docx.py <docx路径> <输出路径>

处理策略

页面类型判断方式处理方式
正文页默认裁掉顶部6%(页眉)+底部4%(页脚),OCR识别文字
插图页OCR无文字输出Preserve as images embedded in Word
Colored cover/chapter pagesColor pixel ratio >25%Preserve as images, add gray annotations

Known limitations

  • Pages with mixed text and images(Text within charts/tables): OCR will recognize text inside charts/tables as body text, requiring manual replacement
    • Solution: User identifies the problematic page, informs the PDF page number, takes a screenshot, and manually replaces it
  • White-background table of contents pages: Will not be automatically recognized as special pages; will be OCRed (with average results)
    • Solution: Manually replace the table of contents page with an image after conversion

Practical case ("The Book of Predictions" 592 pages)

  • Processing time: Approximately 20 minutes (including 0.6s/page interval)
  • Original output size: 303MB (embedded 144 images)
  • Compressed size: 3.4MB (images reduced to 600px width, 60% quality)
  • Recognition effectiveness: High accuracy for body text, chart pages require manual processing
  • Automatically save progress every 50 pages to prevent mid-process crashes

Precautions

  • Free version QPS=2, script already includes a 0.6-second/page interval
  • Cropping ratio (header 6%/footer 4%) can be adjusted at the top of the script
  • After OCR completion, it is recommended to spot-check a few pages to verify accuracy
  • Original high-definition version is retained on the server; compressed version is for distribution

Pdf Ocr

免责申明
部分文章来自各大搜索引擎,如有侵权,请与我联系删除。
打赏
文章底部电脑广告
手机广告位-内容正文底部

相关文章

您是本站第326323名访客 今日有221篇新文章/评论