Pdf Ocr技能使用说明
2026-03-29
新闻来源:网淘吧
围观:6
电脑广告
手机广告
PDF扫描件 OCR 转换技能 📄
配置
- 百度 OCR API Key: vOBOM7tO0lL8cKMJdZy453Ai
- 百度 OCR Secret Key: bib8MvDPTfXXdPz4JyzIyDCvCeKxtpyu
- 免费额度:1000次/月(1次=1页),592页以内一次免费跑完
- 接口: 通用文字识别(高精度版)
accurate_basic
依赖安装
pip install pymupdf python-docx pillow
使用方法
python3 {baseDir}/scripts/pdf_to_docx.py <PDF路径> [输出目录]
输出文件在[输出目录]/xxx_全文_ocr.docx,文件较大时用脚本压缩图片:
python3 {baseDir}/scripts/compress_docx.py <docx路径> <输出路径>
处理策略
| 页面类型 | 判断方式 | 处理方式 |
|---|---|---|
| 正文页 | 默认 | 裁掉顶部6%(页眉)+底部4%(页脚),OCR识别文字 |
| 插图页 | OCR无文字输出 | Preserve as images embedded in Word |
| Colored cover/chapter pages | Color pixel ratio >25% | Preserve as images, add gray annotations |
Known limitations
- Pages with mixed text and images(Text within charts/tables): OCR will recognize text inside charts/tables as body text, requiring manual replacement
- Solution: User identifies the problematic page, informs the PDF page number, takes a screenshot, and manually replaces it
- White-background table of contents pages: Will not be automatically recognized as special pages; will be OCRed (with average results)
- Solution: Manually replace the table of contents page with an image after conversion
Practical case ("The Book of Predictions" 592 pages)
- Processing time: Approximately 20 minutes (including 0.6s/page interval)
- Original output size: 303MB (embedded 144 images)
- Compressed size: 3.4MB (images reduced to 600px width, 60% quality)
- Recognition effectiveness: High accuracy for body text, chart pages require manual processing
- Automatically save progress every 50 pages to prevent mid-process crashes
Precautions
- Free version QPS=2, script already includes a 0.6-second/page interval
- Cropping ratio (header 6%/footer 4%) can be adjusted at the top of the script
- After OCR completion, it is recommended to spot-check a few pages to verify accuracy
- Original high-definition version is retained on the server; compressed version is for distribution

文章底部电脑广告
手机广告位-内容正文底部


微信扫一扫,打赏作者吧~