C#读取pdf文件中的文字内容 – 珊瑚贝

本文适用于由 WORD 等文件转成的PDF文件，如果你的 PDF 文件是基于图片的扫描版，那么本文的代码是无法提取到文字的，你需要的是 OCR 技术。

NuGet： https://www.nuget.org/packages/itext7/

通过 NuGet 引入 itext7 组件（官网：https://itextpdf.com/）之后，使用以下代码即可提取 PDF 文件中的文字。

using System.Collections.Generic;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
public static class PdfHelper
{
    public static IEnumerable<string> ExtractText(string filename)
    {
        using (var r = new PdfReader(filename))
        using (var doc = new PdfDocument(r))
        {
            for (int i = 1; i < doc.GetNumberOfPages(); i++)
            {
                ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
                string text = PdfTextExtractor.GetTextFromPage(doc.GetPage(i), strategy);
                yield return text;
            }
        }
    }
}

使用方法：

var lines = PdfHelper.ExtractText("{PDF文件路径}").ToList();

(adsbygoogle = window.adsbygoogle || []).push({});

来源：https://www.02405.com/archives/7323

猜你喜欢