Does Dify have a workflow or plugin to convert Word (doc and docx) to PDF or directly parse Word (doc and docx)?

Does Dify have a workflow or plugin that can convert Word (doc and docx) to PDF, or directly parse Word (doc and docx) into Page 1: Content. Page 2: Content.?

Yes, there are. File conversion can be done using Markdown Exporter; as for content parsing, you can solve it with the built-in nodes’ doc extractor. If your PDF is an image version, you can try Unstructured Plugin and also search for OCR plugins such as: OCR recognition service. Related post: 工作流中同时上传了文件和一个问题,发现llm在思考时说未收到任何上传的文档 - #12 by lysonober

1 Like

@Dify_Forum_Helper, please summarize the Q&A.

Okay, here is the Q&A summary for this topic:

Question
The user asks if there is a Dify workflow or plugin that can achieve the following functions:

  1. Convert Word documents (doc/docx) to PDF.
  2. Directly parse Word document content, preferably extracting it page by page (e.g., “Page 1: Content”).

Answer

  1. Direct Content Parsing (Core Solution):
    • Use the built-in Doc Extractor node in the Dify workflow. This is the most direct method for parsing document content.
  2. File Format Conversion:
    • You can use the Markdown Exporter plugin from the Dify plugin marketplace to handle file conversion requirements.
  3. Handling Image/Scanned PDFs:

:books: Related Resources:

  • Plugin Marketplace: Search for “OCR” or check the “Tools” category.
  • Built-in Nodes: Look for Doc Extractor in the workflow editor.

Thank you very much for your answer, it’s very detailed, I’ll give it a try :+1::+1: :+1:

1 Like

you can try PaddleOCR as well

1 Like