A file and a question were uploaded simultaneously in the workflow, but the LLM mentioned during reasoning that no uploaded documents were received.

Below is a reusable troubleshooting process, organized chronologically “from error to success,” for the configuration of this Unstructured plugin integration. Future colleagues can follow it step-by-step.


I. Starting Point: LLM Says “No File Received”

  • Scenario: In a workflow, the user uploaded a scanned PDF + a question, using the built-in “Doc Extractor” beforehand.
  • Phenomenon: The LLM prompted “No uploaded documents received” while thinking.
  • Root Cause: Scanned PDFs only contain images, no text layer; the built-in Doc Extractor does not perform OCR, so the output text is empty; the Prompt references empty text, so the LLM naturally says “didn’t see the file.”

Conclusion: To solve the scanned PDF scenario, it’s necessary to switch to a tool node that supports OCR (such as the Unstructured plugin, or other OCR plugins), rather than relying solely on the built-in document extractor.


II. Phase One: Deploying Unstructured Service & Plugin Integration

1. Starting Unstructured Service Locally

  • The user started an unstructured container locally via Docker (port 8000).
  • This service is on the same Docker network as other Dify containers.

Key point here:
The “service address” configured in the plugin later must be accessible within the container network, not just from the host machine.

2. Installing Unstructured Plugin in Dify

  • Install the official Unstructured plugin from “Plugins / Marketplace.”
  • In the plugin’s API configuration pop-up, you need to fill in:
    • An “Unstructured Service API URL”;
    • Select “Service Type” (local deployment);
    • If the service requires additional authentication, fill in the Token (in this example, a local open service is used, so it can be left blank).

III. Phase Two: Pitfalls Related to URL / FILES_URL Configuration

1. Incorrect API URL Leading to 404

The initial configuration was similar to:

Unstructured Service API URL: http://unstructured:8000/general/v0/general

The problem is:
The plugin internally appends paths, for example, adding /general/v0/general again, so the final request URL becomes:

http://unstructured:8000/general/v0/general/general/v0/general

A large number of 404s can be seen in the Unstructured container logs, indicating that the path was repeatedly appended.

Correction Method:

Only keep the service root address:

http://unstructured:8000

Subsequent paths will be handled by the plugin itself. After correction, the 404s disappeared, and the parameter validation phase began.

Tip: If you later see paths like /xxx/xxx/xxx/xxx/xxx being repeatedly appended and returning 404, you can immediately check if you hardcoded the full path in the plugin.

2. FILES_URL / File Access Approach

Although the main issue this time focused on the API URL, your feedback reveals another common pitfall: writing localhost for FILES_URL in .env.

In multi-container deployments:

  • localhost refers to “itself” for each container, not the host machine, nor the Dify Web container;
  • The result is: when the plugin’s container tries to access http://localhost:xxx/..., it cannot reach the file download address exposed by Dify at all.

A more robust approach is:

  • Set FILES_URL in .env to a service name resolvable within the container network, for example (depending on your compose):
FILES_URL=http://web:3000
# Or http://nginx:80, etc., depending on which service you actually expose

As long as the Unstructured container can curl this address and retrieve the uploaded file, the plugin will work correctly.


IV. Phase Three: Unstructured Parameter (chunking_strategy) Error

After correcting the URL, the next error became parameter validation:

An error occurred in ... Partition request failed.
msg:{
  "detail":[
    {\
      "type":"literal_error",
      "loc":["body","chunking_strategy"],
      "msg":"Input should be 'by_title'",
      "input":"by_page",
      "ctx":{"expected":"'by_title'"}
    }\
  ]\
}\

The meaning is straightforward:

  • You filled by_page for chunking strategy in the plugin node;
  • The Unstructured interface currently used only accepts 'by_title' (or a limited set of values), thus returning 422 / 400.

Correction Method:

  • Change chunking_strategy to a value actually supported by the interface (e.g., by_title), or simply leave it blank to use the default value for now;
  • First ensure “no errors, results can be produced,” then gradually adjust advanced parameters.

Suggested debugging order:

  1. First, fill in only the minimum parameters (OCR strategy, language, file input) to get the node running successfully;
  2. Then, add chunking_strategy, chunk_size, overlap, etc., one by one according to the official documentation;
  3. Test run after each addition to ensure no 4xx/5xx errors.

V. Phase Four: Output Results & Final Form of the Node Link

After correcting the URL + parameters:

  • The Unstructured partition node can now successfully parse scanned PDFs;

  • In the output structure, you see a JSON object, roughly containing:

    • text: A whole segment / multiple concatenated segments of plain text (OCR content);
    • files: [] (empty list);
    • images: May be empty or contain image references;
    • elements: List of structured elements;
    • json: More raw structured results.

1. Why is files empty?

  • This is a unified schema field to be compatible with more complex scenarios like “compressed packages, multiple files, documents with attachments”;
  • You are currently passing a single PDF without nested attachments, so naturally there are no sub-files to output, hence an empty list;
  • It has nothing to do with whether OCR was successful—the truly useful parts are text / elements / json.

2. Is a subsequent “Doc Extractor” node still needed?

In your scenario:

  • Unstructured has already completed the entire process of “file → OCR → structured elements → text concatenation”;
  • Adding another built-in “Doc Extractor” node would merely process the text result again without additional benefit for scanned PDFs.

Recommended final workflow:

  1. User input node: Upload scanned PDF (user_files).
  2. Unstructured partition node:
    • Input file: Reference {{ user_files[0].file }} (or your current variable).
    • Configure OCR + chunking strategy (ensure parameters are valid).
  3. LLM node:
    • Directly reference {{ partition.text }} in the system / user prompt (using your node’s output name).
  4. For advanced needs:
    • Use elements / json for more granular filtering (e.g., taking only certain pages, certain element types).

VI. “Quick Self-Check Checklist” for Future Colleagues

When encountering issues similar to “scanned PDF + Unstructured plugin,” you can troubleshoot in this order:

  1. Service Connectivity

    • The local Unstructured container is running normally on port 8000.
    • curl http://unstructured:8000 is successful from other containers on the same network.
  2. Plugin API URL

    • Only fill in the root address: http://unstructured:8000;
    • Do not manually append paths like /general/v0/general, otherwise it will be repeatedly appended, leading to 404.
  3. FILES_URL / File Access

    • FILES_URL in .env does not use localhost, but rather a service name accessible from other containers, like http://web:3000;
    • The Unstructured container can download files uploaded to Dify using this URL.
  4. Parameter Configuration

    • During initial testing, do not specify or specify fewer advanced parameters to ensure the interface does not return 4xx errors;
    • When providing fields like chunking_strategy, refer to the official documentation for supported values (e.g., by_title), and check the expected hint in the error message if an error occurs.
  5. Workflow Usage

    • In scanned PDF scenarios, directly use the text output from the Unstructured node to the LLM;
    • The built-in document extractor can be omitted in this pipeline;
    • Don’t dwell on files being empty; focus on whether text has content.

This comprehensive review covers the path from “LLM not seeing files” to “Unstructured OCR running successfully” and the pitfalls encountered along the way. Moving forward, by checking URL → Network → Parameters → Output in this sequence, everyone should be able to quickly pinpoint issues.

1 Like