I now need to use a document parser to parse scanned PDFs, so do I have to deploy the Unstructured service separately? This is an absolutely common feature, but I couldn’t find any explanation in the official documentation!
Refer to: Document Extractor - Dify Docs
I have already installed and run unstructured-api locally, but why is the document parser component still unable to recognize the content in scanned PDF files?
The .env file configuration is as follows:
The workflow in Dify is as follows:
I’ve pasted the URL for the English version of the page. It seems that the URL may have changed due to the translation tool on your side.
Here’s the Chinese page:
Since the PDF is scanned, I assume that it contains images rather than actual text data.
In this case, some kind of OCR is required to extract the text. As far as I know, Unstructured also doesn’t perform OCR by default. I believe you need to configure the OCR Agent to use OCR in the OSS version:
Thank you, I will study it further. I really appreciate it. The file I uploaded is indeed a scanned PDF, which contains no text—only images.
I tried following your article, but it still didn’t work. Here’s my docker-compose.yml for the Dify platform:
I’ve already configured the environment variables as per the article and restarted the unstructured container (using docker compose down and up -d), but the final test result is still the same:
I also checked the Python environment inside the docker-unstructured-1 container, and it seems fine:
~ $ python -V
Python 3.12.12
~ $ pip list
Package Version
accelerate 1.12.0
aiofiles 25.1.0
annotated-doc 0.0.4
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.12.0
backoff 2.2.1
beautifulsoup4 4.14.3
cachetools 6.2.4
certifi 2025.11.12
cffi 2.0.0
charset-normalizer 3.4.4
click 8.3.1
coloredlogs 15.0.1
contourpy 1.3.3
cryptography 46.0.3
cycler 0.12.1
dataclasses-json 0.6.7
Deprecated 1.3.1
effdet 0.4.1
emoji 2.15.0
et_xmlfile 2.0.0
fastapi 0.128.0
filelock 3.20.1
filetype 1.2.0
flatbuffers 25.12.19
fonttools 4.61.1
fsspec 2025.12.0
google-api-core 2.28.1
google-auth 2.45.0
google-cloud-vision 3.11.0
googleapis-common-protos 1.72.0
gpg 1.24.3
grpcio 1.76.0
grpcio-status 1.76.0
h11 0.16.0
hf-xet 1.2.0
html5lib 1.1
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.36.0
humanfriendly 10.0
idna 3.11
Jinja2 3.1.6
joblib 1.5.3
kiwisolver 1.4.9
langdetect 1.0.9
lxml 6.0.2
Markdown 3.10
MarkupSafe 3.0.3
marshmallow 3.26.2
matplotlib 3.10.8
ml_dtypes 0.5.4
mpmath 1.3.0
msoffcrypto-tool 5.4.2
mypy_extensions 1.1.0
networkx 3.6.1
nltk 3.9.2
numpy 1.26.4
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.8.90
olefile 0.47
omegaconf 2.3.0
onnx 1.20.0
onnxruntime 1.23.2
opencv-python 4.11.0.86
openpyxl 3.1.5
packaging 25.0
pandas 2.3.3
pdf2image 1.17.0
pdfminer.six 20260107
pi_heif 1.1.1
pikepdf 10.1.0
pillow 12.0.0
pip 25.1.1
proto-plus 1.27.0
protobuf 6.33.2
psutil 7.2.1
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycocotools 2.0.11
pycparser 2.23
pycryptodome 3.23.0
pydantic 2.12.5
pydantic_core 2.41.5
pypandoc 1.16.2
pyparsing 3.3.1
pypdf 6.5.0
pypdfium2 5.2.0
python-dateutil 2.9.0.post0
python-docx 1.2.0
python-iso639 2025.11.16
python-magic 0.4.27
python-multipart 0.0.21
python-oxmsg 0.0.2
python-pptx 1.0.2
pytz 2025.2
PyYAML 6.0.3
RapidFuzz 3.14.3
ratelimit 2.2.1
regex 2025.11.3
requests 2.32.5
requests-toolbelt 1.0.0
rsa 4.9.1
safetensors 0.7.0
scipy 1.16.3
setuptools 80.9.0.post20251111
six 1.17.0
soupsieve 2.8.1
starlette 0.41.2
sympy 1.14.0
timm 1.0.22
tokenizers 0.22.1
torch 2.9.1
torchvision 0.24.1
tqdm 4.67.1
transformers 4.57.3
triton 3.5.1
typing_extensions 4.15.0
typing-inspect 0.9.0
typing-inspection 0.4.2
tzdata 2025.3
unstructured 0.18.24
unstructured-client 0.42.6
unstructured_inference 1.1.1
unstructured.pytesseract 0.3.15
urllib3 2.6.2
uvicorn 0.40.0
webencodings 0.5.1
wrapt 2.0.1
xlrd 2.0.2
xlsxwriter 3.2.9
Thank you for the update.
I need to share a few facts with you.
- Basically, Unstructured will not be used unless you set
ETL_TYPEin your.envfile toUnstructured. - However, the purpose of the
ETL_TYPEsetting is specifically to process documents that you have uploaded to Knowledge, and not for text extraction nodes within a workflow. - Therefore, Unstructured can’t be used in text extraction nodes within a workflow.
- Also, even if you upload a file to Knowledge—and if that file is a PDF—
pypdfium2will be used for text extraction regardless of theETL_TYPEsetting. Unstructured will not be used.
As a workaround, since you’ve already gone to the trouble of deploying Unstructured, you might want to consider using the following tool in place of the text extraction node.
Since you can also configure the strategy in the tool, I think it can handle OCR as well. However, I’ve never tried it myself, so my ability to help is a little limited.
Also, if you search the marketplace using keywords like OCR or PDF, you should be able to find some OCR tools created by the community.
Hope this helps.
Thank you very much for your reply.
Based on this diagram, I always thought my current configuration would work, but upon closer inspection, there are two points of doubt:
- This
unstructuredcan be used for the document extractor node, which conflicts with what you said about it not being usable in workflows! - Can it only be used for the document extractor (when the uploaded file is a PPTX?)
@Dify_Forum_Helper Summarize the Q&A.
Here is the summary of the Q&A session:
Issue
User zhouciming was unable to extract text from scanned PDFs using the Document Extractor node in a Dify workflow. Despite deploying the local Unstructured service and configuring ETL_TYPE=Unstructured (along with OCR environment variables) in .env and docker-compose.yaml, the output remained empty.
Diagnosis
Community member kurokobo clarified the limitations of the native configuration:
- Setting Scope: The
ETL_TYPE=Unstructuredsetting in.envis primarily designed for Knowledge Base indexing (dataset uploads), not for the runtime “Document Extractor” node within Workflows. - PDF Handling: Even within Knowledge Bases, Dify often defaults to
pypdfium2for PDFs regardless of theETL_TYPEsetting, meaning the Unstructured service (and its OCR) is not invoked. - Docs vs Reality: The user noted that
.envcomments mention Unstructured support for the “document extractor node for pptx,” confirming that its application to other file types (like PDF) in the workflow node is not supported by default.
Solution / Workaround
To perform OCR on scanned PDFs within a workflow:
- Recommendation: Instead of the native Document Extractor node, use the Unstructured Plugin (available in the Dify Marketplace) or other community-created OCR tools.
- Why: Plugins allow for explicit configuration of processing strategies (e.g., enabling OCR) that the native node does not expose or respect via global environment variables.
My explanation was a little insufficient.To be more precise, whether Unstructured can be used or not in text extraction nodes depends on the file type.
This time, the question was about PDFs, so I answered “it can’t be used in text extraction nodes” under the assumption of PDF files.
However, as noted in the comments of .env.example, for ppt and pptx, as well as doc and epub files, text extraction is performed using UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY.






