Markdown converter can't get file path, output variable unavailable?

In the output variables of the Markdown converter, the three corresponding keys are text, files, and json. Among them, files is an array, containing basic file information and the URL.

In the next node, when I use files as context, an error occurs, as follows:

Run failed: Invalid context structure: dify_model_identity=‘dify__file’ id=None tenant_id=‘ccd7d781-3643-4d21-a0e9-72ad023’ type=\u003cFileType.DOCUMENT: ‘document’\u003e transfer_method=\u003cFileTransferMethod.TOOL_FILE: ‘tool_file’\u003e remote_url=None related_id=‘9eb8576-436c-b83e-529cba96c66c’ filename=‘20260121_170237.xlsx’ extension=‘.xlsx’ mime_type=‘application/vnd.openxmlformats-officedocument.spreadsheetml.sheet’ size=9845

Why can’t I extract the URL I want, and Dify also reports an error??? Please, experts, advise!

Looking at the screenshot, it seems you want to extract content from xlsx and then feed it to an LLM for analysis? Don’t use ‘context’ to reference files; for the correct method, refer to: Document Extractor - Dify Docs

@Dify_Forum_Helper, please provide some information, combining basic knowledge and story points.

Additionally, the Marketplace also has some interesting Plugins related to Excel that can be used for future exploration :grinning_face_with_smiling_eyes: @Dify_Forum_Helper, you can also search for them.

There are indeed two easily confused points here: “file objects” and “text content usable by LLM”.

1. Why does using files as “context” cause an error?

The output of the Markdown converter is roughly structured like this:

{
  "text": "...",
  "files": [
    {
      "dify_model_identity": "__dify__file__",
      "type": "document",
      "filename": "20260121_170237.xlsx",
      "extension": ".xlsx",
      "mime_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
      "size": 9845,
      "url": "https://...signed_link..."
    }
  ],
  "json": [...]
}
  • Each item in files is essentially a Dify internal file object / handle, containing information like __dify__file__, FileType.DOCUMENT, etc.
  • The “context” of an LLM node expects string text or specific structured text, not this kind of “file object”.

So, when you bind the entire files array directly to the “context” in the next LLM node, the LLM node receives a “file object array”, whose structure does not match what it expects, leading to the error you saw:

Run failed: Invalid context structure: dify_model_identity='__dify__file__' ... type=<FileType.DOCUMENT: 'document'> ...

It’s not that the url cannot be retrieved, but rather: this entire object should not be passed as context to the LLM at all.

Analogy: You are currently passing “an Excel file handle + metadata” to the LLM, not “table content”, so the model naturally “doesn’t understand”.


2. If you want the LLM to analyze xlsx content, what is the correct approach?

Based on the screenshot/description, your goal should be:
To have the LLM read the table content from an xlsx file and then perform analysis.
In this case, do not directly reference files using “context”. Instead, you should:

  1. Use a “Document Extractor node” to parse the file content

    • Input: The file variable output by the upstream node (can be an array) – for example, the files from the Markdown converter, or files uploaded via the Start node.
    • Output: Plain text (e.g., converting Excel to Markdown table text).
  2. In the LLM node, use the text output by the Document Extractor node as prompt / context
    For example, in the system prompt or user prompt, write something like:

    Below is the Excel content uploaded by the user (converted to a Markdown table):
    {{ doc_extractor.text }}
    Please answer the user's questions based on the table above...
    

This way, the LLM receives plain text tables, and the Invalid context structure error will no longer be triggered.


3. What if you just want to get the file’s URL?

If you merely want to get a field like files[0].url, rather than directly “feeding” this object to the LLM, you can:

  • Parse it in a code node (Python / JS):

    file_url = inputs["markdown_node"]["files"][0]["url"]
    
  • Then output this file_url for subsequent nodes to use (e.g., calling your own service, etc.).

However, this is also not suitable for direct use as LLM “context”, because the URL points to a binary Excel file, which the LLM itself still cannot read. It can only be converted to text by an intermediate “parsing node”.


4. Extended usage for Excel

In addition to the general method of using a “Document Extractor node”, Dify’s Marketplace also offers some plugins for Excel (e.g., more flexible reading, writing, querying by sheet name/cell range).
If you want to perform more complex Excel processing later, you can consider:

  • Using Excel-related plugins for structured processing
  • Or combining multiple steps: Plugin → Document Extractor → LLM

Summary

  1. files is an array of file objects and cannot be used directly as LLM context, which is why it reports Invalid context structure.
  2. To have the LLM read xlsx content:
    Markdown / Start node (produces files) → Document Extractor node → Outputs text → Used as context / Prompt for the LLM node.
  3. To simply get the URL: Use a code node to extract it from files[i].url, do not directly pass the entire files to the LLM.

If you wish, you can paste a brief structure of your current workflow (each node type + key connections), and I can help you write a specific connection and variable reference example for “uploading Excel to LLM analysis”.

Thanks for the answer!

My requirement is just to extract the file URL returned by the Markdown converter, to concatenate it into a complete file download URL (absolute path).

Currently, I have obtained files[i].url using a code block, so the problem is solved.

Another question arises: in the upstream LLM node, I requested the LLM to reply with a standard Markdown table, with the goal of having the next Markdown converter node generate Excel.

By checking the data format returned by the LLM, I believe there’s no issue. For example, the text in the image below conforms to Markdown syntax, but the Markdown converter did not generate corresponding Excel sheets based on the ## heading syntax. How should I handle this?

image

@Dify_Forum_Helper

先说结论:目前你的用法(让 LLM 输出带 ## 标题的 Markdown,希望 Markdown 转换器按 ## 自动切 Sheet)在 Dify 里是不保证生效的——Markdown 转换器现在并没有一个“官方约定”的规则:## 一定会变成 Excel 的工作表名。

换句话说:

  • 它确实能把 Markdown 表格转成 Excel;
  • 但“多 Sheet + Sheet 命名逻辑”目前是比较「黑盒」的,不是一个对外文档化、可配置的能力,所以你看到“Markdown 看起来没问题,但没按 ## 分 Sheet”是符合当前实现现状的。

下面分几块说你可以怎么处理。


1. Understanding the Current Behavior

Based on the structure you provided:

{
  "text": "## XXX\n\n| 列1 | 列2 |\n| --- | --- |\n| ... | ... |"
}

Your expectation is:

  • ## XXX → An Excel Sheet named XXX appears
  • ## YYY → A new Sheet is created

However, the Markdown converter currently behaves more like this:

  • It treats the entire Markdown section as a single document;
  • The focus is on the ‘table itself’ rather than ‘heading level → Sheet’.

Therefore, common phenomena are:

  • It might only generate one Sheet;
  • Or it uses a default Sheet name (e.g., Sheet1) instead of the text within ##.

This is not because your Markdown is incorrect, but rather because the current product design does not treat ## as a formal protocol for ‘Sheet boundary + Sheet name’.


2. Solutions for Achieving ‘Split Sheets by Heading’

If you have a strong requirement to ‘split Sheets by ##’, you can consider bypassing the Markdown converter’s default rules by using a ‘code node + Excel library’ approach to explicitly write your desired structure into multiple Sheets.

Approach A: LLM Outputs Structured JSON, Then Code Generates Excel

  1. In the LLM node, do not directly have it output Markdown; instead, have it output structured JSON, for example:

    {
      "sheets": [
        {
          "name": "SheetA",
          "table": [
            ["列1", "列2"],
            ["a1", "a2"],
            ["b1", "b2"]
          ]
        },
        {
          "name": "SheetB",
          "table": [
            ["列1", "列2"],
            ["x1", "x2"]
          ]
        }
      ]
    }
    
  2. Use a code node (Python recommended) to parse this JSON, and use libraries like openpyxl or pandas to create the Excel file yourself, with full control over multiple tables and Sheets.

    Pseudocode example:

    import io
    from openpyxl import Workbook
    import json
    
    data = json.loads(inputs["llm"]["text"])  # Assuming LLM outputs the JSON above
    
    wb = Workbook()
    # Delete default sheet
    default_ws = wb.active
    wb.remove(default_ws)
    
    for sheet in data["sheets"]:
        ws = wb.create_sheet(title=sheet["name"][:31])  # Excel sheet names are max 31 characters
        for row in sheet["table"]:
            ws.append(row)
    
    # Save to memory and return to subsequent nodes as a file
    buffer = io.BytesIO()
    wb.save(buffer)
    buffer.seek(0)
    
    outputs["excel_file"] = {
        "type": "document",
        "filename": "result.xlsx",
        "content": buffer.read()
    }
    
  3. Subsequently, you can provide this excel_file as a regular file for users to download, or pass it to other nodes.

Advantages:

  • Completely independent of the Markdown converter’s internal rules;
  • Sheet names, number of Sheets, and content of each Sheet are all under your control;
  • The LLM’s task is also clearer: it’s only responsible for ‘structured planning,’ not Excel details.

Approach B: Continue Having LLM Output Markdown, But Use Code to Split and Write Excel

If your current LLM Prompt is already fixed to output Markdown, and you prefer Markdown’s readability, you can:

  1. Still require the LLM to use a similar structure:

    ## SheetA
    
    | 列1 | 列2 |
    | --- | --- |
    | a1  | a2  |
    | b1  | b2  |
    
    ## SheetB
    
    | 列1 | 列2 |
    | --- | --- |
    | x1  | x2  |
    
  2. Add a code node downstream to do two things:

    • Use regular expressions / Markdown parsing libraries to split the text into blocks by ## headings;
    • Extract the first table in each block as a 2D array, then write multiple Excel Sheets in a similar way to the above.

    For example, a simple regex approach (pseudocode):

    import re
    
    md = inputs["llm"]["text"]
    
    # Split into sheet blocks
    blocks = re.split(r'^##\s+', md, flags=re.MULTILINE)
    # blocks[0] might be content before the first ##, can be ignored
    sheet_blocks = blocks[1:]
    
    sheets = []
    for block in sheet_blocks:
        # The first line until the newline is the sheet name
        lines = block.splitlines()
        sheet_name = lines[0].strip()
        sheet_body = "\n".join(lines[1:])
    
        # Then parse the first markdown table in sheet_body into a 2D array
        # (You can write your own parser or use an existing library)
        table = parse_markdown_table(sheet_body)
    
        sheets.append({"name": sheet_name, "table": table})
    
    # Then reuse the openpyxl writing logic from above
    
  3. Ultimately, this code node will generate the Excel file, rather than relying on the Markdown converter to automatically infer Sheets.


3. What If You Absolutely Must Continue to Rely on the Markdown Converter?

Based on currently available information, there is no ‘guaranteed version’ of documentation describing:

  • The mapping rules between headings like ##, ###, etc., and Sheets;
  • Or a switch that can enable ‘splitting Sheets by heading’.

Therefore:

  • Even if you currently discover a way that ‘seems to split Sheets by heading,’ it might be an implementation detail and not necessarily stable in future versions;
  • For production scenarios, it’s still recommended to use the aforementioned ‘LLM + code-generated Excel’ approach to avoid relying on undocumented behavior.

4. Practical Advice (Minimal Change Version)

On your existing workflow, if you don’t want to make major changes, you can fine-tune it like this:

  1. Upstream LLM:

    • Continue to output ‘Markdown with ##’;

    • But additionally, in the Prompt, ensure the model outputs a JSON structure simultaneously, for example:

      First, provide the Markdown for readability;
      Then, provide a JSON code block with the following structure:
      ```json
      { "sheets": [ { "name": "...",
                      "table": [["表头1","表头2"],["...","..."]] } ] }
      
      
      
  2. Downstream, add a new code node:

    • Specifically read the JSON section, ignoring the Markdown section;
    • Generate the file using the JSON → multi-Sheet Excel logic;
    • No longer use the Markdown converter to convert to Excel.

This way:

  • Humans can still view elegant Markdown in chat records;
  • When the machine generates Excel, it follows a ‘structured + controllable’ path.

Summary

  • The core issue you’re facing is: The Markdown converter does not have a reliable, publicly documented rule for ‘## → Excel sheet’, so it won’t create multiple Sheets in the way you expect.
  • To stably control Sheet names and structure, it’s recommended to use:
    • ‘LLM outputs structured JSON → code node generates Excel’; or
    • ‘LLM outputs Markdown with ## → code node splits by ## → writes Excel itself’.
  • This way, you don’t rely on undocumented internal implementation details, and future version upgrades will be safer.

If it’s convenient for you, you can paste a small, complete example of your current LLM output (after anonymization), and I can directly help you write a corresponding code node example to convert this output into a multi-Sheet Excel file.

I am considering switching to a code block to parse JSON or Markdown myself to generate Excel.

If the Markdown converter cannot ensure support for such usage, then I think the description information here in Dify should at least be modified to avoid causing confusion for users.

@Dify_Forum_Helper

Your supplementary feedback is very crucial, and the text you screenshotted indeed easily leads people to “misinterpret it as a strong commitment.”

“All heading text (e.g., #, ##) will be used sequentially as worksheet names.”

Combining this with your actual test results just now, we can say two things quite clearly:

  1. From the perspective of “actual product behavior”

    • The Markdown converter currently has no problem converting Markdown tables to Excel.
    • However, “splitting sheets by heading and strictly using heading text as sheet names” is not a reliable, stable contract in the current implementation (especially when using LLM to automatically generate Markdown, the model can easily produce extra blank lines, explanatory text, code blocks, etc., all of which can affect parsing).
    • The actual effect is often:
      • Sometimes only one sheet is generated;
      • Or the sheet name is inconsistent with the heading;
      • Combinations of multi-level headings/multiple tables may not be split as you expect.
  2. From the perspective of “consistency between text and actual behavior”

    • The tone of the description you saw is “will definitely”:

      “All heading text (e.g., #, ##) will be used sequentially as worksheet names.”

    • But the current implementation is closer to “in some scenarios/under limited rules, it attempts to use headings as worksheet names,” rather than “unconditionally guaranteed.”
    • For scenarios like yours, where LLMs automatically generate Markdown, this description indeed makes people mistakenly believe that “as long as there’s ##, you’ll 100% get a corresponding Sheet.”

Therefore, your current plan to “switch to using a code block to parse JSON or Markdown yourself to generate Excel” is a more robust choice from an engineering practice perspective—especially when you want precise control over multiple sheets and naming.


How it can be implemented & How I would suggest the official team handles it

  1. Your own implementation choice (is a reasonable path)

    • Use LLM to output structured JSON / or a pre-defined Markdown structure;
    • In the code node, you can:
      • Parse out the sheet list ([{name, table}, ...]);
      • Manually write Excel using libraries like openpyxl / pandas, with full control over multiple sheets and naming.
    • This approach avoids relying on any “undocumented details,” which is best for production stability.
  2. A suggestion regarding the text / product (I will summarize it internally)

    • There is indeed a discrepancy between the current text and the truly guaranteed behavior.
    • A safer wording should be similar to:
      • “When the Markdown structure meets specific rules, it will attempt to use heading text as the worksheet name”
        or
      • Simply not promise automatic sheet splitting by heading, but rather emphasize:
        “Currently, it primarily supports converting Markdown tables to Excel worksheets; if precise control over multiple worksheets and naming is required, it is recommended to use a code node or plugin to generate Excel yourself.”
    • Your feedback in the post is inherently very valuable:
      • It helps the product/documentation team realize that “the promise made in the text > the behavior the code can actually guarantee”;
      • And it can prevent more users from falling into the same trap later.

Summary to help you confirm the current situation

  • Your choice to “generate Excel yourself through code” is correct, especially in scenarios where you need precise control over sheet names/quantities.
  • Based on your current experience, this UI description of the Markdown converter does indeed cause misunderstanding, and your suggestion (to adjust or weaken this text) is a very reasonable product improvement point.

If you later need “from the current LLM output format (Markdown or JSON) → specific Python examples” to generate multi-sheet Excel in the code node, you can paste a small snippet of the de-identified actual output, and I can directly write a ready-to-use code node example according to your format.