Paper-Essence Paper Highlights Push Workflow Setup Tutorial

Adachi · February 9, 2026, 6:34am

Paper-Essence Workflow Setup Tutorial for Paper Digest Push

Project Introduction

Paper-Essence is an automated paper push workflow built on the Dify platform. This workflow can:

Regularly fetch the latest papers from specified research areas on ArXiv daily
Intelligently filter out the most valuable papers using large models
Parse paper PDFs via OCR to extract key technical details
Generate structured paper daily reports and push them automatically via email

GitHub Repository: https://github.com/LiaoYFBH/PaperFlow, you can directly import prj\Paper-Essence-CN.yml or prj\Paper-Essence-EN.yml.

Prerequisites

1. Platform and Account Preparation

Dify Platform Account: Ensure you have registered and logged into the Dify platform (Docker deployment is also an option)
Email Account: An email account that supports SMTP is required (this tutorial uses 163 Mail)
Large Model API: Requires configuring the API from Wenxin Feipaddle Xinghe Community (文心飞桨星河社区)

2. Install Necessary Plugins

Install the following plugins from the Dify Plugin Marketplace:

Plugin Name	Purpose
PaddleOCR	PDF/Image OCR Parsing
163SMTP Email Sending	163 Mail SMTP Sending
Supabase	Database Storage (records pushed papers)
Wenxin Feipaddle Xinghe Community (文心飞桨星河社区)	Xinghe Community API call for Baidu Wenxin Large Model

3. Prepare Supabase Database

To filter out papers that have already been pushed to users, we use the cloud database Supabase here.

Step 1: Log in and Create a Project

Visit supabase.com, click login in the top right corner, then click Start your project

Step 2: Create a Data Table

Execute the following SQL statement in the SQL Editor:

Create a new table in the database to record paper push information, named pushed_papers, and set two core fields for the table with data integrity constraints to ensure the uniqueness and validity of paper push records.

create table pushed_papers (
  arxiv_id text not null,
  pushed_at timestamp default now(),
  primary key (arxiv_id)
);

Step 3: Obtain API Key

Record the following information in Project Settings → API:

NEXT_PUBLIC_SUPABASE_URL → Corresponds to the Supabase URL for the Dify Supabase plugin
NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY → Corresponds to the Supabase Key for the Dify Supabase plugin

Step 4: Configure the Supabase Plugin in Dify

Configure the Supabase plugin in Dify Plugin Management, filling in the URL and Key mentioned above.

Overall Workflow Architecture

Process Description

Stage	Node	Function Description
Trigger	Scheduled Trigger	Automatically starts at a specified time daily
Configuration	Configuration Node	Reads all environment variables and outputs them for subsequent use
Translation	LLM Translation	Translates research topics into English
Search	Get Rows → Preprocessing → HTTP Request → Post-processing	Queries pushed records, searches for new ArXiv papers
Initial Review	LLM Initial Review	Uses LLM to filter Top 3 papers
Iteration	Iteration Node	Performs for each paper: Unpack → Record → OCR → Analyze → Assemble
Output	Template Conversion → Email Sending	Generates formatted report and pushes via email

Full view of this workflow:

Detailed Setup Steps

Step One: Create Workflow

Log in to the Dify platform
Click “Studio” → “Create App” → Select “Workflow” type
Enter application name
Select trigger type

Step Two: Configure Environment Variables

Click the settings button in the top right corner of the UI, and add environment variables:

Variable Name	Type	Description	Example Value
`table_name`	string	Supabase Table Name	`pushed_papers`
`SMTP_PORT`	string	Email SMTP Port	`465`
`SMTP_SERVER`	string	SMTP Server	`smtp.163.com`
`SMTP_PASSWORD`	secret	Email Authorization Code	(Your authorization code)
`SMTP_USER`	secret	Email Account	`your_email@163.com`
`MY_RAW_TOPIC`	string	Research Topic	`agent memory`

Get email authorization code: Log in to 163 Mail → Settings → POP3/SMTP/IMAP → Enable service and get authorization code

Step Three: Scheduled Trigger Node

Node Name: Scheduled Trigger
Configuration Items:

Trigger Frequency: Daily
Trigger Time: 8:59 AM (or adjust as needed)

Step Four: Configuration Node (Code Node)

Node Name: Configuration
Node Type: code

This node is responsible for reading all environment variables and outputting them for subsequent nodes.

Input Variables:

Read from environment variables: SMTP_USER, MY_RAW_TOPIC, SMTP_PORT, SMTP_SERVER, SMTP_PASSWORD, table_name

Output Variables:

raw_topic: Research topic
user_email: Recipient email
fetch_count: Number of papers to search (default 50)
push_limit: Push quantity limit (default 3)
days_lookback: Days to look back (default 30)
and SMTP related configurations

Code:

import os

def main(
    SMTP_USER: str,
    MY_RAW_TOPIC: str,
    SMTP_PORT: str,
    SMTP_SERVER: str,
    SMTP_PASSWORD: str,
    table_name: str
) -> dict:

    user_email = SMTP_USER
    raw_topic = MY_RAW_TOPIC

    smtp_port = SMTP_PORT
    smtp_server = SMTP_SERVER
    smtp_password = SMTP_PASSWORD
    table_name = table_name

    return {
        "raw_topic": raw_topic,
        "user_email": user_email,
        "smtp_port": smtp_port,
        "smtp_server": smtp_server,
        "smtp_password": smtp_password,
        "fetch_count": 50,
        "push_limit": 3,
        "days_lookback": 30,
        "table_name": table_name
    }

Step Five: Research Field Translation (LLM Node)

Node Name: Research Field LLM Translation
Node Type: llm

Translates Chinese research topics into English boolean query strings recognizable by the ArXiv API.

Model Configuration:

Model: ernie-4.5-turbo-128k or ernie-5.0-thinking-preview
Temperature: 0.7

System Prompt:

You are a professional academic translation assistant. Please translate the user's Chinese research topic into English and convert it into a boolean query string format suitable for the ArXiv API.

Requirements:
1. Identify core keywords and translate them into English
2. Use AND/OR to connect multiple keywords
3. Output plain text, without additional explanations
4. Example: Input "多模态大模型" → Output "multimodal AND large language model"

Step Six: Query Pushed Records (Supabase Node)

Node Name: Get Rows
Node Type: tool (Supabase)

Query pushed paper records from the Supabase database to avoid duplicate pushes.

Configuration:

Table Name: {{table_name}} (obtained from the configuration node)

Step Seven: Search Papers (Split into 3 Nodes)

To improve the stability and maintainability of the workflow, the search function is split into three consecutive nodes: “Preprocessing” → “HTTP Request” → “Post-processing”.

7.1 Search Paper Node Preprocessing (Code Node)

Node Name: Search Paper Node Preprocessing
Node Type: code

This node is responsible for preparing search parameters, calculating the date range, and constructing the ArXiv API query string.

Input Variables:

topic: Translated English search term
days_lookback: Days to look back
count: Number of searches
supabase_output: Pushed records (for deduplication)

Code Logic:

Calculate look-back date (cutoff_date)
Parse the list of pushed paper IDs returned by Supabase
Construct a boolean query string based on the topic (supports AND/OR logic)
Add ArXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL, etc.)
Extract search keywords for subsequent filtering

Output Variables:

base_query: Constructed query string
pushed_ids: List of pushed IDs
cutoff_str: Cutoff date string
search_keywords: List of search keywords
fetch_limit: API fetch quantity limit

7.2 HTTP Request (HTTP Node)

Node Name: HTTP Request
Node Type: http-request

Directly calls the ArXiv API to get raw XML data.

Configuration:

API URL: export.arxiv.org/api/query
Method: GET
Query Parameters:
- search_query: {{base_query}}
- start: 0
- max_results: {{fetch_limit}}
- sortBy: submittedDate
- sortOrder: descending

7.3 Search Paper Node Post-processing (Code Node)

Node Name: Search Paper Node Post-processing
Node Type: code

Parses the XML data returned by the API and performs fine-grained filtering.

Input Variables:

http_response_body: HTTP node response body
and all output variables from the preprocessing node

Code Logic:

Parse XML response
Deduplication Filter: Remove papers found in pushed_ids
Date Filter: Remove papers older than cutoff_date
Keyword Filter: Ensure the title or abstract contains at least one search keyword
Format output as a list of JSON objects

Output Variables:

result: Final filtered paper list (JSON string)
count: Final paper count
debug: Debugging information (includes filtering statistics)

Step Eight: LLM Initial Review and Filtering (LLM Node)

Node Name: LLM Initial Review
Node Type: llm

Uses LLM to conduct an initial review of papers and filter out the most valuable ones.

System Prompt:

You are a senior academic researcher, skilled at quickly evaluating paper value.

Task: Filter out the Top 3 most valuable papers from the given list.

Evaluation Criteria:
1. Innovativeness: Is the method novel?
2. Practicality: Does it have practical application value?
3. Impact: Author's institution, publication status
4. Technical Depth: Does it have technical breakthroughs?

Output Requirements:
- Pure JSON array format
- Retain all original fields
- Output Top 3 papers

Step Nine: JSON Parsing (Code Node)

Node Name: JSON Parsing
Node Type: code

Parses the JSON string output by the LLM, handling various possible formats.

Core Logic:

Handle nested JSON
Support papers or top_papers fields
Fault tolerance

Step Ten: Iteration Node

Node Name: Iteration
Node Type: iteration

Iterate through each filtered paper.

Configuration:

Input: top_papers (array of papers)
Output: merged_paper (processed paper object)
Parallel Mode: Off (sequential execution)
Error Handling: Terminate on error

Detailed Internal Iteration Process

No.	Node Name	Type	Function
1	DataUnpack	code	Unpacks iteration items into independent variables
2	Create a Row	tool	Records arxiv_id to Supabase to prevent duplication
3	Large Model Document Parsing	tool	PaddleOCR parses PDF to extract body text
4	get_footnote_text	code	Extracts footnote information (for institution identification)
5	truncated_text	code	Trims OCR text (controls LLM input length)
6	(LLM) Analysis	llm	Deeply analyzes papers to extract key information
7	Data Assembly	code	Assembles the final paper object

### 10.1 DataUnpack (Code Node)

Deconstruct iterative items into independent variables.

Output:

title_str: Paper title
pdf_url: PDF link
summary_str: Abstract
published: Publication date
authors: Authors
arxiv_id: ArXiv ID

10.2 Create a Row (Supabase Node)

Record the paper’s ArXiv ID into the database to prevent duplicate pushes.

Configuration:

Table name: Obtained from the configuration node
Data: {\"arxiv_id\": \"{{arxiv_id}}\"}

10.3 Large Model Document Parsing (PaddleOCR Node)

Node Name: Large Model Document Parsing
Node Type: tool (PaddleOCR)

Use PaddleOCR to parse the paper PDF and extract the main body content.

Configuration:

file: PDF URL
fileType: 0 (PDF file)
useLayoutDetection: true (Enable layout detection)
prettifyMarkdown: true (Beautify output)

10.4 get_footnote_text (Code Node)

Extract footnote information from the OCR text for subsequent institution recognition.

10.5 truncated_text (Code Node)

Crop OCR text to control LLM input length and avoid exceeding token limits.

10.6 LLM Deep Analysis

Node Name: (LLM) Analysis
Node Type: llm

Perform deep analysis on the paper to extract key information.

Extracted Fields:

One_Liner: One-liner problem and solution
Architecture: Model architecture and key innovations
Dataset: Data source and scale
Metrics: Core performance metrics
Chinese_Abstract: Chinese abstract translation
Affiliation: Author affiliation
Code_Url: Code link

Core Principles:

No fluff: Directly state specific methods
Dig deep into details: Summarize algorithm logic, loss function design
Data first: Show the improvement margin compared to SOTA
No N/A: Reasonably infer

Output Format: Pure JSON object

10.7 Data Assembly (Code Node)

Node Name: Data Assembly
Node Type: code

Assemble all information into a structured paper object.

Core Functions:

Parse publication status (identify top-tier conference papers)
Parse LLM output JSON
Extract code links
Assemble the final paper object

Output Fields:

title: Title
authors: Authors
affiliation: Affiliation
pdf_url: PDF link
summary: English abstract
published: Publication status
github_stats: Code status
code_url: Code link
ai_evaluation: AI analysis results

Step Eleven: Template Transformation

Node Name: Template Transformation
Node Type: template-transform

Use a Jinja2 template to convert paper data into formatted email content.

Template Structure:

📅 PaperEssence Research Daily
Based on your specified research content "{{ raw_topic }}", 3 papers selected from arXiv updates in the last 30 days are pushed daily.
--------------------------------------------------
<small><i>⚠️ Note: Content is AI-generated and for academic reference only. Before citing or conducting in-depth research, please click the PDF link to verify with the original paper.</i></small>
Generation Date: {{ items.target_date | default('Today') }}
==================================================

{% set final_list = items.paper | default(items) %}

{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
   🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}

English Abstract:
{{ item.summary | replace('\n', ' ') }}

Chinese Abstract:
{{ item.ai_evaluation.Chinese_Abstract }}

🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}

📊 Summary:
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

💾 Data:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

==================================================
{% else %}
⚠️ No new paper updates today.
{% endfor %}

Step Twelve: Email Sending

Node Name: 163SMTP Email Sending
Node Type: tool (163-smtp-send-mail)

Configuration:

username_send: Sender email (read from environment variables)
authorization_code: Email authorization code (read from environment variables)
username_recv: Recipient email
subject: PaperEssence-{{cutoff_str}}-{{today_str}}
content: Content after template transformation

Step Thirteen: Output Node

Node Name: Output
Node Type: end

Output the final result for easy debugging and verification.

Publish Workflow and Get API

After the workflow is debugged and passed, click the publish button in the upper right corner.

Record the following information:

API Endpoint: api.dify.ai/v1/workflows/run
API Key: A string like app-xxxxxxxxxxxx

Configure Daily Automatic Run

Since the timed trigger on the Dify cloud platform may have limitations in the free version, you can use Windows Task Scheduler with a script to achieve daily timed workflow triggering.

Prerequisite: Install Git for Windows

This solution uses Git Bash to execute curl commands, so Git for Windows needs to be installed first.

Notes during installation:

It is recommended to choose the default installation path (e.g., C:\Program Files\Git) or a custom path (e.g., D:\ProgramFiles\Git)
Ensure “Git Bash Here” is checked in the installation options

Configure Windows Task Scheduler

Press Win + R → Type taskschd.msc, press Enter, open Task Scheduler
Click “Create Task” on the right

General Tab:

Name: Paper-Essence Daily Run
Check “Run with highest privileges”

Triggers Tab:

Click “New”
In the “Begin the task” dropdown, select “On a schedule”
Select “Daily”, set the trigger time (recommended to be consistent with the timer time in the Dify workflow, e.g., 20:55)
Click “OK”

Actions Tab:

Click “New”
For Action, select “Start a program”
Program/script: Enter your Git Bash path, for example: D:\ProgramFiles\Git\bin\bash.exe

Add arguments:

curl -N -X POST "api地址" -H "Authorization: Bearer app-你的API密钥" -H "Content-Type: application/json" -d '{ "inputs": {}, "response_mode": "streaming", "user": "cron-job" }'

Click “OK” to save the task

Testing and Debugging

Manual Testing

Click the “Run” button in the upper right corner of the workflow editor
Observe the execution status of each node
Check if the output of each node meets expectations

Successful Run Effect

After the workflow executes successfully, you will receive a formatted paper daily report email in your inbox, including:

Paper title, authors, affiliation
PDF link and code link
Chinese and English abstracts
AI-analyzed core innovations, architecture, dataset, and performance metrics

Summary

This tutorial covers YAML node configuration and scheduling, environment variables and Supabase usage, and builds an end-to-end pipeline from ArXiv scraping → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, while also enhancing deduplication and error handling.

Acknowledgements

Thanks to Teacher Zhang Jing, Teacher Guan Mu, and Teacher Yang Youzhi for their guidance.

Topic		Replies	Views
Paper-Essence — Tutorial: Building an Automated Paper Digest Workflow Help Me Build case	0	236	February 9, 2026
工作流中同时上传了文件和一个问题，发现llm在思考时说未收到任何上传的文档 Discussion	20	321	January 21, 2026
每天被工单淹没，我终于用 Dify + HAP 做了一个会“听人说话”的工单系统 Help Me Build	1	544	December 9, 2025
Getting Started with Dify Workflows \| A News Editing Assistant Powered by Web Crawlers and LLMs Chinese 🇨🇳 course-beginner , case	0	177	October 15, 2025
Dify Workflow Architecture Explained \| In-Depth Analysis of Node Logic and Data Types! Chinese 🇨🇳 ai , course-beginner	1	534	January 5, 2026
Dify+webhook+poll+error Discussion	6	151	January 26, 2026
Dify有没有工作流或者插件可以将word(doc和docx)转成pdf或者将word(doc和docx)直接解析出来 Discussion	5	220	January 22, 2026
Dify Tutorial \| Workflow, Build a Local, Open-Source Long-Text Translation Powerhouse with Dify \| Hands-on Tutorial Chinese 🇨🇳 ai , course-beginner	0	250	October 15, 2025
工作流运行怎么设置输出知识来源文档 Discussion	1	74	December 16, 2025
Markdown转换器无法获取到文件的存储位置吗，输出变量不可用？ Discussion	6	263	January 23, 2026

Paper-Essence Paper Highlights Push Workflow Setup Tutorial

Paper-Essence Workflow Setup Tutorial for Paper Digest Push

Project Introduction

Prerequisites

1. Platform and Account Preparation

2. Install Necessary Plugins

3. Prepare Supabase Database

Step 1: Log in and Create a Project

Step 2: Create a Data Table

Step 3: Obtain API Key

Step 4: Configure the Supabase Plugin in Dify

Overall Workflow Architecture

Process Description

Detailed Setup Steps

Step One: Create Workflow

Step Two: Configure Environment Variables

Step Three: Scheduled Trigger Node

Step Four: Configuration Node (Code Node)

Step Five: Research Field Translation (LLM Node)

Step Six: Query Pushed Records (Supabase Node)

Step Seven: Search Papers (Split into 3 Nodes)

7.1 Search Paper Node Preprocessing (Code Node)

7.2 HTTP Request (HTTP Node)

7.3 Search Paper Node Post-processing (Code Node)

Step Eight: LLM Initial Review and Filtering (LLM Node)

Step Nine: JSON Parsing (Code Node)

Step Ten: Iteration Node

Detailed Internal Iteration Process

10.2 Create a Row (Supabase Node)

10.3 Large Model Document Parsing (PaddleOCR Node)

10.4 get_footnote_text (Code Node)

10.5 truncated_text (Code Node)

10.6 LLM Deep Analysis

10.7 Data Assembly (Code Node)

Step Eleven: Template Transformation

Step Twelve: Email Sending

Step Thirteen: Output Node

Publish Workflow and Get API

Configure Daily Automatic Run

Prerequisite: Install Git for Windows

Configure Windows Task Scheduler

General Tab:

Triggers Tab:

Actions Tab:

Testing and Debugging

Manual Testing

Successful Run Effect

Summary

Acknowledgements

Related topics