Paper-Essence Workflow Setup Tutorial for Paper Digest Push
Project Introduction
Paper-Essence is an automated paper push workflow built on the Dify platform. This workflow can:
Regularly fetch the latest papers from specified research areas on ArXiv daily
Intelligently filter out the most valuable papers using large models
Parse paper PDFs via OCR to extract key technical details
Generate structured paper daily reports and push them automatically via email
GitHub Repository: https://github.com/LiaoYFBH/PaperFlow, you can directly import prj\Paper-Essence-CN.yml or prj\Paper-Essence-EN.yml.
Prerequisites
1. Platform and Account Preparation
- Dify Platform Account: Ensure you have registered and logged into the Dify platform (Docker deployment is also an option)
- Email Account: An email account that supports SMTP is required (this tutorial uses 163 Mail)
- Large Model API: Requires configuring the API from Wenxin Feipaddle Xinghe Community (文心飞桨星河社区)
2. Install Necessary Plugins
Install the following plugins from the Dify Plugin Marketplace:
| Plugin Name | Purpose |
|---|---|
| PaddleOCR | PDF/Image OCR Parsing |
| 163SMTP Email Sending | 163 Mail SMTP Sending |
| Supabase | Database Storage (records pushed papers) |
| Wenxin Feipaddle Xinghe Community (文心飞桨星河社区) | Xinghe Community API call for Baidu Wenxin Large Model |
3. Prepare Supabase Database
To filter out papers that have already been pushed to users, we use the cloud database Supabase here.
Step 1: Log in and Create a Project
Visit supabase.com, click login in the top right corner, then click Start your project
Step 2: Create a Data Table
Execute the following SQL statement in the SQL Editor:
Create a new table in the database to record paper push information, named pushed_papers, and set two core fields for the table with data integrity constraints to ensure the uniqueness and validity of paper push records.
create table pushed_papers (
arxiv_id text not null,
pushed_at timestamp default now(),
primary key (arxiv_id)
);
Step 3: Obtain API Key
Record the following information in Project Settings → API:
NEXT_PUBLIC_SUPABASE_URL→ Corresponds to the Supabase URL for the Dify Supabase pluginNEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY→ Corresponds to the Supabase Key for the Dify Supabase plugin
Step 4: Configure the Supabase Plugin in Dify
Configure the Supabase plugin in Dify Plugin Management, filling in the URL and Key mentioned above.
Overall Workflow Architecture
Process Description
| Stage | Node | Function Description |
|---|---|---|
| Trigger | Scheduled Trigger | Automatically starts at a specified time daily |
| Configuration | Configuration Node | Reads all environment variables and outputs them for subsequent use |
| Translation | LLM Translation | Translates research topics into English |
| Search | Get Rows → Preprocessing → HTTP Request → Post-processing | Queries pushed records, searches for new ArXiv papers |
| Initial Review | LLM Initial Review | Uses LLM to filter Top 3 papers |
| Iteration | Iteration Node | Performs for each paper: Unpack → Record → OCR → Analyze → Assemble |
| Output | Template Conversion → Email Sending | Generates formatted report and pushes via email |
Full view of this workflow:
Detailed Setup Steps
Step One: Create Workflow
- Log in to the Dify platform
- Click “Studio” → “Create App” → Select “Workflow” type
- Enter application name
- Select trigger type
Step Two: Configure Environment Variables
Click the settings button in the top right corner of the UI, and add environment variables:
| Variable Name | Type | Description | Example Value |
|---|---|---|---|
table_name |
string | Supabase Table Name | pushed_papers |
SMTP_PORT |
string | Email SMTP Port | 465 |
SMTP_SERVER |
string | SMTP Server | smtp.163.com |
SMTP_PASSWORD |
secret | Email Authorization Code | (Your authorization code) |
SMTP_USER |
secret | Email Account | your_email@163.com |
MY_RAW_TOPIC |
string | Research Topic | agent memory |
Get email authorization code: Log in to 163 Mail → Settings → POP3/SMTP/IMAP → Enable service and get authorization code
Step Three: Scheduled Trigger Node
Node Name: Scheduled Trigger
Configuration Items:
- Trigger Frequency: Daily
- Trigger Time:
8:59 AM(or adjust as needed)
Step Four: Configuration Node (Code Node)
Node Name: Configuration
Node Type: code
This node is responsible for reading all environment variables and outputting them for subsequent nodes.
Input Variables:
- Read from environment variables:
SMTP_USER,MY_RAW_TOPIC,SMTP_PORT,SMTP_SERVER,SMTP_PASSWORD,table_name
Output Variables:
raw_topic: Research topicuser_email: Recipient emailfetch_count: Number of papers to search (default 50)push_limit: Push quantity limit (default 3)days_lookback: Days to look back (default 30)- and SMTP related configurations
Code:
import os
def main(
SMTP_USER: str,
MY_RAW_TOPIC: str,
SMTP_PORT: str,
SMTP_SERVER: str,
SMTP_PASSWORD: str,
table_name: str
) -> dict:
user_email = SMTP_USER
raw_topic = MY_RAW_TOPIC
smtp_port = SMTP_PORT
smtp_server = SMTP_SERVER
smtp_password = SMTP_PASSWORD
table_name = table_name
return {
"raw_topic": raw_topic,
"user_email": user_email,
"smtp_port": smtp_port,
"smtp_server": smtp_server,
"smtp_password": smtp_password,
"fetch_count": 50,
"push_limit": 3,
"days_lookback": 30,
"table_name": table_name
}
Step Five: Research Field Translation (LLM Node)
Node Name: Research Field LLM Translation
Node Type: llm
Translates Chinese research topics into English boolean query strings recognizable by the ArXiv API.
Model Configuration:
- Model:
ernie-4.5-turbo-128korernie-5.0-thinking-preview - Temperature:
0.7
System Prompt:
You are a professional academic translation assistant. Please translate the user's Chinese research topic into English and convert it into a boolean query string format suitable for the ArXiv API.
Requirements:
1. Identify core keywords and translate them into English
2. Use AND/OR to connect multiple keywords
3. Output plain text, without additional explanations
4. Example: Input "多模态大模型" → Output "multimodal AND large language model"
Step Six: Query Pushed Records (Supabase Node)
Node Name: Get Rows
Node Type: tool (Supabase)
Query pushed paper records from the Supabase database to avoid duplicate pushes.
Configuration:
- Table Name:
{{table_name}}(obtained from the configuration node)
Step Seven: Search Papers (Split into 3 Nodes)
To improve the stability and maintainability of the workflow, the search function is split into three consecutive nodes: “Preprocessing” → “HTTP Request” → “Post-processing”.
7.1 Search Paper Node Preprocessing (Code Node)
Node Name: Search Paper Node Preprocessing
Node Type: code
This node is responsible for preparing search parameters, calculating the date range, and constructing the ArXiv API query string.
Input Variables:
topic: Translated English search termdays_lookback: Days to look backcount: Number of searchessupabase_output: Pushed records (for deduplication)
Code Logic:
- Calculate look-back date (cutoff_date)
- Parse the list of pushed paper IDs returned by Supabase
- Construct a boolean query string based on the topic (supports AND/OR logic)
- Add ArXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL, etc.)
- Extract search keywords for subsequent filtering
Output Variables:
base_query: Constructed query stringpushed_ids: List of pushed IDscutoff_str: Cutoff date stringsearch_keywords: List of search keywordsfetch_limit: API fetch quantity limit
7.2 HTTP Request (HTTP Node)
Node Name: HTTP Request
Node Type: http-request
Directly calls the ArXiv API to get raw XML data.
Configuration:
- API URL:
export.arxiv.org/api/query - Method:
GET - Query Parameters:
search_query:{{base_query}}start:0max_results:{{fetch_limit}}sortBy:submittedDatesortOrder:descending
7.3 Search Paper Node Post-processing (Code Node)
Node Name: Search Paper Node Post-processing
Node Type: code
Parses the XML data returned by the API and performs fine-grained filtering.
Input Variables:
http_response_body: HTTP node response body- and all output variables from the preprocessing node
Code Logic:
- Parse XML response
- Deduplication Filter: Remove papers found in
pushed_ids - Date Filter: Remove papers older than
cutoff_date - Keyword Filter: Ensure the title or abstract contains at least one search keyword
- Format output as a list of JSON objects
Output Variables:
result: Final filtered paper list (JSON string)count: Final paper countdebug: Debugging information (includes filtering statistics)
Step Eight: LLM Initial Review and Filtering (LLM Node)
Node Name: LLM Initial Review
Node Type: llm
Uses LLM to conduct an initial review of papers and filter out the most valuable ones.
System Prompt:
You are a senior academic researcher, skilled at quickly evaluating paper value.
Task: Filter out the Top 3 most valuable papers from the given list.
Evaluation Criteria:
1. Innovativeness: Is the method novel?
2. Practicality: Does it have practical application value?
3. Impact: Author's institution, publication status
4. Technical Depth: Does it have technical breakthroughs?
Output Requirements:
- Pure JSON array format
- Retain all original fields
- Output Top 3 papers
Step Nine: JSON Parsing (Code Node)
Node Name: JSON Parsing
Node Type: code
Parses the JSON string output by the LLM, handling various possible formats.
Core Logic:
- Handle nested JSON
- Support
papersortop_papersfields - Fault tolerance
Step Ten: Iteration Node
Node Name: Iteration
Node Type: iteration
Iterate through each filtered paper.
Configuration:
- Input:
top_papers(array of papers) - Output:
merged_paper(processed paper object) - Parallel Mode: Off (sequential execution)
- Error Handling: Terminate on error
Detailed Internal Iteration Process
| No. | Node Name | Type | Function |
|---|---|---|---|
| 1 | DataUnpack | code | Unpacks iteration items into independent variables |
| 2 | Create a Row | tool | Records arxiv_id to Supabase to prevent duplication |
| 3 | Large Model Document Parsing | tool | PaddleOCR parses PDF to extract body text |
| 4 | get_footnote_text | code | Extracts footnote information (for institution identification) |
| 5 | truncated_text | code | Trims OCR text (controls LLM input length) |
| 6 | (LLM) Analysis | llm | Deeply analyzes papers to extract key information |
| 7 | Data Assembly | code | Assembles the final paper object |
Deconstruct iterative items into independent variables.
Output:
title_str: Paper titlepdf_url: PDF linksummary_str: Abstractpublished: Publication dateauthors: Authorsarxiv_id: ArXiv ID
10.2 Create a Row (Supabase Node)
Record the paper’s ArXiv ID into the database to prevent duplicate pushes.
Configuration:
- Table name: Obtained from the configuration node
- Data:
{\"arxiv_id\": \"{{arxiv_id}}\"}
10.3 Large Model Document Parsing (PaddleOCR Node)
Node Name: Large Model Document Parsing
Node Type: tool (PaddleOCR)
Use PaddleOCR to parse the paper PDF and extract the main body content.
Configuration:
file: PDF URLfileType: 0 (PDF file)useLayoutDetection: true (Enable layout detection)prettifyMarkdown: true (Beautify output)
10.4 get_footnote_text (Code Node)
Extract footnote information from the OCR text for subsequent institution recognition.
10.5 truncated_text (Code Node)
Crop OCR text to control LLM input length and avoid exceeding token limits.
10.6 LLM Deep Analysis
Node Name: (LLM) Analysis
Node Type: llm
Perform deep analysis on the paper to extract key information.
Extracted Fields:
- One_Liner: One-liner problem and solution
- Architecture: Model architecture and key innovations
- Dataset: Data source and scale
- Metrics: Core performance metrics
- Chinese_Abstract: Chinese abstract translation
- Affiliation: Author affiliation
- Code_Url: Code link
Core Principles:
- No fluff: Directly state specific methods
- Dig deep into details: Summarize algorithm logic, loss function design
- Data first: Show the improvement margin compared to SOTA
- No N/A: Reasonably infer
Output Format: Pure JSON object
10.7 Data Assembly (Code Node)
Node Name: Data Assembly
Node Type: code
Assemble all information into a structured paper object.
Core Functions:
- Parse publication status (identify top-tier conference papers)
- Parse LLM output JSON
- Extract code links
- Assemble the final paper object
Output Fields:
title: Titleauthors: Authorsaffiliation: Affiliationpdf_url: PDF linksummary: English abstractpublished: Publication statusgithub_stats: Code statuscode_url: Code linkai_evaluation: AI analysis results
Step Eleven: Template Transformation
Node Name: Template Transformation
Node Type: template-transform
Use a Jinja2 template to convert paper data into formatted email content.
Template Structure:
📅 PaperEssence Research Daily
Based on your specified research content "{{ raw_topic }}", 3 papers selected from arXiv updates in the last 30 days are pushed daily.
--------------------------------------------------
<small><i>⚠️ Note: Content is AI-generated and for academic reference only. Before citing or conducting in-depth research, please click the PDF link to verify with the original paper.</i></small>
Generation Date: {{ items.target_date | default('Today') }}
==================================================
{% set final_list = items.paper | default(items) %}
{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}
English Abstract:
{{ item.summary | replace('\n', ' ') }}
Chinese Abstract:
{{ item.ai_evaluation.Chinese_Abstract }}
🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}
📊 Summary:
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
💾 Data:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n 🔹 ') | replace('- ', ' 🔹 ') }}
==================================================
{% else %}
⚠️ No new paper updates today.
{% endfor %}
Step Twelve: Email Sending
Node Name: 163SMTP Email Sending
Node Type: tool (163-smtp-send-mail)
Configuration:
username_send: Sender email (read from environment variables)authorization_code: Email authorization code (read from environment variables)username_recv: Recipient emailsubject:PaperEssence-{{cutoff_str}}-{{today_str}}content: Content after template transformation
Step Thirteen: Output Node
Node Name: Output
Node Type: end
Output the final result for easy debugging and verification.
Publish Workflow and Get API
After the workflow is debugged and passed, click the publish button in the upper right corner.
Record the following information:
- API Endpoint: api.dify.ai/v1/workflows/run
- API Key: A string like
app-xxxxxxxxxxxx
Configure Daily Automatic Run
Since the timed trigger on the Dify cloud platform may have limitations in the free version, you can use Windows Task Scheduler with a script to achieve daily timed workflow triggering.
Prerequisite: Install Git for Windows
This solution uses Git Bash to execute curl commands, so Git for Windows needs to be installed first.
Notes during installation:
- It is recommended to choose the default installation path (e.g.,
C:\Program Files\Git) or a custom path (e.g.,D:\ProgramFiles\Git) - Ensure “Git Bash Here” is checked in the installation options
Configure Windows Task Scheduler
- Press
Win + R→ Typetaskschd.msc, press Enter, open Task Scheduler - Click “Create Task” on the right
General Tab:
- Name:
Paper-Essence Daily Run - Check “Run with highest privileges”
Triggers Tab:
- Click “New”
- In the “Begin the task” dropdown, select “On a schedule”
- Select “Daily”, set the trigger time (recommended to be consistent with the timer time in the Dify workflow, e.g.,
20:55) - Click “OK”
Actions Tab:
-
Click “New”
-
For Action, select “Start a program”
-
Program/script: Enter your Git Bash path, for example:
D:\ProgramFiles\Git\bin\bash.exe -
Add arguments:
curl -N -X POST "api地址" -H "Authorization: Bearer app-你的API密钥" -H "Content-Type: application/json" -d '{ "inputs": {}, "response_mode": "streaming", "user": "cron-job" }' -
Click “OK” to save the task
Testing and Debugging
Manual Testing
- Click the “Run” button in the upper right corner of the workflow editor
- Observe the execution status of each node
- Check if the output of each node meets expectations
Successful Run Effect
After the workflow executes successfully, you will receive a formatted paper daily report email in your inbox, including:
- Paper title, authors, affiliation
- PDF link and code link
- Chinese and English abstracts
- AI-analyzed core innovations, architecture, dataset, and performance metrics
Summary
This tutorial covers YAML node configuration and scheduling, environment variables and Supabase usage, and builds an end-to-end pipeline from ArXiv scraping → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, while also enhancing deduplication and error handling.
Acknowledgements
Thanks to Teacher Zhang Jing, Teacher Guan Mu, and Teacher Yang Youzhi for their guidance.