Paper-Essence Paper Highlights Push Workflow Setup Tutorial

Paper-Essence Workflow Setup Tutorial for Paper Digest Push

:open_book: Project Introduction

Paper-Essence is an automated paper push workflow built on the Dify platform. This workflow can:

  • :one_o_clock: Regularly fetch the latest papers from specified research areas on ArXiv daily
  • :robot: Intelligently filter out the most valuable papers using large models
  • :page_facing_up: Parse paper PDFs via OCR to extract key technical details
  • :e_mail: Generate structured paper daily reports and push them automatically via email

GitHub Repository: https://github.com/LiaoYFBH/PaperFlow, you can directly import prj\Paper-Essence-CN.yml or prj\Paper-Essence-EN.yml.


:hammer_and_wrench: Prerequisites

1. Platform and Account Preparation

  • Dify Platform Account: Ensure you have registered and logged into the Dify platform (Docker deployment is also an option)
  • Email Account: An email account that supports SMTP is required (this tutorial uses 163 Mail)
  • Large Model API: Requires configuring the API from Wenxin Feipaddle Xinghe Community (文心飞桨星河社区)

2. Install Necessary Plugins

Install the following plugins from the Dify Plugin Marketplace:

Plugin Name Purpose
PaddleOCR PDF/Image OCR Parsing
163SMTP Email Sending 163 Mail SMTP Sending
Supabase Database Storage (records pushed papers)
Wenxin Feipaddle Xinghe Community (文心飞桨星河社区) Xinghe Community API call for Baidu Wenxin Large Model

3. Prepare Supabase Database

To filter out papers that have already been pushed to users, we use the cloud database Supabase here.

Step 1: Log in and Create a Project

Visit supabase.com, click login in the top right corner, then click Start your project

Step 2: Create a Data Table

Execute the following SQL statement in the SQL Editor:

Create a new table in the database to record paper push information, named pushed_papers, and set two core fields for the table with data integrity constraints to ensure the uniqueness and validity of paper push records.

create table pushed_papers (
  arxiv_id text not null,
  pushed_at timestamp default now(),
  primary key (arxiv_id)
);

Step 3: Obtain API Key

Record the following information in Project Settings → API:

  • NEXT_PUBLIC_SUPABASE_URL → Corresponds to the Supabase URL for the Dify Supabase plugin
  • NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY → Corresponds to the Supabase Key for the Dify Supabase plugin

Step 4: Configure the Supabase Plugin in Dify

Configure the Supabase plugin in Dify Plugin Management, filling in the URL and Key mentioned above.


:bar_chart: Overall Workflow Architecture

Process Description

Stage Node Function Description
Trigger Scheduled Trigger Automatically starts at a specified time daily
Configuration Configuration Node Reads all environment variables and outputs them for subsequent use
Translation LLM Translation Translates research topics into English
Search Get Rows → Preprocessing → HTTP Request → Post-processing Queries pushed records, searches for new ArXiv papers
Initial Review LLM Initial Review Uses LLM to filter Top 3 papers
Iteration Iteration Node Performs for each paper: Unpack → Record → OCR → Analyze → Assemble
Output Template Conversion → Email Sending Generates formatted report and pushes via email

Full view of this workflow:


:wrench: Detailed Setup Steps

Step One: Create Workflow

  1. Log in to the Dify platform
  2. Click “Studio” → “Create App” → Select “Workflow” type
  3. Enter application name
  4. Select trigger type

Step Two: Configure Environment Variables

Click the settings button in the top right corner of the UI, and add environment variables:

Variable Name Type Description Example Value
table_name string Supabase Table Name pushed_papers
SMTP_PORT string Email SMTP Port 465
SMTP_SERVER string SMTP Server smtp.163.com
SMTP_PASSWORD secret Email Authorization Code (Your authorization code)
SMTP_USER secret Email Account your_email@163.com
MY_RAW_TOPIC string Research Topic agent memory

Get email authorization code: Log in to 163 Mail → Settings → POP3/SMTP/IMAP → Enable service and get authorization code


Step Three: Scheduled Trigger Node

Node Name: Scheduled Trigger
Configuration Items:

  • Trigger Frequency: Daily
  • Trigger Time: 8:59 AM (or adjust as needed)

Step Four: Configuration Node (Code Node)

Node Name: Configuration
Node Type: code

This node is responsible for reading all environment variables and outputting them for subsequent nodes.

Input Variables:

  • Read from environment variables: SMTP_USER, MY_RAW_TOPIC, SMTP_PORT, SMTP_SERVER, SMTP_PASSWORD, table_name

Output Variables:

  • raw_topic: Research topic
  • user_email: Recipient email
  • fetch_count: Number of papers to search (default 50)
  • push_limit: Push quantity limit (default 3)
  • days_lookback: Days to look back (default 30)
  • and SMTP related configurations

Code:

import os

def main(
    SMTP_USER: str,
    MY_RAW_TOPIC: str,
    SMTP_PORT: str,
    SMTP_SERVER: str,
    SMTP_PASSWORD: str,
    table_name: str
) -> dict:

    user_email = SMTP_USER
    raw_topic = MY_RAW_TOPIC

    smtp_port = SMTP_PORT
    smtp_server = SMTP_SERVER
    smtp_password = SMTP_PASSWORD
    table_name = table_name

    return {
        "raw_topic": raw_topic,
        "user_email": user_email,
        "smtp_port": smtp_port,
        "smtp_server": smtp_server,
        "smtp_password": smtp_password,
        "fetch_count": 50,
        "push_limit": 3,
        "days_lookback": 30,
        "table_name": table_name
    }

Step Five: Research Field Translation (LLM Node)

Node Name: Research Field LLM Translation
Node Type: llm

Translates Chinese research topics into English boolean query strings recognizable by the ArXiv API.

Model Configuration:

  • Model: ernie-4.5-turbo-128k or ernie-5.0-thinking-preview
  • Temperature: 0.7

System Prompt:

You are a professional academic translation assistant. Please translate the user's Chinese research topic into English and convert it into a boolean query string format suitable for the ArXiv API.

Requirements:
1. Identify core keywords and translate them into English
2. Use AND/OR to connect multiple keywords
3. Output plain text, without additional explanations
4. Example: Input "多模态大模型" → Output "multimodal AND large language model"

Step Six: Query Pushed Records (Supabase Node)

Node Name: Get Rows
Node Type: tool (Supabase)

Query pushed paper records from the Supabase database to avoid duplicate pushes.

Configuration:

  • Table Name: {{table_name}} (obtained from the configuration node)

Step Seven: Search Papers (Split into 3 Nodes)

To improve the stability and maintainability of the workflow, the search function is split into three consecutive nodes: “Preprocessing” → “HTTP Request” → “Post-processing”.

7.1 Search Paper Node Preprocessing (Code Node)

Node Name: Search Paper Node Preprocessing
Node Type: code

This node is responsible for preparing search parameters, calculating the date range, and constructing the ArXiv API query string.

Input Variables:

  • topic: Translated English search term
  • days_lookback: Days to look back
  • count: Number of searches
  • supabase_output: Pushed records (for deduplication)

Code Logic:

  1. Calculate look-back date (cutoff_date)
  2. Parse the list of pushed paper IDs returned by Supabase
  3. Construct a boolean query string based on the topic (supports AND/OR logic)
  4. Add ArXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL, etc.)
  5. Extract search keywords for subsequent filtering

Output Variables:

  • base_query: Constructed query string
  • pushed_ids: List of pushed IDs
  • cutoff_str: Cutoff date string
  • search_keywords: List of search keywords
  • fetch_limit: API fetch quantity limit

7.2 HTTP Request (HTTP Node)

Node Name: HTTP Request
Node Type: http-request

Directly calls the ArXiv API to get raw XML data.

Configuration:

  • API URL: export.arxiv.org/api/query
  • Method: GET
  • Query Parameters:
    • search_query: {{base_query}}
    • start: 0
    • max_results: {{fetch_limit}}
    • sortBy: submittedDate
    • sortOrder: descending

7.3 Search Paper Node Post-processing (Code Node)

Node Name: Search Paper Node Post-processing
Node Type: code

Parses the XML data returned by the API and performs fine-grained filtering.

Input Variables:

  • http_response_body: HTTP node response body
  • and all output variables from the preprocessing node

Code Logic:

  1. Parse XML response
  2. Deduplication Filter: Remove papers found in pushed_ids
  3. Date Filter: Remove papers older than cutoff_date
  4. Keyword Filter: Ensure the title or abstract contains at least one search keyword
  5. Format output as a list of JSON objects

Output Variables:

  • result: Final filtered paper list (JSON string)
  • count: Final paper count
  • debug: Debugging information (includes filtering statistics)

Step Eight: LLM Initial Review and Filtering (LLM Node)

Node Name: LLM Initial Review
Node Type: llm

Uses LLM to conduct an initial review of papers and filter out the most valuable ones.

System Prompt:

You are a senior academic researcher, skilled at quickly evaluating paper value.

Task: Filter out the Top 3 most valuable papers from the given list.

Evaluation Criteria:
1. Innovativeness: Is the method novel?
2. Practicality: Does it have practical application value?
3. Impact: Author's institution, publication status
4. Technical Depth: Does it have technical breakthroughs?

Output Requirements:
- Pure JSON array format
- Retain all original fields
- Output Top 3 papers

Step Nine: JSON Parsing (Code Node)

Node Name: JSON Parsing
Node Type: code

Parses the JSON string output by the LLM, handling various possible formats.

Core Logic:

  • Handle nested JSON
  • Support papers or top_papers fields
  • Fault tolerance

Step Ten: Iteration Node

Node Name: Iteration
Node Type: iteration

Iterate through each filtered paper.

Configuration:

  • Input: top_papers (array of papers)
  • Output: merged_paper (processed paper object)
  • Parallel Mode: Off (sequential execution)
  • Error Handling: Terminate on error

Detailed Internal Iteration Process

No. Node Name Type Function
1 DataUnpack code Unpacks iteration items into independent variables
2 Create a Row tool Records arxiv_id to Supabase to prevent duplication
3 Large Model Document Parsing tool PaddleOCR parses PDF to extract body text
4 get_footnote_text code Extracts footnote information (for institution identification)
5 truncated_text code Trims OCR text (controls LLM input length)
6 (LLM) Analysis llm Deeply analyzes papers to extract key information
7 Data Assembly code Assembles the final paper object
### 10.1 DataUnpack (Code Node)

Deconstruct iterative items into independent variables.

Output:

  • title_str: Paper title
  • pdf_url: PDF link
  • summary_str: Abstract
  • published: Publication date
  • authors: Authors
  • arxiv_id: ArXiv ID

10.2 Create a Row (Supabase Node)

Record the paper’s ArXiv ID into the database to prevent duplicate pushes.

Configuration:

  • Table name: Obtained from the configuration node
  • Data: {\"arxiv_id\": \"{{arxiv_id}}\"}

10.3 Large Model Document Parsing (PaddleOCR Node)

Node Name: Large Model Document Parsing
Node Type: tool (PaddleOCR)

Use PaddleOCR to parse the paper PDF and extract the main body content.

Configuration:

  • file: PDF URL
  • fileType: 0 (PDF file)
  • useLayoutDetection: true (Enable layout detection)
  • prettifyMarkdown: true (Beautify output)

10.4 get_footnote_text (Code Node)

Extract footnote information from the OCR text for subsequent institution recognition.

10.5 truncated_text (Code Node)

Crop OCR text to control LLM input length and avoid exceeding token limits.

10.6 LLM Deep Analysis

Node Name: (LLM) Analysis
Node Type: llm

Perform deep analysis on the paper to extract key information.

Extracted Fields:

  1. One_Liner: One-liner problem and solution
  2. Architecture: Model architecture and key innovations
  3. Dataset: Data source and scale
  4. Metrics: Core performance metrics
  5. Chinese_Abstract: Chinese abstract translation
  6. Affiliation: Author affiliation
  7. Code_Url: Code link

Core Principles:

  • No fluff: Directly state specific methods
  • Dig deep into details: Summarize algorithm logic, loss function design
  • Data first: Show the improvement margin compared to SOTA
  • No N/A: Reasonably infer

Output Format: Pure JSON object

10.7 Data Assembly (Code Node)

Node Name: Data Assembly
Node Type: code

Assemble all information into a structured paper object.

Core Functions:

  1. Parse publication status (identify top-tier conference papers)
  2. Parse LLM output JSON
  3. Extract code links
  4. Assemble the final paper object

Output Fields:

  • title: Title
  • authors: Authors
  • affiliation: Affiliation
  • pdf_url: PDF link
  • summary: English abstract
  • published: Publication status
  • github_stats: Code status
  • code_url: Code link
  • ai_evaluation: AI analysis results

Step Eleven: Template Transformation

Node Name: Template Transformation
Node Type: template-transform

Use a Jinja2 template to convert paper data into formatted email content.

Template Structure:

📅 PaperEssence Research Daily
Based on your specified research content "{{ raw_topic }}", 3 papers selected from arXiv updates in the last 30 days are pushed daily.
--------------------------------------------------
<small><i>⚠️ Note: Content is AI-generated and for academic reference only. Before citing or conducting in-depth research, please click the PDF link to verify with the original paper.</i></small>
Generation Date: {{ items.target_date | default('Today') }}
==================================================

{% set final_list = items.paper | default(items) %}

{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
   🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}

English Abstract:
{{ item.summary | replace('\n', ' ') }}

Chinese Abstract:
{{ item.ai_evaluation.Chinese_Abstract }}

🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}

📊 Summary:
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

💾 Data:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

==================================================
{% else %}
⚠️ No new paper updates today.
{% endfor %}

Step Twelve: Email Sending

Node Name: 163SMTP Email Sending
Node Type: tool (163-smtp-send-mail)

Configuration:

  • username_send: Sender email (read from environment variables)
  • authorization_code: Email authorization code (read from environment variables)
  • username_recv: Recipient email
  • subject: PaperEssence-{{cutoff_str}}-{{today_str}}
  • content: Content after template transformation

Step Thirteen: Output Node

Node Name: Output
Node Type: end

Output the final result for easy debugging and verification.


:outbox_tray: Publish Workflow and Get API

After the workflow is debugged and passed, click the publish button in the upper right corner.

Record the following information:

  • API Endpoint: api.dify.ai/v1/workflows/run
  • API Key: A string like app-xxxxxxxxxxxx

:alarm_clock: Configure Daily Automatic Run

Since the timed trigger on the Dify cloud platform may have limitations in the free version, you can use Windows Task Scheduler with a script to achieve daily timed workflow triggering.

Prerequisite: Install Git for Windows

This solution uses Git Bash to execute curl commands, so Git for Windows needs to be installed first.

Notes during installation:

  • It is recommended to choose the default installation path (e.g., C:\Program Files\Git) or a custom path (e.g., D:\ProgramFiles\Git)
  • Ensure “Git Bash Here” is checked in the installation options

Configure Windows Task Scheduler

  1. Press Win + R → Type taskschd.msc, press Enter, open Task Scheduler
  2. Click “Create Task” on the right

General Tab:

  • Name: Paper-Essence Daily Run
  • Check “Run with highest privileges”

Triggers Tab:

  1. Click “New”
  2. In the “Begin the task” dropdown, select “On a schedule”
  3. Select “Daily”, set the trigger time (recommended to be consistent with the timer time in the Dify workflow, e.g., 20:55)
  4. Click “OK”

Actions Tab:

  1. Click “New”

  2. For Action, select “Start a program”

  3. Program/script: Enter your Git Bash path, for example: D:\ProgramFiles\Git\bin\bash.exe

  4. Add arguments:

    curl -N -X POST "api地址" -H "Authorization: Bearer app-你的API密钥" -H "Content-Type: application/json" -d '{ "inputs": {}, "response_mode": "streaming", "user": "cron-job" }'
    
  5. Click “OK” to save the task


:test_tube: Testing and Debugging

Manual Testing

  1. Click the “Run” button in the upper right corner of the workflow editor
  2. Observe the execution status of each node
  3. Check if the output of each node meets expectations

Successful Run Effect

After the workflow executes successfully, you will receive a formatted paper daily report email in your inbox, including:

  • Paper title, authors, affiliation
  • PDF link and code link
  • Chinese and English abstracts
  • AI-analyzed core innovations, architecture, dataset, and performance metrics

:memo: Summary

This tutorial covers YAML node configuration and scheduling, environment variables and Supabase usage, and builds an end-to-end pipeline from ArXiv scraping → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, while also enhancing deduplication and error handling.

Acknowledgements

Thanks to Teacher Zhang Jing, Teacher Guan Mu, and Teacher Yang Youzhi for their guidance.

1 Like