Paper-Essence — Tutorial: Building an Automated Paper Digest Workflow

Paper-Essence — Tutorial: Building an Automated Paper Digest Workflow

:open_book: Project Overview

Paper-Essence is an automated paper-digest workflow built on the Dify platform. This workflow can:

  • :one_o_clock: Fetch the latest papers from arXiv for specified research areas on a daily schedule
  • :robot: Use large language models to filter and select the most valuable papers
  • :page_facing_up: Parse PDF papers with OCR to extract technical details
  • :e_mail: Generate a structured daily digest and send it by email

GitHub repository: github.com/LiaoYFBH/PaperFlow — you can import prj/Paper-Essence-CN.yml or prj/Paper-Essence-EN.yml directly.


:hammer_and_wrench: Prerequisites

1. Platform and Accounts

  • Dify account: Register and log in to Dify
  • Email account: An SMTP-capable email (this tutorial uses 163 Mail)
  • LLM API: Configure Xinghe Community API

2. Install Required Plugins

Install the following plugins from the Dify plugin marketplace:

Plugin Purpose
paddle-aistudio/ernie-paddle-aistudio Xinghe Community API
langgenius/paddleocr OCR for PDFs and images
wjdsg/163-smtp-send-mail 163 SMTP email sending
langgenius/supabase Database storage for pushed records

3. Prepare Supabase Database

We use a cloud database (Supabase) to record papers that have already been pushed to avoid duplicates.

Step 1: Login and Create Project

Visit supabase.com to create an account and start a new project.

Step 2: Create Table

In the SQL Editor, run the following SQL statement to create a table that records pushed paper IDs to ensure no duplicates:

create table pushed_papers (
  arxiv_id text not null,
  pushed_at timestamp default now(),
  primary key (arxiv_id)
);

Step 3: Get API Keys

Navigate to Project Settings → API section to find your credentials.

Record the following information:

  • NEXT_PUBLIC_SUPABASE_URL → Supabase URL for Dify plugin
  • NEXT_PUBLIC_SUPABASE_PUBLISHABLE_DEFAULT_KEY → Supabase Key for Dify plugin

Step 4: Configure Supabase Plugin in Dify

In Dify, go to the plugin marketplace, find the Supabase plugin, and enter your URL and Key from the previous step.

(Optional) Deploy Dify with Docker

Environment Setup

This tutorial uses WSL + Docker. You can refer to Microsoft’s documentation for WSL and Docker configuration.

Clone the Dify Repository

First, clone the Dify repository. If you haven’t configured Git, you can directly download the ZIP file from the repository page and extract it.

If you have Git configured, run the following commands in your terminal:

# Clone Dify repository
git clone https://github.com/langgenius/dify.git

You need to have Git and Docker configured.

# Navigate to docker deployment directory
cd dify/docker

# Copy environment configuration file
cp .env.example .env

First, open Docker Desktop, then enter in the terminal:

# Start Dify (this will automatically pull images and start all services)
docker compose up -d

Check the status:

docker compose ps

Access the application at: http://localhost/


:bar_chart: Workflow Architecture

The workflow follows this core flow:

Schedule TriggerConfiguration (code) → LLM trans (llm - translate topic) → Get Rows (tool - check pushed papers) → Search Pre-process (code) → HTTP Request (http-request) → Search Post-process (code) → LLM Initial Review (llm) → JSON Parse (code) → Iteration (for each paper: DataUnpack (code) → Create a Row (tool) → Document Parsing (tool - OCR) → get_footnote_text (code) → truncated_text (code) → Analysis (llm) → Data Assembly (code)) → Template Transform (template-transform) → 163 SMTP Email Sender (tool) → Output (end)

The complete view of this workflow:


:wrench: Step-by-step Setup

Step 1 — Create the Workflow

  1. Log in to Dify
  2. In the Studio, click “Create App” → choose “Workflow”
  3. Enter an application name
  4. Choose the Trigger type for the workflow

Step 2 — Configure Environment Variables

Click the Settings button in the top-right corner and then select “Add Variable”.

These environment variables will be read by the Configuration node (a Code node) that processes and outputs them for downstream nodes.

Key variables:

Name Type Description Example
table_name string Supabase table name pushed_papers
SMTP_PORT string SMTP port 465
SMTP_SERVER string SMTP server smtp.163.com
SMTP_PASSWORD secret SMTP authorization code (your auth code)
SMTP_USER secret SMTP user/email your_email@163.com
MY_RAW_TOPIC string Research topic agent memory

To get your email authorization code, log in to your email provider’s settings and generate an SMTP authorization code for third-party applications.


Step 3 — Scheduled Trigger

Node name: Scheduled Trigger

Configuration:

  • Trigger Frequency: Daily
  • Trigger Time: 8:59 AM (or adjust as needed)

Step 4 — Configuration (Code Node)

Node name: Configuration (Type: code) — This Code node reads environment variables, processes them, and outputs configuration values for downstream nodes.

Input Variables:

  • From environment variables: SMTP_PORT, SMTP_SERVER, SMTP_USER, SMTP_PASSWORD, MY_RAW_TOPIC, table_name

Output Variables:

  • raw_topic: Research topic
  • user_email: Recipient email
  • fetch_count: Number of papers to fetch (default: 50)
  • push_limit: Push limit (default: 3)
  • days_lookback: Days to look back (default: 30)
  • Plus SMTP configuration

Step 5 — LLM Translation (LLM Node)

Node name: LLM trans (Type: llm) — Converts the research topic into an optimized English boolean query for arXiv.

Model Configuration:

  • Model: ernie-4.5-turbo-128k or ernie-5.0-thinking-preview. If you choose the deep thinking model (ernie-5.0-thinking-preview), you need to enable the reasoning format separation.
  • Temperature: 0.7

Prompt Rules: Extract core concepts, translate terms (if necessary), construct boolean logic using AND/OR, wrap phrases in quotes, and output only the query string.


Step 6 — Query Pushed Records (Supabase Node)

Node name: Get Rows (Type: tool - Supabase plugin) — Fetches existing pushed arXiv IDs to avoid duplicates.

Configuration:

  • Table name: {{table_name}} (from Configuration node)

Step 7 — Search Papers (3 Nodes)

To improve stability and maintainability, the search function is split into “Pre-process” → “HTTP Request” → “Post-process”.

7.1 Search Pre-process (Code Node)

Node name: Search Pre-process (Type: code) — Builds the arXiv API request and prepares search parameters.

Input Variables:

  • topic: Translated English search term
  • days_lookback: Days to look back
  • count: Number of papers to fetch
  • supabase_output: Already pushed records (for deduplication)

Code Logic:

  1. Calculate cutoff date (cutoff_date)
  2. Parse Supabase returned pushed paper ID list
  3. Build boolean query string based on topic (supports AND/OR logic)
  4. Add arXiv category restrictions based on topic keywords (e.g., cs.CV, cs.CL)
  5. Extract search keywords for subsequent filtering

Output Variables:

  • base_query: Constructed query string
  • pushed_ids: List of already pushed IDs
  • cutoff_str: Cutoff date string
  • search_keywords: List of search keywords
  • fetch_limit: API fetch limit

7.2 HTTP Request (HTTP Node)

Node name: HTTP Request (Type: http-request) — Calls arXiv API to get raw XML data.

Configuration:

  • API URL: http://export.arxiv.org/api/query
  • Method: GET

7.3 Search Post-process (Code Node)

Node name: Search Post-process (Type: code) — Parses XML response and filters papers.

Input Variables:

  • http_response_body: HTTP node response body
  • Plus all output variables from the pre-process node

Code Logic:

  1. Parse XML response
  2. Deduplication filtering: Remove papers in pushed_ids
  3. Date filtering: Remove papers earlier than cutoff_date
  4. Keyword filtering: Ensure title or abstract contains at least one search keyword
  5. Format output as JSON object list

Output Variables:

  • result: Final filtered paper list (JSON string)
  • count: Final paper count
  • debug: Debug information (including filtering statistics)

Step 8 — LLM Initial Review

Node name: LLM Initial Review (Type: llm) — Uses an LLM to score and select the top papers (Top 3).

Output Requirements:

  • Clean JSON array format
  • Preserve all original fields
  • Output Top 3 papers

Step 9 — JSON Parsing (Code Node)

Node name: JSON Parse (Type: code) — Tolerant parsing of LLM outputs into a normalized list of papers.

Core Logic:

  • Handle nested JSON
  • Support papers or top_papers fields
  • Error-tolerant processing

Step 10 — Iteration Node

Node name: Iteration — Processes each selected paper sequentially (unpack, record to Supabase, OCR the PDF, analyze with LLMs, assemble the final object).

Configuration:

  • Input: top_papers (paper array)
  • Output: merged_paper (processed paper object)
  • Parallel Mode: Off (sequential execution)
  • Error Handling: Stop on error

Iteration Internal Flow

Node Name Type Function
1 DataUnpack code Unpack iteration item into individual variables
2 Create a Row tool Record arxiv_id to Supabase to prevent duplicates
3 Document Parsing tool PaddleOCR parses PDF to extract text
4 get_footnote_text code Extract footnote information (for affiliation recognition)
5 truncated_text code Truncate OCR text (control LLM input length)
6 Analysis llm Deep analysis to extract key information
7 Data Assembly code Assemble final paper object

10.1 DataUnpack (Code Node)

Unpacks the iteration item into individual variables.

Output:

  • title_str: Paper title
  • pdf_url: PDF link
  • summary_str: Abstract
  • published: Publication date
  • authors: Authors
  • arxiv_id: ArXiv ID

10.2 Create a Row (Supabase Node)

Records the paper ArXiv ID to the database to prevent duplicate pushes.

Configuration:

  • Table name: From Configuration node
  • Data: {"arxiv_id": "{{arxiv_id}}"}

10.3 Document Parsing (PaddleOCR Node)

Node name: Document Parsing (Type: tool - PaddleOCR)

Uses PaddleOCR (not an LLM) to parse the paper PDF and extract text content via OCR.

Configuration:

  • file: PDF URL
  • fileType: 0 (PDF file)
  • useLayoutDetection: true (enable layout detection)
  • prettifyMarkdown: true (beautify output)

10.4 get_footnote_text (Code Node)

Extracts footnote information from OCR text for subsequent affiliation recognition.

10.5 truncated_text (Code Node)

Truncates OCR text to control LLM input length and avoid exceeding token limits.

10.6 Analysis (LLM Node)

Node name: Analysis (Type: llm)

Performs deep analysis of the paper to extract key information.

Extracted Fields:

  1. One_Liner: One-sentence pain point and solution
  2. Architecture: Model architecture and key innovations
  3. Dataset: Data sources and scale
  4. Metrics: Core performance metrics
  5. Affiliation: Author affiliations
  6. Code_Url: Code repository link

Core Principles:

  • No fluff: Directly state specific methods
  • Deep dive into details: Summarize algorithm logic, loss function design
  • Data first: Show improvement over SOTA
  • No N/A: Make reasonable inferences

Output Format: Pure JSON object

10.7 Data Assembly (Code Node)

Node name: Data Assembly (Type: code)

Assembles all information into a structured paper object.

Core Functions:

  1. Parse publication status (identify top conference papers)
  2. Parse LLM output JSON
  3. Extract code links
  4. Assemble final paper object

Output Fields:

  • title: Title
  • authors: Authors
  • affiliation: Affiliation
  • pdf_url: PDF link
  • summary: English abstract
  • published: Publication status
  • github_stats: Code status
  • code_url: Code link
  • ai_evaluation: AI analysis results

Step 11 — Template Transform

Node name: Template Transform (Type: template-transform)

Uses a Jinja2 template to convert paper data into formatted email content.

Template Structure:

📅 PaperEssence Daily
Based on your specified research topic "{{ raw_topic }}", here are the top 3 papers selected from ArXiv updates in recent times.
--------------------------------------------------
--------------------------------------------------
⚠️ Note: Content is AI-generated and for academic reference only. Please verify by checking the original PDF before citing or conducting in-depth research.
Date: {{ items.target_date | default('Today') }}
==================================================

{# Automatically adapt data structure #}
{% set final_list = items.paper | default(items) %}

{% for item in final_list %}
📄 [{{ loop.index }}] {{ item.title }}
--------------------------------------------------
👤 Authors: {{ item.authors }}
🏢 Affiliation: {{ item.affiliation }}
🔗 PDF: {{ item.pdf_url }}
📅 Status: {{ item.published }}
{% if item.code_url and item.code_url != 'N/A' %}
📦 Code: {{ item.github_stats }}
   🔗 {{ item.code_url }}
{% else %}
📦 Code: {{ item.github_stats }}
{% endif %}
Abstract:
{{ item.summary | replace('\n', ' ') }}

🚀 Core Innovation:
{{ item.ai_evaluation.One_Liner }}

📊 Summary:
{# Use newlines and indentation for hierarchy, not relying on HTML tags #}
--------------------------------------------------
🏗️ Architecture:
{{ item.ai_evaluation.Architecture | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

💾 Dataset:
{{ item.ai_evaluation.Dataset | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

📈 Metrics:
{{ item.ai_evaluation.Metrics | replace('\n- ', '\n\n   🔹 ') | replace('- ', '   🔹 ') }}

==================================================
{% else %}
⚠️ No new papers today.
{% endfor %}

Step 12 — Send Email (163 SMTP)

Node name: 163 SMTP Email Sender (Type: tool - 163-smtp-send-mail)

Configuration:

  • username_send: Sender email (from environment variables via SMTP_USER)
  • authorization_code: Email authorization code (from environment variables via SMTP_PASSWORD)
  • username_recv: Recipient email (from Configuration node)
  • subject: PaperEssence-{{cutoff_str}}-{{today_str}}
  • content: Content from template transform

Note: These parameter names (username_send, authorization_code, username_recv) are specific to the 163 SMTP plugin.


Step 13 — Output Node

Node name: Output (Type: end)

Outputs the final result for debugging and verification.


:outbox_tray: Publishing and Getting the Workflow API

After testing and confirming the workflow works correctly, click the Publish button in the top-right corner.

Record the following information:

  • API endpoint: https://api.dify.ai/v1/workflows/run (for Dify cloud deployment) or your private deployment URL (e.g., http://localhost/v1/workflows/run)
  • API key: app-xxxxxxxxxxxx

:alarm_clock: Alternative: Local Schedule Trigger (Windows Task Scheduler)

If Dify cloud scheduling is restricted on the free tier, you can use Windows Task Scheduler to trigger the workflow via a curl POST.

Prerequisite: Install Git for Windows

This solution uses Git Bash to execute curl commands, so you need to install Git for Windows first (search “Git for Windows” to download).

Installation Notes:

  • Recommended to use default installation path (e.g., C:\Program Files\Git) or custom path (e.g., D:\ProgramFiles\Git)
  • Ensure “Git Bash Here” is checked during installation

Configure Windows Task Scheduler

  1. Press Win + R → type taskschd.msc → Enter
  2. Click “Create Task”

General Tab:

  • Name: Paper-Essence Daily Run
  • Check “Run with highest privileges”

Triggers Tab:

  1. Click “New”
  2. Select “On a schedule”
  3. Choose “Daily”, set time (recommended to match Dify workflow timer, e.g., 20:55)
  4. Click “OK”

Actions Tab:

  1. Click “New”

  2. Action: “Start a program”

  3. Program/script: Enter your Git Bash path, for example:

    D:\ProgramFiles\Git\bin\bash.exe
    

    or default installation path:

    C:\Program Files\Git\bin\bash.exe
    
  4. Add arguments: Use curl command to POST to your workflow API endpoint with your API key

    :warning: Note: Replace the API endpoint and key with your actual values from the previous step

Conditions Tab (Optional):

  • You can uncheck “Start the task only if the computer is on AC power” to ensure laptops run the task on battery

Settings Tab (Optional):

  • Check “If the task fails, restart every” and set retry interval
  1. Click “OK” to save the task

:test_tube: Testing and Debugging

Manual Test

  1. Click “Run” in the workflow editor (top-right)
  2. Observe node execution and outputs
  3. Verify each node’s output meets expectations

Success Result

When the workflow executes successfully, you will receive an email with the daily paper digest.


:memo: Summary

This tutorial covers building a complete end-to-end pipeline: arXiv fetching → PaddleOCR parsing → LLM analysis → Jinja2 templating → SMTP delivery, with Supabase-based deduplication and scheduling. The workflow involves YAML node configuration, environment variables, and Supabase integration, creating a comprehensive pipeline from ArXiv retrieval to email delivery with enhanced deduplication and error handling.

The provided prj/Paper-Essence-CN.yml and prj/Paper-Essence-EN.yml can be imported into a Dify workspace to reproduce the workflow.


Acknowledgments

Special thanks to Alex Zhang, Guan Mu, and Yang Youzhi for their guidance.

1 Like