diff --git a/packages/web-crawler/README.md b/packages/web-crawler/README.md index cc8c274603f2a..d21e9bc99da66 100644 --- a/packages/web-crawler/README.md +++ b/packages/web-crawler/README.md @@ -1,34 +1,61 @@ # @lobechat/web-crawler -LobeChat 内置的网页抓取模块,用于从网页中提取结构化内容,并转换为 Markdown 格式。 +LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format. -## 📝 简介 +## 📝 Introduction -`@lobechat/web-crawler` 是 LobeChat 项目的内部组件,专门负责网页内容的抓取和处理。它能够智能地从各种网页中提取有意义的内容,剔除广告、导航栏等干扰元素,并将结果转换为结构良好的 Markdown 文本。 +`@lobechat/web-crawler` is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text. -## 🔍 主要功能 +## 🛠️ Core Features -- **网页内容抓取**:支持从各类网站获取原始 HTML 内容 -- **智能内容提取**:使用 Mozilla 的 Readability 算法识别页面中的主要内容 -- **降级处理机制**:当标准抓取失败时,自动切换到 Browserless.io 服务进行渲染抓取(需要自行配置环境变量) -- **Markdown 转换**:将提取的 HTML 内容转换为易于 AI 处理的 Markdown 格式 +- **Intelligent Content Extraction**: Identifies main content based on Mozilla Readability algorithm +- **Multi-level Crawling Strategy**: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering +- **Custom URL Rules**: Handles specific website crawling logic through a flexible rule system -## 🛠️ 技术实现 +## 🤝 Contribution -该模块主要依赖以下技术: +Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through: -- **@mozilla/readability**:提供了强大的内容提取算法 -- **happy-dom**:轻量级的服务端 DOM 实现 -- **node-html-markdown**:高效的 HTML 到 Markdown 转换工具 +### How to Contribute URL Rules -## 🤝 共建改进 +1. Add new rules to the [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) file +2. Rule example: -由于网页结构的多样性和复杂性,内容提取可能会遇到各种挑战。如果您发现某些网站的抓取效果不佳,欢迎通过以下方式参与改进: +```typescript +// Example: handling specific websites +const url = [ + // ... other URL matching rules + { + // URL matching pattern, supports regex + urlPattern: 'https://example.com/articles/(.*)', -1. 提交具体的问题网址和期望的输出结果 -2. 分享您对特定网站类型的处理经验 -3. 提出针对性的算法或配置调整建议 + // Optional: URL transformation, redirects to an easier-to-crawl version + urlTransform: 'https://example.com/print/$1', -## 📌 注意事项 + // Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless' + impls: ['naive', 'jina', 'browserless'], -这是 LobeHub 的内部模块(`"private": true`),不作为独立包发布使用。它专为 LobeChat 的特定需求设计,与其他系统组件紧密集成。 + // Optional: content filtering configuration + filterOptions: { + // Whether to enable Readability algorithm for filtering distracting elements + enableReadability: true, + // Whether to convert to plain text + pureText: false, + }, + }, +]; +``` + +### Rule Submission Process + +1. Fork the [LobeChat repository](https://github.com/lobehub/lobe-chat) +2. Add or modify URL rules +3. Submit a Pull Request describing: + +- Target website characteristics +- Problems solved by the rule +- Test cases (example URLs) + +## 📌 Note + +This is an internal module of LobeHub (`"private": true`), designed specifically for LobeChat and not published as a standalone package. diff --git a/packages/web-crawler/README.zh-CN.md b/packages/web-crawler/README.zh-CN.md new file mode 100644 index 0000000000000..c480cc614711f --- /dev/null +++ b/packages/web-crawler/README.zh-CN.md @@ -0,0 +1,61 @@ +# @lobechat/web-crawler + +LobeChat 内置的网页抓取模块,用于智能提取网页内容并转换为 Markdown 格式。 + +## 📝 简介 + +`@lobechat/web-crawler` 是 LobeChat 的核心组件,负责网页内容的智能抓取与处理。它能够从各类网页中提取有价值的内容,过滤掉干扰元素,并生成结构化的 Markdown 文本。 + +## 🛠️ 核心功能 + +- **智能内容提取**:基于 Mozilla Readability 算法识别主要内容 +- **多级抓取策略**:支持多种抓取实现,包括基础抓取、Jina 和 Browserless 渲染抓取 +- **自定义 URL 规则**:通过灵活的规则系统处理特定网站的抓取逻辑 + +## 🤝 参与共建 + +网页结构多样复杂,我们欢迎社区贡献特定网站的抓取规则。您可以通过以下方式参与改进: + +### 如何贡献 URL 规则 + +1. 在 [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) 文件中添加新规则 +2. 规则示例: + +```typescript +// 示例:处理特定网站 +const url = [ + // ... 其他 url 匹配规则 + { + // URL 匹配模式,仅支持正则表达式 + urlPattern: 'https://example.com/articles/(.*)', + + // 可选:URL 转换,用于重定向到更易抓取的版本 + urlTransform: 'https://example.com/print/$1', + + // 可选:指定抓取实现方式,支持 'naive'、'jina' 和 'browserless' 三种 + impls: ['naive', 'jina', 'browserless'], + + // 可选:内容过滤配置 + filterOptions: { + // 是否启用 Readability 算法,用于过滤干扰元素 + enableReadability: true, + // 是否转换为纯文本 + pureText: false, + }, + }, +]; +``` + +### 规则提交流程 + +1. Fork [LobeChat 仓库](https://github.com/lobehub/lobe-chat) +2. 添加或修改 URL 规则 +3. 提交 Pull Request 并描述: + +- 目标网站特点 +- 规则解决的问题 +- 测试用例(示例 URL) + +## 📌 注意事项 + +这是 LobeHub 的内部模块(`"private": true`),专为 LobeChat 设计,不作为独立包发布使用。 diff --git a/packages/web-crawler/src/__test__/crawler.test.ts b/packages/web-crawler/src/__test__/crawler.test.ts index 221388718df62..ab725d51d8e9d 100644 --- a/packages/web-crawler/src/__test__/crawler.test.ts +++ b/packages/web-crawler/src/__test__/crawler.test.ts @@ -80,9 +80,12 @@ describe('Crawler', () => { }); expect(result).toEqual({ - content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed', - errorMessage: 'Crawl failed', - errorType: 'CrawlError', + crawler: 'browserless', + data: { + content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed', + errorMessage: 'Crawl failed', + errorType: 'CrawlError', + }, originalUrl: 'https://example.com', transformedUrl: undefined, }); diff --git a/packages/web-crawler/src/crawler.ts b/packages/web-crawler/src/crawler.ts index a4f9075607647..86e809672bc2b 100644 --- a/packages/web-crawler/src/crawler.ts +++ b/packages/web-crawler/src/crawler.ts @@ -32,6 +32,7 @@ export class Crawler { ...userFilterOptions, }; + let finalCrawler: string | undefined; let finalError: Error | undefined; const systemImpls = (ruleImpls ?? this.impls) as CrawlImplType[]; @@ -55,6 +56,7 @@ export class Crawler { } catch (error) { console.error(error); finalError = error as Error; + finalCrawler = impl; } } @@ -62,9 +64,12 @@ export class Crawler { const errorMessage = finalError?.message; return { - content: `Fail to crawl the page. Error type: ${errorType}, error message: ${errorMessage}`, - errorMessage: errorMessage, - errorType, + crawler: finalCrawler, + data: { + content: `Fail to crawl the page. Error type: ${errorType}, error message: ${errorMessage}`, + errorMessage: errorMessage, + errorType, + }, originalUrl: url, transformedUrl: transformedUrl !== url ? transformedUrl : undefined, }; diff --git a/packages/web-crawler/src/type.ts b/packages/web-crawler/src/type.ts index b675b26a6d5ee..3e42567918aa0 100644 --- a/packages/web-crawler/src/type.ts +++ b/packages/web-crawler/src/type.ts @@ -11,6 +11,7 @@ export interface CrawlSuccessResult { export interface CrawlErrorResult { content: string; errorMessage: string; + errorType: string; url: string; } @@ -36,9 +37,7 @@ export interface CrawlUrlRule { // 内容过滤配置(可选) filterOptions?: FilterOptions; impls?: CrawlImplType[]; - // 是否使用正则表达式匹配(默认为glob模式) - isRegex?: boolean; - // URL匹配模式,支持glob模式或正则表达式 + // URL匹配模式,仅支持正则表达式 urlPattern: string; // URL转换模板(可选),如果提供则进行URL转换 urlTransform?: string; diff --git a/packages/web-crawler/src/urlRules.ts b/packages/web-crawler/src/urlRules.ts index 01c3420dcd49d..414022587ed35 100644 --- a/packages/web-crawler/src/urlRules.ts +++ b/packages/web-crawler/src/urlRules.ts @@ -22,6 +22,11 @@ export const crawUrlRules: CrawlUrlRule[] = [ impls: ['jina'], urlPattern: 'https://(.*).pdf', }, + // arxiv PDF use jina + { + impls: ['jina'], + urlPattern: 'https://arxiv.org/pdf/(.*)', + }, // 知乎有爬虫防护,使用 jina { impls: ['jina'], diff --git a/src/tools/web-browsing/Portal/PageContent/index.tsx b/src/tools/web-browsing/Portal/PageContent/index.tsx index 57d56a789f821..3be138d98e97b 100644 --- a/src/tools/web-browsing/Portal/PageContent/index.tsx +++ b/src/tools/web-browsing/Portal/PageContent/index.tsx @@ -1,4 +1,4 @@ -import { Alert, CopyButton, Icon, Markdown } from '@lobehub/ui'; +import { Alert, CopyButton, Highlighter, Icon, Markdown } from '@lobehub/ui'; import { Descriptions, Segmented, Typography } from 'antd'; import { createStyles } from 'antd-style'; import { ExternalLink } from 'lucide-react'; @@ -90,7 +90,42 @@ const PageContent = memo(({ result }) => { const { styles } = useStyles(); const [display, setDisplay] = useState('render'); - if (!result) return undefined; + if (!result || !result.data) return undefined; + + if ('errorType' in result.data) { + return ( + +
+ +
+ + {JSON.stringify(result.data, null, 2)} + + } + message={ +
+ {result.data.errorMessage || result.data.content} +
+ } + type={'error'} + /> +
+ ); + } const { url, title, description, content } = result.data; return ( diff --git a/src/tools/web-browsing/Render/PageContent/Result.tsx b/src/tools/web-browsing/Render/PageContent/Result.tsx index 9fc6f7a0535b3..214ce5fd049ac 100644 --- a/src/tools/web-browsing/Render/PageContent/Result.tsx +++ b/src/tools/web-browsing/Render/PageContent/Result.tsx @@ -1,7 +1,7 @@ 'use client'; -import { CrawlSuccessResult } from '@lobechat/web-crawler'; -import { Icon } from '@lobehub/ui'; +import { CrawlErrorResult, CrawlSuccessResult } from '@lobechat/web-crawler'; +import { Alert, Highlighter, Icon } from '@lobehub/ui'; import { Descriptions, Typography } from 'antd'; import { createStyles } from 'antd-style'; import { ExternalLink } from 'lucide-react'; @@ -82,7 +82,7 @@ interface CrawlerData { crawler: string; messageId: string; originalUrl: string; - result: CrawlSuccessResult; + result: CrawlSuccessResult | CrawlErrorResult; } const CrawlerResultCard = memo(({ result, messageId, crawler, originalUrl }) => { @@ -90,6 +90,39 @@ const CrawlerResultCard = memo(({ result, messageId, crawler, origi const { styles } = useStyles(); const [openToolUI, togglePageContent] = useChatStore((s) => [s.openToolUI, s.togglePageContent]); + if ('errorType' in result) { + return ( + +
+ +
+ + {JSON.stringify(result, null, 2)} + + } + message={ +
{result.errorMessage || result.content}
+ } + type={'error'} + /> +
+ ); + } + const { url, title, description } = result; return ( diff --git a/src/tools/web-browsing/Render/PageContent/index.tsx b/src/tools/web-browsing/Render/PageContent/index.tsx index a72281bb3a3d8..91043701f20e1 100644 --- a/src/tools/web-browsing/Render/PageContent/index.tsx +++ b/src/tools/web-browsing/Render/PageContent/index.tsx @@ -1,3 +1,4 @@ +import { CrawlErrorResult } from '@lobechat/web-crawler'; import { memo } from 'react'; import { Flexbox } from 'react-layout-kit'; @@ -31,7 +32,16 @@ const PagesContent = memo(({ results, messageId, urls }) => { key={result.originalUrl} messageId={messageId} originalUrl={result.originalUrl} - result={result.data} + result={ + result.data || + // TODO: Remove this in v2 as it's deprecated + ({ + content: (result as any)?.content, + errorMessage: (result as any)?.errorMessage, + errorType: (result as any)?.errorType, + url: result.originalUrl, + } as CrawlErrorResult) + } /> ))} diff --git a/src/types/tool/crawler.ts b/src/types/tool/crawler.ts index 75b4d8179626d..38e370ce90940 100644 --- a/src/types/tool/crawler.ts +++ b/src/types/tool/crawler.ts @@ -1,4 +1,4 @@ -import { CrawlSuccessResult } from '@lobechat/web-crawler'; +import { CrawlErrorResult, CrawlSuccessResult } from '@lobechat/web-crawler'; export interface CrawlSinglePageQuery { url: string; @@ -10,7 +10,7 @@ export interface CrawlMultiPagesQuery { export interface CrawlResult { crawler: string; - data: CrawlSuccessResult; + data: CrawlSuccessResult | CrawlErrorResult; originalUrl: string; }