Skip to content

Commit

Permalink
🐛 fix: Fix page crash with crawler error (lobehub#6662)
Browse files Browse the repository at this point in the history
* try to fix issue

* fix

* fix types

* fix tests

* update docs
  • Loading branch information
arvinxx authored Mar 3, 2025
1 parent a5fc714 commit 0c24251
Show file tree
Hide file tree
Showing 10 changed files with 215 additions and 37 deletions.
67 changes: 47 additions & 20 deletions packages/web-crawler/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,61 @@
# @lobechat/web-crawler

LobeChat 内置的网页抓取模块,用于从网页中提取结构化内容,并转换为 Markdown 格式。
LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.

## 📝 简介
## 📝 Introduction

`@lobechat/web-crawler` LobeChat 项目的内部组件,专门负责网页内容的抓取和处理。它能够智能地从各种网页中提取有意义的内容,剔除广告、导航栏等干扰元素,并将结果转换为结构良好的 Markdown 文本。
`@lobechat/web-crawler` is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.

## 🔍 主要功能
## 🛠️ Core Features

- **网页内容抓取**:支持从各类网站获取原始 HTML 内容
- **智能内容提取**:使用 Mozilla 的 Readability 算法识别页面中的主要内容
- **降级处理机制**:当标准抓取失败时,自动切换到 Browserless.io 服务进行渲染抓取(需要自行配置环境变量)
- **Markdown 转换**:将提取的 HTML 内容转换为易于 AI 处理的 Markdown 格式
- **Intelligent Content Extraction**: Identifies main content based on Mozilla Readability algorithm
- **Multi-level Crawling Strategy**: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
- **Custom URL Rules**: Handles specific website crawling logic through a flexible rule system

## 🛠️ 技术实现
## 🤝 Contribution

该模块主要依赖以下技术:
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:

- **@mozilla/readability**:提供了强大的内容提取算法
- **happy-dom**:轻量级的服务端 DOM 实现
- **node-html-markdown**:高效的 HTML 到 Markdown 转换工具
### How to Contribute URL Rules

## 🤝 共建改进
1. Add new rules to the [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) file
2. Rule example:

由于网页结构的多样性和复杂性,内容提取可能会遇到各种挑战。如果您发现某些网站的抓取效果不佳,欢迎通过以下方式参与改进:
```typescript
// Example: handling specific websites
const url = [
// ... other URL matching rules
{
// URL matching pattern, supports regex
urlPattern: 'https://example.com/articles/(.*)',

1. 提交具体的问题网址和期望的输出结果
2. 分享您对特定网站类型的处理经验
3. 提出针对性的算法或配置调整建议
// Optional: URL transformation, redirects to an easier-to-crawl version
urlTransform: 'https://example.com/print/$1',

## 📌 注意事项
// Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
impls: ['naive', 'jina', 'browserless'],

这是 LobeHub 的内部模块(`"private": true`),不作为独立包发布使用。它专为 LobeChat 的特定需求设计,与其他系统组件紧密集成。
// Optional: content filtering configuration
filterOptions: {
// Whether to enable Readability algorithm for filtering distracting elements
enableReadability: true,
// Whether to convert to plain text
pureText: false,
},
},
];
```

### Rule Submission Process

1. Fork the [LobeChat repository](https://github.com/lobehub/lobe-chat)
2. Add or modify URL rules
3. Submit a Pull Request describing:

- Target website characteristics
- Problems solved by the rule
- Test cases (example URLs)

## 📌 Note

This is an internal module of LobeHub (`"private": true`), designed specifically for LobeChat and not published as a standalone package.
61 changes: 61 additions & 0 deletions packages/web-crawler/README.zh-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# @lobechat/web-crawler

LobeChat 内置的网页抓取模块,用于智能提取网页内容并转换为 Markdown 格式。

## 📝 简介

`@lobechat/web-crawler` 是 LobeChat 的核心组件,负责网页内容的智能抓取与处理。它能够从各类网页中提取有价值的内容,过滤掉干扰元素,并生成结构化的 Markdown 文本。

## 🛠️ 核心功能

- **智能内容提取**:基于 Mozilla Readability 算法识别主要内容
- **多级抓取策略**:支持多种抓取实现,包括基础抓取、Jina 和 Browserless 渲染抓取
- **自定义 URL 规则**:通过灵活的规则系统处理特定网站的抓取逻辑

## 🤝 参与共建

网页结构多样复杂,我们欢迎社区贡献特定网站的抓取规则。您可以通过以下方式参与改进:

### 如何贡献 URL 规则

1.[urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) 文件中添加新规则
2. 规则示例:

```typescript
// 示例:处理特定网站
const url = [
// ... 其他 url 匹配规则
{
// URL 匹配模式,仅支持正则表达式
urlPattern: 'https://example.com/articles/(.*)',

// 可选:URL 转换,用于重定向到更易抓取的版本
urlTransform: 'https://example.com/print/$1',

// 可选:指定抓取实现方式,支持 'naive'、'jina' 和 'browserless' 三种
impls: ['naive', 'jina', 'browserless'],

// 可选:内容过滤配置
filterOptions: {
// 是否启用 Readability 算法,用于过滤干扰元素
enableReadability: true,
// 是否转换为纯文本
pureText: false,
},
},
];
```

### 规则提交流程

1. Fork [LobeChat 仓库](https://github.com/lobehub/lobe-chat)
2. 添加或修改 URL 规则
3. 提交 Pull Request 并描述:

- 目标网站特点
- 规则解决的问题
- 测试用例(示例 URL)

## 📌 注意事项

这是 LobeHub 的内部模块(`"private": true`),专为 LobeChat 设计,不作为独立包发布使用。
9 changes: 6 additions & 3 deletions packages/web-crawler/src/__test__/crawler.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,12 @@ describe('Crawler', () => {
});

expect(result).toEqual({
content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
errorMessage: 'Crawl failed',
errorType: 'CrawlError',
crawler: 'browserless',
data: {
content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
errorMessage: 'Crawl failed',
errorType: 'CrawlError',
},
originalUrl: 'https://example.com',
transformedUrl: undefined,
});
Expand Down
11 changes: 8 additions & 3 deletions packages/web-crawler/src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ export class Crawler {
...userFilterOptions,
};

let finalCrawler: string | undefined;
let finalError: Error | undefined;

const systemImpls = (ruleImpls ?? this.impls) as CrawlImplType[];
Expand All @@ -55,16 +56,20 @@ export class Crawler {
} catch (error) {
console.error(error);
finalError = error as Error;
finalCrawler = impl;
}
}

const errorType = finalError?.name || 'UnknownError';
const errorMessage = finalError?.message;

return {
content: `Fail to crawl the page. Error type: ${errorType}, error message: ${errorMessage}`,
errorMessage: errorMessage,
errorType,
crawler: finalCrawler,
data: {
content: `Fail to crawl the page. Error type: ${errorType}, error message: ${errorMessage}`,
errorMessage: errorMessage,
errorType,
},
originalUrl: url,
transformedUrl: transformedUrl !== url ? transformedUrl : undefined,
};
Expand Down
5 changes: 2 additions & 3 deletions packages/web-crawler/src/type.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ export interface CrawlSuccessResult {
export interface CrawlErrorResult {
content: string;
errorMessage: string;
errorType: string;
url: string;
}

Expand All @@ -36,9 +37,7 @@ export interface CrawlUrlRule {
// 内容过滤配置(可选)
filterOptions?: FilterOptions;
impls?: CrawlImplType[];
// 是否使用正则表达式匹配(默认为glob模式)
isRegex?: boolean;
// URL匹配模式,支持glob模式或正则表达式
// URL匹配模式,仅支持正则表达式
urlPattern: string;
// URL转换模板(可选),如果提供则进行URL转换
urlTransform?: string;
Expand Down
5 changes: 5 additions & 0 deletions packages/web-crawler/src/urlRules.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,11 @@ export const crawUrlRules: CrawlUrlRule[] = [
impls: ['jina'],
urlPattern: 'https://(.*).pdf',
},
// arxiv PDF use jina
{
impls: ['jina'],
urlPattern: 'https://arxiv.org/pdf/(.*)',
},
// 知乎有爬虫防护,使用 jina
{
impls: ['jina'],
Expand Down
39 changes: 37 additions & 2 deletions src/tools/web-browsing/Portal/PageContent/index.tsx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { Alert, CopyButton, Icon, Markdown } from '@lobehub/ui';
import { Alert, CopyButton, Highlighter, Icon, Markdown } from '@lobehub/ui';
import { Descriptions, Segmented, Typography } from 'antd';
import { createStyles } from 'antd-style';
import { ExternalLink } from 'lucide-react';
Expand Down Expand Up @@ -90,7 +90,42 @@ const PageContent = memo<PageContentProps>(({ result }) => {
const { styles } = useStyles();
const [display, setDisplay] = useState('render');

if (!result) return undefined;
if (!result || !result.data) return undefined;

if ('errorType' in result.data) {
return (
<Flexbox className={styles.footer} gap={4}>
<div>
<Descriptions
classNames={{
content: styles.footerText,
}}
column={1}
items={[
{
children: result.crawler,
label: t('search.crawPages.meta.crawler'),
},
]}
size="small"
/>
</div>
<Alert
extra={
<div style={{ maxWidth: 500, overflowX: 'scroll' }}>
<Highlighter language={'json'}>{JSON.stringify(result.data, null, 2)}</Highlighter>
</div>
}
message={
<div style={{ textAlign: 'start' }}>
{result.data.errorMessage || result.data.content}
</div>
}
type={'error'}
/>
</Flexbox>
);
}

const { url, title, description, content } = result.data;
return (
Expand Down
39 changes: 36 additions & 3 deletions src/tools/web-browsing/Render/PageContent/Result.tsx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
'use client';

import { CrawlSuccessResult } from '@lobechat/web-crawler';
import { Icon } from '@lobehub/ui';
import { CrawlErrorResult, CrawlSuccessResult } from '@lobechat/web-crawler';
import { Alert, Highlighter, Icon } from '@lobehub/ui';
import { Descriptions, Typography } from 'antd';
import { createStyles } from 'antd-style';
import { ExternalLink } from 'lucide-react';
Expand Down Expand Up @@ -82,14 +82,47 @@ interface CrawlerData {
crawler: string;
messageId: string;
originalUrl: string;
result: CrawlSuccessResult;
result: CrawlSuccessResult | CrawlErrorResult;
}

const CrawlerResultCard = memo<CrawlerData>(({ result, messageId, crawler, originalUrl }) => {
const { t } = useTranslation('plugin');
const { styles } = useStyles();
const [openToolUI, togglePageContent] = useChatStore((s) => [s.openToolUI, s.togglePageContent]);

if ('errorType' in result) {
return (
<Flexbox className={styles.footer} gap={4}>
<div>
<Descriptions
classNames={{
content: styles.footerText,
}}
column={1}
items={[
{
children: crawler,
label: t('search.crawPages.meta.crawler'),
},
]}
size="small"
/>
</div>
<Alert
extra={
<div style={{ maxWidth: 500, overflowX: 'scroll' }}>
<Highlighter language={'json'}>{JSON.stringify(result, null, 2)}</Highlighter>
</div>
}
message={
<div style={{ textAlign: 'start' }}>{result.errorMessage || result.content}</div>
}
type={'error'}
/>
</Flexbox>
);
}

const { url, title, description } = result;

return (
Expand Down
12 changes: 11 additions & 1 deletion src/tools/web-browsing/Render/PageContent/index.tsx
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { CrawlErrorResult } from '@lobechat/web-crawler';
import { memo } from 'react';
import { Flexbox } from 'react-layout-kit';

Expand Down Expand Up @@ -31,7 +32,16 @@ const PagesContent = memo<PagesContentProps>(({ results, messageId, urls }) => {
key={result.originalUrl}
messageId={messageId}
originalUrl={result.originalUrl}
result={result.data}
result={
result.data ||
// TODO: Remove this in v2 as it's deprecated
({
content: (result as any)?.content,
errorMessage: (result as any)?.errorMessage,
errorType: (result as any)?.errorType,
url: result.originalUrl,
} as CrawlErrorResult)
}
/>
))}
</Flexbox>
Expand Down
4 changes: 2 additions & 2 deletions src/types/tool/crawler.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { CrawlSuccessResult } from '@lobechat/web-crawler';
import { CrawlErrorResult, CrawlSuccessResult } from '@lobechat/web-crawler';

export interface CrawlSinglePageQuery {
url: string;
Expand All @@ -10,7 +10,7 @@ export interface CrawlMultiPagesQuery {

export interface CrawlResult {
crawler: string;
data: CrawlSuccessResult;
data: CrawlSuccessResult | CrawlErrorResult;
originalUrl: string;
}

Expand Down

0 comments on commit 0c24251

Please sign in to comment.