Skip to content

Commit

Permalink
update ai data masking doc (#1310)
Browse files Browse the repository at this point in the history
  • Loading branch information
johnlanni authored Sep 13, 2024
1 parent 452bd4e commit 7610c9f
Show file tree
Hide file tree
Showing 2 changed files with 154 additions and 32 deletions.
55 changes: 23 additions & 32 deletions plugins/wasm-rust/extensions/ai-data-masking/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,35 @@
# 功能说明
---
title: AI 数据脱敏
keywords: [higress,ai data masking]
description: AI 数据脱敏插件配置参考
---

## 功能说明

对请求/返回中的敏感词拦截、替换

![image](https://github.com/user-attachments/assets/f281c8c3-9613-4053-94aa-067694cc5fd4)


```mermaid
sequenceDiagram
participant 用户
participant 敏感词插件
participant 后端服务
用户->>敏感词插件: 请求数据(如:包含[email protected])
敏感词插件->>敏感词插件: 数据解析
opt 如果包含拦截词
敏感词插件-->>用户: 返回预设错误消息 (拦截)
end
opt 替换敏感词
敏感词插件->>后端服务: 关键词替换后的请求数据 (将[email protected]替换为****@gmail.com)
后端服务->>敏感词插件: 原始返回响应(包含 ****@gmail.com)
敏感词插件->>用户: 数据恢复后的相应数据(将****@gmail.com恢复为[email protected])
end
```
![image](https://img.alicdn.com/imgextra/i4/O1CN0156Wtko1T9JO0RiWow_!!6000000002339-0-tps-1314-638.jpg)

## 处理数据范围
### 处理数据范围
- openai协议:请求/返回对话内容
- jsonpath:只处理指定字段
- raw:整个请求/返回body

## 敏感词拦截
### 敏感词拦截
- 处理数据范围中出现敏感词直接拦截,返回预设错误信息
- 支持系统内置敏感词库和自定义敏感词

## 敏感词替换
### 敏感词替换
- 将请求数据中出现的敏感词替换为脱敏字符串,传递给后端服务。可保证敏感数据不出域
- 部分脱敏数据在后端服务返回后可进行还原
- 自定义规则支持标准正则和grok规则,替换字符串支持变量替换

## 运行属性

插件执行阶段:`认证阶段`
插件执行优先级:`991`

# 配置字段
## 配置字段

| 名称 | 数据类型 | 默认值 | 描述 |
| -------- | -------- | -------- | -------- |
Expand All @@ -57,7 +48,7 @@ sequenceDiagram
| replace_roles.restore | bool | false | 是否恢复 |
| replace_roles.value | string | - | 替换值(支持正则变量) |

# 配置示例
## 配置示例

```yaml
system_deny: true
Expand Down Expand Up @@ -98,17 +89,17 @@ replace_roles:
# hash后的值提供给大模型,从大模型返回的数据中会将hash值还原为原始值
```

# 敏感词替换样例
## 敏感词替换样例

## 用户请求内容
### 用户请求内容

请将 `curl http://172.20.5.14/api/openai/v1/chat/completions -H "Authorization: sk-12345" -H "Auth: [email protected]"` 改成post方式

## 处理后请求大模型内容
### 处理后请求大模型内容

`curl http://***.***.***.***/api/openai/v1/chat/completions -H "Authorization: 48a7e98a91d93896d8dac522c5853948" -H "Auth: ****@gmail.com"` 改成post方式

## 大模型返回内容
### 大模型返回内容

您想要将一个 `curl` 的 GET 请求转换为 POST 请求,并且这个请求是向一个特定的 API 发送数据。下面是修改后的 `curl` 命令,以 POST 方式发送:

Expand All @@ -129,7 +120,7 @@ curl -X POST \

请注意,您需要将 `"key":"value"` 替换为您实际要发送的数据内容。如果您的 API 接受不同的数据结构或者需要特定的字段,请根据实际情况调整这部分内容。

## 处理后返回用户内容
### 处理后返回用户内容

您想要将一个 `curl` 的 GET 请求转换为 POST 请求,并且这个请求是向一个特定的 API 发送数据。下面是修改后的 `curl` 命令,以 POST 方式发送:

Expand All @@ -151,7 +142,7 @@ curl -X POST \
请注意,您需要将 `"key":"value"` 替换为您实际要发送的数据内容。如果您的 API 接受不同的数据结构或者需要特定的字段,请根据实际情况调整这部分内容。


# 相关说明
## 相关说明

- 流模式中如果脱敏后的词被多个chunk拆分,可能无法进行还原
- 流模式中,如果敏感词语被多个chunk拆分,可能会有敏感词的一部分返回给用户的情况
Expand Down
131 changes: 131 additions & 0 deletions plugins/wasm-rust/extensions/ai-data-masking/README_EN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: AI Data Masking
keywords: [higress, ai data masking]
description: AI Data Masking Plugin Configuration Reference
---
## Function Description
Interception and replacement of sensitive words in requests/responses
![image](https://img.alicdn.com/imgextra/i4/O1CN0156Wtko1T9JO0RiWow_!!6000000002339-0-tps-1314-638.jpg)

### Data Handling Scope
- openai protocol: Request/response conversation content
- jsonpath: Only process specified fields
- raw: Entire request/response body

### Sensitive Word Interception
- Directly intercept sensitive words in the data handling scope and return preset error messages
- Supports system's built-in sensitive word library and custom sensitive words

### Sensitive Word Replacement
- Replace sensitive words in request data with masked strings before passing to back-end services. Ensures that sensitive data does not leave the domain
- Some masked data can be restored after being returned by the back-end service
- Custom rules support standard regular expressions and grok rules, and replacement strings support variable substitution

## Execution Properties
Plugin Execution Phase: `Authentication Phase`
Plugin Execution Priority: `991`

## Configuration Fields
| Name | Data Type | Default Value | Description |
| ---------------------- | ---------------- | -------------- | ------------------------------------ |
| deny_openai | bool | true | Intercept openai protocol |
| deny_jsonpath | string | [] | Intercept specified jsonpath |
| deny_raw | bool | false | Intercept raw body |
| system_deny | bool | true | Enable built-in interception rules |
| deny_code | int | 200 | HTTP status code when intercepted |
| deny_message | string | Sensitive words found in the question or answer have been blocked | AI returned message when intercepted |
| deny_raw_message | string | {"errmsg":"Sensitive words found in the question or answer have been blocked"} | Content returned when not openai intercepted |
| deny_content_type | string | application/json | Content type header returned when not openai intercepted |
| deny_words | array of string | [] | Custom sensitive word list |
| replace_roles | array | - | Custom sensitive word regex replacement |
| replace_roles.regex | string | - | Rule regex (built-in GROK rule) |
| replace_roles.type | [replace, hash] | - | Replacement type |
| replace_roles.restore | bool | false | Whether to restore |
| replace_roles.value | string | - | Replacement value (supports regex variables) |

## Configuration Example
```yaml
system_deny: true
deny_openai: true
deny_jsonpath:
- "$.messages[*].content"
deny_raw: true
deny_code: 200
deny_message: "Sensitive words found in the question or answer have been blocked"
deny_raw_message: "{\"errmsg\":\"Sensitive words found in the question or answer have been blocked\"}"
deny_content_type: "application/json"
deny_words:
- "Custom sensitive word 1"
- "Custom sensitive word 2"
replace_roles:
- regex: "%{MOBILE}"
type: "replace"
value: "****"
# Mobile number 13800138000 -> ****
- regex: "%{EMAILLOCALPART}@%{HOSTNAME:domain}"
type: "replace"
restore: true
value: "****@$domain"
# Email [email protected] -> ****@gmail.com
- regex: "%{IP}"
type: "replace"
restore: true
value: "***.***.***.***"
# IP 192.168.0.1 -> ***.***.***.***
- regex: "%{IDCARD}"
type: "replace"
value: "****"
# ID card number 110000000000000000 -> ****
- regex: "sk-[0-9a-zA-Z]*"
restore: true
type: "hash"
# hash sk-12345 -> 9cb495455da32f41567dab1d07f1973d
# The hashed value is provided to the large model, and the hash value will be restored to the original value from the data returned by the large model
```

## Sensitive Word Replacement Example
### User Request Content
Please change `curl http://172.20.5.14/api/openai/v1/chat/completions -H "Authorization: sk-12345" -H "Auth: [email protected]"` to POST method

### Processed Request Large Model Content
`curl http://***.***.***.***/api/openai/v1/chat/completions -H "Authorization: 48a7e98a91d93896d8dac522c5853948" -H "Auth: ****@gmail.com"` change to POST method

### Large Model Returned Content
You want to convert a `curl` GET request to a POST request, and this request is sending data to a specific API. Below is the modified `curl` command to send as POST:
```sh
curl -X POST \
-H "Authorization: 48a7e98a91d93896d8dac522c5853948" \
-H "Auth: ****@gmail.com" \
-H "Content-Type: application/json" \
-d '{"key":"value"}' \
http://***.***.***.***/api/openai/v1/chat/completions
```
Here are the following modifications made:
- `-X POST` sets the request method to POST.
- `-H "Content-Type: application/json"` sets the `Content-Type` in the request header to `application/json`, which is typically used to inform the server that the data you are sending is in JSON format.
- `-d '{"key":"value"}'` sets the data to be sent, where `'{"key":"value"}'` is a simple example of a JSON object. You need to replace it with the actual data you want to send.

Please note that you need to replace `"key":"value"` with the actual data content you want to send. If your API accepts a different data structure or requires specific fields, please adjust this part according to your actual situation.

### Processed Return to User Content
You want to convert a `curl` GET request to a POST request, and this request is sending data to a specific API. Below is the modified `curl` command to send as POST:
```sh
curl -X POST \
-H "Authorization: sk-12345" \
-H "Auth: [email protected]" \
-H "Content-Type: application/json" \
-d '{"key":"value"}' \
http://172.20.5.14/api/openai/v1/chat/completions
```
Here are the following modifications made:
- `-X POST` sets the request method to POST.
- `-H "Content-Type: application/json"` sets the `Content-Type` in the request header to `application/json`, which is typically used to inform the server that the data you are sending is in JSON format.
- `-d '{"key":"value"}'` sets the data to be sent, where `'{"key":"value"}'` is a simple example of a JSON object. You need to replace it with the actual data you want to send.

Please note that you need to replace `"key":"value"` with the actual data content you want to send. If your API accepts a different data structure or requires specific fields, please adjust this part according to your actual situation.

## Related Notes
- In streaming mode, if the masked words are split across multiple chunks, restoration may not be possible
- In streaming mode, if sensitive words are split across multiple chunks, there may be cases where part of the sensitive word is returned to the user
- Grok built-in rule list: https://help.aliyun.com/zh/sls/user-guide/grok-patterns
- Built-in sensitive word library data source: https://github.com/houbb/sensitive-word/tree/master/src/main/resources

0 comments on commit 7610c9f

Please sign in to comment.