Skip to content

Commit

Permalink
Merge pull request #157 from huawei-noah/zjj_release_1.7.1
Browse files Browse the repository at this point in the history
release 1.7.1
  • Loading branch information
zhangjiajin authored Oct 11, 2021
2 parents 013590c + e81a831 commit a4d5059
Show file tree
Hide file tree
Showing 23 changed files with 911 additions and 379 deletions.
10 changes: 5 additions & 5 deletions README.cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@

---

**Vega ver1.7.0 发布**
**Vega ver1.7.1 released**

- 特性增强
- Bug修复:

- 提供用于Ascend MindStudio的发布版本。
- 提供Horovod(GPU)和HCCL(NPU)的数据并行训练能力。
- 修复BUG:BOHB算法在超过3轮后可能会无法自动停止。
- 增加评估服务最大尝试次数限制.
- 使用SafeLoader加载YAML文件.
- 增加评估服务输入参数异常处理.

---

Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@

---

**Vega ver1.7.0 released**
**Vega ver1.7.1 released**

- Feature enhancement:
- Bug fixes:

- Releases Ascend MindStudio version.
- Provides data parallel training capabilities for Horovod (GPU) and HCCL (NPU).
- Fixed bug: The BOHB algorithm may not automatically stop after more than three rounds.
- Maximum number of evaluation service attempts.
- Use SafeLoader to load the YAML file.
- Catch evaluation service input parameter exceptions.

---

Expand Down
2 changes: 1 addition & 1 deletion RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
**Vega ver1.7.0 released:**
**Vega ver1.7.1 released:**

**Introduction**

Expand Down
82 changes: 69 additions & 13 deletions docs/cn/user/security_configure.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,51 @@
# vega 安全配置
## 用户数据保护
用户用于训练的模型脚本/文件、预训练模型以及数据集属于比较重要的数据文件,需要做好安全保护,可以通过设置正确的文件权限来提升其安全性。可以通过如下命令来设置正确的文件权限
```
```shell
chmod 640 -R "file_path"
```
## 训练服务器
### 训练服务器安全配置
训练节点在进行多卡训练时需要启动dask和zmq服务,这些服务会随机监听本地127.0.0.1的27000 - 34000 端口。为了保护用户的服务不被恶意攻击,可以通过如下方式配置防火墙保护这些端口:

## 安全配置文件
vega在启动时会尝试读取```~/.vega/vega.ini```配置文件中的内容,如果该文件不存在或者文件中的配置不正确,那么vega会报错并自动退出。

用户在安装vega之后,可以通过命令```vega-security-config -i```初始化该文件,初始化之后该文件内容如下:
```ini
[security]
enable = True

[https]
cert_pem_file =
secret_key_file =
```
iptables -I OUTPUT -p tcp -m owner --uid-owner "user_id" -d 127.0.0.1 --match multiport --dports 27000:34000 -j ACCEPT
iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j DROP
```[security] -> enable```的默认配置为True,此时用户还需要配置```[https]```段落下的```cert_pem_file``````secret_key_file。```关于如何生成这2个文件请参考下面的章节,生成文件之后用户可以直接编辑vega.ini配置这2项内容,也可以通过如下命令来配置
```shell
vega-security-config -m https -c "cert_file_path" -k "key_file_path"
# 替换“cert_file_path”与“key_file_path"为真实的文件路径
```
其中```"user_id"```需要用户执行命令```id "username"```查看用户的id并镜像替换。
> 注意:该配置限制了所有其他用户对端口27000-34000的访问,在多用户环境下如果其他用户也需要运行vega训练任务,需要使用其他用户的id去运行第一条命令,以便使该用户添加到防火墙的白名单中。

> 注意:用户也可以选择关闭安全配置,通过运行命令```vega-security-config -s 0```来实现。关闭安全配置之后,训练服务器与推理服务器之间的通信将不再使用https而是https协议,无法保证通信安全。
>
> 用户在关闭安全配置后,可以通过命令```vega-security-config -s 1```来重新开启安全配置。
>
vega-security-config提供的操作vega.ini文件的命令总览如下:
```shell
# 1. 初始化vega.ini文件
vega-security-config -i
# 2. 关闭安全配置
vega-security-config -s 0
# 3. 打开安全配置
vega-security-config -s 1
# 4. 查询当前的安全配置开关是否打开
vega-security-config -q sec
# 5. 查询https的证书与密钥配置
vega-security-config -q https
# 6. 配置https的证书与密钥文件路径
vega-security-config -m https -c "cert_file_path" -k "key_file_path"
# 7. 只配置https的证书路径(在训练服务器上)
vega-security-config -m https -c "cert_file_path"
```

## 评估服务器
### 评估服务器 https 安全配置
#### 生成评估服务器密钥和证书
Expand All @@ -25,30 +55,30 @@ iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j
1.将/etc/pki/tls/openssl.cnf或者/etc/ssl/openssl.cnf拷贝到当前文件夹

2.修改当前目录下的openssl.cnf文件内容,在[ v3_ca ]段落中添加内容
```
```ini
subjectAltName = IP:xx.xx.xx.xx
```
> 注意:xx.xx.xx.xx修改为推理服务器的IP地址
>
3.生成服务器密钥
```
```shell
openssl genrsa -aes-256-ofb -out example_key.pem 4096
```
> 注意:在这个阶段需要用户输入保护密钥的密码,此密码由用户自己记住,并且输入的密码强度需满足需求,具体的密码强度需求见下面的启动评估服务器章节
>
4.生成证书请求文件
```
```shell
openssl req -new -key example_key.pem -out example.csr -extensions v3_ca \
-config openssl.cnf
```
5.生成自签名证书
```
```shell
openssl x509 -req -days 365 -in example.csr -signkey example_key.pem \
-out example_crt.pem -extensions v3_ca -extfile openssl.cnf
```
6.设置密钥/证书权限
为了确保系统安全,需要正确配置密钥/证书文件的权限,用户可以使用如下命令进行配置
```
```shell
sudo chmod 600 example_key.pem example_crt.pem
```

Expand Down Expand Up @@ -116,3 +146,29 @@ max_content_length=100000 # 配置请求大小最大100K
3. 必须包含至少1位小写字母
4. 必须包含至少1位数字
```
## 训练服务器
### 训练服务器安全配置
训练服务器需要配置推理服务器的证书信息,才能正常向推理服务器发送请求进行推理。用户可以按照如下方法进行配置:
修改配置文件`~/.vega/vega.ini` 配置密钥和证书
```ini
[security]
enable = True # 需要配置成True才能启用https加密通信
[https]
cert_pem_file = /home/<username>/.vega/example_crt.pem # 修改username和证书文件名
```
> 注意:这里的example_crt.pem为上面的步骤中生成的证书文件,用户需要手动将该证书文件拷贝到训练节点的对应目录下。
### 训练服务器防火墙设置
训练节点在进行多卡训练时需要启动dask和zmq服务,这些服务会随机监听本地127.0.0.1的27000 - 34000 端口。为了保护用户的服务不被恶意攻击,可以通过如下方式配置防火墙保护这些端口:

```shell
iptables -I OUTPUT -p tcp -m owner --uid-owner "user_id" -d 127.0.0.1 --match multiport --dports 27000:34000 -j ACCEPT
iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j DROP
```
其中```"user_id"```需要用户执行命令```id "username"```查看用户的id并镜像替换。
> 注意:该配置限制了所有其他用户对端口27000-34000的访问,在多用户环境下如果其他用户也需要运行vega训练任务,需要使用其他用户的id去运行第一条命令,以便使该用户添加到防火墙的白名单中。
>
4 changes: 2 additions & 2 deletions evaluate_service/hardwares/davinci/compile_atlas200.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ mkdir -p build/intermediates/device
mkdir -p build/intermediates/host

cd build/intermediates/device
cmake ../../../src -Dtype=device -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++
cmake ../../../src -Dtype=device -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DCMAKE_CXX_FLAGS="-s" -DCMAKE_C_FLAGS="-s"
make install
echo "[INFO] build the device sucess"
cd ../host
cmake ../../../src -Dtype=host -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++
cmake ../../../src -Dtype=host -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DCMAKE_CXX_FLAGS="-s" -DCMAKE_C_FLAGS="-s"
make install
echo "[INFO] build the host sucess"

Expand Down
2 changes: 1 addition & 1 deletion evaluate_service/hardwares/davinci/compile_atlas300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ SAVE_PATH=$2
cd $EXAMPLE_DIR/
mkdir -p build/intermediates/host
cd build/intermediates/host
cmake ../../../src -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE
cmake ../../../src -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE -DCMAKE_CXX_FLAGS="-s" -DCMAKE_C_FLAGS="-s"
make

cd ../../../out
Expand Down
18 changes: 12 additions & 6 deletions evaluate_service/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,16 @@ def _add_params(cls, work_path, optional_params):

def post(self):
"""Interface to response to the post request of the client."""
self.parse_paras()
self.upload_files()

self.hardware_instance = ClassFactory.get_cls(self.hardware)(self.optional_params)
try:
self.parse_paras()
self.upload_files()
self.hardware_instance = ClassFactory.get_cls(self.hardware)(self.optional_params)
except Exception:
self.result["status"] = "Params error."
self.result["error_message"] = traceback.format_exc()
logging.error("[ERROR] Params error!")
traceback.print_exc()
return self.result

if self.reuse_model == "True":
logging.warning("Reuse the model, no need to convert the model.")
Expand All @@ -77,9 +83,10 @@ def post(self):
self.result["error_message"] = traceback.format_exc()
logging.error("[ERROR] Model convert failed!")
traceback.print_exc()
return self.result
try:
latency_sum = 0
for repeat in range(self.repeat_times):
for repeat in range(min(self.repeat_times, 10)):
latency, output = self.hardware_instance.inference(converted_model=self.share_dir,
input_data=self.input_data)
latency_sum += float(latency)
Expand All @@ -90,7 +97,6 @@ def post(self):
self.result["error_message"] = traceback.format_exc()
logging.error("[ERROR] Inference failed! ")
traceback.print_exc()

return self.result

def parse_paras(self):
Expand Down
74 changes: 37 additions & 37 deletions examples/features/custom_dataset/classification_dataset.yml
Original file line number Diff line number Diff line change
@@ -1,43 +1,43 @@
# ClassificationDataset is used to import image files.
# These files must be stored in a specified folder format.
#
# └─ custom_dataset
# ├─ train # Train dataset folder.
# │ ├─ class_1
# │ │ image 1.jpg
# │ │ image 2.jpeg
# │ │ image 3.png
# │ ├─ class_2
# │ │ image 1.jpg
# │ │ image 2.jpeg
# │ │ image 3.png
# │ └─ class_3
# │ │ image 1.jpg
# │ │ image 2.jpeg
# │ │ image 3.png
# ├─ val # This folder is optional. If the directory does not exist, you need to specify `portion` parameter.
# │ ├─ class_1
# │ │ image 1.jpg
# │ │ image 2.jpeg
# │ │ image 3.png
# │ ├─ class_2
# │ │ image 1.jpg
# │ │ image 2.jpeg
# │ │ image 3.png
# │ └─ class_3
# image 1.jpg
# image 2.jpeg
# image 3.png
# └─ test # Test dataset folder.
# ├─ class_1
# image 1.jpg
# image 2.jpeg
# image 3.png
# ├─ class_2
# image 1.jpg
# image 2.jpeg
# image 3.png
# └─ class_3
# +- custom_dataset
# +- train # Train dataset folder.
# | +- class_1
# | | image 1.jpg
# | | image 2.jpeg
# | | image 3.png
# | +- class_2
# | | image 1.jpg
# | | image 2.jpeg
# | | image 3.png
# | +- class_3
# | | image 1.jpg
# | | image 2.jpeg
# | | image 3.png
# +- val # This folder is optional. If the directory does not exist, you need to specify `portion` parameter.
# | +- class_1
# | | image 1.jpg
# | | image 2.jpeg
# | | image 3.png
# | +- class_2
# | | image 1.jpg
# | | image 2.jpeg
# | | image 3.png
# | +- class_3
# | image 1.jpg
# | image 2.jpeg
# | image 3.png
# +- test # Test dataset folder.
# +- class_1
# | image 1.jpg
# | image 2.jpeg
# | image 3.png
# +- class_2
# | image 1.jpg
# | image 2.jpeg
# | image 3.png
# +- class_3
# image 1.jpg
# image 2.jpeg
# image 3.png
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

setuptools.setup(
name="noah-vega",
version="1.7.0",
version="1.7.1",
packages=["vega", "evaluate_service"],
include_package_data=True,
python_requires=">=3.6",
Expand Down
2 changes: 1 addition & 1 deletion vega/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "1.7.0"
__version__ = "1.7.1"


import sys
Expand Down
3 changes: 2 additions & 1 deletion vega/algorithms/compression/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@
"prune_ea": ["PruneCodec", "PruneEA", "PruneSearchSpace", "PruneTrainerCallback"],
"prune_ea_mobilenet": ["PruneMobilenetCodec", "PruneMobilenetTrainerCallback"],
"quant_ea": ["QuantCodec", "QuantEA", "QuantTrainerCallback"],
"prune_dag": ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace"],
"prune_dag": ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace",
"KnockoffFeaturesCallback"],
})
4 changes: 3 additions & 1 deletion vega/algorithms/compression/prune_dag/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from .prune_dag import PruneDAGSearchSpace, AdaptiveBatchNormalizationCallback, SCOPDAGSearchSpace
from .knockoff_callback import KnockoffFeaturesCallback

__all__ = ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace"]
__all__ = ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace",
"KnockoffFeaturesCallback"]
Loading

0 comments on commit a4d5059

Please sign in to comment.