Merge pull request #157 from huawei-noah/zjj_release_1.7.1

release 1.7.1
huawei-noah · Oct 11, 2021 · a4d5059 · a4d5059
2 parents 013590c + e81a831
commit a4d5059
Show file tree

Hide file tree

Showing 23 changed files with 911 additions and 379 deletions.
diff --git a/README.cn.md b/README.cn.md
@@ -9,13 +9,13 @@
 
 ---
 
-**Vega ver1.7.0 发布**
+**Vega ver1.7.1 released**
 
-- 特性增强
+- Bug修复:
 
-  - 提供用于Ascend MindStudio的发布版本。
-  - 提供Horovod（GPU）和HCCL（NPU）的数据并行训练能力。
-  - 修复BUG：BOHB算法在超过3轮后可能会无法自动停止。
+  - 增加评估服务最大尝试次数限制.
+  - 使用SafeLoader加载YAML文件.
+  - 增加评估服务输入参数异常处理.
 
 ---
 

diff --git a/README.md b/README.md
@@ -8,13 +8,13 @@
 
 ---
 
-**Vega ver1.7.0 released**
+**Vega ver1.7.1 released**
 
-- Feature enhancement:
+- Bug fixes:
 
-  - Releases Ascend MindStudio version.
-  - Provides data parallel training capabilities for Horovod (GPU) and HCCL (NPU).
-  - Fixed bug: The BOHB algorithm may not automatically stop after more than three rounds.
+  - Maximum number of evaluation service attempts.
+  - Use SafeLoader to load the YAML file.
+  - Catch evaluation service input parameter exceptions.
 
 ---
 

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,4 +1,4 @@
-**Vega ver1.7.0 released:**
+**Vega ver1.7.1 released:**
 
 **Introduction**
 

diff --git a/docs/cn/user/security_configure.md b/docs/cn/user/security_configure.md
@@ -1,21 +1,51 @@
 # vega 安全配置
 ## 用户数据保护
 用户用于训练的模型脚本/文件、预训练模型以及数据集属于比较重要的数据文件，需要做好安全保护，可以通过设置正确的文件权限来提升其安全性。可以通过如下命令来设置正确的文件权限
-```
+```shell
 chmod 640 -R "file_path"
 ```
-## 训练服务器
-### 训练服务器安全配置
-训练节点在进行多卡训练时需要启动dask和zmq服务，这些服务会随机监听本地127.0.0.1的27000 - 34000 端口。为了保护用户的服务不被恶意攻击，可以通过如下方式配置防火墙保护这些端口:
 
+## 安全配置文件
+vega在启动时会尝试读取```~/.vega/vega.ini```配置文件中的内容，如果该文件不存在或者文件中的配置不正确，那么vega会报错并自动退出。
+
+用户在安装vega之后，可以通过命令```vega-security-config -i```初始化该文件，初始化之后该文件内容如下：
+```ini
+[security]
+enable = True
+
+[https]
+cert_pem_file =
+secret_key_file =
 ```
-iptables -I OUTPUT -p tcp -m owner --uid-owner "user_id" -d 127.0.0.1 --match multiport --dports 27000:34000 -j ACCEPT
-iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j DROP
+```[security] -> enable```的默认配置为True，此时用户还需要配置```[https]```段落下的```cert_pem_file```与```secret_key_file。```关于如何生成这2个文件请参考下面的章节，生成文件之后用户可以直接编辑vega.ini配置这2项内容，也可以通过如下命令来配置
+```shell
+vega-security-config -m https -c "cert_file_path" -k "key_file_path"
+# 替换“cert_file_path”与“key_file_path"为真实的文件路径
 ```
-其中```"user_id"```需要用户执行命令```id "username"```查看用户的id并镜像替换。
-> 注意：该配置限制了所有其他用户对端口27000-34000的访问，在多用户环境下如果其他用户也需要运行vega训练任务，需要使用其他用户的id去运行第一条命令，以便使该用户添加到防火墙的白名单中。
+
+> 注意：用户也可以选择关闭安全配置，通过运行命令```vega-security-config -s 0```来实现。关闭安全配置之后，训练服务器与推理服务器之间的通信将不再使用https而是https协议，无法保证通信安全。
+> 
+> 用户在关闭安全配置后，可以通过命令```vega-security-config -s 1```来重新开启安全配置。
 > 
 
+vega-security-config提供的操作vega.ini文件的命令总览如下：
+```shell
+# 1. 初始化vega.ini文件
+vega-security-config -i
+# 2. 关闭安全配置
+vega-security-config -s 0
+# 3. 打开安全配置
+vega-security-config -s 1
+# 4. 查询当前的安全配置开关是否打开
+vega-security-config -q sec
+# 5. 查询https的证书与密钥配置
+vega-security-config -q https
+# 6. 配置https的证书与密钥文件路径
+vega-security-config -m https -c "cert_file_path" -k "key_file_path"
+# 7. 只配置https的证书路径(在训练服务器上)
+vega-security-config -m https -c "cert_file_path"
+```
+
 ## 评估服务器
 ### 评估服务器 https 安全配置
 #### 生成评估服务器密钥和证书
@@ -25,30 +55,30 @@ iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j
 1.将/etc/pki/tls/openssl.cnf或者/etc/ssl/openssl.cnf拷贝到当前文件夹
 
 2.修改当前目录下的openssl.cnf文件内容，在[ v3_ca ]段落中添加内容
-```
+```ini
 subjectAltName = IP:xx.xx.xx.xx
 ```
 > 注意：xx.xx.xx.xx修改为推理服务器的IP地址
 > 
 3.生成服务器密钥
-```
+```shell
 openssl genrsa -aes-256-ofb -out example_key.pem 4096
 ```
 > 注意：在这个阶段需要用户输入保护密钥的密码，此密码由用户自己记住，并且输入的密码强度需满足需求，具体的密码强度需求见下面的启动评估服务器章节
 > 
 4.生成证书请求文件
-```
+```shell
 openssl req -new -key example_key.pem -out example.csr -extensions v3_ca \
 -config openssl.cnf
 ```
 5.生成自签名证书
-```
+```shell
 openssl x509 -req -days 365 -in example.csr -signkey example_key.pem \
 -out example_crt.pem -extensions v3_ca -extfile openssl.cnf
 ```
 6.设置密钥/证书权限
 为了确保系统安全，需要正确配置密钥/证书文件的权限，用户可以使用如下命令进行配置
-```
+```shell
 sudo chmod 600 example_key.pem example_crt.pem
 ```
 
@@ -116,3 +146,29 @@ max_content_length=100000 # 配置请求大小最大100K
 3. 必须包含至少1位小写字母
 4. 必须包含至少1位数字
 ```
+
+## 训练服务器
+### 训练服务器安全配置
+训练服务器需要配置推理服务器的证书信息，才能正常向推理服务器发送请求进行推理。用户可以按照如下方法进行配置：
+
+修改配置文件`~/.vega/vega.ini` 配置密钥和证书
+```ini
+[security]
+enable = True  # 需要配置成True才能启用https加密通信
+
+[https]
+cert_pem_file = /home/<username>/.vega/example_crt.pem  # 修改username和证书文件名
+```
+> 注意：这里的example_crt.pem为上面的步骤中生成的证书文件，用户需要手动将该证书文件拷贝到训练节点的对应目录下。
+
+### 训练服务器防火墙设置
+训练节点在进行多卡训练时需要启动dask和zmq服务，这些服务会随机监听本地127.0.0.1的27000 - 34000 端口。为了保护用户的服务不被恶意攻击，可以通过如下方式配置防火墙保护这些端口:
+
+```shell
+iptables -I OUTPUT -p tcp -m owner --uid-owner "user_id" -d 127.0.0.1 --match multiport --dports 27000:34000 -j ACCEPT
+iptables -A OUTPUT -p tcp --match multiport -d 127.0.0.1 --dports 27000:34000 -j DROP
+```
+其中```"user_id"```需要用户执行命令```id "username"```查看用户的id并镜像替换。
+> 注意：该配置限制了所有其他用户对端口27000-34000的访问，在多用户环境下如果其他用户也需要运行vega训练任务，需要使用其他用户的id去运行第一条命令，以便使该用户添加到防火墙的白名单中。
+> 
+
diff --git a/evaluate_service/hardwares/davinci/compile_atlas200.sh b/evaluate_service/hardwares/davinci/compile_atlas200.sh
@@ -25,11 +25,11 @@ mkdir -p build/intermediates/device
 mkdir -p build/intermediates/host
 
 cd build/intermediates/device
-cmake ../../../src -Dtype=device -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++
+cmake ../../../src -Dtype=device -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DCMAKE_CXX_FLAGS="-s"  -DCMAKE_C_FLAGS="-s"
 make install
 echo "[INFO] build the device sucess"
 cd ../host
-cmake ../../../src -Dtype=host -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++
+cmake ../../../src -Dtype=host -Dtarget=RC -DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++ -DCMAKE_CXX_FLAGS="-s"  -DCMAKE_C_FLAGS="-s"
 make install 
 echo "[INFO] build the host sucess"
 

diff --git a/evaluate_service/hardwares/davinci/compile_atlas300.sh b/evaluate_service/hardwares/davinci/compile_atlas300.sh
@@ -4,7 +4,7 @@ SAVE_PATH=$2
 cd $EXAMPLE_DIR/
 mkdir -p build/intermediates/host
 cd build/intermediates/host
-cmake ../../../src -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE
+cmake ../../../src -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE -DCMAKE_CXX_FLAGS="-s"  -DCMAKE_C_FLAGS="-s"
 make
 
 cd ../../../out

diff --git a/evaluate_service/main.py b/evaluate_service/main.py
@@ -60,10 +60,16 @@ def _add_params(cls, work_path, optional_params):
 
     def post(self):
         """Interface to response to the post request of the client."""
-        self.parse_paras()
-        self.upload_files()
-
-        self.hardware_instance = ClassFactory.get_cls(self.hardware)(self.optional_params)
+        try:
+            self.parse_paras()
+            self.upload_files()
+            self.hardware_instance = ClassFactory.get_cls(self.hardware)(self.optional_params)
+        except Exception:
+            self.result["status"] = "Params error."
+            self.result["error_message"] = traceback.format_exc()
+            logging.error("[ERROR] Params error!")
+            traceback.print_exc()
+            return self.result
 
         if self.reuse_model == "True":
             logging.warning("Reuse the model, no need to convert the model.")
@@ -77,9 +83,10 @@ def post(self):
                 self.result["error_message"] = traceback.format_exc()
                 logging.error("[ERROR] Model convert failed!")
                 traceback.print_exc()
+                return self.result
         try:
             latency_sum = 0
-            for repeat in range(self.repeat_times):
+            for repeat in range(min(self.repeat_times, 10)):
                 latency, output = self.hardware_instance.inference(converted_model=self.share_dir,
                                                                    input_data=self.input_data)
                 latency_sum += float(latency)
@@ -90,7 +97,6 @@ def post(self):
             self.result["error_message"] = traceback.format_exc()
             logging.error("[ERROR] Inference failed! ")
             traceback.print_exc()
-
         return self.result
 
     def parse_paras(self):

diff --git a/examples/features/custom_dataset/classification_dataset.yml b/examples/features/custom_dataset/classification_dataset.yml
@@ -1,43 +1,43 @@
 # ClassificationDataset is used to import image files. 
 # These files must be stored in a specified folder format.
 #
-# └─ custom_dataset
-#     ├─ train  # Train dataset folder.
-#     │  ├─ class_1
-#     │  │      image 1.jpg
-#     │  │      image 2.jpeg
-#     │  │      image 3.png
-#     │  ├─ class_2
-#     │  │      image 1.jpg
-#     │  │      image 2.jpeg
-#     │  │      image 3.png
-#     │  └─ class_3
-#     │  │      image 1.jpg
-#     │  │      image 2.jpeg
-#     │  │      image 3.png
-#     ├─ val   # This folder is optional. If the directory does not exist, you need to specify `portion` parameter.
-#     │  ├─ class_1
-#     │  │      image 1.jpg
-#     │  │      image 2.jpeg
-#     │  │      image 3.png
-#     │  ├─ class_2
-#     │  │      image 1.jpg
-#     │  │      image 2.jpeg
-#     │  │      image 3.png
-#     │  └─ class_3
-#     │         image 1.jpg
-#     │         image 2.jpeg
-#     │         image 3.png
-#     └─ test  # Test dataset folder.
-#         ├─ class_1
-#         │     image 1.jpg
-#         │     image 2.jpeg
-#         │     image 3.png
-#         ├─ class_2
-#         │     image 1.jpg
-#         │     image 2.jpeg
-#         │     image 3.png
-#         └─ class_3
+# +- custom_dataset
+#     +- train  # Train dataset folder.
+#     |  +- class_1
+#     |  |      image 1.jpg
+#     |  |      image 2.jpeg
+#     |  |      image 3.png
+#     |  +- class_2
+#     |  |      image 1.jpg
+#     |  |      image 2.jpeg
+#     |  |      image 3.png
+#     |  +- class_3
+#     |  |      image 1.jpg
+#     |  |      image 2.jpeg
+#     |  |      image 3.png
+#     +- val   # This folder is optional. If the directory does not exist, you need to specify `portion` parameter.
+#     |  +- class_1
+#     |  |      image 1.jpg
+#     |  |      image 2.jpeg
+#     |  |      image 3.png
+#     |  +- class_2
+#     |  |      image 1.jpg
+#     |  |      image 2.jpeg
+#     |  |      image 3.png
+#     |  +- class_3
+#     |         image 1.jpg
+#     |         image 2.jpeg
+#     |         image 3.png
+#     +- test  # Test dataset folder.
+#         +- class_1
+#         |     image 1.jpg
+#         |     image 2.jpeg
+#         |     image 3.png
+#         +- class_2
+#         |     image 1.jpg
+#         |     image 2.jpeg
+#         |     image 3.png
+#         +- class_3
 #              image 1.jpg
 #              image 2.jpeg
 #              image 3.png

diff --git a/setup.py b/setup.py
@@ -23,7 +23,7 @@
 
 setuptools.setup(
     name="noah-vega",
-    version="1.7.0",
+    version="1.7.1", 
     packages=["vega", "evaluate_service"],
     include_package_data=True,
     python_requires=">=3.6",

diff --git a/vega/__init__.py b/vega/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "1.7.0"
+__version__ = "1.7.1"
 
 
 import sys

diff --git a/vega/algorithms/compression/__init__.py b/vega/algorithms/compression/__init__.py
@@ -16,5 +16,6 @@
     "prune_ea": ["PruneCodec", "PruneEA", "PruneSearchSpace", "PruneTrainerCallback"],
     "prune_ea_mobilenet": ["PruneMobilenetCodec", "PruneMobilenetTrainerCallback"],
     "quant_ea": ["QuantCodec", "QuantEA", "QuantTrainerCallback"],
-    "prune_dag": ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace"],
+    "prune_dag": ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace",
+                  "KnockoffFeaturesCallback"],
 })
diff --git a/vega/algorithms/compression/prune_dag/__init__.py b/vega/algorithms/compression/prune_dag/__init__.py
@@ -1,3 +1,5 @@
 from .prune_dag import PruneDAGSearchSpace, AdaptiveBatchNormalizationCallback, SCOPDAGSearchSpace
+from .knockoff_callback import KnockoffFeaturesCallback
 
-__all__ = ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace"]
+__all__ = ["PruneDAGSearchSpace", "AdaptiveBatchNormalizationCallback", "SCOPDAGSearchSpace",
+           "KnockoffFeaturesCallback"]