Add generate endpoint to Tutorial (#71)

triton-inference-server · Nov 14, 2023 · 8a2d268 · 8a2d268
1 parent c55bd64
commit 8a2d268
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 4 deletions.
diff --git a/Quick_Deploy/HuggingFaceTransformers/Dockerfile b/Quick_Deploy/HuggingFaceTransformers/Dockerfile
@@ -23,5 +23,5 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-FROM nvcr.io/nvidia/tritonserver:23.09-py3
+FROM nvcr.io/nvidia/tritonserver:23.10-py3
 RUN pip install transformers==4.34.0 protobuf==3.20.3 sentencepiece==0.1.99 accelerate==0.23.0 einops==0.6.1
diff --git a/Quick_Deploy/HuggingFaceTransformers/README.md b/Quick_Deploy/HuggingFaceTransformers/README.md
@@ -93,7 +93,7 @@ I0922 23:28:40.395611 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8
 
 Now we can query the server using curl, specifying the server address and input details:
 
-```json
+```bash
 curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["I am going"]}]}'
 ```
 In our testing, the server returned the following result (formatted for legibility):
@@ -135,7 +135,7 @@ Again, launch the server by invoking the `docker run` command from above and wai
 that the server has launched successfully.
 
 Query the server making sure to change the host address for each model:
-```json
+```bash
 curl -X POST localhost:8000/v2/models/falcon7b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["How can you be"]}]}'
 curl -X POST localhost:8000/v2/models/persimmon8b/infer -d '{"inputs": [{"name":"text_input","datatype":"BYTES","shape":[1],"data":["Where is the nearest"]}]}'
 ```
@@ -147,7 +147,12 @@ In our testing, these queries returned the following parsed results:
 # persimmon8b
 "Where is the nearest starbucks?"
 ```
+Beginning in the 23.10 release, users can now interact with large language models (LLMs) hosted
+by Triton in a simplified fashion by using Triton's generate endpoint:
 
+```bash
+curl -X POST localhost:8000/v2/models/falcon7b/generate -d '{"text_input":"How can you be"}'
+```
 ## 'Day Zero' Support
 
 The latest transformer models may not always be supported in the most recent, official
@@ -206,7 +211,7 @@ the Triton server using the `docker run` command from above.
 Once Triton launches successfully, start a Triton SDK container by running the following in a separate window:
 
 ```bash
-docker run -it --net=host nvcr.io/nvidia/tritonserver:23.09-py3-sdk bash
+docker run -it --net=host nvcr.io/nvidia/tritonserver:23.10-py3-sdk bash
 ```
 This container comes with all of Triton's deployment analyzers pre-installed, meaning
 we can simply enter the following to get feedback on our model's inference performance: