🦙 LLama-based Question Answering Service

This project leverages the LLama model to generate answers based on user requests. It is composed of two primary components: a Processing Server and a Model Server, working together to provide seamless and safe interactions.

📜 Architecture Overview

1. Processing Server

Handles user input and response processing, with two core tasks:

Preprocessing: Validates input for prohibited or offensive words.
Postprocessing: Detects the toxicity level of the model's response and rechecks for any prohibited or offensive words.

2. Model Server

Hosts the LLama model and generates responses to user inputs.

✨ Features

Language Support: English only.
Toxicity Detection: Ensures safe responses by checking for offensive content both before and after processing.
Dockerized Setup: Simplified deployment using a pre-built Docker image.

🚀 Installation Instructions

Clone the repository:

git clone https://github.com/sillymultifora/fluffy-octo-dollop.git
cd fluffy-octo-dollop

Install the required dependencies:
```
pip install -r requirements.txt
```

The repo was tested on python3.8 and cuda 12.1

Note:
This service utilizes the meta-llama/Meta-Llama-3.1-8B-Instruct model. To work with this model, access must be obtained via Hugging Face.
If you'd prefer a different model, simply update the model name in the configuration.

🛠️ Running the Servers

Start the servers with a single command:

bash start_servers.sh

Manually start the servers:

Start the Model Server (VLLM server) in the background:

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --api-key my_token

Start the Processing Server in the background:

python processing_server.py --api-key my_token --api-base http://localhost:8000/v1/ --model-name meta-llama/Meta-Llama-3.1-8B-Instruct

📡 Sending Requests

You can send a request to the processing server using curl:

curl -X POST http://localhost:5000/process \
  -H "Content-Type: application/json" \
  -d '{"input": "Who are you?"}'

This command sends an input prompt to the server, processes it, and returns a response from the LLama model after both preprocessing and postprocessing.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
benchmarking		benchmarking
tests		tests
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
constants.py		constants.py
processing_server.py		processing_server.py
requirements.txt		requirements.txt
start_servers.sh		start_servers.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦙 LLama-based Question Answering Service

📜 Architecture Overview

1. Processing Server

2. Model Server

✨ Features

🚀 Installation Instructions

🛠️ Running the Servers

Start the servers with a single command:

Manually start the servers:

📡 Sending Requests

About

Releases

Packages

Languages

sillymultifora/fluffy-octo-dollop

Folders and files

Latest commit

History

Repository files navigation

🦙 LLama-based Question Answering Service

📜 Architecture Overview

1. Processing Server

2. Model Server

✨ Features

🚀 Installation Instructions

🛠️ Running the Servers

Start the servers with a single command:

Manually start the servers:

📡 Sending Requests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages