File Embedding Processor

This project allows you to generate embeddings for files in a specified directory, store them in a PostgreSQL database, and later search the database based on a query to retrieve relevant file paths and their corresponding embeddings. The project uses HuggingFace and PyTorch for generating embeddings and psycopg2 for database interaction.

Features

Embedding Generation: Generates file embeddings using HuggingFace's transformer models or a custom embedding generation function.
Database Storage: Embeddings are stored in a PostgreSQL database, allowing for fast retrieval.
Search: Query the database using a user-defined question, retrieve the most relevant file embeddings, and display the corresponding file paths.

Prerequisites

Python 3.7+: This project uses Python for scripting.
PostgreSQL: You will need a running PostgreSQL instance to store embeddings.
CUDA (optional): If you have a GPU, PyTorch will automatically use it for faster embedding generation.

Requirements

Python Dependencies

The following Python packages are required:

psycopg2: PostgreSQL database adapter for Python.
numpy: Package for numerical operations (used for embedding normalization).
torch: PyTorch, used for running HuggingFace models.
transformers: HuggingFace library for pretrained transformer models.
argparse: Command-line argument parser.
langchain: For embedding models (if using the HuggingFaceEmbeddings class from langchain).

Install these packages using the requirements.txt file:

pip install -r requirements.txt

Setup

Step 1: Set up PostgreSQL Database

Install PostgreSQL on your machine, if it's not already installed.
Create a new database (file_vector) to store the embeddings.
Run the following SQL to create the required table:

CREATE TABLE embeddings_table (
    file_path TEXT PRIMARY KEY,
    embedding FLOAT8[]
);

Alternatively, you can use the provided setup_database.sh script to automatically set up the database:

sh setup_database.sh

Step 2: Set up Virtual Environment

Create a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- On macOS/Linux:
```
source venv/bin/activate
```
Install required Python dependencies:
```
pip install -r requirements.txt
```

Step 3: Configuration

Update the database connection parameters in both main.py and question.py to match your PostgreSQL setup:

DB_HOST = "localhost"
DB_PORT = "5432"
DB_NAME = "file_vector"
DB_USER = "your_database_user"
DB_PASSWORD = "your_database_password"

Step 4: Prepare Ignore List (Optional)

If you want to exclude certain files or directories from processing, create a text file (e.g., ignore_list.txt) containing newline-separated file or directory paths to ignore.

Example ignore_list.txt:

path/to/ignore1
file_to_ignore.txt

Step 5: Run the Embedding Generation Script

The main.py script processes files in the specified directory, generates embeddings, and stores them in the database. You can pass an optional ignore list file.

To run the embedding generation:

python main.py --path /path/to/start/directory --ignore-list ignore_list.txt

Where:

/path/to/start/directory is the path to the directory you want to process.
ignore_list.txt is an optional file containing a list of files or directories to ignore.

Step 6: Perform a Search

The question.py script allows you to search the database using a query string. It generates an embedding for the query and retrieves the most relevant files based on vector similarity.

To run a query:

python question.py --question "your search query"

This will generate an embedding for the search query and return the top 5 matching files with their embedding values.

Example Usage

Generate Embeddings for Files:

python main.py --path /path/to/files --ignore-list ignore_list.txt

Search for Relevant Files Based on Query:

python question.py --question "What are the important files related to data processing?"

Database Query Example

If you want to manually query the database, you can use the following SQL query to find the most similar embeddings to a given query embedding:

SELECT file_path, embedding
FROM embeddings_table
ORDER BY embedding <=> %s::vector  -- Ensure proper cast to vector type
LIMIT 5;

License

This project is licensed under the MIT License - see the LICENSE file for details.

This README includes the steps for setup, configuration, and usage. Let me know if you need more sections or clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Embedding Processor

Features

Prerequisites

Requirements

Python Dependencies

Setup

Step 1: Set up PostgreSQL Database

Step 2: Set up Virtual Environment

Step 3: Configuration

Step 4: Prepare Ignore List (Optional)

Step 5: Run the Embedding Generation Script

Step 6: Perform a Search

Example Usage

Generate Embeddings for Files:

Search for Relevant Files Based on Query:

Database Query Example

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
ignore		ignore
main.py		main.py
question.py		question.py
requirements.txt		requirements.txt
setup_database.sh		setup_database.sh

CoderCookE/vectorFileSearch

Folders and files

Latest commit

History

Repository files navigation

File Embedding Processor

Features

Prerequisites

Requirements

Python Dependencies

Setup

Step 1: Set up PostgreSQL Database

Step 2: Set up Virtual Environment

Step 3: Configuration

Step 4: Prepare Ignore List (Optional)

Step 5: Run the Embedding Generation Script

Step 6: Perform a Search

Example Usage

Generate Embeddings for Files:

Search for Relevant Files Based on Query:

Database Query Example

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages