This topic can quickly become extensive due to its complexity. My aim here is to document my experiment using llamaindex, ollama, and chroma_db for RAG.
In essence, RAG involves conversing with an AI about processed documents. My focus lies not in the intricacies of document processing into vector stores, but rather in engaging with the data through conversation with a language model (LLM).
I experimented with various LLM models to observe their “personality” during conversations. These models were locally run on a 3080 GPU and quantized for local functionality, facilitated by ollama. The embedding models’ impact on quality and performance remains a point of interest. I tested two medium-sized models
- llama3
- wizardlm2 and two small models
- gemma
- phi3.
The larger models (llama3-8b and wizardlm2-7b) demonstrated competence, providing accurate responses in a concise manner. While wizardlm2 was more verbose, I found no clear preference between the two. If pressed to choose, I lean towards llama3 due to its more direct responses.
Conversely, the smaller models (gemma and phi3) exhibited excessive hallucination, possibly influenced by various factors including data processing and prompt templates. Larger models, benefiting from extensive training, outperformed them as expected. However, exploring smaller models remains valuable for local AI deployment, offering speed and resource efficiency.
Although smaller models aren’t currently suitable for RAG, future iterations might prove beneficial, especially with fine-tuning. Data accuracy remains crucial; larger models like wizardlm2 and llama3 excel in this regard. RAG performance complexity involves both the LLM and the framework. Optimizing chunking and overlap sizes, considering max tokens per model, can enhance response times and data accuracy.
Performance issues may also stem from the choice of vector database. While my experience is limited, chroma’s simplicity and acceptable performance make it a viable option.
My experiment aimed to assess how different models convey search results. Free and open-source models like llama3 and wizardlm2 show promise, despite minor shortcomings. With further refinement, smaller models could enhance RAG applications. For now, llama3 or wizardlm2 are reliable choices.