Using AI to talk to books
Day 10 / 365
I have made several small AI tools so far. So today I wanted to challenge myself. So I set out to build a tool that you can give a pdf file to, and then ask any question about its contents.
The use cases are several. For instance, all of the NCERT books are available in PDF format. With such a tool, a student can get their doubts answered about any chapter of a book more easily.
The Complexity
So what makes this problem not so straightforward?
Firstly let’s see where ChatGPT gets its info from. ChatGPT works using an AI model that has been trained on a large set of text from all over the internet. So if your question is related to text that is available for free on the internet, it will be able to give you a good enough answer.
But the catch here is that the training data for ChatGPT has a cut-off date.
GPT 3.5 knows only about things on the internet up to Sep 2021. So if your question is about something after that, it would not be able to help you with it.
Providing extra info to GPT
One solution could be to provide GPT with the additional info in the prompt itself, and then ask it your question.
So in the case of the NCERT example, I could copy and paste the whole text from a chapter into my prompt and it would work fine.
The problem arises when the text becomes too long. See LLMs like GPT have a limit to the amount of text they can remember at any given time. This is called the Context Window. For most models, it's not more than a few 1000 words. That would translate to 3–4 pages.
So we need to think of a better solution
Embeddings and Vector Databases
Suppose I had a book with 100 pages, and I wanted to ask a specific question about it. It could be that out of those 100 pages, only 2–3 had info related to my question. Instead of passing the whole text to GPT, we just need to pass the relevant pages. This way we will stay within the context window.
This is where vector databases come in. I would need a whole other blog post to explain what they are, but for now, we just need to know the following -
- We can convert text from the book to integer vectors using an embedding model and store them in so-called vector databases
- We can convert our questions into vectors as well
- By measuring the distance between our question vector and the vectors in our database, we can find text that is relevant to our question.
Similar texts will be close to each other in the vector space.
Putting it all together
Once I had the solution in my mind, I gathered the tools to code it
- I used Streamlit for the UI
- OpenAI embedding model to create the vectors
- PyPDF to extract text from PDF
- FAISS as my vector database
- GPT-3.5 as my LLM
Here’s the final product — https://talk-to-doc.streamlit.app/
To test it out, I gave it a 100-page PDF with short stories — https://www.nipccd.nic.in/uploads/page/Short-stories-from-100-Selected-Storiespdf-958b29ac59dc03ab693cca052b4036e2.pdf
For the question I chose a random page and looked at this paragraph —
So my question was —
“What surprise did Mr. McCaskey get when he came back home”
And within seconds, It gave me the correct answer!
“The surprise Mr. McCaskey got when he came back home was that instead of the usual stove-lid or potato-masher being thrown at him by his wife, Mrs. McCaskey, he was met with only words. Thereafter, an argument ensued between them which escalated into a food fight.”
I have to say that this worked way better than I had expected!
I know that this is not a new idea. Many people built apps to talk to PDF files last year. But I am happy that I was able to not only understand how they work but also make one of my own from scratch. This opens so many more possibilities for future apps.