I have 200 mixed documents. Can I build an AI assistant that answers from them, cites the source, and admits when it does not know?
Posted 4 May 2026Deep dive
Yes, you can. The hard part is not getting AI to read the documents. The hard part is knowing when the answer is trustworthy, when the documents disagree, and when the answer is not in the material at all.
For most readers, a finished tool will get you most of the way there in an afternoon. For anyone using this in work where a wrong answer would cause harm, the extra effort is mostly about testing rather than tools.
What you are really asking
You have a real-world pile: mixed formats, some old, some superseded, some scanned. You want plain questions answered from the material, and you want to be told when the assistant does not know. That is the grown-up version of the AI question, and it deserves a grown-up answer.
Can I just use a finished tool?
For personal research, yes. Google's NotebookLM is built for exactly this job. You give it your sources, you ask questions, and the answers come back with citations that link to the passage it pulled from. When the answer is not in the material, NotebookLM is unusually honest about saying so. Free tier handles a respectable number of sources per notebook. Check the current limits on the plans page; both the source cap and the pricing change.
If your documents already live inside Microsoft 365 (SharePoint, OneDrive, Outlook), Microsoft 365 Copilot indexes them in place and answers with citations. The data stays in your tenant. As of 2026 it can ground on scanned PDFs and image-based documents, which closes a long-standing gap.
Both are good enough for personal research, learning, summarising, and exploring. They are not, on their own, safe enough for legal, medical, financial, audit, or compliance work. That distinction matters and the rest of the answer keeps coming back to it.
What could go wrong
The trouble starts when the tool looks like it is working but quietly is not. Five examples of things that look right on the surface and are not:
Mixed-up old and new. If the assistant reads both a 2019 policy and a 2024 update, it may quote both as if they are equally current. That is how a clever-looking system gives a wrong answer.
OCR muddle. Scanned pages and screenshots have to be turned into text by a process called OCR (optical character recognition: the trick that turns a picture of text into actual text). Default OCR is decent, not perfect. A "1" can become an "l" in a contract clause; a footnote can vanish entirely. The assistant then quotes the wrong number with a confident citation.
Looking but not finding. When the answer really is not in the documents, a well-behaved assistant should say so. Some products do, others quietly invent something plausible. NotebookLM is comparatively well-behaved here. Custom GPTs are more likely to wander.
Citations that do not say what the assistant says they say. The model can produce a citation that points to a real passage and then summarise the passage incorrectly. A reader who sees the citation badge and stops there has been quietly misled.
Documents disagreeing. Two of your sources contradict. The assistant picks one silently and answers as if the other does not exist. You only spot this if you already knew the second one was there.
When you need to be more careful
If any of these are true, the finished-tool approach starts to get thin: the documents are confidential and cannot leave your premises; they change often and the assistant must stay current; you need an audit trail showing which passage every answer came from; or you have legal, financial, or regulatory obligations attached to the answers.
That is the moment to look at a more serious setup. The shorthand for it is RAG, retrieval-augmented generation: the standard pattern where the AI looks things up in your documents first, then writes the answer using only what it found. Tools like LlamaIndex and LangChain stitch the pieces together. The pieces themselves include:
A way to chunk documents (split each one into smaller passages so the system can pull just the relevant bits). An embedding step (turning each passage into a list of numbers, so the system can find related passages by mathematical distance, not just exact keyword matches). A vector database (a database designed to store those number-lists and find nearest matches quickly). A re-ranker (a second pass that re-orders the candidate passages to put the most relevant ones first). And the language model itself, with strict instructions about answering only from the supplied passages and refusing otherwise.
Each piece has trade-offs. Chunk size affects whether tables get sliced in half. The embedding model affects whether your assistant finds a passage about "termination" when the source uses "cancellation". The re-ranker often matters more than the language model. None of this matters for personal research; all of it matters once a wrong answer costs you something.
This is real engineering. Two days for a developer to stand up; weeks of work to make trustworthy. If you do not have a developer and you are in this category, bring one in rather than build it yourself.
How would I know whether it is working?
This is the part most projects skip and then regret. The discipline is borrowed from software testing and is now standard practice for any AI system people rely on. The idea is simple: you decide in advance what good looks like, and you measure.
Build a small set of question-and-answer pairs you already know the answer to. For each one, write down the question, the correct answer, and a pointer to where in the documents that answer lives. Mix in three kinds:
Ordinary questions whose answer is in one place. Questions whose answer is genuinely not in any of the documents (the assistant should refuse). And questions where two documents disagree, where you know which one is current.
Ask the assistant each one and write down what it does. Three things to watch: did it get the right answer; does the cited passage actually contain the answer; did it correctly say "I do not know" when it should have. The exact pass mark depends on how serious the use case is. For personal research, a couple of misses out of twenty is fine. For anything legally or financially consequential, even one confident wrong answer is a failure.
Run the set again whenever something changes: new documents added, system prompt edited, model upgraded. Open-source frameworks like Ragas automate this for custom builds. For NotebookLM and Copilot the test is manual, which sounds heavier than it is. Twenty questions take an hour to run. Catching a regression takes longer.
A simple test you can run before trusting it
Before you act on anything the assistant tells you, do this once. Ask it ten questions whose answers you already know. Ask it five questions whose answers you know are not in the documents. Ask it three questions where you have deliberately included a 2019 version and a more recent update of the same source.
If it gets the first ten right, refuses the second five, and picks the recent version on the third three, you have something worth using for the kind of work you have. If it fails any of those, do not yet trust it for anything where a wrong answer matters. Find out which step in the chain failed (was the OCR wrong? Did it retrieve the wrong passage? Did it cite correctly but summarise wrongly?) and fix that one thing before retrying.
What I'd avoid
Uploading confidential files to any tool without first checking where the data goes and who can see it. Treating the presence of a citation as proof the answer is right. Building a custom system before testing whether NotebookLM or Copilot already does the job. Trusting an AI assistant that has not been tested against questions you already know the answers to. And the consultant version of "have you tried turning it off and on again": anyone who hears "200 mixed documents" and answers "just put it in NotebookLM" or "just use a Custom GPT" without first asking about volume, sensitivity, update frequency, and what counts as a wrong answer in your world has not yet understood the question.
The short version
Start with NotebookLM (or Microsoft Copilot, if your documents live in Microsoft 365). For learning, summarising, exploring, and personal research, that is most of the answer. Before you act on what it tells you for anything that matters, run the ten/five/three test above. If it passes, you have something useful. If you need to be confidential, audited, current, or legally accountable, treat the finished tool as a prototype rather than the destination, and bring in someone who can build and test the more serious version. The danger is not that AI cannot read your documents. The danger is the moment you stop checking.
Got a question?
Send it through the feedback link. No signup, no list. I'll add it to the queue.