Querying Transcripts in Snowflake with Claude

Mike Nucci

June 11, 2026

?? min read

A follow-up to Load Testing Conversation Analysis in Snowflake.

‍

Earlier this week I posted about load-testing Snowflake Cortex, testing whether I could stand up a conversation-intelligence app on Cortex itself.

‍

This time I wanted to try a different flavor of the same idea: what if you just pull the raw transcripts out of Snowflake and unleash Claude on them?

‍

Point the model straight at the warehouse, ask it questions a CEO or CTO actually wants to know, and let Claude read and reason on its own.

‍

It seems like a completely reasonable thing to do.

‍

1,854 of our own customer calls
Real questions that require reading all of the transcripts
Across every Claude entry point
- Chat app
- Cowork
- Claude Code

‍

Here’s the TLDR:

‍

‍Whichever surface you use — the chat app, Cowork, or Claude Code — you land in one of two places.

‍

Sampling: Almost always, it doesn't read your transcripts at all: it keyword-samples and hands you a confident answer that changes every time you ask and quietly skips the cases that matter most.
Cost Prohibitive: On the rare path where you force it to actually read everything, it has to fan out an army of sub-agents at a cost that's absurd for answering a single question.

‍

So to answer a really important question it’s either cheap and unreliable or you have to take out a second mortgage.

‍

The setup

‍

We froze a slice of our own Gong transcripts in Snowflake: 1,854 customer calls from last quarter, about 16 million tokens of raw conversation.

‍

We connected Claude (Opus 4.8) through the production Snowflake connector and asked, in plain English:

‍

"Which of our customers are seriously considering building this kind of capability in-house instead of buying a vendor? List them."

‍

At this moment, this is a really important question as a SaaS company. It's also exactly the kind of question that's invisible to a keyword search — someone saying "we're thinking about just doing this ourselves in Claude" doesn't contain the word "churn" or "competitor." You have to actually read the conversation to catch it.

‍

Can Claude read all of last quarter’s 1,854 conversations and answer it?

‍

Last month’s 589 conversations?

‍

Last week’s 167 conversations?

‍

Yesterday’s 34 conversations?

‍

Finding 1: it doesn't read them

‍

Not "it reads them slowly." It doesn't read them.

‍

Across every single run, on every surface, Claude pulled essentially 0% of the transcripts into context.

‍

What it does instead is write a SQL ILIKE '%in-house%' filter, grab a few hundred characters of text around each keyword hit, and reason over those snippets. The full conversations never get read. They can't; they don't fit in the context window, so the model quietly samples and moves on.

‍

‍

That chart is the whole ballgame. Bigger context windows didn't change it. The agentic surfaces didn't change it. At 34 calls it sampled; at 1,854 calls it sampled harder. The single highest-coverage run in the entire study read about 2% of the data.

‍

Here is what Claude admitted when I asked, “If I ask Claude how it answered and analyzed 100% of the transcripts will it answer precisely?”

‍

‍

Finding 2: ask it three times, get three answers

‍

Here's the part that should make anyone nervous about running their business on this.

‍

I asked that identical build-in-house question, over the identical data, with the identical model, three times in a fresh chat. It returned 7 customers. Then 25. Then 13.

‍

‍

And it's not just the count that moved. The membership barely overlapped. Only about four accounts showed up in all three runs.

Why?

‍

Because each time, Claude samples differently.

‍

Run it Monday, run it Friday, and you're looking at two different realities — with no flag that either one is a 2% sample.

‍

Finding 3: it missed the customer who already left

‍

This is the one that stuck with me. Buried in those transcripts is Customer XYZ, telling us, in plain language, that they'd "moved off Rippit into some internal tool." A customer who already churned to an in-house build. That is precisely the signal this question exists to surface.

‍

Of the eight keyword runs, six missed it entirely. Every run in the chat app missed it. It only surfaced in the two runs where the model happened to cast its widest net.

‍

When the stakes are "which six-figure account is about to leave," I don't want to hope I get lucky.

‍

Finding 4: the agentic surfaces are slower, not better

‍

The obvious pushback: "sure, the chat app is limited, but Cowork and Claude Code have huge context windows and can write code and loop. Surely they read everything."

‍

They don't.

‍

They were more consistent (they clustered around 20-some accounts instead of swinging 7-to-25) and they were genuinely more rigorous — Claude Code even caught a data-quality bug in our table on its own.

‍

But they read the same ~0% of transcripts.

‍

They just spent five to ten times longer doing more elaborate keyword passes. Our slowest Code run took ten and a half minutes to read nothing in full.

‍

‍

More effort, more patience, more tokens - same coverage.

‍

Finding 5: the only way to make it read costs a fortune — for one question

‍

So I tried the obvious thing: I told Claude Code to read all 1,854 transcripts in full. No sampling, no keyword filter — read every word.

‍

Claude correctly scoped the job and said: “~16 million tokens, too large for even a 1M-token window, so the only honest way to do it is to fan out across 100 sub-agents, each reading a batch, then stitch the results back together.”

‍

To actually read everything, you don't run a query, you stand up a multi-agent pipeline.

‍

I was about to tell Claude to go ahead but decided to ask how much it would cost first.

‍

Claude said it was going to cost between $300 and $550 to answer 1 question.

‍

‍

Ask ten questions this way and you're north of $4,000.

‍

So if you really want to know the correct answer to your question - Claude has to do this exorbitant fan out and you re-pay the whole bill for every new question. There's no setting where pointing Claude at your raw warehouse is both reliable and affordable for this kind of work.

‍

Finding 6: it works great — for a different kind of question

‍

I want to be fair, because this matters. Claude + Snowflake is genuinely excellent at one class of question. When I asked "how many calls mention Competitor XYZ?" — a keyword question — it nailed it: one SQL filter, 38 seconds, ~100% recall. Every mention, found.

‍

That's the tell. The whole study collapses onto a single axis:

‍

‍

If your question is keyword-shaped or sits in a structured column, Claude + Snowflake is fast and right.

‍

The moment your question requires reading and judgment: who's frustrated, who's building instead of buying, how is sentiment really trending — it samples, and it degrades.

‍

Asked for the top churn reasons, the keyword layer flagged ~1,100 calls; the real number was closer to 15–20.

‍

Asked for a sentiment trend, it produced a clean, confident monthly chart that was essentially noise, because the words "churn" and "frustrated" saturate our transcripts as product vocabulary, not customer mood.

‍

The failure mode isn't that it breaks loudly. It's that it hands you a polished chart that's quietly wrong.

‍

The part I didn't expect

‍

Here's the kicker. Over and over, unprompted, Claude itself kept pointing back at the architecture we'd told it not to use. On the churn question it said the right source was the enriched, pre-labeled data, not the raw text. On sentiment it flat-out said the role-aware, pre-computed score "is the right source," because the raw transcripts "can't cleanly separate" a frustrated customer from a rep demoing our frustration-detection feature.

‍

Claude - under explicit instruction to just pull Snowflake transcripts and provide the analysis - made the argument for using Rippit instead.

‍

That's not a knock on Claude. We love Claude and use it all the time. The lesson isn't "Claude can't do this." Pairing Claude with Snowflake is the wrong marriage.

Why Rippit Reads 100% of Conversations and Claude + Snowflake Doesn't

Sampling gives you an answer. It doesn't give you the right answer — and that's the difference between Claude + Snowflake and Claude + Rippit.

Connect Claude to Snowflake and ask, "What are the top three reasons customers are churning?"

‍

You'll get a confident answer that’s, on its face, incredibly compelling, with beautiful graphs and formatting.

‍

But read the fine print. For example, Claude may say, the average transcript is ~34K characters. If I pulled the whole April/May window into context, that's ~340M chars — totally infeasible to read directly. So I need a sampling strategy.

‍

I randomly sampled 50 conversations.
I used key word analysis to select the sample.
I read the first comment of each support case.
I will select one call per account

‍

Why 50? Why key word search? Why the first comment?

‍

Why not 100% of all 10,000 conversations?

‍

It's not the model. It's the tools.

‍

Same brain, different tools

‍

The model is the brain, the tools are the arms and legs.

‍

Claude and Rippit share the same brain.

‍

So does Claude connected to any data source - Snowflake, a helpdesk, a call recorder, a chat platform, a CRM.

‍

What's different is how Rippit can orchestrate the brain.

‍

What happens when Claude talks to Snowflake

‍

Claude writes SQL and Snowflake runs the query. Rows come back - including raw transcripts.

‍

Claude has to load and read the transcripts into its context window and reason over them at query time.

‍

That's the bottleneck.

‍

A transcript could be 2,000 - 10,000 tokens, maybe more. Reading 50,000 of them to answer one question is both technically possible and economically insane.

‍

Claude or any LLM cannot actually load 100% of the transcripts into its context window - so the LLM has to sample the transcripts someway.

‍

The model summarizes the sample and reports back. It takes shortcuts because it has to.

‍

Again - it will give you what seems to be an amazing answer, that may look totally defensible - but when you dig into what Claude actually did - it reveals all the shortcuts it took.

‍

The same thing happens with any conversation source

‍

Different tool, same shape.

‍

Connect Claude to Zendesk. Connect it to Gong. Connect it to Intercom, Salesforce Service Cloud, Front, Dialpad, Slack, Teams, your in-product chats, your agent-to-agent messages. Pick any conversation source.

‍

Zendesk and Gong aren’t data platforms like Snowflake where they can handle 100% analysis on the fly for you so Claude has to sample from those sources and fit what it can in the context window which is a tiny percent of the total conversation volume.

‍

Snowflake has the ability to enrich 100% of conversations but it is not designed for this so it’s too slow to be feasible - and requires more complex internal building.

‍

What Rippit does differently — part one, pre-enrichment

‍

Every conversation that lands in Rippit gets processed once, at ingestion. Topics. Intents. Sentiment. Escalations. Outcomes. Themes. Entities.

‍

We’ve determined a number of data points that the most people will need and pre-enrich the conversations based on what we’ve learned over the past 10 years.

‍

However, we also give customers the ability to build customized pre-enrichment prompts. Given every business is unique - the conversations each one is having is unique - therefore you should have the ability to customize what insights you want out of your data.

‍

You define the questions you actually care about - "did the customer mention a competitor by name," "did the agent quote pricing," "was a regulatory term invoked," "did this conversation contain a renewal objection," “what was our customers’ thoughts on X new product” — and we run those against every single conversation at ingestion.

‍

A custom question becomes a custom dimension. Asked once, answered across 100% of the corpus.

‍

Those become structured dimensions and measures that Rippit can leverage when someone asks a question - this means each question is cheaper and faster.

‍

This is why Rippit is better than the "Claude + warehouse" setup. The generic setup gives you raw text and asks the model to figure out your business at query time, every time, over a sample. Rippit lets you encode your business once and run it over everything.

‍

The per-conversation model cost was paid once, asynchronously, at ingestion. Not on every analytical pass.

‍

If you do take the time to build LLM pipelines, in Snowflake for example, you’ll find the cost wildly different from what it would cost in Rippit. We’re just passing through the cost that the model providers charge us - 1:1, whereas Snowflake marks up these tokens and recognizes that markup as revenue. Therefore, your LLM token costs are always 10-25% higher than using Rippit - when you get hooked and want to get more and more insight out of your data - that 10-25% turns into a much higher annual cost.

‍

What Rippit does differently — part two, on-the-fly enrichment

‍

Pre-enrichment only covers some set of use cases customer ask about. You also need to do -on-the-fly enrichment

‍

When you ask Rippit some questions, the model will need to read the raw transcripts to figure out the answer.

‍

Rippit will run query time question-specific LLM calls over every conversation that has been selected for this analysis.

‍

Pre-enrichment gives you a better baseline than what exists in your CRM or Phone System and on-the-fly enrichment is what gives you depth and 100% question coverage.

‍

Sampling gives you an answer. It doesn't give you the right answer.

‍

This is the part that should bother you most.

‍

When the model samples 50 conversations and tells you the top three reasons customers are churning, it isn't lying. It found patterns. The patterns are real in those 50 conversations.

‍

But it’s not verifiably accurate.

‍

It's anecdotal evidence confidently articulated as analytics.

‍

An LLM with a confident voice and three bullet points feels like a research report. It isn't. It's the equivalent of asking a consultant to talk to a couple customers at random and come back with a strategy memo. The summary will sound smart. The conclusions might even be directionally correct. They also might be completely wrong, and you have no way to know which.

‍

You're going to take this answer into a board meeting. Or a roadmap review. Or a renewal conversation. Decisions get made on it. And the underlying evidence base is fifty conversations out of fifty thousand, picked by a heuristic nobody audited.

‍

Here’s just three examples of where the anecdote-vs-statistics gap actually bites:

‍

Rare events. If 2% of conversations contain a churn signal, a 50-conversation sample catches one of them, maybe. You'll never see the pattern. Compliance violations, executive escalations, regulatory disclosures - exactly the things you can't afford to miss, exactly the things sampling guarantees you'll miss.

‍

Trend detection. Comparing two periods requires comparable coverage. Sample 50 last quarter and 50 this quarter, and the noise swamps any signal smaller than 20 points. With 100%, a 2-point shift in resolution rate is real.

‍

Segmented analysis. "What are enterprise customers saying about pricing?" needs volume inside the segment. Sample globally and filter, and you've got three conversations from your top tier. Sample inside the segment, and you had to know the segment mattered before you sampled — which means you've already lost the ability to ask unanticipated questions.

‍

The tradeoff is real

‍

Nothing comes for free. 100% coverage costs more. We’re not talking about how much we charge but literally costs more in AI tokens.

‍

But you get the answer you can trust and make better decisions with.

‍

The summary

‍

Claude + Snowflake is a brain reaching for a tool that doesn't understand conversations.

‍

Claude + Rippit is a brain reaching for a tool that already does.

‍

Same model. Different tools. Different answers.

‍

Snowflake Example from Claude

‍

‍

Prompt:

Can you use snowflake to tell me how customers are talking about Rippit and how that's changed over the last month or two?

‍

Claude states that it has to sample and is just trying to determine the best approach

‍

So the windows look like:

‍

Last ~2 months (focus): April + May 2026 = 1,099 calls across ~340 distinct accounts
Prior comparable window: Feb + March 2026 = 1,318 calls across ~370 accounts

‍

Average transcript is ~34K characters. If I pulled the whole April/May window into context, that's ~37M chars — totally infeasible to read directly. So I need a sampling strategy.

‍

Where I'd actually start. Cheap signals first to find which calls are even worth opening, then read only those. Roughly:

‍

Stratify by account, not by call. A single big account can produce 30 calls in a month and dominate any naive random sample. I'd partition into buckets — e.g. (a) top-ARR accounts, (b) churn-cohort accounts (which I do happen to know from Salesforce), (c) new logos, (d) everyone else — and sample within each.
Use cheap lexical filters on the raw transcript to find calls that are likely to be about Rippit-as-a-product (vs. random discovery calls, demos, internal syncs that got recorded). Keywords like "Rippit", "MaestroQA", "the product", "feature", "broken", "doesn't work", "love", "cancel", "renewal", "competitor", etc. That gets you from ~1,099 → maybe 300-500 candidate calls.
From candidates, take a stratified random sample small enough to actually summarize — maybe 30-50 calls per window. Pull the first ~5K chars of each, dump into a single context, and synthesize themes by hand (or with one LLM call).

‍

The methodology it chose

‍

Step 1: Scope the windows and drop unattributable calls. April-May 2026 = 1,099 calls, but ~250 of those have no ACCOUNT_ID (internal syncs, recordings that never got matched to a Salesforce account). Drop them — you can't talk about "customers" without knowing which customer. That leaves 852 attributed calls.

‍

Step 2: Pick a role lens. CSM calls are where existing customers talk about the product. SDR/AE calls are pitch motion — different signal, separate analysis. For "what are customers saying about Rippit," restrict to CSM + Head of CS. That's 508 calls in Apr-May.

‍

Step 3: Lexical filters to find candidate calls. Cheap and noisy, but better than random. The big lesson: most sentiment keywords are useless. "Love"/"great"/"amazing" matched 92% of all calls — they're filler. The keywords that do discriminate are concrete: product/brand mentions, named competitors, integration names (Gladly, Kustomer, Zendesk), specific surfaces (auto-QA, coaching, calibration, screen capture).

‍

Step 4: Stratified one-call-per-account sampling. A naive random sample over-weights chatty accounts (Mindbody had 8 product-mention calls in 2 months). Take the most recent call per account so each customer voice gets weighted once.

‍

Step 5: Snippet extraction, not full transcripts. Pull ±400 chars around the first brand mention. ~900 chars per call × 50 calls = 45K chars total — fits in one context window for thematic coding.

‍

Step 6: Track keyword rates month-over-month. Raw counts are misleading because call volumes swing 30%+ month-over-month. Normalize as percent of CSM calls.

‍

What the methodology misses: (in Claude’s own words)

‍

Sentiment direction. Frustrated/confused word counts are flat, but that doesn't mean sentiment is flat — those words are too generic. A keyword can’t capture the target of the frustration.
Defect categorization. I can count "Integration Partner X Mentions" mentions but not split "X integration broken" vs "evaluating X as a competitor" vs "X is fine, talking about something else."
Quote extraction. Surfacing the actual damning customer line requires reading the transcript manually for every flagged call.
Aggregation across accounts. "What are top 5 themes driving churn risk" is a one-line SQL query, but how do you determine what counts as a churn risk in a conversation? Without running an LLM over the entire transcript, you’re just using some key word search heuristic.
‍

Realistic conclusion: the lexical approach gets you the integration health questions (which CRM/help desk is the friction) and the vocabulary tracking questions (rebrand, named competitors) cheaply. It can't replace enrichment for sentiment, defect taxonomy, or churn-driver attribution - those genuinely need the LLM pass.

‍

Check out the video of the second attempt of the prompt using the Snowflake MCP – the outputs still highlight the limitations outlined above:

Load Testing Conversation Analysis in Snowflake

We tried to rebuild Rippit inside Snowflake to see if Cortex could do it alone. It didn't crash at 200 users — but "it didn't crash" is a low bar.

I really enjoy using Snowflake. I'm in it constantly. It's where our data lives, and most days it just quietly does what I need it to do.

‍

I run finance and operations at Rippit. Part of what we do is conversation intelligence. Point us at your sales and support calls and we turn them into answers.

‍

And Snowflake has a built-in AI service, Cortex, sitting right next to all that conversation data.

‍

Which raises an obvious, slightly terrifying question:

‍

Could you just… do this yourself, in Snowflake, without Rippit?

‍

I genuinely wanted to know. Part curiosity, part self-preservation: if Snowflake can already do this on its own, Rippit's in real trouble.

‍

So I tried to build Rippit in Snowflake, or at least the core of it, and push it until something broke.

‍

The experiment

‍

The dream version is simple. Someone types a plain-English question — how are customers talking about us this month? which deals are slipping, and why? — and the system calls an LLM against the actual transcripts, on the fly, and hands back an answer.

‍

All inside Snowflake. No extra infrastructure, no data leaving the building.

‍

But before any of that, before Cortex reads a single word, I had to get the conversations into a shape so that the LLM could even analyze them.

‍

Our conversations don't live in one tidy place. Sales calls come from Gong, support chats from Intercom. Both sync into Snowflake through Fivetran, but they land raw. And raw is a mess.

‍

A Gong call isn't a transcript; it's hundreds of rows, one per spoken sentence, scattered across tables with speaker IDs and timestamps.

‍

Intercom is the same idea in a different shape: each conversation is a pile of message "parts," wrapped in HTML, tagged by whether an agent, a customer, or a bot wrote them.

‍

So before I could analyze anything, I had to do real data modeling.

‍

Stitch the sentence rows back into one ordered transcript per call.
Label who's speaking, [agent] vs. [customer], so the model isn't guessing.
Strip the HTML out of the Intercom threads.
Force two completely different sources into one common shape: one row per conversation, one clean transcript field, and shared columns like account, date, and channel.
And join all of it back to Salesforce, so a question like "which deals are slipping" even knows what a deal is.

‍

That's a stack of dbt models and a daily pipeline standing between raw data and the very first question. None of it is hard, exactly, but it's hours of work and design decisions that I had to make. The LLM can't read what I hadn’t cleaned yet.

‍

With the data finally in one clean, queryable place, I wired it up and started turning the dial: one user, then ten, then a hundred, then two hundred all asking at once.

‍

I tracked how fast it answered, what it cost, and what fell over.

‍

A few things surprised me.

‍

Buying a bigger computer did almost nothing

‍

When it felt slow, my first instinct was the lazy one: throw more hardware at it.

‍

I 4x’d the compute. It only got 8% faster.

‍

Here's why. The AI doesn't actually run on the computer you're paying to scale up. It runs on a separate, shared service behind the platform. Your compute just takes the order and carries the plate. The kitchen is somewhere else, and it cooks at its own pace.

‍

A fancier waiter doesn't make the kitchen cook faster.

‍

The good questions are the expensive ones

‍

One answer is cheap. Reading a single transcript to answer a question costs about four cents on a premium model. Nobody's going to notice four cents.

‍

But the questions worth asking don't read one transcript. They read all of them.

‍

"Give me a competitive rundown across last quarter's calls" means actually reading last quarter's calls. Two hundred and fifty of them run about $11 and fifty minutes.

‍

And it runs again, from scratch, every time someone tweaks the question or the data refreshes.

‍

A premium model does give you a better answer: sharper, better at holding a complicated instruction. You just pay for that quality every time, across every conversation, and nothing warns you when you've wandered onto the expensive path.

‍

It's a slow cooker, not a microwave

‍

We're used to AI replying in a second or two. That holds for one short question. It stops holding the moment you ask it to read a lot.

‍

‍

Twelve seconds per conversation. Fine on its own. But 250 conversations is nearly an hour.

‍

The time grows in a straight line with the volume. There's no setting that makes "read everything" fast.

‍

200 people at once? It didn't break. It was really slow.

‍

This was the real test I cared about. Picture a sales team on a Monday morning: dozens of reps all opening the tool and asking about their own accounts at the same time.

‍

So I built up to it: a hundred simultaneous users, then two hundred.

‍

‍

Nothing broke. No errors, no "try again later." Double the users and the total output doubled too. Honestly, I was impressed — Snowflake didn't flinch at 200.

‍

But it was slow the whole way: about thirty seconds for a quick question, a full minute for a deep one, whether 100 people were asking or 200.

‍

And that's for reading one transcript.

‍

Now picture a real question, "summarize every competitor mentioned across my accounts," that has to read a hundred transcripts, asked by a hundred people at once. That's ten thousand analyses demanded on the spot.

‍

At the speeds I measured, the folks at the back of that line are waiting tens of minutes, not seconds. (Rough estimate, I didn't run that exact test, but the throughput math is the throughput math.)

‍

And to even reach 200, I had to stand up eight separate compute clusters and babysit them by hand, because a single one only handles a few dozen people at a time. Eight clusters, all running, all billing, for the length of the test.

‍

It's not the LLM model. It's everything around it.

‍

So — does Rippit get to exist?

‍

Here's what I walked away believing.

‍

The raw capability is real. Snowflake and Cortex shrugged off 200 people without blinking, and I still love it.

‍

But "it didn't crash" is a low bar for something people are supposed to enjoy using.

‍

The distance between one person asking one question on one transcript and an AI product is almost invisible, right up until you're at scale. Then it’s the whole thing.

‍

Keeping answers fast when everyone shows up at once.

‍

Keeping the bill from quietly compounding.

‍

Building an entire orchestration layer to spread the work across all that compute.

‍

Reading 100% of the data instead of a convenient sample, without it taking six hours.

‍

None of that is the model. The model is the easy part.

‍

Everything around it — the tools, the environment, the engineering — is the actual job.

‍

And yea: turns out Rippit gets to exist. I’ll keep stress-testing.

Explore All

Where conversations become

insights

actionable data

business intelligence

enterprise visibility