Load Testing Conversation Analysis in Snowflake

Mike Nucci

June 8, 2026

?? min read

I really enjoy using Snowflake. I'm in it constantly. It's where our data lives, and most days it just quietly does what I need it to do.

‍

I run finance and operations at Rippit. Part of what we do is conversation intelligence. Point us at your sales and support calls and we turn them into answers.

‍

And Snowflake has a built-in AI service, Cortex, sitting right next to all that conversation data.

‍

Which raises an obvious, slightly terrifying question:

‍

Could you just… do this yourself, in Snowflake, without Rippit?

‍

I genuinely wanted to know. Part curiosity, part self-preservation: if Snowflake can already do this on its own, Rippit's in real trouble.

‍

So I tried to build Rippit in Snowflake, or at least the core of it, and push it until something broke.

‍

The experiment

‍

The dream version is simple. Someone types a plain-English question — how are customers talking about us this month? which deals are slipping, and why? — and the system calls an LLM against the actual transcripts, on the fly, and hands back an answer.

‍

All inside Snowflake. No extra infrastructure, no data leaving the building.

‍

But before any of that, before Cortex reads a single word, I had to get the conversations into a shape so that the LLM could even analyze them.

‍

Our conversations don't live in one tidy place. Sales calls come from Gong, support chats from Intercom. Both sync into Snowflake through Fivetran, but they land raw. And raw is a mess.

‍

A Gong call isn't a transcript; it's hundreds of rows, one per spoken sentence, scattered across tables with speaker IDs and timestamps.

‍

Intercom is the same idea in a different shape: each conversation is a pile of message "parts," wrapped in HTML, tagged by whether an agent, a customer, or a bot wrote them.

‍

So before I could analyze anything, I had to do real data modeling.

‍

Stitch the sentence rows back into one ordered transcript per call.
Label who's speaking, [agent] vs. [customer], so the model isn't guessing.
Strip the HTML out of the Intercom threads.
Force two completely different sources into one common shape: one row per conversation, one clean transcript field, and shared columns like account, date, and channel.
And join all of it back to Salesforce, so a question like "which deals are slipping" even knows what a deal is.

‍

That's a stack of dbt models and a daily pipeline standing between raw data and the very first question. None of it is hard, exactly, but it's hours of work and design decisions that I had to make. The LLM can't read what I hadn’t cleaned yet.

‍

With the data finally in one clean, queryable place, I wired it up and started turning the dial: one user, then ten, then a hundred, then two hundred all asking at once.

‍

I tracked how fast it answered, what it cost, and what fell over.

‍

A few things surprised me.

‍

Buying a bigger computer did almost nothing

‍

When it felt slow, my first instinct was the lazy one: throw more hardware at it.

‍

I 4x’d the compute. It only got 8% faster.

‍

Here's why. The AI doesn't actually run on the computer you're paying to scale up. It runs on a separate, shared service behind the platform. Your compute just takes the order and carries the plate. The kitchen is somewhere else, and it cooks at its own pace.

‍

A fancier waiter doesn't make the kitchen cook faster.

‍

The good questions are the expensive ones

‍

One answer is cheap. Reading a single transcript to answer a question costs about four cents on a premium model. Nobody's going to notice four cents.

‍

But the questions worth asking don't read one transcript. They read all of them.

‍

"Give me a competitive rundown across last quarter's calls" means actually reading last quarter's calls. Two hundred and fifty of them run about $11 and fifty minutes.

‍

And it runs again, from scratch, every time someone tweaks the question or the data refreshes.

‍

A premium model does give you a better answer: sharper, better at holding a complicated instruction. You just pay for that quality every time, across every conversation, and nothing warns you when you've wandered onto the expensive path.

‍

It's a slow cooker, not a microwave

‍

We're used to AI replying in a second or two. That holds for one short question. It stops holding the moment you ask it to read a lot.

‍

‍

Twelve seconds per conversation. Fine on its own. But 250 conversations is nearly an hour.

‍

The time grows in a straight line with the volume. There's no setting that makes "read everything" fast.

‍

200 people at once? It didn't break. It was really slow.

‍

This was the real test I cared about. Picture a sales team on a Monday morning: dozens of reps all opening the tool and asking about their own accounts at the same time.

‍

So I built up to it: a hundred simultaneous users, then two hundred.

‍

‍

Nothing broke. No errors, no "try again later." Double the users and the total output doubled too. Honestly, I was impressed — Snowflake didn't flinch at 200.

‍

But it was slow the whole way: about thirty seconds for a quick question, a full minute for a deep one, whether 100 people were asking or 200.

‍

And that's for reading one transcript.

‍

Now picture a real question, "summarize every competitor mentioned across my accounts," that has to read a hundred transcripts, asked by a hundred people at once. That's ten thousand analyses demanded on the spot.

‍

At the speeds I measured, the folks at the back of that line are waiting tens of minutes, not seconds. (Rough estimate, I didn't run that exact test, but the throughput math is the throughput math.)

‍

And to even reach 200, I had to stand up eight separate compute clusters and babysit them by hand, because a single one only handles a few dozen people at a time. Eight clusters, all running, all billing, for the length of the test.

‍

It's not the LLM model. It's everything around it.

‍

So — does Rippit get to exist?

‍

Here's what I walked away believing.

‍

The raw capability is real. Snowflake and Cortex shrugged off 200 people without blinking, and I still love it.

‍

But "it didn't crash" is a low bar for something people are supposed to enjoy using.

‍

The distance between one person asking one question on one transcript and an AI product is almost invisible, right up until you're at scale. Then it’s the whole thing.

‍

Keeping answers fast when everyone shows up at once.

‍

Keeping the bill from quietly compounding.

‍

Building an entire orchestration layer to spread the work across all that compute.

‍

Reading 100% of the data instead of a convenient sample, without it taking six hours.

‍

None of that is the model. The model is the easy part.

‍

Everything around it — the tools, the environment, the engineering — is the actual job.

‍

And yea: turns out Rippit gets to exist. I’ll keep stress-testing.

Why Rippit Reads 100% of Conversations and Claude + Snowflake Doesn't

Sampling gives you an answer. It doesn't give you the right answer — and that's the difference between Claude + Snowflake and Claude + Rippit.

Connect Claude to Snowflake and ask, "What are the top three reasons customers are churning?"

‍

You'll get a confident answer that’s, on its face, incredibly compelling, with beautiful graphs and formatting.

‍

But read the fine print. For example, Claude may say, the average transcript is ~34K characters. If I pulled the whole April/May window into context, that's ~340M chars — totally infeasible to read directly. So I need a sampling strategy.

‍

I randomly sampled 50 conversations.
I used key word analysis to select the sample.
I read the first comment of each support case.
I will select one call per account

‍

Why 50? Why key word search? Why the first comment?

‍

Why not 100% of all 10,000 conversations?

‍

It's not the model. It's the tools.

‍

Same brain, different tools

‍

The model is the brain, the tools are the arms and legs.

‍

Claude and Rippit share the same brain.

‍

So does Claude connected to any data source - Snowflake, a helpdesk, a call recorder, a chat platform, a CRM.

‍

What's different is how Rippit can orchestrate the brain.

‍

What happens when Claude talks to Snowflake

‍

Claude writes SQL and Snowflake runs the query. Rows come back - including raw transcripts.

‍

Claude has to load and read the transcripts into its context window and reason over them at query time.

‍

That's the bottleneck.

‍

A transcript could be 2,000 - 10,000 tokens, maybe more. Reading 50,000 of them to answer one question is both technically possible and economically insane.

‍

Claude or any LLM cannot actually load 100% of the transcripts into its context window - so the LLM has to sample the transcripts someway.

‍

The model summarizes the sample and reports back. It takes shortcuts because it has to.

‍

Again - it will give you what seems to be an amazing answer, that may look totally defensible - but when you dig into what Claude actually did - it reveals all the shortcuts it took.

‍

The same thing happens with any conversation source

‍

Different tool, same shape.

‍

Connect Claude to Zendesk. Connect it to Gong. Connect it to Intercom, Salesforce Service Cloud, Front, Dialpad, Slack, Teams, your in-product chats, your agent-to-agent messages. Pick any conversation source.

‍

Zendesk and Gong aren’t data platforms like Snowflake where they can handle 100% analysis on the fly for you so Claude has to sample from those sources and fit what it can in the context window which is a tiny percent of the total conversation volume.

‍

Snowflake has the ability to enrich 100% of conversations but it is not designed for this so it’s too slow to be feasible - and requires more complex internal building.

‍

What Rippit does differently — part one, pre-enrichment

‍

Every conversation that lands in Rippit gets processed once, at ingestion. Topics. Intents. Sentiment. Escalations. Outcomes. Themes. Entities.

‍

We’ve determined a number of data points that the most people will need and pre-enrich the conversations based on what we’ve learned over the past 10 years.

‍

However, we also give customers the ability to build customized pre-enrichment prompts. Given every business is unique - the conversations each one is having is unique - therefore you should have the ability to customize what insights you want out of your data.

‍

You define the questions you actually care about - "did the customer mention a competitor by name," "did the agent quote pricing," "was a regulatory term invoked," "did this conversation contain a renewal objection," “what was our customers’ thoughts on X new product” — and we run those against every single conversation at ingestion.

‍

A custom question becomes a custom dimension. Asked once, answered across 100% of the corpus.

‍

Those become structured dimensions and measures that Rippit can leverage when someone asks a question - this means each question is cheaper and faster.

‍

This is why Rippit is better than the "Claude + warehouse" setup. The generic setup gives you raw text and asks the model to figure out your business at query time, every time, over a sample. Rippit lets you encode your business once and run it over everything.

‍

The per-conversation model cost was paid once, asynchronously, at ingestion. Not on every analytical pass.

‍

If you do take the time to build LLM pipelines, in Snowflake for example, you’ll find the cost wildly different from what it would cost in Rippit. We’re just passing through the cost that the model providers charge us - 1:1, whereas Snowflake marks up these tokens and recognizes that markup as revenue. Therefore, your LLM token costs are always 10-25% higher than using Rippit - when you get hooked and want to get more and more insight out of your data - that 10-25% turns into a much higher annual cost.

‍

What Rippit does differently — part two, on-the-fly enrichment

‍

Pre-enrichment only covers some set of use cases customer ask about. You also need to do -on-the-fly enrichment

‍

When you ask Rippit some questions, the model will need to read the raw transcripts to figure out the answer.

‍

Rippit will run query time question-specific LLM calls over every conversation that has been selected for this analysis.

‍

Pre-enrichment gives you a better baseline than what exists in your CRM or Phone System and on-the-fly enrichment is what gives you depth and 100% question coverage.

‍

Sampling gives you an answer. It doesn't give you the right answer.

‍

This is the part that should bother you most.

‍

When the model samples 50 conversations and tells you the top three reasons customers are churning, it isn't lying. It found patterns. The patterns are real in those 50 conversations.

‍

But it’s not verifiably accurate.

‍

It's anecdotal evidence confidently articulated as analytics.

‍

An LLM with a confident voice and three bullet points feels like a research report. It isn't. It's the equivalent of asking a consultant to talk to a couple customers at random and come back with a strategy memo. The summary will sound smart. The conclusions might even be directionally correct. They also might be completely wrong, and you have no way to know which.

‍

You're going to take this answer into a board meeting. Or a roadmap review. Or a renewal conversation. Decisions get made on it. And the underlying evidence base is fifty conversations out of fifty thousand, picked by a heuristic nobody audited.

‍

Here’s just three examples of where the anecdote-vs-statistics gap actually bites:

‍

Rare events. If 2% of conversations contain a churn signal, a 50-conversation sample catches one of them, maybe. You'll never see the pattern. Compliance violations, executive escalations, regulatory disclosures - exactly the things you can't afford to miss, exactly the things sampling guarantees you'll miss.

‍

Trend detection. Comparing two periods requires comparable coverage. Sample 50 last quarter and 50 this quarter, and the noise swamps any signal smaller than 20 points. With 100%, a 2-point shift in resolution rate is real.

‍

Segmented analysis. "What are enterprise customers saying about pricing?" needs volume inside the segment. Sample globally and filter, and you've got three conversations from your top tier. Sample inside the segment, and you had to know the segment mattered before you sampled — which means you've already lost the ability to ask unanticipated questions.

‍

The tradeoff is real

‍

Nothing comes for free. 100% coverage costs more. We’re not talking about how much we charge but literally costs more in AI tokens.

‍

But you get the answer you can trust and make better decisions with.

‍

The summary

‍

Claude + Snowflake is a brain reaching for a tool that doesn't understand conversations.

‍

Claude + Rippit is a brain reaching for a tool that already does.

‍

Same model. Different tools. Different answers.

‍

Snowflake Example from Claude

‍

‍

Prompt:

Can you use snowflake to tell me how customers are talking about Rippit and how that's changed over the last month or two?

‍

Claude states that it has to sample and is just trying to determine the best approach

‍

So the windows look like:

‍

Last ~2 months (focus): April + May 2026 = 1,099 calls across ~340 distinct accounts
Prior comparable window: Feb + March 2026 = 1,318 calls across ~370 accounts

‍

Average transcript is ~34K characters. If I pulled the whole April/May window into context, that's ~37M chars — totally infeasible to read directly. So I need a sampling strategy.

‍

Where I'd actually start. Cheap signals first to find which calls are even worth opening, then read only those. Roughly:

‍

Stratify by account, not by call. A single big account can produce 30 calls in a month and dominate any naive random sample. I'd partition into buckets — e.g. (a) top-ARR accounts, (b) churn-cohort accounts (which I do happen to know from Salesforce), (c) new logos, (d) everyone else — and sample within each.
Use cheap lexical filters on the raw transcript to find calls that are likely to be about Rippit-as-a-product (vs. random discovery calls, demos, internal syncs that got recorded). Keywords like "Rippit", "MaestroQA", "the product", "feature", "broken", "doesn't work", "love", "cancel", "renewal", "competitor", etc. That gets you from ~1,099 → maybe 300-500 candidate calls.
From candidates, take a stratified random sample small enough to actually summarize — maybe 30-50 calls per window. Pull the first ~5K chars of each, dump into a single context, and synthesize themes by hand (or with one LLM call).

‍

The methodology it chose

‍

Step 1: Scope the windows and drop unattributable calls. April-May 2026 = 1,099 calls, but ~250 of those have no ACCOUNT_ID (internal syncs, recordings that never got matched to a Salesforce account). Drop them — you can't talk about "customers" without knowing which customer. That leaves 852 attributed calls.

‍

Step 2: Pick a role lens. CSM calls are where existing customers talk about the product. SDR/AE calls are pitch motion — different signal, separate analysis. For "what are customers saying about Rippit," restrict to CSM + Head of CS. That's 508 calls in Apr-May.

‍

Step 3: Lexical filters to find candidate calls. Cheap and noisy, but better than random. The big lesson: most sentiment keywords are useless. "Love"/"great"/"amazing" matched 92% of all calls — they're filler. The keywords that do discriminate are concrete: product/brand mentions, named competitors, integration names (Gladly, Kustomer, Zendesk), specific surfaces (auto-QA, coaching, calibration, screen capture).

‍

Step 4: Stratified one-call-per-account sampling. A naive random sample over-weights chatty accounts (Mindbody had 8 product-mention calls in 2 months). Take the most recent call per account so each customer voice gets weighted once.

‍

Step 5: Snippet extraction, not full transcripts. Pull ±400 chars around the first brand mention. ~900 chars per call × 50 calls = 45K chars total — fits in one context window for thematic coding.

‍

Step 6: Track keyword rates month-over-month. Raw counts are misleading because call volumes swing 30%+ month-over-month. Normalize as percent of CSM calls.

‍

What the methodology misses: (in Claude’s own words)

‍

Sentiment direction. Frustrated/confused word counts are flat, but that doesn't mean sentiment is flat — those words are too generic. A keyword can’t capture the target of the frustration.
Defect categorization. I can count "Integration Partner X Mentions" mentions but not split "X integration broken" vs "evaluating X as a competitor" vs "X is fine, talking about something else."
Quote extraction. Surfacing the actual damning customer line requires reading the transcript manually for every flagged call.
Aggregation across accounts. "What are top 5 themes driving churn risk" is a one-line SQL query, but how do you determine what counts as a churn risk in a conversation? Without running an LLM over the entire transcript, you’re just using some key word search heuristic.
‍

Realistic conclusion: the lexical approach gets you the integration health questions (which CRM/help desk is the friction) and the vocabulary tracking questions (rebrand, named competitors) cheaply. It can't replace enrichment for sentiment, defect taxonomy, or churn-driver attribution - those genuinely need the LLM pass.

‍

Check out the video of the second attempt of the prompt using the Snowflake MCP – the outputs still highlight the limitations outlined above:

Great AI Products

Great AI products shouldn't require a services team to set up—if you can’t learn it yourself in 10 minutes, the product is failing you.

Every software company is building AI into their product. As we embark on the Rippit journey, something we have spent a lot of time thinking about over the last 2-3 years is: what are the attributes of great products?

‍

In the technology industry, there has been a lot of talk about Forward-Deployed Engineers (FDE). These are people who work with the customers to customize the product to a customer's needs.

‍

What they are primarily doing is building prompts for LLMs on behalf of customers.

‍

This sounds like a great deal for the customer, and it often leads to customers giving their business to the company that offers more services and help.

‍

Our pre-Rippit experience suggests that this is a death trap for the company and customer within 24-36 months.

‍

When a software company is small, early customers get the best employees as Forward-Deployed Engineers, and because the company is desperate for early customers, they also under-charge for the human resources they provide. Both of these things have to change for the software company to succeed over time.

‍

The average customer loses the best FDE involved in their success. The average customer has to pay more for people resources.

‍

Because the company has committed to a strategy where humans fill in product gaps, the product gets harder and harder to use over time.

‍

What ends up happening is only the most important customers have good customer experiences.

‍

I’ve lived this. I don’t think you can offer the FDE model to deliver a high quality customer experience unless the customer is paying $500,000/Year (±$250,000). It’s very hard to hire enough high-quality Forward-Deployed Engineers to match your growing customer count.

‍

This is why I think Great AI Products have to be easy enough to use that a customer could learn them on their own. It is the most important criterion when judging a Great AI Product.

‍

That’s why we’ve committed to this strategy at Rippit — we’re not where we want to be yet, but we are making progress.

‍

This also seems to be the defining attribute of the best software products we use internally, like Figma, ChatGPT, Claude, Ramp, Cursor, Snowflake, AWS, and more.

‍

We felt even higher conviction when we evaluated the alternatives for Voice of Customer software, Quality Assurance software, Experience Management software, and Conversation Intelligence software. All of them require FDEs — the signal was that they all had minimum price points of $25,000 to $50,000. That is way too much friction for many potential customers.

‍

I’m not sure a single one lets you sign up for the product without talking to a human, which is often another signal of a complex product. The ones I researched were Qualtrics, Medallia, Enterpret, Chattermill, Unwrap, Loris, Level AI, Observe.AI, Cresta, MaestroQA, Balto, and unitQ.

‍

I think Great AI Products have to be so easy to use that you can sign up and learn them yourself within 10 minutes.

‍

Not everyone in the industry agrees with me — in fact, I think most won’t.

‍

Some will argue it’s different for products selling to B2B or certain industries.

‍

Some will argue that companies with FDEs will use the learnings from customers to build an easier product. The reality is it’s very hard to keep the plane flying just as fast while building an easier-to-use product — and it requires a different product culture. It requires a founder to potentially slow growth down and rearchitect everything, and only the exceptions have the courage to do that.

‍

Everyone will admit that deploying AI in all situations will get easier over time — similar to how making a website got easier from the 1990s to 2020.

‍

There might be an exception but I’m not letting the exception be the rule.

‍

Lastly, similar to how it’s harder to write fewer words than many words to get a point across, it’s harder to make easy-to-use software than hard-to-use software. I think people who argue for complexity are often scared to step up to the engineering challenge.

Explore All

Where conversations become

insights

actionable data

business intelligence

enterprise visibility