WEBVTT

1
00:00:00.000 --> 00:00:02.440
We've taken a week or so off.

2
00:00:02.440 --> 00:00:04.500
Hope everybody had a good Thanksgiving.

3
00:00:04.500 --> 00:00:05.340
We just got back.

4
00:00:05.340 --> 00:00:06.760
We're kind of ready to hit the ground running

5
00:00:06.760 --> 00:00:08.840
for the last like two weeks,

6
00:00:08.840 --> 00:00:10.840
two, three weeks of like full weeks of December.

7
00:00:10.840 --> 00:00:14.800
So today we have our lead developer, Luchen,

8
00:00:14.800 --> 00:00:17.280
who's coming all the way from Montreal, Canada

9
00:00:17.280 --> 00:00:18.600
and excited to have him on here.

10
00:00:18.600 --> 00:00:19.520
He's gonna be talking about

11
00:00:19.520 --> 00:00:24.280
how we do our automated test workflow using some AI bots.

12
00:00:24.280 --> 00:00:25.960
And I'm gonna turn it over to him.

13
00:00:25.960 --> 00:00:28.080
And then we may do some follow-up questions,

14
00:00:28.080 --> 00:00:30.280
but he's got a little presentation to walk through.

15
00:00:30.280 --> 00:00:32.119
And yeah, I'm excited to hear about it.

16
00:00:32.119 --> 00:00:33.440
So over to you, Luchen.

17
00:00:35.960 --> 00:00:37.000
Hey, what's up guys?

18
00:00:37.000 --> 00:00:40.480
So I was developing this.

19
00:00:41.420 --> 00:00:43.160
This is gonna be a very like technical,

20
00:00:43.160 --> 00:00:46.500
like we're gonna go pretty deep in this.

21
00:00:46.500 --> 00:00:49.640
So I was developing this AI chat bot for a client

22
00:00:49.640 --> 00:00:52.920
and they have like fairly complex workflow.

23
00:00:52.920 --> 00:00:53.800
We use like rag.

24
00:00:53.800 --> 00:00:57.000
We have like an ingestion pipeline for their documents

25
00:00:57.000 --> 00:00:59.080
and then they're able to ask questions

26
00:00:59.080 --> 00:01:00.720
about their documents.

27
00:01:00.720 --> 00:01:05.239
So developing like the prompts and the whole like AI agent,

28
00:01:05.239 --> 00:01:09.640
the whole flow was like,

29
00:01:09.640 --> 00:01:11.240
we would have to like manually test it.

30
00:01:11.240 --> 00:01:13.880
So that was a bit tedious.

31
00:01:13.880 --> 00:01:16.320
And I was thinking of,

32
00:01:16.320 --> 00:01:20.840
we have like our like tried and true API test,

33
00:01:20.840 --> 00:01:23.960
like framework and like our whole way of doing things.

34
00:01:23.960 --> 00:01:27.480
So I was thinking, how can I reuse the lessons learned

35
00:01:27.480 --> 00:01:32.480
from it to put in place like a testing approach

36
00:01:33.480 --> 00:01:37.120
for AI chat bots and save myself some time and headache

37
00:01:37.120 --> 00:01:40.000
and not have to do as much like manual flow.

38
00:01:40.000 --> 00:01:45.000
So this is more like a backend oriented presentation.

39
00:01:46.000 --> 00:01:47.340
So let's go through it.

40
00:01:48.300 --> 00:01:49.740
Do you know?

41
00:01:49.740 --> 00:01:54.740
So why this was hard?

42
00:01:56.420 --> 00:02:00.000
Like it's really about like reducing friction

43
00:02:00.000 --> 00:02:05.000
and removing as much like manual testing as we can.

44
00:02:06.020 --> 00:02:11.020
So it's quite complex to like upload like files

45
00:02:11.300 --> 00:02:14.980
and we'd have to think in advance of how to,

46
00:02:14.980 --> 00:02:19.080
like which cases we want to test for.

47
00:02:19.080 --> 00:02:20.940
Like some documents are hard to read

48
00:02:20.940 --> 00:02:24.200
or maybe they contain irrelevant info.

49
00:02:24.200 --> 00:02:27.980
So thinking in advance of like kind of this setup

50
00:02:27.980 --> 00:02:32.900
for each test case and having to do over and over

51
00:02:32.900 --> 00:02:34.860
and kind of keeping it in working memory

52
00:02:34.860 --> 00:02:38.300
is a bit time consuming.

53
00:02:38.300 --> 00:02:40.260
And then long feedback loop,

54
00:02:40.260 --> 00:02:44.180
like all of the manual steps I would have to take

55
00:02:44.180 --> 00:02:47.500
and then every time that we like test manually

56
00:02:47.500 --> 00:02:49.340
we might miss edge cases

57
00:02:49.340 --> 00:02:51.340
and not go through all of the cases.

58
00:02:51.340 --> 00:02:53.060
So we might kind of break something

59
00:02:53.060 --> 00:02:55.860
when we modify the prompt or something like that.

60
00:02:55.860 --> 00:02:59.420
And the LLM responses vary.

61
00:02:59.420 --> 00:03:02.220
So sometimes you're not really sure,

62
00:03:02.220 --> 00:03:04.620
like, is this a good response?

63
00:03:04.620 --> 00:03:07.500
So this is kind of the challenges going into it

64
00:03:07.500 --> 00:03:10.100
and like how our feature works.

65
00:03:10.100 --> 00:03:12.720
I kind of gave you a overview.

66
00:03:12.720 --> 00:03:15.760
Users, they upload some medical documents

67
00:03:15.760 --> 00:03:19.760
like a doctor visit, like doctor notes or things like that.

68
00:03:19.760 --> 00:03:23.240
And then we have like another user that asks questions

69
00:03:23.240 --> 00:03:24.780
about their documents.

70
00:03:24.780 --> 00:03:27.720
Maybe they want to get like, I don't know,

71
00:03:27.720 --> 00:03:31.240
a quick overview of like the most severe conditions

72
00:03:31.240 --> 00:03:32.480
or something like that.

73
00:03:32.480 --> 00:03:36.020
And then the LLM reads files content and answers queries.

74
00:03:36.020 --> 00:03:40.900
So our backend has like tool calling and connections.

75
00:03:41.780 --> 00:03:44.900
So we had like some chatbots

76
00:03:44.900 --> 00:03:47.740
that they're just about like taking an input

77
00:03:47.740 --> 00:03:49.940
and then doing an action on it.

78
00:03:49.940 --> 00:03:52.820
But this, since it has like a more complex flow

79
00:03:52.820 --> 00:03:55.200
that also goes into a consideration

80
00:03:55.200 --> 00:03:57.380
like how we're gonna test this.

81
00:03:57.380 --> 00:04:00.700
And we have like a capability here,

82
00:04:00.700 --> 00:04:03.940
like file selection that a user can like choose

83
00:04:03.940 --> 00:04:06.340
specific files to ask questions about.

84
00:04:06.340 --> 00:04:11.340
That's kind of like a special feature

85
00:04:11.380 --> 00:04:14.660
that we're gonna be testing in our backend.

86
00:04:14.660 --> 00:04:19.660
And so in our normal API tests,

87
00:04:20.440 --> 00:04:23.820
we try to get as much, as close as we can

88
00:04:23.820 --> 00:04:26.560
to like the real behavior.

89
00:04:26.560 --> 00:04:29.740
So we use like a real database,

90
00:04:29.740 --> 00:04:31.300
but it's like a separate database.

91
00:04:31.300 --> 00:04:34.540
It's like a test database that gets wiped

92
00:04:34.540 --> 00:04:36.060
in between each run,

93
00:04:36.740 --> 00:04:37.580
but it's a real database.

94
00:04:37.580 --> 00:04:40.980
So if we write like bad SQL queries,

95
00:04:40.980 --> 00:04:43.540
then we're gonna catch that in our test database,

96
00:04:43.540 --> 00:04:46.460
which you wouldn't get if you would mock

97
00:04:46.460 --> 00:04:50.080
like in some teams, in some approaches, they do that.

98
00:04:51.900 --> 00:04:55.940
So the thing is about being pragmatic

99
00:04:55.940 --> 00:04:57.780
about like what we're gonna mock

100
00:04:57.780 --> 00:04:59.580
and what we're gonna let...

101
00:05:00.000 --> 00:05:05.840
um let be like the real back-end behavior so

102
00:05:05.840 --> 00:05:11.760
we i already put in place like um a test for the data ingestion pipeline

103
00:05:11.760 --> 00:05:16.800
so i already know that it's like writing to the database in a specific table

104
00:05:16.800 --> 00:05:22.720
like the file content takes um certain like pdf files and then parses

105
00:05:22.720 --> 00:05:26.160
them and for each page it takes the page content and that

106
00:05:26.160 --> 00:05:29.440
becomes like a row in the database and then another row

107
00:05:29.440 --> 00:05:34.000
for the next page etc so the ingestion

108
00:05:34.000 --> 00:05:37.440
piece is already tested so there's no need to like include it

109
00:05:37.440 --> 00:05:42.000
to be like kind of end-to-end we we're gonna go with a starting point that like

110
00:05:42.000 --> 00:05:45.280
the database is already pre-populated with

111
00:05:45.280 --> 00:05:49.120
um file contents and we have control over

112
00:05:49.120 --> 00:05:52.400
what files and what's their content so that way we can

113
00:05:52.400 --> 00:05:56.880
start designing test cases um

114
00:05:56.880 --> 00:06:00.640
then we make real api calls through the controller so we

115
00:06:00.640 --> 00:06:07.760
actually let it um call open ai and then we capture the

116
00:06:07.760 --> 00:06:13.200
llm responses and then we we put it in a structured file

117
00:06:13.200 --> 00:06:16.560
so that we can review each test case like

118
00:06:16.560 --> 00:06:20.400
what was the setup the scenario and then what was the actual

119
00:06:20.400 --> 00:06:26.160
response so in the setup like the

120
00:06:26.160 --> 00:06:29.440
files that we start off with in the knowledge base

121
00:06:29.440 --> 00:06:34.560
we kind of hard code this and so we started with like hard coding

122
00:06:34.560 --> 00:06:39.200
two um so simulate as if they already went

123
00:06:39.200 --> 00:06:44.480
through the ingestion pipeline uh we have full control over the contents

124
00:06:44.480 --> 00:06:48.240
and then if we want we can push it further but i'm just going to show you

125
00:06:48.240 --> 00:06:52.240
like a basic one um in our case like we're adding two

126
00:06:52.240 --> 00:06:55.680
files and then we choose to ask a question about only

127
00:06:55.680 --> 00:07:00.800
one file and then we assert that um the second

128
00:07:00.800 --> 00:07:05.600
file is not included in the response so that the

129
00:07:05.600 --> 00:07:09.520
tool calling and like everything internally is hooked up properly

130
00:07:09.520 --> 00:07:15.440
and that way we get the maximum confidence that our api is like is

131
00:07:15.440 --> 00:07:20.480
working properly so to make this real api call we

132
00:07:20.480 --> 00:07:24.400
we actually call the like chat controller kind of manually

133
00:07:24.400 --> 00:07:27.520
and it goes through the real logic it calls the

134
00:07:27.520 --> 00:07:32.720
the llm with the real file contents and so it's kind of like the complete

135
00:07:32.720 --> 00:07:37.840
flow uh for that like the chat api

136
00:07:37.840 --> 00:07:43.360
and then once we get that back we uh we just like create like a markdown

137
00:07:43.360 --> 00:07:49.280
file that says like what it was uh the expected behavior

138
00:07:49.280 --> 00:07:53.120
like the second file content shouldn't be there the first file content should

139
00:07:53.120 --> 00:07:56.240
be there uh the actual response that we got from

140
00:07:56.240 --> 00:08:00.880
the llm like the real uh open ai call and then what we can

141
00:08:00.880 --> 00:08:05.840
check for manually and so this makes us like

142
00:08:05.840 --> 00:08:10.000
way faster to test and we can add like other cases

143
00:08:10.000 --> 00:08:13.040
uh with full control and confidence that like

144
00:08:13.040 --> 00:08:19.920
our api is working properly and we can like repeat it as much as we

145
00:08:19.920 --> 00:08:23.600
want like when we tweak prompts we can see the effect almost in real

146
00:08:23.600 --> 00:08:27.600
time and it's the real behavior and it's like

147
00:08:27.600 --> 00:08:30.400
real readable to me what's important in writing

148
00:08:30.400 --> 00:08:32.640
test is that it's like readable kind of like

149
00:08:32.640 --> 00:08:36.480
like plain english like a story you know we started off with

150
00:08:36.480 --> 00:08:40.480
two medical files and we chose to read one and then we got

151
00:08:40.480 --> 00:08:47.360
this output and then we can evaluate if it's if it makes sense

152
00:08:47.360 --> 00:08:53.920
so file selection feature uh kind of went already through this

153
00:08:53.920 --> 00:08:57.440
current state so we have like multiple test scenarios

154
00:08:57.440 --> 00:09:01.520
um and another thing that i didn't put in the slides that

155
00:09:01.520 --> 00:09:05.440
we actually skipped this from continuous integration because

156
00:09:05.440 --> 00:09:10.000
since this test makes like a real api call it's going to be slower

157
00:09:10.000 --> 00:09:13.520
and sometimes it could be also flaky and it might like

158
00:09:13.520 --> 00:09:17.360
require manual review so it doesn't really make sense to put in continuous

159
00:09:17.360 --> 00:09:22.240
integration this is a for for now our setup is

160
00:09:22.240 --> 00:09:26.160
if you're working in the in the chatbot api

161
00:09:26.160 --> 00:09:29.920
you're going to be running it manual you choose to like

162
00:09:29.920 --> 00:09:33.600
turn it on and run it manually every time

163
00:09:33.600 --> 00:09:39.200
um but it doesn't make sense to put in in ci right now there could be a

164
00:09:39.200 --> 00:09:44.240
separate setup for that and this next step i i tweak with this

165
00:09:44.240 --> 00:09:49.040
but like have another llm to like read

166
00:09:49.040 --> 00:09:54.800
the output and like expect it in actual and and tell us like if it passed or

167
00:09:54.800 --> 00:09:58.720
failed um i kind of play with this a little bit

168
00:09:58.720 --> 00:10:01.760
it's a bit uh

169
00:10:00.000 --> 00:10:05.760
flaky sometimes there's some prompt engineering to do but that can be like a bit of a time saver

170
00:10:07.440 --> 00:10:16.240
um so when to use this it's good for like um when the api is like pretty simple it has just

171
00:10:16.240 --> 00:10:25.280
like one case then it's okay to like uh test it manually but almost like you're almost never

172
00:10:25.280 --> 00:10:30.320
going to be in that kind of situation so i would say it's almost always like worth it to write

173
00:10:30.320 --> 00:10:35.840
tests because you're saving yourself so much effort and you're getting more confidence about

174
00:10:35.840 --> 00:10:43.040
the stability of what you're building uh features where you need many test scenarios so faster

175
00:10:43.040 --> 00:10:51.440
iteration is really the key here and confidence by using like a realistic uh case uh simple mpi

176
00:10:52.080 --> 00:11:00.960
influence ui specific there's i'm not really touching like um front-end tests in this

177
00:11:00.960 --> 00:11:07.600
presentation this is just back-end so but we have like another way of testing uh which i

178
00:11:08.160 --> 00:11:15.120
won't go into right now so let me go in um in the test here i'll show you the code

179
00:11:16.080 --> 00:11:27.520
um so here we have the the controller so here we have like the database like a helper

180
00:11:28.320 --> 00:11:34.560
for like the the real database that gets like wiped this is where we do the wiping of the

181
00:11:34.560 --> 00:11:43.120
database and then in here like we set up the user with their like case and like a

182
00:11:43.840 --> 00:11:50.720
chat like an empty chat history and then we create like the the file so notice already it just kind

183
00:11:50.720 --> 00:11:56.880
of reads as a like plain english like story we create a user we create a ptsd file for them

184
00:11:56.880 --> 00:12:02.160
we create a file about chronic pain and then we're asking what are the current treatment approaches

185
00:12:03.120 --> 00:12:11.200
and then we're like mocking this and then we select just the ptsd file and then we call

186
00:12:11.840 --> 00:12:19.680
the controller the api endpoint and then we get the chat messages just to like unwrap the

187
00:12:20.800 --> 00:12:25.600
the response or like get them from the database because we save the chat history

188
00:12:25.680 --> 00:12:33.360
so we save like the messages that was uh responded with and then we we write like okay

189
00:12:33.360 --> 00:12:39.280
what was the setup about what is this test about we have like these files and then we selected

190
00:12:40.000 --> 00:12:45.920
this file we have this query and then the expected behavior it should only be about the ptsd file

191
00:12:46.480 --> 00:12:52.800
and not about like mention like a chronic pain and then just the utility like a helper

192
00:12:53.360 --> 00:13:00.960
to write to a file and then i created this just to play around with this but this is like the

193
00:13:00.960 --> 00:13:09.680
the second llm call to like validate so we have let me just run it as a demo

194
00:13:11.680 --> 00:13:19.600
so it goes ahead it makes the call here takes like five seconds and then here i'll kind of

195
00:13:19.600 --> 00:13:27.280
not scroll we're getting a bit of a spoiler it fails so test result is written here let me go in

196
00:13:27.280 --> 00:13:32.000
like preview mode so this is what like if you're gonna do manual review this kind of what it looks

197
00:13:32.000 --> 00:13:38.320
like you can just open it after each run and then see okay you get like a refresher in case you

198
00:13:38.320 --> 00:13:47.920
weren't reading the code um so we have this this is the setup and then we expect this and we should

199
00:13:47.920 --> 00:13:53.280
not see like knee pain and then we can review what the what the chat is actually going to be

200
00:13:53.280 --> 00:14:02.400
responding so the treatment we have some medicine cognitive therapy and pdsd nightmares blah blah

201
00:14:02.400 --> 00:14:07.920
so i'm not seeing anything neat and we see the source that's great uh so it seems like it's

202
00:14:07.920 --> 00:14:14.320
passing so uh my setup like saved me a bunch of time of like not having to do this through the ui

203
00:14:14.960 --> 00:14:25.600
and getting a really controlled um experience and this i went with the test result file

204
00:14:25.600 --> 00:14:31.600
um you know that's another one validate response content so this makes like another um

205
00:14:33.680 --> 00:14:37.280
like llm call you validate test results analyze the markdown report

206
00:14:37.920 --> 00:14:42.800
and then it should mention items should not mention check these topics are absent

207
00:14:43.760 --> 00:14:53.760
and it's a bit flaky like it tells us some things um so it's it's not perfect some prompt

208
00:14:53.760 --> 00:14:59.840
engineering has to be done here so i say it's good enough for now but um there

209
00:15:00.000 --> 00:15:03.500
we'll keep like incrementing on it,

210
00:15:03.500 --> 00:15:06.660
but right now it saves a lot of time

211
00:15:06.660 --> 00:15:10.820
and the kind of foundation, the framework is set up

212
00:15:10.820 --> 00:15:14.140
for us to add like more scenarios

213
00:15:14.140 --> 00:15:17.340
because we could have like adding

214
00:15:17.340 --> 00:15:19.680
like different data sources to the reg,

215
00:15:19.680 --> 00:15:24.100
like not only from like documents,

216
00:15:24.100 --> 00:15:25.420
uploaded medical documents,

217
00:15:25.420 --> 00:15:30.420
you can also ask questions from like call transcripts,

218
00:15:31.100 --> 00:15:35.380
maybe like the medical provider did a call with the client

219
00:15:35.380 --> 00:15:36.660
and that was like transcribed

220
00:15:36.660 --> 00:15:39.340
and we wanna ask questions about that as well.

221
00:15:39.340 --> 00:15:42.920
So that would be very easy to add from here.

222
00:15:42.920 --> 00:15:47.200
And from here, like tests are readable.

223
00:15:47.200 --> 00:15:50.300
So the main points is that it's gonna increase

224
00:15:50.300 --> 00:15:52.660
your like feedback loop,

225
00:15:52.660 --> 00:15:56.460
keep as much as you can, not mocked.

226
00:15:56.460 --> 00:15:57.820
So that way it gives you confidence

227
00:15:57.820 --> 00:16:00.580
that everything internally is hooked up together properly

228
00:16:00.580 --> 00:16:04.520
and that your prompts are doing what they should

229
00:16:04.520 --> 00:16:06.860
and make it as readable as you can,

230
00:16:06.860 --> 00:16:09.380
like kind of like plain English story,

231
00:16:09.380 --> 00:16:11.540
more or less as much as possible.

232
00:16:11.540 --> 00:16:13.060
So that's it for me.

233
00:16:14.420 --> 00:16:16.540
If you guys have like questions from the team

234
00:16:16.540 --> 00:16:19.660
or anything like that, I'll be welcome.

235
00:16:19.660 --> 00:16:20.500
That's great.

236
00:16:20.500 --> 00:16:21.320
Thanks for doing that.

237
00:16:21.920 --> 00:16:23.880
This is, I know we did something

238
00:16:23.880 --> 00:16:25.560
kind of similar to this as well.

239
00:16:26.760 --> 00:16:29.680
Sorry, when you were running the test,

240
00:16:29.680 --> 00:16:33.040
like we did something where we were essentially

241
00:16:33.040 --> 00:16:36.000
scraping even the JPEGs of these documents

242
00:16:36.000 --> 00:16:37.120
that weren't readable.

243
00:16:37.120 --> 00:16:38.360
It's like another test.

244
00:16:38.360 --> 00:16:40.260
You remember between you and I,

245
00:16:40.260 --> 00:16:41.920
we created like a test suite

246
00:16:41.920 --> 00:16:45.120
because we were basically having the same job

247
00:16:45.120 --> 00:16:46.680
or the same tool call would go off

248
00:16:46.680 --> 00:16:50.280
and like process lots of different types of files.

249
00:16:50.280 --> 00:16:53.520
So what you could do is we wrote like multiple tests

250
00:16:53.520 --> 00:16:55.960
and then gave it, basically use this prompt,

251
00:16:55.960 --> 00:16:59.380
but run it against these five different files.

252
00:16:59.380 --> 00:17:01.940
That way you could just tweak the prompts in one place

253
00:17:01.940 --> 00:17:03.800
and then it would, you could kind of run it again

254
00:17:03.800 --> 00:17:06.720
against all five different scenarios.

255
00:17:06.720 --> 00:17:09.319
I think, is that similar in some ways

256
00:17:09.319 --> 00:17:10.440
what you're talking about here?

257
00:17:10.440 --> 00:17:12.839
Because I think we did something similar to that.

258
00:17:12.839 --> 00:17:14.520
Yeah, what you're talking about

259
00:17:14.520 --> 00:17:16.000
is like on the ingestion piece

260
00:17:16.000 --> 00:17:18.599
and we had like a really nice setup there.

261
00:17:18.599 --> 00:17:22.280
Even we had like a kind of reporting dashboard

262
00:17:22.280 --> 00:17:25.440
like that showed a preview of the page.

263
00:17:25.440 --> 00:17:30.200
Like that was, it's kind of like a picture of the page

264
00:17:30.200 --> 00:17:33.320
and like the doctor had some handwriting and whatever.

265
00:17:33.320 --> 00:17:36.200
And you could see like the preview and output

266
00:17:36.200 --> 00:17:38.080
of the doc ingestion.

267
00:17:38.080 --> 00:17:39.680
That was quite good.

268
00:17:39.680 --> 00:17:43.840
And we even like leveraged AI

269
00:17:43.840 --> 00:17:47.480
to like build that like simple reporting dashboard.

270
00:17:47.520 --> 00:17:50.720
That was a really good setup on the ingestion.

271
00:17:50.720 --> 00:17:52.160
I don't know if we would want to like,

272
00:17:52.160 --> 00:17:54.320
kind of find it again,

273
00:17:54.320 --> 00:17:56.920
like kind of parse through the code to try to find it.

274
00:17:56.920 --> 00:17:59.880
I didn't prepare like a presentation about that piece,

275
00:17:59.880 --> 00:18:03.320
but that was a really good setup as well.

276
00:18:03.320 --> 00:18:05.620
Cause we had just in one,

277
00:18:05.620 --> 00:18:06.880
like you could just scroll the page

278
00:18:06.880 --> 00:18:09.600
and see like the multiple different cases,

279
00:18:09.600 --> 00:18:10.520
like multiple different,

280
00:18:10.520 --> 00:18:14.880
like a page is like blank or it has illegible handwriting

281
00:18:14.880 --> 00:18:17.960
or another one that has like legible handwriting,

282
00:18:17.960 --> 00:18:20.360
one that had like forms filled out.

283
00:18:20.360 --> 00:18:23.960
So we had all of these just kind of lined up

284
00:18:23.960 --> 00:18:27.120
and we were seeing like when we tweak the prompt

285
00:18:27.120 --> 00:18:28.880
to parse these pages,

286
00:18:28.880 --> 00:18:32.560
then how the doc ingestion pipeline would work.

287
00:18:32.560 --> 00:18:34.760
And it worked really well.

288
00:18:34.760 --> 00:18:35.680
Nice.

289
00:18:35.680 --> 00:18:38.160
Yeah, I think this is helpful when you're,

290
00:18:39.600 --> 00:18:42.200
you know, the typical scenario is like we have a client

291
00:18:42.200 --> 00:18:45.280
who is using chatGBT on their own, right?

292
00:18:45.280 --> 00:18:47.120
They're just going to chatGBT.com

293
00:18:47.120 --> 00:18:48.680
and they're like throwing in a few files

294
00:18:48.680 --> 00:18:51.360
and they're just like asking questions against it.

295
00:18:51.360 --> 00:18:54.160
And so now when you try to bring that to like a company,

296
00:18:55.200 --> 00:18:56.160
it makes it more complicated

297
00:18:56.160 --> 00:18:57.720
cause you're trying to standardize a lot of things.

298
00:18:57.720 --> 00:19:01.560
So we have to, that's where the rag piece comes in.

299
00:19:01.560 --> 00:19:03.880
So this has been a helpful tool I know for you

300
00:19:03.880 --> 00:19:05.880
cause you can quickly get lost

301
00:19:05.880 --> 00:19:08.880
in sort of an infinite amount of like variables here

302
00:19:08.880 --> 00:19:12.200
because you're working with like natural language

303
00:19:12.200 --> 00:19:14.920
versus like a very digital, you know,

304
00:19:14.920 --> 00:19:16.680
like either it's true or false,

305
00:19:16.680 --> 00:19:19.400
like this kind of natural language responses.

306
00:19:19.400 --> 00:19:21.760
So having this type of thing, I think it's helpful

307
00:19:21.760 --> 00:19:24.520
because I think if I understood your demo there,

308
00:19:24.520 --> 00:19:25.840
like you were running the test,

309
00:19:25.840 --> 00:19:26.880
it was coming back response

310
00:19:26.880 --> 00:19:28.520
and then you were even having it validate,

311
00:19:28.520 --> 00:19:30.480
like use the response that was coming back

312
00:19:30.480 --> 00:19:32.560
and then you're using AI to kind of interpret that.

313
00:19:32.560 --> 00:19:34.280
Is that what was happening?

314
00:19:34.280 --> 00:19:35.120
Yes.

315
00:19:35.120 --> 00:19:37.880
Like I said, that piece is like a little bit flaky

316
00:19:37.880 --> 00:19:40.680
and I wouldn't like trust it a hundred percent yet

317
00:19:40.680 --> 00:19:42.640
until I iterate more and like-

318
00:19:42.640 --> 00:19:44.040
Still AI, you wouldn't want to-

319
00:19:44.040 --> 00:19:49.040
Yeah, so like what we had in the ingestion pipeline

320
00:19:49.240 --> 00:19:52.640
is probably the best trade-off

321
00:19:52.640 --> 00:19:56.000
that you see like all the cases in one line

322
00:19:56.000 --> 00:19:57.760
and like all the outputs

323
00:19:57.760 --> 00:19:59.960
and then you can just give it like thumbs up.

324
00:20:00.000 --> 00:20:05.100
Um, that that's kind of like, uh, emerging, like pattern that, uh, I was

325
00:20:05.100 --> 00:20:10.140
reading recently they called evals and datasets that there's a bunch of like

326
00:20:10.140 --> 00:20:12.740
good documentation online about it.

327
00:20:13.080 --> 00:20:14.420
But yeah, that's the main idea.

328
00:20:14.420 --> 00:20:19.220
Like you, you set up your kind of scenarios you're expected and your

329
00:20:19.220 --> 00:20:22.900
actual, and then you just kind of like eyeball them quickly and see like,

330
00:20:22.900 --> 00:20:25.300
okay, yeah, each one of them is good.

331
00:20:25.980 --> 00:20:27.940
Um, or some of them are not good.

332
00:20:27.940 --> 00:20:29.640
And then you go back and tweak.

333
00:20:29.840 --> 00:20:33.320
So yeah, like just not having like so many variables and being like kind of

334
00:20:33.640 --> 00:20:37.520
spread thin and holding all the kind of working context in your mind.

335
00:20:37.520 --> 00:20:38.400
It's really good for that.

336
00:20:39.400 --> 00:20:39.760
Nice.

337
00:20:40.360 --> 00:20:40.760
That's good.

338
00:20:40.760 --> 00:20:41.000
Yeah.

339
00:20:41.000 --> 00:20:46.640
If you guys, um, IRA and Keith just joined Leon, do not, if you guys have

340
00:20:46.640 --> 00:20:49.320
any questions that we know in the chat, I can bring you on and we can just chat.

341
00:20:49.360 --> 00:20:51.120
I think this is kind of casual.

342
00:20:51.560 --> 00:20:56.440
Um, I know we're doing a couple of implementations like this, um,

343
00:20:56.440 --> 00:20:57.960
on some other projects, but yeah.

344
00:20:58.000 --> 00:21:01.400
If you guys have questions for, uh, let me know.

345
00:21:01.440 --> 00:21:03.120
I just need to bring you on here.

346
00:21:03.560 --> 00:21:10.840
Let me, uh, awesome.

347
00:21:12.080 --> 00:21:14.840
Um, yeah, thanks for putting that together.

348
00:21:16.560 --> 00:21:18.600
And let me just see it real quick.

349
00:21:19.120 --> 00:21:25.080
Um, so, um, yeah, do you guys have any questions, thoughts?

350
00:21:25.080 --> 00:21:27.080
I know some of you guys joined halfway through, so that

351
00:21:27.080 --> 00:21:28.120
probably made no sense.

352
00:21:28.240 --> 00:21:29.120
Uh, halfway through.

353
00:21:31.120 --> 00:21:32.400
Any other thoughts, questions?

354
00:21:34.040 --> 00:21:38.400
I know IRA, you've been doing a lot with the same client.

355
00:21:38.880 --> 00:21:44.560
Um, I don't know if you've touched any of these, these, uh, AI automated tests yet.

356
00:21:46.600 --> 00:21:49.640
No, not yet, but this was, uh, some really good context for me.

357
00:21:51.320 --> 00:21:51.640
Sweet.

358
00:21:53.440 --> 00:21:54.840
Yeah, that's true.

359
00:21:54.840 --> 00:21:56.240
I really liked your approach.

360
00:21:56.240 --> 00:22:01.240
Like, uh, I was even like, uh, you mentioned evolves.

361
00:22:01.280 --> 00:22:06.320
I was, I was reading about it and like, uh, I think the whole way you are

362
00:22:06.320 --> 00:22:12.040
thinking about testing, it's more or less like how they suggest that we test it.

363
00:22:12.040 --> 00:22:15.120
So I think it's very, very, uh, interesting.

364
00:22:15.160 --> 00:22:21.520
I'm curious to, to use it in the Hackstack project for, for the chat part that you

365
00:22:21.520 --> 00:22:27.040
have, it's a kind of a bit forgotten because we haven't touched it for some

366
00:22:27.040 --> 00:22:32.320
time, but I think once we, we have some time to get back at it, I will, I will

367
00:22:32.320 --> 00:22:39.640
for sure want to, to try some of it because that's a very core problem of

368
00:22:39.640 --> 00:22:44.720
working with LLMs, like it's, it can be like very inconsistent and like, for

369
00:22:44.720 --> 00:22:49.480
instance, when you are, when you want to, to given in our case, for instance, we

370
00:22:49.480 --> 00:22:57.880
have the ability to kind of, uh, do, do estimates.

371
00:22:57.920 --> 00:23:00.360
So we do a budget for the customer.

372
00:23:00.560 --> 00:23:03.240
Like we can make calculations wrong.

373
00:23:03.240 --> 00:23:07.800
Like it can be wrong on the, on this part and we need for, for, for it to be

374
00:23:07.800 --> 00:23:08.440
consistent.

375
00:23:08.440 --> 00:23:11.840
So I really liked the overall approach in general.

376
00:23:11.840 --> 00:23:13.120
I'm curious to test it.

377
00:23:15.680 --> 00:23:15.960
Yeah.

378
00:23:15.960 --> 00:23:18.160
It really shines when you have like tool calling.

379
00:23:18.600 --> 00:23:22.400
So let's say like you kind of pre load the database with like your financial

380
00:23:22.400 --> 00:23:26.640
reports and then you like prompted, like, Hey, like calculate this.

381
00:23:27.040 --> 00:23:29.680
And there was like no other way that it could possibly know, except

382
00:23:29.680 --> 00:23:30.920
for making the tool calls.

383
00:23:31.080 --> 00:23:32.680
So you're, you're set up for testing that.

384
00:23:34.120 --> 00:23:37.600
There's so much we could talk about as far as the rag stuff goes.

385
00:23:37.680 --> 00:23:40.920
Um, so this is like just totally like an intro to it, but yeah, Greg.

386
00:23:41.800 --> 00:23:45.240
Uh, Hey, I joined halfway through, sorry.

387
00:23:45.240 --> 00:23:48.240
Um, there are people doing stuff like this.

388
00:23:48.240 --> 00:23:53.280
Even when I caught Lucian, there's a local startup here where I'm at called

389
00:23:53.280 --> 00:23:59.720
ordinal and their whole thing is bringing all the city documentation, the, all the

390
00:23:59.720 --> 00:24:05.320
different, you know, volumes and volumes of ordinances and all that stuff.

391
00:24:05.760 --> 00:24:14.040
And basically sort of digitizing that and then having an LLM go through it and send

392
00:24:14.080 --> 00:24:15.520
it back to the LLM again.

393
00:24:15.520 --> 00:24:18.320
And, and, you know, a lot and be like, is this right?

394
00:24:18.360 --> 00:24:19.400
Are you sure it's right?

395
00:24:19.400 --> 00:24:20.160
Where is this at?

396
00:24:20.160 --> 00:24:22.480
If you don't know, say, you don't know.

397
00:24:22.920 --> 00:24:26.280
So I think it's definitely one of the interesting use cases to get something

398
00:24:26.280 --> 00:24:30.200
from the LLM and then send it back to the same one or a different one and go, okay,

399
00:24:30.200 --> 00:24:31.040
what do you think?

400
00:24:31.240 --> 00:24:32.160
This is what I wanted.

401
00:24:32.280 --> 00:24:33.920
Did you give it to me or not?

402
00:24:33.920 --> 00:24:35.560
And I'm like, Oh yeah, no, sorry.

403
00:24:36.960 --> 00:24:37.360
It was good.

404
00:24:38.520 --> 00:24:39.880
And you're absolutely right.

405
00:24:40.480 --> 00:24:41.960
And that's a great point.

406
00:24:41.960 --> 00:24:45.840
It kind of brings me like, cause you can use this approach, like on the ingestion

407
00:24:45.840 --> 00:24:48.960
pipeline as well as on the, on the chat side.

408
00:24:49.440 --> 00:24:53.480
And, you know, I was reading about like different patterns, like you're, you're

409
00:24:53.480 --> 00:24:57.160
mentioning, it's kind of like chaining, you know, like you read it and then you

410
00:24:57.160 --> 00:25:02.840
put it in the, in the, in the, in the, in the, in the, in the, in the, in the, in the

411
00:25:00.000 --> 00:25:05.840
pass it to another step like the the output is like oh review this does it make sense and

412
00:25:05.840 --> 00:25:10.480
what's what's really cool also is that you can use this approach for like testing just

413
00:25:10.480 --> 00:25:16.400
intermediate steps you know still making the real response but you know if you're gonna write like

414
00:25:16.400 --> 00:25:21.440
fewer tests I would say just like the full like the whole end-to-end thing is a good starting point

415
00:25:22.800 --> 00:25:26.480
and that's good as well because I think when you're also developing with AI

416
00:25:26.480 --> 00:25:31.120
we've it's it's very collaborative with the clients like it's it's kind of like you're

417
00:25:31.120 --> 00:25:38.160
working with this like this other third party which is the AI so we're trying to like train

418
00:25:38.160 --> 00:25:42.480
it and then it they'll go and use it a bunch of times so when they come back and say hey I'm

419
00:25:42.480 --> 00:25:46.560
getting a but when I ask this question like I always get this so that would be a good point

420
00:25:46.560 --> 00:25:50.560
to like okay let's add that to the test and then make sure that it tests against that I think that's

421
00:25:50.560 --> 00:25:57.280
kind of the workflow is that what you're thinking which is like yeah it's almost like you don't even

422
00:25:57.920 --> 00:26:03.200
invent and it's almost like you just take like the client's words and that becomes the test you know

423
00:26:03.200 --> 00:26:07.520
you're just asking them like hey how are you using it today and they'll show you a few examples that's

424
00:26:07.520 --> 00:26:12.960
when you paid like the most attention and then you just like kind of forward that and they become

425
00:26:12.960 --> 00:26:22.000
like test cases and it like kind of next level is putting in place kind of client driven or like

426
00:26:22.000 --> 00:26:28.880
user driven evals like in uh cloud or in like an open ai like in the normal chat there's like a

427
00:26:28.880 --> 00:26:34.240
thumbs up thumbs down when you get a response and you can give like feedback so I'm sure you can use

428
00:26:34.240 --> 00:26:40.880
these to like drive up like your your test further like oh we didn't catch this kind of edge case so

429
00:26:40.880 --> 00:26:48.240
so that can be a valuable data data point could you set it up in a way and motivate them to use

430
00:26:49.680 --> 00:26:54.080
your own client rather than just going straight to chat tpt and then you can actually capture

431
00:26:54.080 --> 00:26:59.200
the input and see what they did wrong yeah I mean like since since we're building like their custom

432
00:26:59.200 --> 00:27:04.560
so we have like full control and I'm thinking like some users in some scenarios like we know in

433
00:27:04.560 --> 00:27:08.960
advance like some users like kind of like the power users so their feedback is different

434
00:27:08.960 --> 00:27:15.760
disproportionately impactful so we can even kind of put like if you will like user feature flags like

435
00:27:15.760 --> 00:27:22.160
the feedback mechanism is only shown to some users and normally like they would use the thumbs up

436
00:27:22.160 --> 00:27:28.800
thumbs down only when something gets wrong so so you're sure to get like kind of good quality data

437
00:27:30.720 --> 00:27:31.220
cool

438
00:27:31.220 --> 00:27:41.460
very cool anything else we can still hang out and chat but just related to that that's good
