|
798 | 798 |
|
799 | 799 | 00:19:53 Well, I'll link that as well.
|
800 | 800 |
|
801 |
| -00:19:56 Now, let's dive into the whole NLP and spacey side of things. |
| 801 | +00:19:56 Now, let's dive into the whole NLP and spaCy side of things. |
802 | 802 |
|
803 | 803 | 00:20:00 I had Ines from Explosion on just back a couple months ago in June.
|
804 | 804 |
|
|
808 | 808 |
|
809 | 809 | 00:20:13 So two to three months ago.
|
810 | 810 |
|
811 |
| -00:20:15 Anyway, we talked more about LLMs, not so much spacey, even though she's behind it. |
| 811 | +00:20:15 Anyway, we talked more about LLMs, not so much spaCy, even though she's behind it. |
812 | 812 |
|
813 |
| -00:20:20 So give people a sense of what is spacey. |
| 813 | +00:20:20 So give people a sense of what is spaCy. |
814 | 814 |
|
815 | 815 | 00:20:23 We just talked about Scikit-Learn and the types of problems it solves.
|
816 | 816 |
|
817 |
| -00:20:26 What about spacey? |
| 817 | +00:20:26 What about spaCy? |
818 | 818 |
|
819 | 819 | 00:20:28 There's a couple of stories that could be told about it.
|
820 | 820 |
|
|
834 | 834 |
|
835 | 835 | 00:20:56 And it was definitely kind of useful, but it wasn't necessarily a coherent pipeline.
|
836 | 836 |
|
837 |
| -00:20:59 And one way to, I think, historically describe spacey, it was like a very honest, good attempt to make a pipeline for all these different NLP components that kind of click together. |
| 837 | +00:20:59 And one way to, I think, historically describe spaCy, it was like a very honest, good attempt to make a pipeline for all these different NLP components that kind of click together. |
838 | 838 |
|
839 |
| -00:21:09 And the first component inside of spacey that made it popular was basically a tokenizer. |
| 839 | +00:21:09 And the first component inside of spaCy that made it popular was basically a tokenizer. |
840 | 840 |
|
841 | 841 | 00:21:15 Something I can take text and split it up into separate words.
|
842 | 842 |
|
|
870 | 870 |
|
871 | 871 | 00:21:46 Because then if you like went back when I worked at the company, I used to work at Explosion just for context.
|
872 | 872 |
|
873 |
| -00:21:51 They would emphasize like the way you spell spacey is not with a capital S, it's with a capital C. |
| 873 | +00:21:51 They would emphasize like the way you spell spaCy is not with a capital S, it's with a capital C. |
874 | 874 |
|
875 | 875 | 00:21:55 It's like when you go and put what is your location and your social media.
|
876 | 876 |
|
|
932 | 932 |
|
933 | 933 | 00:23:01 This is going to happen.
|
934 | 934 |
|
935 |
| -00:23:02 But anyway, but back to spacey, I suppose. |
| 935 | +00:23:02 But anyway, but back to spaCy, I suppose. |
936 | 936 |
|
937 | 937 | 00:23:04 Like this is sort of the origin story.
|
938 | 938 |
|
|
1702 | 1702 |
|
1703 | 1703 | 00:41:11 This is the thing that people don't always recognize.
|
1704 | 1704 |
|
1705 |
| -00:41:12 But the way that spacey is made, if you're from scikit-learn, this sounds a bit surprising |
| 1705 | +00:41:12 But the way that spaCy is made, if you're from scikit-learn, this sounds a bit surprising |
1706 | 1706 |
|
1707 | 1707 | 00:41:17 because in scikit-learn land, you are typically used to the fact that you do batching and stuff
|
1708 | 1708 |
|
1709 | 1709 | 00:41:21 that's vectorized and numpy and that's sort of the way you would do it.
|
1710 | 1710 |
|
1711 |
| -00:41:23 But spacey actually has a small preference to using generators. |
| 1711 | +00:41:23 But spaCy actually has a small preference to using generators. |
1712 | 1712 |
|
1713 | 1713 | 00:41:27 And the whole thinking is that in natural language problems, you are typically dealing
|
1714 | 1714 |
|
|
1762 | 1762 |
|
1763 | 1763 | 00:42:46 that.
|
1764 | 1764 |
|
1765 |
| -00:42:46 But my spacey habit would always be do the generator thing. |
| 1765 | +00:42:46 But my spaCy habit would always be do the generator thing. |
1766 | 1766 |
|
1767 | 1767 | 00:42:49 Yeah.
|
1768 | 1768 |
|
|
1782 | 1782 |
|
1783 | 1783 | 00:43:12 nested data structures as well.
|
1784 | 1784 |
|
1785 |
| -00:43:13 So that's the first thing that I usually end up doing when I'm doing something with spacey. |
| 1785 | +00:43:13 So that's the first thing that I usually end up doing when I'm doing something with spaCy. |
1786 | 1786 |
|
1787 | 1787 | 00:43:17 Just get it into a generator.
|
1788 | 1788 |
|
|
2330 | 2330 |
|
2331 | 2331 | 00:57:17 A trick that I always like to use in terms of what examples should I annotate first?
|
2332 | 2332 |
|
2333 |
| -00:57:22 At some point, you got to imagine I have some sort of spacey model. |
| 2333 | +00:57:22 At some point, you got to imagine I have some sort of spaCy model. |
2334 | 2334 |
|
2335 | 2335 | 00:57:25 Maybe it has like 200 data points of labels.
|
2336 | 2336 |
|
|
2340 | 2340 |
|
2341 | 2341 | 00:57:31 When those two models disagree, something interesting is usually happening.
|
2342 | 2342 |
|
2343 |
| -00:57:35 Because the LLM model is pretty good and the spacey model is pretty good. |
| 2343 | +00:57:35 Because the LLM model is pretty good and the spaCy model is pretty good. |
2344 | 2344 |
|
2345 | 2345 | 00:57:38 But when they disagree, then I'm probably dealing with either a model that can be improved or data point that's just kind of tricky or something like that.
|
2346 | 2346 |
|
|
0 commit comments