Changes for spacy 3.0 by datanizing · Pull Request #322 · chartbeat-labs/textacy

datanizing · 2021-03-14T13:45:32Z

Hi Burton,

as discussed with Jens I have created a few changes to make (parts of) textacy work with spacy 3.0.

Description

Take care of API changes in spacy 3.0
Make passing of numpy arrays possible

Motivation and Context

Make it work with latest spacy

How Has This Been Tested?

Ran the normal examples

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation, and I have updated it accordingly.

bdewilde

Hi @datanizing , thanks for the PR. I have to get the develop branch set up for spaCy v3 — it's been on my backburner for a while, but going to make some time soon — then we'll want to rebase these changes onto that. Will report back when that branch is ready for you, okay?

bdewilde · 2021-03-16T20:34:30Z

src/textacy/corpus.py

        self.n_sents = 0
        self.n_tokens = 0
-        if data:
+        if data is not None:


Is this a v3-specific issue, or just an edge case bug that has never been properly handled? Could you give a mini-example showing how this failed for you?

Update: I'm guessing from the commit message that you added this for numpy's sake, e.g.

In [1]: import numpy as np In [2]: arr = np.ndarray([1,2,3]) In [3]: if arr: print("okay") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-3-a19c660a75e5> in <module> ----> 1 if arr: print("okay") ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Which makes this an edge case that I never properly handled. I'm going to just add it to the latest develop branch, which is now working (with minimal hacks...) for spacy v3, since the other two changes in this PR were incidentally handled on the way to making the code run and tests pass.

bdewilde · 2021-03-16T20:39:43Z

src/textacy/extract.py

+        freqs = itertoolz.frequencies(ngram.lemma_.lower() for ngram in ngrams_)
+        ngrams_ = (ngram for ngram in ngrams_ if freqs[ngram.lemma_.lower()] >= min_freq)


Again, is this a spaCy v3 issue, or just a difference in expectations? Lemmatizing and lowercasing under the hood when determining ngram frequencies felt like more magic than I would expect, so I stopped at just lowercasing. I'm open to reconsidering, but would want to check that lemmas are available for all languages — I don't think so, in which case this feature would fail for certain langs, but I'm not sure.

Okay, I just saw that they removed Span.lower_ from the v3 API, so the direct equivalent here would be to use .text.lower(). I'm currently working through a pile of bugs/errors on the v3-based develop branch, so I'll fix it there — just have to get everything working again before I start making larger v3-oriented changes. :)

bdewilde · 2021-03-16T20:41:00Z

src/textacy/extract.py

        )
    matcher = Matcher(doc.vocab)
-    matcher.add("match", on_match, *patterns)
+    matcher.add("match", patterns, on_match=on_match)


Noted! I remember seeing this in spacy's changelog.

bdewilde · 2021-03-19T16:19:39Z

Hi @datanizing , your changes (or in one case, a basically equivalent change) are now live in the develop branch — plus a big update for the preprocessing subpackage. As such, I'm going to close this PR, but will give you proper credit in the next release notes. Thanks for jump-starting this process!

datanizing added 2 commits March 14, 2021 14:34

allow passing numpy array

eab5375

changes for spacy 3.0

26aeef7

bdewilde reviewed Mar 16, 2021

View reviewed changes

bdewilde closed this Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for spacy 3.0#322

Changes for spacy 3.0#322
datanizing wants to merge 2 commits intochartbeat-labs:masterfrom
datanizing:master

datanizing commented Mar 14, 2021

Uh oh!

bdewilde left a comment

Uh oh!

bdewilde Mar 16, 2021

Uh oh!

bdewilde Mar 17, 2021

Uh oh!

bdewilde Mar 16, 2021

Uh oh!

bdewilde Mar 17, 2021

Uh oh!

bdewilde Mar 16, 2021

Uh oh!

bdewilde commented Mar 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		freqs = itertoolz.frequencies(ngram.lemma_.lower() for ngram in ngrams_)
		ngrams_ = (ngram for ngram in ngrams_ if freqs[ngram.lemma_.lower()] >= min_freq)

Conversation

datanizing commented Mar 14, 2021

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

bdewilde left a comment

Choose a reason for hiding this comment

Uh oh!

bdewilde Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

bdewilde Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

bdewilde Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

bdewilde Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

bdewilde Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

bdewilde commented Mar 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants