Skip to content

Changes for spacy 3.0#322

Closed
datanizing wants to merge 2 commits intochartbeat-labs:masterfrom
datanizing:master
Closed

Changes for spacy 3.0#322
datanizing wants to merge 2 commits intochartbeat-labs:masterfrom
datanizing:master

Conversation

@datanizing
Copy link

Hi Burton,

as discussed with Jens I have created a few changes to make (parts of) textacy work with spacy 3.0.

Description

Take care of API changes in spacy 3.0
Make passing of numpy arrays possible

Motivation and Context

Make it work with latest spacy

How Has This Been Tested?

Ran the normal examples

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation, and I have updated it accordingly.

Copy link
Collaborator

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @datanizing , thanks for the PR. I have to get the develop branch set up for spaCy v3 — it's been on my backburner for a while, but going to make some time soon — then we'll want to rebase these changes onto that. Will report back when that branch is ready for you, okay?

self.n_sents = 0
self.n_tokens = 0
if data:
if data is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a v3-specific issue, or just an edge case bug that has never been properly handled? Could you give a mini-example showing how this failed for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I'm guessing from the commit message that you added this for numpy's sake, e.g.

In [1]: import numpy as np

In [2]: arr = np.ndarray([1,2,3])

In [3]: if arr: print("okay")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a19c660a75e5> in <module>
----> 1 if arr: print("okay")

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Which makes this an edge case that I never properly handled. I'm going to just add it to the latest develop branch, which is now working (with minimal hacks...) for spacy v3, since the other two changes in this PR were incidentally handled on the way to making the code run and tests pass.

Comment on lines +165 to +166
freqs = itertoolz.frequencies(ngram.lemma_.lower() for ngram in ngrams_)
ngrams_ = (ngram for ngram in ngrams_ if freqs[ngram.lemma_.lower()] >= min_freq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is this a spaCy v3 issue, or just a difference in expectations? Lemmatizing and lowercasing under the hood when determining ngram frequencies felt like more magic than I would expect, so I stopped at just lowercasing. I'm open to reconsidering, but would want to check that lemmas are available for all languages — I don't think so, in which case this feature would fail for certain langs, but I'm not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just saw that they removed Span.lower_ from the v3 API, so the direct equivalent here would be to use .text.lower(). I'm currently working through a pile of bugs/errors on the v3-based develop branch, so I'll fix it there — just have to get everything working again before I start making larger v3-oriented changes. :)

)
matcher = Matcher(doc.vocab)
matcher.add("match", on_match, *patterns)
matcher.add("match", patterns, on_match=on_match)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! I remember seeing this in spacy's changelog.

@bdewilde
Copy link
Collaborator

Hi @datanizing , your changes (or in one case, a basically equivalent change) are now live in the develop branch — plus a big update for the preprocessing subpackage. As such, I'm going to close this PR, but will give you proper credit in the next release notes. Thanks for jump-starting this process!

@bdewilde bdewilde closed this Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants