Skip to content

Commit 40718fe

Browse files
author
Guanheng Zhang
committed
checkpoint
1 parent 9e01aa7 commit 40718fe

File tree

1 file changed

+70
-68
lines changed

1 file changed

+70
-68
lines changed

beginner_source/text_sentiment_ngrams_tutorial.py

Lines changed: 70 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -7,60 +7,63 @@
77
- Access to the raw data as an iterator
88
- Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
99
- Shuffle and iterate the data with torch.utils.data.DataLoader
10-
11-
Access to the raw dataset iterators
12-
-----------------------------------
13-
14-
For some advanced users, they prefer to work on the raw data strings with their custom data process pipeline. The new torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.
15-
1610
"""
1711

12+
######################################################################
13+
# Access to the raw dataset iterators
14+
# -----------------------------------
15+
#
16+
#For some advanced users, they prefer to work on the raw data strings with their custom data process pipeline. The new torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.
17+
1818
import torch
1919
# With torchtext 0.9.0 rc
2020
# from torchtext.datasets import AG_NEWS
2121
from torchtext.experimental.datasets.raw import AG_NEWS
2222
train_iter, = AG_NEWS(split=('train'))
2323

24-
"""
25-
next(train_iter)
26-
>>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters -
27-
Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green
28-
again.")
29-
30-
next(train_iter)
31-
>>> (3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private
32-
investment firm Carlyle Group,\\which has a reputation for making well-timed
33-
and occasionally\\controversial plays in the defense industry, has quietly
34-
placed\\its bets on another part of the market.')
35-
36-
next(train_iter)
37-
>>> (3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring
38-
crude prices plus worries\\about the economy and the outlook for earnings are
39-
expected to\\hang over the stock market next week during the depth of
40-
the\\summer doldrums.")
41-
42-
Prepare data processing pipelines
43-
---------------------------------
44-
45-
We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.
46-
47-
Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. We provide a function build_vocab_from_iterator to build the vocabulary from a text iterator. Users can set up the minimum frequency for the tokens to be included.
48-
"""
24+
25+
# next(train_iter)
26+
# >>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters -
27+
# Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green
28+
# again.")
29+
#
30+
# next(train_iter)
31+
# >>> (3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private
32+
# investment firm Carlyle Group,\\which has a reputation for making well-timed
33+
# and occasionally\\controversial plays in the defense industry, has quietly
34+
# placed\\its bets on another part of the market.')
35+
#
36+
# next(train_iter)
37+
# >>> (3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring
38+
# crude prices plus worries\\about the economy and the outlook for earnings are
39+
# expected to\\hang over the stock market next week during the depth of
40+
# the\\summer doldrums.")
41+
42+
43+
######################################################################
44+
# Prepare data processing pipelines
45+
# ---------------------------------
46+
#
47+
#We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.
48+
#
49+
#Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. We provide a function build_vocab_from_iterator to build the vocabulary from a text iterator. Users can set up the minimum frequency for the tokens to be included.
50+
4951

5052
from torchtext.experimental.vocab import build_vocab_from_iterator
5153
from torchtext.experimental.transforms import basic_english_normalize
5254
tokenizer = basic_english_normalize()
5355
train_iter, = AG_NEWS(split=('train',))
5456
vocab = build_vocab_from_iterator(iter(tokenizer(line) for label, line in train_iter), min_freq=1)
5557

56-
"""
57-
The vocabulary block converts a list of tokens into integers.
5858

59-
vocab(['here', 'is', 'an', 'example'])
60-
>>> [475, 21, 30, 5286]
59+
######################################################################
60+
# The vocabulary block converts a list of tokens into integers.
61+
#
62+
# vocab(['here', 'is', 'an', 'example'])
63+
# >>> [475, 21, 30, 5286]
64+
#
65+
# Prepare data pipeline with the tokenizer and vocabulary. The pipelines will be used for the raw data strings from the dataset iterators.
6166

62-
Prepare data pipeline with the tokenizer and vocabulary. The pipelines will be used for the raw data strings from the dataset iterators.
63-
"""
6467

6568
def generate_text_pipeline(tokenizer, vocab):
6669
def _forward(text):
@@ -69,23 +72,26 @@ def _forward(text):
6972
text_pipeline = generate_text_pipeline(basic_english_normalize(), vocab)
7073
label_pipeline = lambda x: int(x) - 1
7174

72-
"""
73-
The text pipeline converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,
7475

75-
text_pipeline('here is the an example')
76-
>>> [475, 21, 2, 30, 5286]
77-
label_pipeline('10')
78-
>>> 9
79-
80-
Generate data batch and iterator
81-
--------------------------------
82-
83-
The PyTorch data loading utility is the torch.utils.data.DataLoader class. It works with a map-style dataset that implements the getitem() and len() protocols, and represents a map from indices/keys to data samples. It also works with an iterable datasets with the shuffle argumnet of False.
76+
######################################################################
77+
# The text pipeline converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,
78+
#
79+
# text_pipeline('here is the an example')
80+
# >>> [475, 21, 2, 30, 5286]
81+
# label_pipeline('10')
82+
# >>> 9
83+
#
84+
######################################################################
85+
# Generate data batch and iterator
86+
# --------------------------------
87+
#
88+
#The PyTorch data loading utility is the torch.utils.data.DataLoader class. It works with a map-style dataset that implements the getitem() and len() protocols, and represents a map from indices/keys to data samples. It also works with an iterable datasets with the shuffle argumnet of False.
89+
#
90+
#Before sending to the model, collate_fn function works on a batch of samples generated from DataLoader. The input to collat_fn is a batch of data with the batch size in DataLoader, and collate_fn processes them according to the data processing pipelines declared on Step 2. Pay attention here and make sure that collate_fn is declared as a top level def. This ensures that the function is available in each worker.
91+
#
92+
#In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of indidividual text entries.
8493

85-
Before sending to the model, collate_fn function works on a batch of samples generated from DataLoader. The input to collat_fn is a batch of data with the batch size in DataLoader, and collate_fn processes them according to the data processing pipelines declared on Step 2. Pay attention here and make sure that collate_fn is declared as a top level def. This ensures that the function is available in each worker.
8694

87-
In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of indidividual text entries.
88-
"""
8995
from torch.utils.data import DataLoader
9096
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
9197

@@ -385,15 +391,13 @@ def predict(text, text_pipeline):
385391
print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])
386392

387393

388-
389-
"""
390-
Other data processing pipeline - SentencePiece
391-
----------------------------------------------
392-
393-
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. We provide a few pretrained SentencePiece models and they are accessable from PRETRAINED_SP_MODEL. Here is an example to apply SentencePiece transform to build the dataset.
394-
395-
By using spm_transform transform in collate_batch function, you can re-run the tutorial with slightly improved results.
396-
"""
394+
##############################################
395+
# Other data processing pipeline - SentencePiece
396+
# ----------------------------------------------
397+
#
398+
# SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. We provide a few pretrained SentencePiece models and they are accessable from PRETRAINED_SP_MODEL. Here is an example to apply SentencePiece transform to build the dataset.
399+
#
400+
# By using spm_transform transform in collate_batch function, you can re-run the tutorial with slightly improved results.
397401

398402
from torchtext.experimental.transforms import (
399403
PRETRAINED_SP_MODEL,
@@ -405,11 +409,9 @@ def predict(text, text_pipeline):
405409
spm_transform = sentencepiece_processor(spm_filepath)
406410
sp_model = load_sp_model(spm_filepath)
407411

408-
"""
409-
The sentecepiece processor converts a text string into a list of integers. You can use the decode method to convert a list of integers back to the original string.
410-
411-
spm_transform('here is the an example')
412-
>>> [130, 46, 9, 76, 1798]
413-
spm_transform.decode([6468, 17151, 4024, 8246, 16887, 87, 23985, 12, 581, 15120])
414-
>>> 'torchtext sentencepiece processor can encode and decode'
415-
"""
412+
# The sentecepiece processor converts a text string into a list of integers. You can use the decode method to convert a list of integers back to the original string.
413+
#
414+
# spm_transform('here is the an example')
415+
# >>> [130, 46, 9, 76, 1798]
416+
# spm_transform.decode([6468, 17151, 4024, 8246, 16887, 87, 23985, 12, 581, 15120])
417+
# >>> 'torchtext sentencepiece processor can encode and decode'

0 commit comments

Comments
 (0)