You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: beginner_source/text_sentiment_ngrams_tutorial.py
+70-68Lines changed: 70 additions & 68 deletions
Original file line number
Diff line number
Diff line change
@@ -7,60 +7,63 @@
7
7
- Access to the raw data as an iterator
8
8
- Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
9
9
- Shuffle and iterate the data with torch.utils.data.DataLoader
10
-
11
-
Access to the raw dataset iterators
12
-
-----------------------------------
13
-
14
-
For some advanced users, they prefer to work on the raw data strings with their custom data process pipeline. The new torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.
#For some advanced users, they prefer to work on the raw data strings with their custom data process pipeline. The new torchtext library provides a few raw dataset iterators, which yield the raw text strings. For example, the AG_NEWS dataset iterators yield the raw data as a tuple of label and text.
investment firm Carlyle Group,\\which has a reputation for making well-timed
33
-
and occasionally\\controversial plays in the defense industry, has quietly
34
-
placed\\its bets on another part of the market.')
35
-
36
-
next(train_iter)
37
-
>>> (3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring
38
-
crude prices plus worries\\about the economy and the outlook for earnings are
39
-
expected to\\hang over the stock market next week during the depth of
40
-
the\\summer doldrums.")
41
-
42
-
Prepare data processing pipelines
43
-
---------------------------------
44
-
45
-
We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.
46
-
47
-
Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. We provide a function build_vocab_from_iterator to build the vocabulary from a text iterator. Users can set up the minimum frequency for the tokens to be included.
48
-
"""
24
+
25
+
# next(train_iter)
26
+
# >>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters -
27
+
# Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green
#We have revisited the very basic components of the torchtext library, including vocab, word vectors, tokenizer backed by regular expression, and sentencepiece. Those are the basic data processing building blocks for raw text string.
48
+
#
49
+
#Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. We provide a function build_vocab_from_iterator to build the vocabulary from a text iterator. Users can set up the minimum frequency for the tokens to be included.
The text pipeline converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,
74
75
75
-
text_pipeline('here is the an example')
76
-
>>> [475, 21, 2, 30, 5286]
77
-
label_pipeline('10')
78
-
>>> 9
79
-
80
-
Generate data batch and iterator
81
-
--------------------------------
82
-
83
-
The PyTorch data loading utility is the torch.utils.data.DataLoader class. It works with a map-style dataset that implements the getitem() and len() protocols, and represents a map from indices/keys to data samples. It also works with an iterable datasets with the shuffle argumnet of False.
# The text pipeline converts a text string into a list of integers based on the lookup defined in the vocab. The label pipeline converts the label into integers. For example,
#The PyTorch data loading utility is the torch.utils.data.DataLoader class. It works with a map-style dataset that implements the getitem() and len() protocols, and represents a map from indices/keys to data samples. It also works with an iterable datasets with the shuffle argumnet of False.
89
+
#
90
+
#Before sending to the model, collate_fn function works on a batch of samples generated from DataLoader. The input to collat_fn is a batch of data with the batch size in DataLoader, and collate_fn processes them according to the data processing pipelines declared on Step 2. Pay attention here and make sure that collate_fn is declared as a top level def. This ensures that the function is available in each worker.
91
+
#
92
+
#In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of indidividual text entries.
84
93
85
-
Before sending to the model, collate_fn function works on a batch of samples generated from DataLoader. The input to collat_fn is a batch of data with the batch size in DataLoader, and collate_fn processes them according to the data processing pipelines declared on Step 2. Pay attention here and make sure that collate_fn is declared as a top level def. This ensures that the function is available in each worker.
86
94
87
-
In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of indidividual text entries.
print("This is a %s news"%ag_news_label[predict(ex_text_str, text_pipeline)])
386
392
387
393
388
-
389
-
"""
390
-
Other data processing pipeline - SentencePiece
391
-
----------------------------------------------
392
-
393
-
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. We provide a few pretrained SentencePiece models and they are accessable from PRETRAINED_SP_MODEL. Here is an example to apply SentencePiece transform to build the dataset.
394
-
395
-
By using spm_transform transform in collate_batch function, you can re-run the tutorial with slightly improved results.
396
-
"""
394
+
##############################################
395
+
# Other data processing pipeline - SentencePiece
396
+
# ----------------------------------------------
397
+
#
398
+
# SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. For sentencepiece transforms in torchtext, both subword units (e.g., byte-pair-encoding (BPE) ) and unigram language model are supported. We provide a few pretrained SentencePiece models and they are accessable from PRETRAINED_SP_MODEL. Here is an example to apply SentencePiece transform to build the dataset.
399
+
#
400
+
# By using spm_transform transform in collate_batch function, you can re-run the tutorial with slightly improved results.
The sentecepiece processor converts a text string into a list of integers. You can use the decode method to convert a list of integers back to the original string.
>>> 'torchtext sentencepiece processor can encode and decode'
415
-
"""
412
+
# The sentecepiece processor converts a text string into a list of integers. You can use the decode method to convert a list of integers back to the original string.
0 commit comments