PyiSA is a Python package providing access to the iSAX algorithm for supervised, aggregated text classification developed by VOICES from the Blogs.
This package is for academic use only; commercial use of iSA/iSAX is protected by U.S. provisional patent application No. 62/215264. PyiSA is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The iSA(X) algorithm is described in the paper "iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content", Information Sciences (2016).
In essence, iSA is an algorithm for providing aggregated classification of text documents based on a supervised (human coded) sample. Unlike general-purpose classifiers (e.g. naive Bayes, support vector machines, decision trees or neural networks), iSA is designed to give an good estimate of the distribution of categories across a full corpus of documents, not to provide accurate per-document catgeory predictions.
iSA is particularly effective when working with small training corpora (i.e. the number of human-coded documents is relatively low), and the iSAX variant adds a step which augments small documents (e.g. Tweets or other short texts), compensating for their short length and improving prediction accuracy.
The original implementation of iSAX by the paper authors (Andrea Ceron, Luigi Curini and Stefano Iacus) is the iSAX R package.
PyiSA is a Python implementation which replicates the core functionality of iSAX; it has been designed to mimic the
arguments and keywords of the R version where this is possible and sensible. However, a number of changes have been
made to render the package more "pythonic"; most notably, the prep_data() function has been reduced in its scope,
allowing users to choose their own approach to building a Term-Document Matrix from the several excellent packages
available for this purpose in Python. The test_isa.py file gives one example of this using Scikit-Learn and NLTK to
effectively construct a matrix for the English language, but other languages will require different approaches.
At present, PyiSA may be installed simply by dropping the ./pyisax/ folder and its contents in your project. Please
ensure that you install the following dependencies (using pip or another package manager):
pandas
numpy
scipy
quadprog
A future version of PyiSA will be installable automatically using the pip command.
Import PyiSA to your project using the following command:
from pyisax import PyiSAYou can now directly access the PyiSA.prep_data() function. This function expects to be passed a term-document
matrix (documents in rows, features/vocabulary in columns), and accepts either a Numpy array or a Scipy CSR sparse
matrix. It returns a list of string representations of each document which may be passed directly to the main iSA
object.
To use iSA, first create an instance of the algorithm object with the settings you wish to use:
my_isa = PyiSA(boot_count=1000, predict=False, sequence_length=5, sparse=False,
verbose=False, tolerance=0)Parameters: (all are optional)
boot_count: Controls the number of runs the algorithm will attempt - higher figures may yield more accurate results at the expense of processing time.predict: If True, will populate thepredict_catsattribute with each stem's predicted category after processing. These predictions are provided for informational purposes only.sequence_length: Controls the length of the sub-sequences which should be used to augment the feature space of the data in the iSAX step. Set this to 0 to skip iSAX and perform "vanilla" iSA on un-augmented data.sparse: Experimental; if set to True, will use Pandas sparse arrays for some internal processing. May help if memory constraints are very tight. Probably doesn't, though.verbose: Give text feedback on various elements of the iSA algorithm's progress through stdout.tolerance: Lower bound for the determinant of P'*P; the feature space matrix is considered uninvertible below this, raising an error in the algorithm. A value of 0 means this will never raise an error; generally best left alone.
Once you have created a PyiSA object with your required data, use it to predict category distribution as follows:
my_isa.fit(X_train, X_test, y_train)X_train and X_test are lists of document strings received from the prep_data function, the former being the documents
for which categorisation data exists, the latter being the remainder of the corpus. y_train is a list of categories for
the X_train documents (i.e. target variables).
After fitting, the results of the algorithm can be accessed from the following attributes:
| Attribute | Description |
|---|---|
best |
Best estimation of incidence of all categories across the corpus |
best_table |
Details of best estimation, including standard error and P-value |
estimate |
First estimate of category incidence. If boot_count is 0, this will be the same as best. |
estimate_table |
Detailed statistics for estimate. |
boot |
Results from each run of the equation (averaged to discover best). |
predict_cats |
If predict parameter is set, predicted category for each 'stem'. |
elapsed_time |
Time in seconds it took to fit the model to the data. |
iSAX has been used successfully with languages such as Chinese and Japanese; the core algorithm is perfectly suited to these languages, but the process of creating a Term-Document Matrix is more challenging, requiring the use of custom tokenisation software to extract stems from the corpus.
For Japanese, either the mecab-python3 package (requiring the
installation of external software - easy on Linux or macOS, very challenging on Windows) or the janome package (no
external dependencies) are recommended.
Questions or bug reports specific to the Python package should be directed through Github.
If you have a general question about iSA, or are interested in using iSA technology in an enterprise environment, please contact iSA@voices-int.com.