#Analysis Tools
These scripts do analysis of input data and run cross-validation tests using the Universal Recommender. The analysis show information about usage data, and the CV tests use a technique called MAP@k to rest the predictiveness of the many types of indicators used in a deployment of the UR. The tools run in 2 phases:
- Split: Split the data into train and test splits, calculate usage stats.
- Train and Deploy: Import the "train" dataset into the EventServer, train, and deploy the model. This can also be scripted but is not part of these scripts.
- Analyze: Run several forms of MAP@k predictiveness tests on the splits, collect stats, and create a report for Excel.
Note: This toolset does #1 and #3 but does not do #2.
Since #1 can take a significant amount of time, it is advised to run it once, then train and deploy once, and run #3 with different configs if needed. There is no need to retrain in most cases. This will create a completely reproducible split so you can safely compare results from any run of #3.
If any change is made to the engine.json, repeat #1-3
#Setup
Install Spark, PredictionIO v0.9.6 or greater, and the Universal Recommender v0.3.0 or greater. Make sure pio status completes with no errors and the integration-test for the UR runs correctly.
-
Install Python and check the version
python --vif the version is less than 2.7.9 upgrade to the most recent stable version of python using systems package management tools like
apt-getfor Ubuntu linux orbrewfor the Mac. -
Install Python libraries using the Python package manager found here
sudo pip install numpy scipy pandas ml_metrics predictionio tqdm click openpyxl -
Setup Spark and Pyspark paths in
.bashrc(linux) or.bash_profileMac.export SPARK_HOME=/path/to/spark export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/build/:$PYTONPAHTH
Analysis script should be run from UR (Universal recommender) folder. It uses two configuration files:
engine.json(configuration of UR, this file is used to take event list and primary event)config.json(all other configuration including engine.json path if necessary)
##Configuration options
config.json has the following structure:
{
"engine_config": "./engine.json",
"splitting": {
"version": "1",
"source_file": "hdfs:...<PUT SOME PATH>...",
"train_file": "hdfs:...<PUT SOME PATH>...train",
"test_file": "hdfs:...<PUT SOME PATH>...test",
"type": "date",
"train_ratio": 0.8,
"random_seed": 29750,
"split_event": "<SOME NAME>"
},
"reporting": {
"file": "./report.xlsx"
},
"testing": {
"map_k": 10,
"non_zero_users_file": "./non_zero_users.dat",
"consider_non_zero_scores_only": true,
"custom_combos": {
"event_groups": [["ev2", "ev3"], ["ev6", "ev8", "ev9"]]
}
},
"spark": {
"master": "spark://<some-url>:7077"
}
}
- engine_config - file to be used as engine.json (see configuration of UR)
- splitting - this section is about splitting data into train and test sets
- version - version to append to train and test file names (may be helpful is different test with different split configurations are run)
- source_file - file with data to be split
- train_file - file with training data to be produced (note that version will be append to file name)
- test_file - file with test data to be produced (note that version will be append to file name)
- type - split type (can be time in this case eventTime will be used to make split or random)
- train_ratio - float in (0..1), share of training samples
- random_seed - seed for random split
- split_event - in case of type = "date", this is event to use to look for split date, all events with this name are ordered by eventTime and time which devides all such events into first train_ratio and last (1 - train_ratio) sets is used to split all the rest data (events with all names) into training set and test set
- reporting - reporting settings
- file - excel file to write report to
- csv_dir - directory name for csv reporting
- use_uuid - append to csv file names unique uuid associated with script run (can be useful to manage different results and reports)
- testing - this section is about different tests and how to perform them
- map_k - maximum map @ k to be reported
- non_zero_users_file - file to save users with scores != 0 after first run of test set with primary event, this set may be much smaller then initial set of users so saving it and reusing can save much time
- consider_non_zero_scores_only (default: true) whether take into account only users with non-zero scores (i.e. users for which recommendations exist)
- custom_combos
- event_groups - groups of events to be additionally tested if necessary
- spark - Apache Spark configuration
- master - for now only master URL is configurable
##Split the Data
Get data from the EventServer with:
pio export --output path/to/store/events
Use this command to run split of data into "train" and "test" sets
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split
Additional options are available:
--csv_report- put report to csv file not excel--intersections- calculate train / test event intersection data (Advanced)
##Train a Model
The above command will create a test and training split in the location specified in config.json. Now you must import, setup engine.json, train and deploy the "train" model so the rest of the MAP@k tests will be able to query the model.
##Test and Analysis
To run tests
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py test --all
Additional options are available and may be used to run not all test:
--csv_report---dummy_test- run dummy test--separate_test- run test for each separate event--all_but_test- run test with all events and tests with all but each every event--primary_pairs_test- run tests with all pairs of events with primary event--custom_combos_test- run custom combo tests as configured in config.json--non_zero_users_from_file- use list of users from file prepared on previous script run to save time
##Generated Report
###Old ipython analysis script This is not recommended old approach to run ipython notebook.
IPYTHON_OPTS="notebook" /usr/local/spark/bin/pyspark --master spark://spark-url:7077