The purpose of this project is to recreate the Word Power paper by Jegadeesh and Wu. There is a documented Jupyter Notebook in notebooks/Word Power.ipynb which can be run in order to generate results, but it is not able to use multiprocessing so it is slower than the actual program. To leverage multiprocessing abilities, run the main.py program via $ python main.py.
Given that Python and Redis will need to load the data into system memory (RAM) the program can get very memory intense. It is necessary to at least have 16GB of RAM in order to run the full program (from year 1995 - 2008). In order to run this program, it is necessary to have Redis installed and working properly. If you are using Windows, Redis can be installed through Chocolatey via C:> choco install redis-64. If you are running MacOSX you can install Redis via $ brew install redis. If you are running Linux or another Unix you should be able to install Redis through your package manager or compile from source.
This program depends on having the necessary software and packages in order to run. First, you need to have Python 3.5.2 installed. Next, you should be able to install all Python software dependencies through running $ pip install -r requirements.txt. In order to get the lxml package installed on Windows, it may be necessary to install the .whl file located in the lib project directory via C:> pip install lib/lxml-3.6.4-cp35-cp35m-win_amd64.whl. We had some issues with the third-party package SECEdgar and had to modify it in order to get it to work properly. Once the package is installed via pip it is possible to copy our version in the lib/SECEdgar folder and overwrite the version downloaded via pip if needed.
This outlines the project structure.
data- This folder contains necessary data to run the analysis. The merged CRSP and Compustat datafile is too large to include in the project so it is necessary to run the SAS program (CRSP+Comp.sas) to generate thecrsp_comp.sas7bdatdata file first.data/_amended- This folder is used to hold 10-K files that are amended 10-Ksdata/_error- This folder is used to hold 10-K files that contained errors that made it impossible to analyzedata/_nostockdata- This folder contains 10-K files in which we had no stock information for the company on the filing datedata/_outofrange- This folder contains 10-K files that are outside of our date range that we are looking atdata/SEC-Edgar-data- This folder is created by the Jupyter Notebook program to contain the download 10-K fileslib- This folder contains needed library files that may be helpfulnotebooks- This folder contains a Jupyter Notebook file that was used in development of the algorithmSEC-Edgar-data- This folder contains the download 10-K files from themain.pyprogramCRSP+Comp.egp- This is the SAS Enterprise Guide project file that can be used to generate the CRSP and Compustat dataCRSP+Comp.sas- This is the SAS code file that can be run to generate the CRSP and Compustat datamain.py- This is the main file that runs the program version of the algorithmrequirements.txt- This file contains all needed dependencies which can be installed via$ pip install -r requirements.txtWord Power - A New Approach for Content Analysis.pdf- This is the PDF version of the paper we are recreatingWordPower.py- This is the Python file that contains code for theWordPowerclass that themain.pyfile uses
For any issues that may occur or if there is an issue with obtaining the correct data, please contact Andrew Jarrett at andrew.jarrett@gatech.edu.