DSI Notes

Tables of Contents:

CLI Scripting
Unix
Git
Statistics
Python
- Pandas
- Numpy
- Scipy
- Matplotlib
AWS
Docker
Machine Learning

List of things to look into:

MyPy (DropBox is currently working on MyPyC)
Download p4merge
look into internet extenders
find out what map() is
pyodbc
crontab, luige, apache, or jams for automation
dbeaver to get column names
postman for interpretting apis
toptal
haversine

Links:

CLI Scripting

bash profile location on OSX: ~/.bash_profile
for zsh use: ~/.zshrc

how to make a bash / zsh function

function gitadder(){
    git pull
    git add -A
    if [ "$1" != "" ]; then
        git commit -m "$1: $(date '+%b %d, %Y %H:%M:%S')"
    else
        git commit -m "Auto Update: $(date '+%b %d, %Y %H:%M:%S')"
    fi
    git push
}

call this function using

gitadder "Enter update text"

# Or just...

gitadder

Back to top

Unix

echo "Hello World" - prints message to screen
echo "Hello World" > hello.txt - prints message to file (creates new file or overwrites existing)
echo "goodbye" >> hello.txt - appends message to the end of file
cat hello.txt - prints file to terminal
man ls - print out a help menu for the command
history > history.txt - saves command history in a file
grep <filter> - looks at sheet of text and outputs only lines that have the keyword you are looking for
!<history number> - runs command from a certain line in history
open https://google.com & - opens google

Back to top

Git

git status - Display the state of the working directory and the staging area and see which changes have been staged, which haven't, and which files aren't being tracked by Git
git log - View version history of current branch (to exit this type q)
git reset [file] - Unstages the file, but it preserves the file contents.
git reset [commit] - Undoes all the commits after the specified commit and preserves the changes locally.
git add . - Adds all files from the root directory to the staging area
git add -A - Adds all files root and sub directories to the staging area
git add [filename1 filename2] - Adds only listed files to the staging area
git commit -m "[Enter commit message]" - Creates a new commit containing the current contents of the index
git pull - Fetches and merges changes on the remote server to your working directory
git push - Send your updates and new files in your commit from your local machine to the remote repository
git revert [commit] - Undo the commit

Working on a team

git clone the person's repo
git branch [branchname]
git checkout [branchname]
git add .
git commit -m "adding new branch to remote repo"
git push --set-upstream origin [branchname]
git checkout master

Back to top

Probability & Statistics

Conditional probability

The probability of event A given B. When the two events are independent, the probabilty is simply P(A).

Bayes' theorem

Describes the probability of a posterior event, based on previous conditions related to the event. For example, given the probabilities P(B) and P(B | A), Bayes' theorem can be applied to calculate P(A | B).

Tip: it's a bayes problem if there's two different

Law of total probability

Back to top

Python

Generator expressions

Some simple generators can be coded succinctly as expressions using a syntax similar to list comprehensions but with parentheses instead of square brackets. These expressions are designed for situations where the generator is used right away by an enclosing function. Generator expressions are more compact but less versatile than full generator definitions and tend to be more memory friendly than equivalent list comprehensions.

sum(i*i for i in range(10))         # sum of squares

# instead of...

sum([i*i for i in range(10)])       # needlessly allocates a list in memory

dir(var_name) - shows all methods available for a variable
who - list variable names

Pandas

Create a DataFrame

from a dictionary (creating by columns)

df = pd.DataFrame({'Letters': ['a','b','c'],
                   'Numbers': [1,2,3]})

from list of lists (creating by rows)

df = pd.DataFrame([['a', 1],
                   ['b', 2],
                   ['c', 3]],
                   columns=['Letters','Numbers'])

from csv (use parameter sep='\t' for txt file)

df_csv = pd.read_csv('filename.csv')
df_txt = pd.read_csv('filename.txt', sep='\t')

Summary statistics (numerical variables)

df.describe()

Correlation -- Pearson and Spearman (numerical variables)

df.corr(method='pearson').round(3)

Drop a column

df.drop('col_1', axis = 1)

Rename columns

df.rename(columns={'old_name': 'new_name'})

Value counts for a column (i.e. series)

s.value_counts(dropna=False)

Fill NA with mean (mean can be replaced with other stat functions)

s.fillna(s.mean())

Convert a series of str formatted dates into a datetime type

norm_dates = pd.to_datetime(str_dates, format='%Y%m%d')

Back to top

Numpy

Create an array

a = np.array([1,2,3,4,5,6])     # 1D
b = np.array([[1,2,3],          # 2D
              [4,5,6]])
c = np.array([[[1,2,3],         # 3D
               [4,5,6]],
              [[7,8,9],
               [10,11,12]]])

Reshape array

a = np.array([1,2,3,4,5,6]).reshape(2,3)  # 2 rows, 3 cols

Create an array of evenly spaced values (by step)

a = np.arange(1,10,2)

Create an array of evenly spaced values (by # of samples)

a = np.linespace(1,10,101)   # odd numbers for 3rd parameter work best

`np.random.uniform` & `np.random.normal`

The np.random subpackage contains some functions for creating arrays of random numbers. These two are the most useful, but there are more!

unif = np.random.normal(loc=0.0, scale=1.0, size=10**7)

fig, ax = plt.subplots(figsize=(10, 4))
_ = ax.hist(unif, bins=100, color="green")

unif = np.random.normal(loc=0.0, scale=1.0, size=10**7)

fig, ax = plt.subplots(figsize=(10, 4))
_ = ax.hist(unif, bins=100, color="green")

Create a new array based on conditions

arr = np.arange(10)
out = np.where(arr % 2 == 1, -1, arr)
print(arr)
out
#> [0 1 2 3 4 5 6 7 8 9]
# array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

Back to top

Scipy

Back to top

matplotlib

The Object Oriented interface

# this is a shorter way to create 
# a single blank figure & axes
fig, ax = plt.subplots()

ax.plot(x_data,y_data_1)

ax.set_title('some stuff')
ax.set_xlabel('horizontal axis')
ax.set_ylabel('vertical axis')

# let's add another plot on these axes
ax.scatter(x_data, y_data_2, color='b')

Subplots

plt.subplots(n,m) specifies a grid with n rows and m columns, with an axes object in each grid square. These axes objects are returned in a numpy array.
figsize=(width, length) parameter in suplots() controls the size
plt.tight_layout() prevents ugly overlapping text
Remember you can use dir() function on any object to see it's attributes and methods!!!

fig, axs = plt.subplots(2,4, figsize=(10,4))

axs[0,2].plot(x_data,y_data_2)
axs[0,2].set_title('cool')

plt.tight_layout()

Interating over subplots

fig, axs = plt.subplots(2,4, figsize=(10,4))

for i, ax in enumerate(axs.flatten()):
    ax.scatter(x_data,y_data_3)
    ax.set_title(f'Plot number {i}')

plt.tight_layout()

You can also use list comprehension

m, n = 5,7
xx = np.linspace(0,10,40)
yy = np.random.random(size = (len(xx), m*n))
fig, axs = plt.subplots(m,n, figsize = (10,4))

[ax.scatter(xx, yy[:,i]) for i, ax in enumerate(axs.flatten())];

Scatterplot

plt.scatter(
    x,              # scalar or array
    y,              # scalar or array
    s=None,         # marker size
    c=None,         # color
    marker=None,    # marker shape (ex. 'o'(default),'x','*','v','^')
    cmap=None,      # A `.Colormap` instance or registered colormap name. *cmap* is
                    # only used if c is an array  of floats.
    norm=None,      # used to scale luminance data between 0 and 1 (only used when c is an array  of floats)
    vmin=None,      # only used with norm
    vmax=None,      # only used with norm
    alpha=None,     # The alpha blending value, between 0 (transparent) and 1 (opaque).
    linewidths=None,# border width
    edgecolors=None,# color of border
    *,
    plotnonfinite=False,
    data=None,
    **kwargs,
)

Plot draws a line between each point instead of just the point

# Example
plt.plot(x, 
         y, 
         color='k', 
         linestyle='--',
         linewidth=2,
         marker='^', 
         markersize=8)

Plot() Documentation

Back to top

AWS

Doing this weird vm thing:

In ~/.ssh/config add...

Host examplename    # name of host
 HostName 52.27.155.84  # replace with path from AWS
 User ubuntu
 IdentityFile ~/.ssh/rft5.pem

Call using ssh examplename

Making a bucket and adding files to it

import boto3

s3 = boto3.client('s3')

remote_file_name = 'cancer_rates.png'
local_file_name = 'cancer_rates.png'
bucket_name = 'mdubatto1'

s3.create_bucket(Bucket=bucket_name)

s3.upload_file(Filename=local_file_name, 
               Bucket=bucket_name, 
               Key=remote_file_name)

Copying a .py file from a local environment to a aws machine

scp local_file.py examplename:/home/ubuntu/remote_file.py

Back to top

Docker

Postgres container

docker run --name pgserv -d -p 5432:5432 -v "$PWD":/home/data -e POSTGRES_PASSWORD='password' postgres

the -d flag means "run this container in the background"
-p 5432:5432 means "connect port 5432 from this computer (localhost) to the container's port 5432". This will allow us to connect to the Postgres server (which is inside the container) from services running outside of the container (such as python, as we'll see later).
- Most services expect to find Postgres running on port 5432. If you have any previous installations of Postgres installed you may want to change this to 5435. If you do you will have to remember to specify what port to connect to working from your system.
the -v flag connects the filesystem in the container to your computer's filesystem. See the documentation for docker volumes.
- Here, the container's folder /home/data will be mapped to whichever folder you ran the docker run command from ($PWD). If you want to make your entire home folder visible to the docker container, navigate to ~ before running the above command. If you only want the container to see, say, a folder you cloned from github, navigate to /path/to/repo_folder first. Any changes made to files in this folder are immediately visible to the container and your native file system. This is important for the step of loading data into the database
the -e is setting up an enviroment variable which is the default password for postgres. You most likely want to choose something better than the default password above.

PySpark container

docker run --name sparkbook -p 8881:8888 -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook start.sh jupyter lab --LabApp.token=''

this will take a while to download the first time you run it
here I've given this container the name sparkbook. You can call it whatever you like.
the -p flag is exposing port 8881, so as not to collide with any other notebooks you have running.
the -v flag connects the filesystem in the container to your computer's filesystem. See the documentation for docker volumes.
- Here, the container's folder /home/jovyan/work will be mapped to whichever folder you ran the docker run command from ($PWD). If you want to make your entire home folder visible to the docker container, navigate to ~ before running the above command. If you only want the container to see, say, a folder you cloned from github, navigate to /path/to/repo_folder first. Then you can make changes from inside the container and run git commands outside the container

Open a bash shell in a container

docker exec -it <container name> bash

Back to top

Machine Learning

Bias Variance Tradeoff

High Bias:

Model is underfit
Line is too rigid
Not enough features
Errors tend towards one side or the other in blocks
Error magnitude not randomly distributed

High Variance:

Model is overfit
Line is too flexible
Too many features
Errors tend to alternate positive/negative
Error magnitudes normally distributed

Conclusion:

Optimal model has minimum total error
Neither overfit nor underfit
It is necessary to do train-test split to observe error on unseen data
All else being equal, prefer simpler models to more complex

Back to top

Cross Validation

We use cross-validation for two things:

Attempting to quantify how well a model (of some given complexity) will predict on an unseen data set
Tuning hyperparameters of models to get best predictions.

The code snippet below loads data on the iris training set, splits it into a train/validation and test set, instantiates a logistic regression model and runs cross validation on the train/validation set with 5 folds.

from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)

logreg = LogisticRegression()
scores = cross_val_score(logreg, X_trainval, y_trainval, cv=5)
print(f"Cross-validation scores: {scores}")

Note: Don't freak out if you have testing error lower than training error...it means your outliers probably all got randomly selected in the training set

KFolds - k = 5 is common

Back to top

sklearn code snippets

train_test_split

from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Get list of scorers

import sklearn

sorted(sklearn.metrics.SCORERS.keys())

Building a model

similar process for all sklearn models...just different parameters

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from numpy as np

reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse}")

Back to top

Models

k-NN

Methods for calculating distance:

Pros:

Super simple
Training is trivial (store the data)
Works with any number of classes
Easy to add more data
Few hyperparameters:
- distance metric
- k

Cons:

High prediction cost (especially for large datasets)
Bad with high dimensions
Categorical features don’t work well

Other notes:

Rule of thumb - k = n**0.5
Don't forget to scale your data!!
Works well for dimensions < 5 (curse of dimensionality)
The more dimensions you have the more data points are needed to maintain density
n**(d_new/d_original)

Back to top

Ridge

greater lambda means simpler model

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
images		images
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Tables of Contents:

List of things to look into:

Links:

how to make a bash / zsh function

call this function using

Working on a team

Conditional probability

Bayes' theorem

Law of total probability

Generator expressions

Create a DataFrame

Summary statistics (numerical variables)

Correlation -- Pearson and Spearman (numerical variables)

Drop a column

Rename columns

Value counts for a column (i.e. series)

Fill NA with mean (mean can be replaced with other stat functions)

Convert a series of str formatted dates into a datetime type

Create an array

Reshape array

Create an array of evenly spaced values (by step)

Create an array of evenly spaced values (by # of samples)

np.random.uniform & np.random.normal

Create a new array based on conditions

The Object Oriented interface

Subplots

Interating over subplots

You can also use list comprehension

Scatterplot

Plot draws a line between each point instead of just the point

Doing this weird vm thing:

Making a bucket and adding files to it

Copying a .py file from a local environment to a aws machine

Postgres container

PySpark container

Open a bash shell in a container

train_test_split

Get list of scorers

Building a model

Ridge

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

`np.random.uniform` & `np.random.normal`