- MyPy (DropBox is currently working on MyPyC)
- Download p4merge
- look into internet extenders
- find out what map() is
- pyodbc
- crontab, luige, apache, or jams for automation
- dbeaver to get column names
- postman for interpretting apis
- toptal
- haversine
- DateTime
- Git Reference
- *args and *kargs gist
- Unix Cheat Sheet
- Markdown cheat sheet
- Python Tutor
- Color brewer
- Univariate distribution relationshipd
- Visualizing
scipy.statsdistributions - MathJax basic tutorial and quick reference
- bash profile location on OSX:
~/.bash_profile - for zsh use:
~/.zshrc
function gitadder(){
git pull
git add -A
if [ "$1" != "" ]; then
git commit -m "$1: $(date '+%b %d, %Y %H:%M:%S')"
else
git commit -m "Auto Update: $(date '+%b %d, %Y %H:%M:%S')"
fi
git push
}gitadder "Enter update text"
# Or just...
gitadderecho "Hello World"- prints message to screenecho "Hello World" > hello.txt- prints message to file (creates new file or overwrites existing)echo "goodbye" >> hello.txt- appends message to the end of filecat hello.txt- prints file to terminalman ls- print out a help menu for the commandhistory > history.txt- saves command history in a filegrep <filter>- looks at sheet of text and outputs only lines that have the keyword you are looking for!<history number>- runs command from a certain line in historyopen https://google.com &- opens google
git status- Display the state of the working directory and the staging area and see which changes have been staged, which haven't, and which files aren't being tracked by Gitgit log- View version history of current branch (to exit this typeq)git reset [file]- Unstages the file, but it preserves the file contents.git reset [commit]- Undoes all the commits after the specified commit and preserves the changes locally.git add .- Adds all files from the root directory to the staging areagit add -A- Adds all files root and sub directories to the staging areagit add [filename1 filename2]- Adds only listed files to the staging areagit commit -m "[Enter commit message]"- Creates a new commit containing the current contents of the indexgit pull- Fetches and merges changes on the remote server to your working directorygit push- Send your updates and new files in your commit from your local machine to the remote repositorygit revert [commit]- Undo the commit
-
git clonethe person's repo -
git branch [branchname] -
git checkout [branchname] -
git add . -
git commit -m "adding new branch to remote repo" -
git push --set-upstream origin [branchname] -
git checkout master
The probability of event A given B. When the two events are independent, the probabilty is simply P(A).
Describes the probability of a posterior event, based on previous conditions related to the event. For example, given the probabilities P(B) and P(B | A), Bayes' theorem can be applied to calculate P(A | B).
Tip: it's a bayes problem if there's two different
Some simple generators can be coded succinctly as expressions using a syntax similar to list comprehensions but with parentheses instead of square brackets. These expressions are designed for situations where the generator is used right away by an enclosing function. Generator expressions are more compact but less versatile than full generator definitions and tend to be more memory friendly than equivalent list comprehensions.
sum(i*i for i in range(10)) # sum of squares
# instead of...
sum([i*i for i in range(10)]) # needlessly allocates a list in memorydir(var_name)- shows all methods available for a variablewho- list variable names
-
from a dictionary (creating by columns)
df = pd.DataFrame({'Letters': ['a','b','c'], 'Numbers': [1,2,3]})
-
from list of lists (creating by rows)
df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]], columns=['Letters','Numbers'])
-
from csv (use parameter
sep='\t'for txt file)df_csv = pd.read_csv('filename.csv') df_txt = pd.read_csv('filename.txt', sep='\t')
df.describe()df.corr(method='pearson').round(3)df.drop('col_1', axis = 1)df.rename(columns={'old_name': 'new_name'})s.value_counts(dropna=False)s.fillna(s.mean())norm_dates = pd.to_datetime(str_dates, format='%Y%m%d')a = np.array([1,2,3,4,5,6]) # 1D
b = np.array([[1,2,3], # 2D
[4,5,6]])
c = np.array([[[1,2,3], # 3D
[4,5,6]],
[[7,8,9],
[10,11,12]]])a = np.array([1,2,3,4,5,6]).reshape(2,3) # 2 rows, 3 colsa = np.arange(1,10,2)a = np.linespace(1,10,101) # odd numbers for 3rd parameter work bestThe np.random subpackage contains some functions for creating arrays of random numbers. These two are the most useful, but there are more!
unif = np.random.normal(loc=0.0, scale=1.0, size=10**7)
fig, ax = plt.subplots(figsize=(10, 4))
_ = ax.hist(unif, bins=100, color="green")unif = np.random.normal(loc=0.0, scale=1.0, size=10**7)
fig, ax = plt.subplots(figsize=(10, 4))
_ = ax.hist(unif, bins=100, color="green")arr = np.arange(10)
out = np.where(arr % 2 == 1, -1, arr)
print(arr)
out
#> [0 1 2 3 4 5 6 7 8 9]
# array([ 0, -1, 2, -1, 4, -1, 6, -1, 8, -1])# this is a shorter way to create
# a single blank figure & axes
fig, ax = plt.subplots()
ax.plot(x_data,y_data_1)
ax.set_title('some stuff')
ax.set_xlabel('horizontal axis')
ax.set_ylabel('vertical axis')
# let's add another plot on these axes
ax.scatter(x_data, y_data_2, color='b')plt.subplots(n,m)specifies a grid with n rows and m columns, with an axes object in each grid square. These axes objects are returned in a numpy array.figsize=(width, length)parameter in suplots() controls the sizeplt.tight_layout()prevents ugly overlapping text- Remember you can use
dir()function on any object to see it's attributes and methods!!!
fig, axs = plt.subplots(2,4, figsize=(10,4))
axs[0,2].plot(x_data,y_data_2)
axs[0,2].set_title('cool')
plt.tight_layout()fig, axs = plt.subplots(2,4, figsize=(10,4))
for i, ax in enumerate(axs.flatten()):
ax.scatter(x_data,y_data_3)
ax.set_title(f'Plot number {i}')
plt.tight_layout()m, n = 5,7
xx = np.linspace(0,10,40)
yy = np.random.random(size = (len(xx), m*n))
fig, axs = plt.subplots(m,n, figsize = (10,4))
[ax.scatter(xx, yy[:,i]) for i, ax in enumerate(axs.flatten())];plt.scatter(
x, # scalar or array
y, # scalar or array
s=None, # marker size
c=None, # color
marker=None, # marker shape (ex. 'o'(default),'x','*','v','^')
cmap=None, # A `.Colormap` instance or registered colormap name. *cmap* is
# only used if c is an array of floats.
norm=None, # used to scale luminance data between 0 and 1 (only used when c is an array of floats)
vmin=None, # only used with norm
vmax=None, # only used with norm
alpha=None, # The alpha blending value, between 0 (transparent) and 1 (opaque).
linewidths=None,# border width
edgecolors=None,# color of border
*,
plotnonfinite=False,
data=None,
**kwargs,
)# Example
plt.plot(x,
y,
color='k',
linestyle='--',
linewidth=2,
marker='^',
markersize=8)In ~/.ssh/config add...
Host examplename # name of host
HostName 52.27.155.84 # replace with path from AWS
User ubuntu
IdentityFile ~/.ssh/rft5.pemCall using ssh examplename
import boto3
s3 = boto3.client('s3')
remote_file_name = 'cancer_rates.png'
local_file_name = 'cancer_rates.png'
bucket_name = 'mdubatto1'
s3.create_bucket(Bucket=bucket_name)
s3.upload_file(Filename=local_file_name,
Bucket=bucket_name,
Key=remote_file_name)scp local_file.py examplename:/home/ubuntu/remote_file.pydocker run --name pgserv -d -p 5432:5432 -v "$PWD":/home/data -e POSTGRES_PASSWORD='password' postgres- the
-dflag means "run this container in the background" -p 5432:5432means "connect port 5432 from this computer (localhost) to the container's port 5432". This will allow us to connect to the Postgres server (which is inside the container) from services running outside of the container (such as python, as we'll see later).- Most services expect to find Postgres running on port 5432. If you have any previous installations of Postgres installed you may want to change this to 5435. If you do you will have to remember to specify what port to connect to working from your system.
- the
-vflag connects the filesystem in the container to your computer's filesystem. See the documentation for docker volumes.- Here, the container's folder
/home/datawill be mapped to whichever folder you ran thedocker runcommand from ($PWD). If you want to make your entire home folder visible to the docker container, navigate to~before running the above command. If you only want the container to see, say, a folder you cloned from github, navigate to/path/to/repo_folderfirst. Any changes made to files in this folder are immediately visible to the container and your native file system. This is important for the step of loading data into the database
- Here, the container's folder
- the
-eis setting up an enviroment variable which is the default password for postgres. You most likely want to choose something better than the defaultpasswordabove.
docker run --name sparkbook -p 8881:8888 -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook start.sh jupyter lab --LabApp.token=''- this will take a while to download the first time you run it
- here I've given this container the name
sparkbook. You can call it whatever you like. - the
-pflag is exposing port8881, so as not to collide with any other notebooks you have running. - the
-vflag connects the filesystem in the container to your computer's filesystem. See the documentation for docker volumes.- Here, the container's folder
/home/jovyan/workwill be mapped to whichever folder you ran thedocker runcommand from ($PWD). If you want to make your entire home folder visible to the docker container, navigate to~before running the above command. If you only want the container to see, say, a folder you cloned from github, navigate to/path/to/repo_folderfirst. Then you can make changes from inside the container and run git commands outside the container
- Here, the container's folder
docker exec -it <container name> bashHigh Bias:
- Model is underfit
- Line is too rigid
- Not enough features
- Errors tend towards one side or the other in blocks
- Error magnitude not randomly distributed
High Variance:
- Model is overfit
- Line is too flexible
- Too many features
- Errors tend to alternate positive/negative
- Error magnitudes normally distributed
Conclusion:
- Optimal model has minimum total error
- Neither overfit nor underfit
- It is necessary to do train-test split to observe error on unseen data
- All else being equal, prefer simpler models to more complex
We use cross-validation for two things:
- Attempting to quantify how well a model (of some given complexity) will predict on an unseen data set
- Tuning hyperparameters of models to get best predictions.
The code snippet below loads data on the iris training set, splits it into a train/validation and test set, instantiates a logistic regression model and runs cross validation on the train/validation set with 5 folds.
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42)
logreg = LogisticRegression()
scores = cross_val_score(logreg, X_trainval, y_trainval, cv=5)
print(f"Cross-validation scores: {scores}")Note: Don't freak out if you have testing error lower than training error...it means your outliers probably all got randomly selected in the training set
KFolds - k = 5 is common
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.25, random_state=42)import sklearn
sorted(sklearn.metrics.SCORERS.keys())- similar process for all sklearn models...just different parameters
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from numpy as np
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse}")Methods for calculating distance:

Pros:
- Super simple
- Training is trivial (store the data)
- Works with any number of classes
- Easy to add more data
- Few hyperparameters:
- distance metric
- k
Cons:
- High prediction cost (especially for large datasets)
- Bad with high dimensions
- Categorical features don’t work well
Other notes:
- Rule of thumb - k = n**0.5
- Don't forget to scale your data!!
- Works well for dimensions < 5 (curse of dimensionality)
- The more dimensions you have the more data points are needed to maintain density
- n**(dnew/doriginal)
greater lambda means simpler model



