Skip to content

Commit bc7b4c4

Browse files
authored
Update README.md
1 parent 226dcd2 commit bc7b4c4

File tree

1 file changed

+0
-354
lines changed

1 file changed

+0
-354
lines changed

README.md

Lines changed: 0 additions & 354 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@ Please visit the [wiki](https://github.com/ksator/Machine_Learning_with_Python/w
33

44
# Documentation structure
55

6-
- [Introduction to arrays using numpy](#introduction-to-arrays-using-numpy)
7-
- [visualize a dataset using seaborn](#visualize-a-dataset-using-seaborn)
86
- [manipulate dataset with pandas](#manipulate-dataset-with-pandas)
97
- [iris flowers classification](#iris-flowers-classification)
108
- [iris flowers data set](#iris-flowers-data-set)
@@ -23,358 +21,6 @@ Please visit the [wiki](https://github.com/ksator/Machine_Learning_with_Python/w
2321

2422

2523

26-
# manipulate dataset with pandas
27-
Pandas is a python library for data manipulation
28-
29-
```
30-
>>> import pandas as pd
31-
```
32-
```
33-
>>> bear_family = [
34-
... [100, 5 , 20, 80],
35-
... [50 , 2.5, 10, 40],
36-
... [110, 6 , 22, 80]]
37-
>>> bear_family
38-
[[100, 5, 20, 80], [50, 2.5, 10, 40], [110, 6, 22, 80]]
39-
>>> type(bear_family)
40-
<class 'list'>
41-
```
42-
use the DataFrame class
43-
```
44-
>>> bear_family_df = pd.DataFrame(bear_family)
45-
>>> type(bear_family_df)
46-
<class 'pandas.core.frame.DataFrame'>
47-
>>> bear_family_df
48-
0 1 2 3
49-
0 100 5.0 20 80
50-
1 50 2.5 10 40
51-
2 110 6.0 22 80
52-
```
53-
We can specify column and row names
54-
```
55-
>>> bear_family_df = pd.DataFrame(bear_family, index = ['mom', 'baby', 'dad'], columns = ['leg', 'hair','tail', 'belly'])
56-
>>> bear_family_df
57-
leg hair tail belly
58-
mom 100 5.0 20 80
59-
baby 50 2.5 10 40
60-
dad 110 6.0 22 80
61-
```
62-
access the leg column of the table
63-
```
64-
>>> bear_family_df.leg
65-
mom 100
66-
baby 50
67-
dad 110
68-
Name: leg, dtype: int64
69-
>>> bear_family_df["leg"]
70-
mom 100
71-
baby 50
72-
dad 110
73-
Name: leg, dtype: int64
74-
>>> bear_family_df["leg"].values
75-
array([100, 50, 110])
76-
```
77-
Let's now access dad bear: first by his position (2), then by his name "dad"
78-
```
79-
>>> bear_family_df.iloc[2]
80-
leg 110.0
81-
hair 6.0
82-
tail 22.0
83-
belly 80.0
84-
Name: dad, dtype: float64
85-
>>> bear_family_df.loc["dad"]
86-
leg 110.0
87-
hair 6.0
88-
tail 22.0
89-
belly 80.0
90-
Name: dad, dtype: float64
91-
```
92-
find out which bear has a leg of 110:
93-
```
94-
>>> bear_family_df["leg"] == 110
95-
mom False
96-
baby False
97-
dad True
98-
Name: leg, dtype: bool
99-
```
100-
filter lines
101-
select the bears that have a belly size of 80
102-
```
103-
>>> mask = bear_family_df["belly"] == 80
104-
>>> bears_80 = bear_family_df[mask]
105-
>>> bears_80
106-
leg hair tail belly
107-
mom 100 5.0 20 80
108-
dad 110 6.0 22 80
109-
```
110-
use the operator `~` to select the bears that don't have a belly size of 80
111-
```
112-
>>> bear_family_df[~mask]
113-
leg hair tail belly
114-
baby 50 2.5 10 40
115-
116-
```
117-
create a new dataframe with 2 new bears
118-
use the same columns as bear_family_df
119-
```
120-
>>> some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]], columns = bear_family_df.columns)
121-
>>> some_bears
122-
leg hair tail belly
123-
0 105 4 19 80
124-
1 100 5 20 80
125-
```
126-
assemble the two DataFrames together
127-
```
128-
>>> all_bears = bear_family_df.append(some_bears)
129-
>>> all_bears
130-
leg hair tail belly
131-
mom 100 5.0 20 80
132-
baby 50 2.5 10 40
133-
dad 110 6.0 22 80
134-
0 105 4.0 19 80
135-
1 100 5.0 20 80
136-
```
137-
In the DataFrame all_bears, the first bear (mom) and the last bear have exactly the same measurements
138-
drop duplicates
139-
```
140-
>>> all_bears = all_bears.drop_duplicates()
141-
>>> all_bears
142-
leg hair tail belly
143-
mom 100 5.0 20 80
144-
baby 50 2.5 10 40
145-
dad 110 6.0 22 80
146-
0 105 4.0 19 80
147-
```
148-
get names of columns
149-
```
150-
>>> bear_family_df.columns
151-
Index(['leg', 'hair', 'tail', 'belly'], dtype='object')
152-
```
153-
create a new column to a DataFrame
154-
mom and baby are female, dad is male
155-
```
156-
>>> bear_family_df["sex"] = ["f", "f", "m"]
157-
>>> bear_family_df
158-
leg hair tail belly sex
159-
mom 100 5.0 20 80 f
160-
baby 50 2.5 10 40 f
161-
dad 110 6.0 22 80 m
162-
```
163-
get the number of items
164-
```
165-
>>> len(bear_family_df)
166-
3
167-
```
168-
get the distinct values for a columns
169-
```
170-
>>> bear_family_df.belly.unique()
171-
array([80, 40])
172-
```
173-
read a csv file with Pandas
174-
```
175-
>>> import os
176-
>>> os.getcwd()
177-
'/home/ksator'
178-
>>> data = pd.read_csv("seaborn-data/iris.csv", sep=",")
179-
```
180-
load the titanic dataset
181-
```
182-
>>> import seaborn as sns
183-
>>> titanic = sns.load_dataset('titanic')
184-
```
185-
displays the first elements of the DataFrame
186-
```
187-
>>> titanic.head(5)
188-
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
189-
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
190-
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
191-
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
192-
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
193-
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
194-
>>> titanic.age.head(5)
195-
0 22.0
196-
1 38.0
197-
2 26.0
198-
3 35.0
199-
4 35.0
200-
Name: age, dtype: float64
201-
```
202-
displays the latest elements of the DataFrame.
203-
```
204-
>>> titanic.tail(5)
205-
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
206-
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
207-
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
208-
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
209-
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
210-
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True
211-
>>> titanic.age.tail(5)
212-
886 27.0
213-
887 19.0
214-
888 NaN
215-
889 26.0
216-
890 32.0
217-
Name: age, dtype: float64
218-
```
219-
220-
returns the unique values present in a Pandas data structure.
221-
```
222-
>>> titanic.age.unique()
223-
array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
224-
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
225-
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
226-
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
227-
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
228-
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
229-
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
230-
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
231-
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
232-
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
233-
```
234-
235-
The method `describe` provides various statistics (average, maximum, minimum, etc.) on the data in each column
236-
```
237-
>>> titanic.describe(include="all")
238-
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
239-
count 891.000000 891.000000 891 714.000000 891.000000 891.000000 891.000000 889 891 891 891 203 889 891 891
240-
unique NaN NaN 2 NaN NaN NaN NaN 3 3 3 2 7 3 2 2
241-
top NaN NaN male NaN NaN NaN NaN S Third man True C Southampton no True
242-
freq NaN NaN 577 NaN NaN NaN NaN 644 491 537 537 59 644 549 537
243-
mean 0.383838 2.308642 NaN 29.699118 0.523008 0.381594 32.204208 NaN NaN NaN NaN NaN NaN NaN NaN
244-
std 0.486592 0.836071 NaN 14.526497 1.102743 0.806057 49.693429 NaN NaN NaN NaN NaN NaN NaN NaN
245-
min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
246-
25% 0.000000 2.000000 NaN 20.125000 0.000000 0.000000 7.910400 NaN NaN NaN NaN NaN NaN NaN NaN
247-
50% 0.000000 3.000000 NaN 28.000000 0.000000 0.000000 14.454200 NaN NaN NaN NaN NaN NaN NaN NaN
248-
75% 1.000000 3.000000 NaN 38.000000 1.000000 0.000000 31.000000 NaN NaN NaN NaN NaN NaN NaN NaN
249-
max 1.000000 3.000000 NaN 80.000000 8.000000 6.000000 512.329200 NaN NaN NaN NaN NaN NaN NaN NaN
250-
```
251-
NaN stands for Not a Number
252-
```
253-
>>> titanic.age.head(10)
254-
0 22.0
255-
1 38.0
256-
2 26.0
257-
3 35.0
258-
4 35.0
259-
5 NaN
260-
6 54.0
261-
7 2.0
262-
8 27.0
263-
9 14.0
264-
Name: age, dtype: float64
265-
```
266-
use the fillna method to replace NaN with other values
267-
This returns a DataFrame where all NaN in the age column have been replaced by 0.
268-
```
269-
>>> titanic.fillna(value={"age": 0}).age.head(10)
270-
0 22.0
271-
1 38.0
272-
2 26.0
273-
3 35.0
274-
4 35.0
275-
5 0.0
276-
6 54.0
277-
7 2.0
278-
8 27.0
279-
9 14.0
280-
Name: age, dtype: float64
281-
```
282-
This returns a DataFrame where all NaN in the age column have been replaced with the previous values
283-
```
284-
>>> titanic.fillna(method="pad").age.head(10)
285-
0 22.0
286-
1 38.0
287-
2 26.0
288-
3 35.0
289-
4 35.0
290-
5 35.0
291-
6 54.0
292-
7 2.0
293-
8 27.0
294-
9 14.0
295-
Name: age, dtype: float64
296-
```
297-
use the dropna method to delete columns or rows/lines that contain NaN
298-
By default, it deletes the lines that contain NaN
299-
```
300-
>>>
301-
>>> titanic.dropna().head(10)
302-
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
303-
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
304-
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
305-
6 0 1 male 54.0 0 0 51.8625 S First man True E Southampton no True
306-
10 1 3 female 4.0 1 1 16.7000 S Third child False G Southampton yes False
307-
11 1 1 female 58.0 0 0 26.5500 S First woman False C Southampton yes True
308-
21 1 2 male 34.0 0 0 13.0000 S Second man True D Southampton yes True
309-
23 1 1 male 28.0 0 0 35.5000 S First man True A Southampton yes True
310-
27 0 1 male 19.0 3 2 263.0000 S First man True C Southampton no False
311-
52 1 1 female 49.0 1 0 76.7292 C First woman False D Cherbourg yes False
312-
54 0 1 male 65.0 0 1 61.9792 C First man True B Cherbourg no False
313-
```
314-
we can also delete the columns that contain NaN
315-
```
316-
>>> titanic.dropna(axis="columns").head()
317-
survived pclass sex sibsp parch fare class who adult_male alive alone
318-
0 0 3 male 1 0 7.2500 Third man True no False
319-
1 1 1 female 1 0 71.2833 First woman False yes False
320-
2 1 3 female 0 0 7.9250 Third woman False yes True
321-
3 1 1 female 1 0 53.1000 First woman False yes False
322-
4 0 3 male 0 0 8.0500 Third man True no True
323-
```
324-
rename a column
325-
```
326-
>>> titanic.rename(columns={"sex":"gender"}).head(5)
327-
survived pclass gender age sibsp parch fare embarked class who adult_male deck embark_town alive alone
328-
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
329-
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
330-
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
331-
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
332-
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
333-
```
334-
335-
delete the line with an index equal to 0.
336-
```
337-
>>> titanic.drop(0)
338-
```
339-
Deletes the column "age"
340-
```
341-
>>> titanic.drop(columns=["age"])
342-
```
343-
see the distribution of survivors by gender and ticket type
344-
the column survived uses 0s and 1s (0 means died and 1 means survived)
345-
the result is an average
346-
so 50% of females in third class died
347-
```
348-
>>> titanic.pivot_table('survived', index='sex', columns='class')
349-
class First Second Third
350-
sex
351-
female 0.968085 0.921053 0.500000
352-
male 0.368852 0.157407 0.135447
353-
```
354-
get the total number of survivors in each case
355-
the column survived uses 0s and 1s
356-
lets use the sum function
357-
```
358-
>>> titanic.pivot_table('survived', index='sex', columns='class', aggfunc="sum")
359-
class First Second Third
360-
sex
361-
female 91 70 72
362-
male 45 17 47
363-
```
364-
remove the lines with NaN
365-
group the ages into three categories
366-
use the cut function to segment data values
367-
```
368-
>>> titanic.dropna(inplace=True)
369-
>>> age = pd.cut(titanic['age'], [0, 18, 80])
370-
>>> titanic.pivot_table('survived', ['sex', age], 'class')
371-
class First Second Third
372-
sex age
373-
female (0, 18] 0.909091 1.000000 0.500000
374-
(18, 80] 0.968254 0.875000 0.666667
375-
male (0, 18] 0.800000 1.000000 1.000000
376-
(18, 80] 0.397436 0.333333 0.250000
377-
```
37824

37925

38026
# iris flowers classification

0 commit comments

Comments
 (0)