@@ -3,8 +3,6 @@ Please visit the [wiki](https://github.com/ksator/Machine_Learning_with_Python/w
3
3
4
4
# Documentation structure
5
5
6
- - [ Introduction to arrays using numpy] ( #introduction-to-arrays-using-numpy )
7
- - [ visualize a dataset using seaborn] ( #visualize-a-dataset-using-seaborn )
8
6
- [ manipulate dataset with pandas] ( #manipulate-dataset-with-pandas )
9
7
- [ iris flowers classification] ( #iris-flowers-classification )
10
8
- [ iris flowers data set] ( #iris-flowers-data-set )
@@ -23,358 +21,6 @@ Please visit the [wiki](https://github.com/ksator/Machine_Learning_with_Python/w
23
21
24
22
25
23
26
- # manipulate dataset with pandas
27
- Pandas is a python library for data manipulation
28
-
29
- ```
30
- >>> import pandas as pd
31
- ```
32
- ```
33
- >>> bear_family = [
34
- ... [100, 5 , 20, 80],
35
- ... [50 , 2.5, 10, 40],
36
- ... [110, 6 , 22, 80]]
37
- >>> bear_family
38
- [[100, 5, 20, 80], [50, 2.5, 10, 40], [110, 6, 22, 80]]
39
- >>> type(bear_family)
40
- <class 'list'>
41
- ```
42
- use the DataFrame class
43
- ```
44
- >>> bear_family_df = pd.DataFrame(bear_family)
45
- >>> type(bear_family_df)
46
- <class 'pandas.core.frame.DataFrame'>
47
- >>> bear_family_df
48
- 0 1 2 3
49
- 0 100 5.0 20 80
50
- 1 50 2.5 10 40
51
- 2 110 6.0 22 80
52
- ```
53
- We can specify column and row names
54
- ```
55
- >>> bear_family_df = pd.DataFrame(bear_family, index = ['mom', 'baby', 'dad'], columns = ['leg', 'hair','tail', 'belly'])
56
- >>> bear_family_df
57
- leg hair tail belly
58
- mom 100 5.0 20 80
59
- baby 50 2.5 10 40
60
- dad 110 6.0 22 80
61
- ```
62
- access the leg column of the table
63
- ```
64
- >>> bear_family_df.leg
65
- mom 100
66
- baby 50
67
- dad 110
68
- Name: leg, dtype: int64
69
- >>> bear_family_df["leg"]
70
- mom 100
71
- baby 50
72
- dad 110
73
- Name: leg, dtype: int64
74
- >>> bear_family_df["leg"].values
75
- array([100, 50, 110])
76
- ```
77
- Let's now access dad bear: first by his position (2), then by his name "dad"
78
- ```
79
- >>> bear_family_df.iloc[2]
80
- leg 110.0
81
- hair 6.0
82
- tail 22.0
83
- belly 80.0
84
- Name: dad, dtype: float64
85
- >>> bear_family_df.loc["dad"]
86
- leg 110.0
87
- hair 6.0
88
- tail 22.0
89
- belly 80.0
90
- Name: dad, dtype: float64
91
- ```
92
- find out which bear has a leg of 110:
93
- ```
94
- >>> bear_family_df["leg"] == 110
95
- mom False
96
- baby False
97
- dad True
98
- Name: leg, dtype: bool
99
- ```
100
- filter lines
101
- select the bears that have a belly size of 80
102
- ```
103
- >>> mask = bear_family_df["belly"] == 80
104
- >>> bears_80 = bear_family_df[mask]
105
- >>> bears_80
106
- leg hair tail belly
107
- mom 100 5.0 20 80
108
- dad 110 6.0 22 80
109
- ```
110
- use the operator ` ~ ` to select the bears that don't have a belly size of 80
111
- ```
112
- >>> bear_family_df[~mask]
113
- leg hair tail belly
114
- baby 50 2.5 10 40
115
-
116
- ```
117
- create a new dataframe with 2 new bears
118
- use the same columns as bear_family_df
119
- ```
120
- >>> some_bears = pd.DataFrame([[105,4,19,80],[100,5,20,80]], columns = bear_family_df.columns)
121
- >>> some_bears
122
- leg hair tail belly
123
- 0 105 4 19 80
124
- 1 100 5 20 80
125
- ```
126
- assemble the two DataFrames together
127
- ```
128
- >>> all_bears = bear_family_df.append(some_bears)
129
- >>> all_bears
130
- leg hair tail belly
131
- mom 100 5.0 20 80
132
- baby 50 2.5 10 40
133
- dad 110 6.0 22 80
134
- 0 105 4.0 19 80
135
- 1 100 5.0 20 80
136
- ```
137
- In the DataFrame all_bears, the first bear (mom) and the last bear have exactly the same measurements
138
- drop duplicates
139
- ```
140
- >>> all_bears = all_bears.drop_duplicates()
141
- >>> all_bears
142
- leg hair tail belly
143
- mom 100 5.0 20 80
144
- baby 50 2.5 10 40
145
- dad 110 6.0 22 80
146
- 0 105 4.0 19 80
147
- ```
148
- get names of columns
149
- ```
150
- >>> bear_family_df.columns
151
- Index(['leg', 'hair', 'tail', 'belly'], dtype='object')
152
- ```
153
- create a new column to a DataFrame
154
- mom and baby are female, dad is male
155
- ```
156
- >>> bear_family_df["sex"] = ["f", "f", "m"]
157
- >>> bear_family_df
158
- leg hair tail belly sex
159
- mom 100 5.0 20 80 f
160
- baby 50 2.5 10 40 f
161
- dad 110 6.0 22 80 m
162
- ```
163
- get the number of items
164
- ```
165
- >>> len(bear_family_df)
166
- 3
167
- ```
168
- get the distinct values for a columns
169
- ```
170
- >>> bear_family_df.belly.unique()
171
- array([80, 40])
172
- ```
173
- read a csv file with Pandas
174
- ```
175
- >>> import os
176
- >>> os.getcwd()
177
- '/home/ksator'
178
- >>> data = pd.read_csv("seaborn-data/iris.csv", sep=",")
179
- ```
180
- load the titanic dataset
181
- ```
182
- >>> import seaborn as sns
183
- >>> titanic = sns.load_dataset('titanic')
184
- ```
185
- displays the first elements of the DataFrame
186
- ```
187
- >>> titanic.head(5)
188
- survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
189
- 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
190
- 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
191
- 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
192
- 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
193
- 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
194
- >>> titanic.age.head(5)
195
- 0 22.0
196
- 1 38.0
197
- 2 26.0
198
- 3 35.0
199
- 4 35.0
200
- Name: age, dtype: float64
201
- ```
202
- displays the latest elements of the DataFrame.
203
- ```
204
- >>> titanic.tail(5)
205
- survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
206
- 886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
207
- 887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
208
- 888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
209
- 889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
210
- 890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True
211
- >>> titanic.age.tail(5)
212
- 886 27.0
213
- 887 19.0
214
- 888 NaN
215
- 889 26.0
216
- 890 32.0
217
- Name: age, dtype: float64
218
- ```
219
-
220
- returns the unique values present in a Pandas data structure.
221
- ```
222
- >>> titanic.age.unique()
223
- array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
224
- 4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
225
- 8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
226
- 49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
227
- 16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
228
- 71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
229
- 51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
230
- 45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
231
- 60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
232
- 70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
233
- ```
234
-
235
- The method ` describe ` provides various statistics (average, maximum, minimum, etc.) on the data in each column
236
- ```
237
- >>> titanic.describe(include="all")
238
- survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
239
- count 891.000000 891.000000 891 714.000000 891.000000 891.000000 891.000000 889 891 891 891 203 889 891 891
240
- unique NaN NaN 2 NaN NaN NaN NaN 3 3 3 2 7 3 2 2
241
- top NaN NaN male NaN NaN NaN NaN S Third man True C Southampton no True
242
- freq NaN NaN 577 NaN NaN NaN NaN 644 491 537 537 59 644 549 537
243
- mean 0.383838 2.308642 NaN 29.699118 0.523008 0.381594 32.204208 NaN NaN NaN NaN NaN NaN NaN NaN
244
- std 0.486592 0.836071 NaN 14.526497 1.102743 0.806057 49.693429 NaN NaN NaN NaN NaN NaN NaN NaN
245
- min 0.000000 1.000000 NaN 0.420000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
246
- 25% 0.000000 2.000000 NaN 20.125000 0.000000 0.000000 7.910400 NaN NaN NaN NaN NaN NaN NaN NaN
247
- 50% 0.000000 3.000000 NaN 28.000000 0.000000 0.000000 14.454200 NaN NaN NaN NaN NaN NaN NaN NaN
248
- 75% 1.000000 3.000000 NaN 38.000000 1.000000 0.000000 31.000000 NaN NaN NaN NaN NaN NaN NaN NaN
249
- max 1.000000 3.000000 NaN 80.000000 8.000000 6.000000 512.329200 NaN NaN NaN NaN NaN NaN NaN NaN
250
- ```
251
- NaN stands for Not a Number
252
- ```
253
- >>> titanic.age.head(10)
254
- 0 22.0
255
- 1 38.0
256
- 2 26.0
257
- 3 35.0
258
- 4 35.0
259
- 5 NaN
260
- 6 54.0
261
- 7 2.0
262
- 8 27.0
263
- 9 14.0
264
- Name: age, dtype: float64
265
- ```
266
- use the fillna method to replace NaN with other values
267
- This returns a DataFrame where all NaN in the age column have been replaced by 0.
268
- ```
269
- >>> titanic.fillna(value={"age": 0}).age.head(10)
270
- 0 22.0
271
- 1 38.0
272
- 2 26.0
273
- 3 35.0
274
- 4 35.0
275
- 5 0.0
276
- 6 54.0
277
- 7 2.0
278
- 8 27.0
279
- 9 14.0
280
- Name: age, dtype: float64
281
- ```
282
- This returns a DataFrame where all NaN in the age column have been replaced with the previous values
283
- ```
284
- >>> titanic.fillna(method="pad").age.head(10)
285
- 0 22.0
286
- 1 38.0
287
- 2 26.0
288
- 3 35.0
289
- 4 35.0
290
- 5 35.0
291
- 6 54.0
292
- 7 2.0
293
- 8 27.0
294
- 9 14.0
295
- Name: age, dtype: float64
296
- ```
297
- use the dropna method to delete columns or rows/lines that contain NaN
298
- By default, it deletes the lines that contain NaN
299
- ```
300
- >>>
301
- >>> titanic.dropna().head(10)
302
- survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
303
- 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
304
- 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
305
- 6 0 1 male 54.0 0 0 51.8625 S First man True E Southampton no True
306
- 10 1 3 female 4.0 1 1 16.7000 S Third child False G Southampton yes False
307
- 11 1 1 female 58.0 0 0 26.5500 S First woman False C Southampton yes True
308
- 21 1 2 male 34.0 0 0 13.0000 S Second man True D Southampton yes True
309
- 23 1 1 male 28.0 0 0 35.5000 S First man True A Southampton yes True
310
- 27 0 1 male 19.0 3 2 263.0000 S First man True C Southampton no False
311
- 52 1 1 female 49.0 1 0 76.7292 C First woman False D Cherbourg yes False
312
- 54 0 1 male 65.0 0 1 61.9792 C First man True B Cherbourg no False
313
- ```
314
- we can also delete the columns that contain NaN
315
- ```
316
- >>> titanic.dropna(axis="columns").head()
317
- survived pclass sex sibsp parch fare class who adult_male alive alone
318
- 0 0 3 male 1 0 7.2500 Third man True no False
319
- 1 1 1 female 1 0 71.2833 First woman False yes False
320
- 2 1 3 female 0 0 7.9250 Third woman False yes True
321
- 3 1 1 female 1 0 53.1000 First woman False yes False
322
- 4 0 3 male 0 0 8.0500 Third man True no True
323
- ```
324
- rename a column
325
- ```
326
- >>> titanic.rename(columns={"sex":"gender"}).head(5)
327
- survived pclass gender age sibsp parch fare embarked class who adult_male deck embark_town alive alone
328
- 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
329
- 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
330
- 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
331
- 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
332
- 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
333
- ```
334
-
335
- delete the line with an index equal to 0.
336
- ```
337
- >>> titanic.drop(0)
338
- ```
339
- Deletes the column "age"
340
- ```
341
- >>> titanic.drop(columns=["age"])
342
- ```
343
- see the distribution of survivors by gender and ticket type
344
- the column survived uses 0s and 1s (0 means died and 1 means survived)
345
- the result is an average
346
- so 50% of females in third class died
347
- ```
348
- >>> titanic.pivot_table('survived', index='sex', columns='class')
349
- class First Second Third
350
- sex
351
- female 0.968085 0.921053 0.500000
352
- male 0.368852 0.157407 0.135447
353
- ```
354
- get the total number of survivors in each case
355
- the column survived uses 0s and 1s
356
- lets use the sum function
357
- ```
358
- >>> titanic.pivot_table('survived', index='sex', columns='class', aggfunc="sum")
359
- class First Second Third
360
- sex
361
- female 91 70 72
362
- male 45 17 47
363
- ```
364
- remove the lines with NaN
365
- group the ages into three categories
366
- use the cut function to segment data values
367
- ```
368
- >>> titanic.dropna(inplace=True)
369
- >>> age = pd.cut(titanic['age'], [0, 18, 80])
370
- >>> titanic.pivot_table('survived', ['sex', age], 'class')
371
- class First Second Third
372
- sex age
373
- female (0, 18] 0.909091 1.000000 0.500000
374
- (18, 80] 0.968254 0.875000 0.666667
375
- male (0, 18] 0.800000 1.000000 1.000000
376
- (18, 80] 0.397436 0.333333 0.250000
377
- ```
378
24
379
25
380
26
# iris flowers classification
0 commit comments