@@ -35,11 +35,13 @@ import qualified Data.Text.IO as TIO
3535df <- D. readCsv " ../dataframe/data/housing.csv"
3636
3737TIO. putStrLn $ D. toMarkdownTable $ D. frequencies " ocean_proximity" df
38+
3839```
39- > | Statistic<br >Text | <1H OCEAN<br >Any | INLAND<br >Any | ISLAND<br >Any | NEAR BAY<br >Any | NEAR OCEAN<br >Any |
40- > | ------------------| ------------------| ---------------| ---------------| -----------------| ------------------ |
41- > | Count | 9136 | 6551 | 5 | 2290 | 2658 |
42- > | Percentage (%) | 44.26% | 31.74% | 0.02% | 11.09% | 12.88% |
40+
41+ > | Statistic<br >Text | ocean_proximity<br >Any |
42+ > | ------------------| ----------------------- |
43+ > | Count | 20640 |
44+ > | Percentage (%) | 100.00% |
4345
4446
4547We can also plot similar tables for non-categorical data with a small value set e.g shoe sizes.
@@ -61,7 +63,9 @@ Arguably the first thing to do when presented with a datset is check for null va
6163
6264``` haskell
6365TIO. putStrLn $ D. toMarkdownTable $ D. describeColumns df
66+
6467```
68+
6569> | Column Name<br >Text | # Non-null Values<br >Int | # Null Values<br >Int | Type<br >Text |
6670> | --------------------| --------------------------| ----------------------| ------------- |
6771> | total_bedrooms | 20433 | 207 | Maybe Double |
@@ -90,7 +94,9 @@ import qualified DataFrame.Functions as F
9094D. mean (F. col @ Double " housing_median_age" ) df
9195
9296D. median (F. col @ Double " housing_median_age" ) df
97+
9398```
99+
94100> 28.639486434108527
95101> 29.0
96102
@@ -115,7 +121,9 @@ TIO.putStrLn $ D.toMarkdownTable $
115121 df |> D. derive " deviation" (abs (median_house_value - (F. mean median_house_value)))
116122 |> D. select [" median_house_value" , " deviation" ]
117123 |> D. take 10
124+
118125```
126+
119127> | median_house_value<br >Double | deviation<br >Double |
120128> | -----------------------------| -------------------- |
121129> | 452600.0 | 245744.18309108526 |
@@ -143,7 +151,9 @@ From the small sample it does seem like there are some wild deviations. The firs
143151df |> D. derive " deviation" (abs (median_house_value - (F. mean median_house_value)))
144152 |> D. select [" median_house_value" , " deviation" ]
145153 |> D. mean (F. col @ Double " deviation" )
154+
146155```
156+
147157> 91170.43994367118
148158
149159
@@ -172,7 +182,9 @@ sumOfSqureDifferences = withDeviation |> D.derive "deviation^2" (F.pow deviation
172182n = fromIntegral (fst (D. dimensions df) - 1 )
173183
174184sqrt (sumOfSqureDifferences / n)
185+
175186```
187+
176188> 2765.8049483764235
177189
178190The standard deviation being larger than the mean absolute deviation means we do have some outliers. However, since the difference is fairly small we can conclude that there aren't very many outliers in our dataset.
@@ -182,7 +194,9 @@ We can calculate the standard deviation in one line as follows:
182194
183195``` haskell
184196D. standardDeviation (F. col @ Double " median_house_value" ) df
197+
185198```
199+
186200> 115395.61587441359
187201
188202
@@ -196,7 +210,9 @@ For our dataset:
196210
197211``` haskell
198212D. interQuartileRange (F. col @ Double " median_house_value" ) df
213+
199214```
215+
200216> 145125.0
201217
202218
@@ -210,7 +226,9 @@ In our example it's a very large number:
210226
211227``` haskell
212228D. variance (F. col @ Double " median_house_value" ) df
229+
213230```
231+
214232> 1.3316148163035213e10
215233
216234
@@ -228,7 +246,9 @@ A skewness score between -0.5 and 0.5 means the data has little skew. A score be
228246
229247``` haskell
230248D. skewness (F. col @ Double " median_house_value" ) df
249+
231250```
251+
232252> 0.977668529406543
233253
234254So the median house value is moderately skewed to the left. That is, there are more houses that are cheaper than the mean values and a tail of expensive outliers. Having lived in California, I can confirm that this data reflects reality.
@@ -241,7 +261,9 @@ We can get all these statistics with a single command:
241261
242262``` haskell
243263TIO. putStrLn $ D. toMarkdownTable $ D. summarize df
244- ```
264+
265+ ```
266+
245267> | Statistic<br >Text | longitude<br >Double | latitude<br >Double | housing_median_age<br >Double | total_rooms<br >Double | total_bedrooms<br >Double | population<br >Double | households<br >Double | median_income<br >Double | median_house_value<br >Double |
246268> | ------------------| ---------------------| --------------------| ------------------------------| -----------------------| --------------------------| ----------------------| ----------------------| -------------------------| ----------------------------- |
247269> | Count | 20640.0 | 20640.0 | 20640.0 | 20640.0 | 20433.0 | 20640.0 | 20640.0 | 20640.0 | 20640.0 |
@@ -278,34 +300,112 @@ range of value. Going back to our california housing dataset, we can plot a hist
278300
279301
280302``` haskell
281- D. plotHistogram " median_house_value" df
303+ -- cabal: build-depends: granite
304+ import Granite.Svg
305+ import qualified Data.Text.IO as T
306+ import qualified Data.Text as T
307+
308+ let houseValues = D. columnAsList (F. col @ Double " median_house_value" ) df
309+
310+ T. putStrLn $
311+ histogram
312+ (bins 30 140000 502000 )
313+ houseValues
314+ defPlot
315+ { widthChars = 68
316+ , heightChars = 18
317+ , legendPos = LegendBottom
318+ , xFormatter = \ _ _ v -> T. pack (show (round v :: Int ))
319+ , xNumTicks = 10
320+ , yNumTicks = 5
321+ , plotTitle = " Median House Prices of California Houses ($)"
322+ }
323+
282324```
283- > ```
284- > 1501.0│ ▁▁██
285- > │ ▂▂████
286- > │ ██ ▂▂████████
287- > │ ██▅▅██████████
288- > │ ██████████████
289- > │ ██████████████ ▄▄ ▁▁
290- > │ ▄▄██████████████ ██ ██
291- > │ ████████████████▂▂██▆▆ ██
292- > │ ██████████████████████ ██
293- > │ ██████████████████████ ██
294- > 750.5│ ██████████████████████████▆▆ ██
295- > │ ████████████████████████████▂▂ ██
296- > │ ██████████████████████████████ ██
297- > │ ██████████████████████████████ ██
298- > │ ██████████████████████████████▅▅ ▅▅██ ██
299- > │ ████████████████████████████████▇▇████▁▁ ██
300- > │ ████████████████████████████████████████▁▁ ██
301- > │ ██████████████████████████████████████████▆▆▂▂ ▁▁ ██
302- > │ ▄▄██████████████████████████████████████████████▇▇██▂▂▁▁██
303- > 0.0│▂▂██████████████████████████████████████████████████████████
304- > └────────────────────────────────────────────────────────────
305- > 1.5e4 2.6e5 5.0e5
306- >
307- > ⣿ count
308- > ```
325+
326+ > <svg xmlns =" http://www.w3.org/2000/svg " viewBox =" 0 0 770 394 " width =" 770 " height =" 394 " font-family =" system-ui, -apple-system, sans-serif " >
327+ > <rect width =" 100% " height =" 100% " fill =" white " />
328+ > <text x =" 410 " y =" 26 " text-anchor =" middle " fill =" #222 " font-size =" 14 " >Median House Prices of California Houses ($)</text >
329+ > <line x1 =" 70 " y1 =" 322 " x2 =" 750 " y2 =" 322 " stroke =" #aaa " stroke-width =" 1 " />
330+ > <line x1 =" 70 " y1 =" 34 " x2 =" 70 " y2 =" 322 " stroke =" #aaa " stroke-width =" 1 " />
331+ > <line x1 =" 70 " y1 =" 34 " x2 =" 66 " y2 =" 34 " stroke =" #aaa " stroke-width =" 1 " />
332+ > <text x =" 62 " y =" 38 " text-anchor =" end " fill =" #555 " font-size =" 11 " >1252.0</text >
333+ > <line x1 =" 70 " y1 =" 34 " x2 =" 750 " y2 =" 34 " stroke =" #eee " stroke-width =" 0.50 " />
334+ > <line x1 =" 70 " y1 =" 106.25 " x2 =" 66 " y2 =" 106.25 " stroke =" #aaa " stroke-width =" 1 " />
335+ > <text x =" 62 " y =" 110.25 " text-anchor =" end " fill =" #555 " font-size =" 11 " >939.0</text >
336+ > <line x1 =" 70 " y1 =" 106.25 " x2 =" 750 " y2 =" 106.25 " stroke =" #eee " stroke-width =" 0.50 " />
337+ > <line x1 =" 70 " y1 =" 178.50 " x2 =" 66 " y2 =" 178.50 " stroke =" #aaa " stroke-width =" 1 " />
338+ > <text x =" 62 " y =" 182.50 " text-anchor =" end " fill =" #555 " font-size =" 11 " >626.0</text >
339+ > <line x1 =" 70 " y1 =" 178.50 " x2 =" 750 " y2 =" 178.50 " stroke =" #eee " stroke-width =" 0.50 " />
340+ > <line x1 =" 70 " y1 =" 249.75 " x2 =" 66 " y2 =" 249.75 " stroke =" #aaa " stroke-width =" 1 " />
341+ > <text x =" 62 " y =" 253.75 " text-anchor =" end " fill =" #555 " font-size =" 11 " >313.0</text >
342+ > <line x1 =" 70 " y1 =" 249.75 " x2 =" 750 " y2 =" 249.75 " stroke =" #eee " stroke-width =" 0.50 " />
343+ > <line x1 =" 70 " y1 =" 322 " x2 =" 66 " y2 =" 322 " stroke =" #aaa " stroke-width =" 1 " />
344+ > <text x =" 62 " y =" 326 " text-anchor =" end " fill =" #555 " font-size =" 11 " >0.0</text >
345+ > <line x1 =" 70 " y1 =" 322 " x2 =" 750 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
346+ > <line x1 =" 70 " y1 =" 322 " x2 =" 70 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
347+ > <text x =" 70 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >140000</text >
348+ > <line x1 =" 70 " y1 =" 34 " x2 =" 70 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
349+ > <line x1 =" 145.11 " y1 =" 322 " x2 =" 145.11 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
350+ > <text x =" 145.11 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >180222</text >
351+ > <line x1 =" 145.11 " y1 =" 34 " x2 =" 145.11 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
352+ > <line x1 =" 221.22 " y1 =" 322 " x2 =" 221.22 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
353+ > <text x =" 221.22 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >220444</text >
354+ > <line x1 =" 221.22 " y1 =" 34 " x2 =" 221.22 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
355+ > <line x1 =" 296.33 " y1 =" 322 " x2 =" 296.33 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
356+ > <text x =" 296.33 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >260667</text >
357+ > <line x1 =" 296.33 " y1 =" 34 " x2 =" 296.33 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
358+ > <line x1 =" 372.44 " y1 =" 322 " x2 =" 372.44 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
359+ > <text x =" 372.44 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >300889</text >
360+ > <line x1 =" 372.44 " y1 =" 34 " x2 =" 372.44 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
361+ > <line x1 =" 447.56 " y1 =" 322 " x2 =" 447.56 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
362+ > <text x =" 447.56 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >341111</text >
363+ > <line x1 =" 447.56 " y1 =" 34 " x2 =" 447.56 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
364+ > <line x1 =" 523.67 " y1 =" 322 " x2 =" 523.67 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
365+ > <text x =" 523.67 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >381333</text >
366+ > <line x1 =" 523.67 " y1 =" 34 " x2 =" 523.67 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
367+ > <line x1 =" 598.78 " y1 =" 322 " x2 =" 598.78 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
368+ > <text x =" 598.78 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >421556</text >
369+ > <line x1 =" 598.78 " y1 =" 34 " x2 =" 598.78 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
370+ > <line x1 =" 674.89 " y1 =" 322 " x2 =" 674.89 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
371+ > <text x =" 674.89 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >461778</text >
372+ > <line x1 =" 674.89 " y1 =" 34 " x2 =" 674.89 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
373+ > <line x1 =" 750 " y1 =" 322 " x2 =" 750 " y2 =" 326 " stroke =" #aaa " stroke-width =" 1 " />
374+ > <text x =" 750 " y =" 338 " text-anchor =" middle " fill =" #555 " font-size =" 11 " >502000</text >
375+ > <line x1 =" 750 " y1 =" 34 " x2 =" 750 " y2 =" 322 " stroke =" #eee " stroke-width =" 0.50 " />
376+ > <rect x =" 70 " y =" 86.91 " width =" 21.67 " height =" 235.09 " fill =" #1abc9c " />
377+ > <rect x =" 92.67 " y =" 34.00 " width =" 21.67 " height =" 288.00 " fill =" #1abc9c " />
378+ > <rect x =" 115.33 " y =" 83.92 " width =" 21.67 " height =" 238.08 " fill =" #1abc9c " />
379+ > <rect x =" 138 " y =" 88.98 " width =" 21.67 " height =" 233.02 " fill =" #1abc9c " />
380+ > <rect x =" 160.67 " y =" 123.94 " width =" 21.67 " height =" 198.06 " fill =" #1abc9c " />
381+ > <rect x =" 183.33 " y =" 182.60 " width =" 21.67 " height =" 139.40 " fill =" #1abc9c " />
382+ > <rect x =" 206 " y =" 140.04 " width =" 21.67 " height =" 181.96 " fill =" #1abc9c " />
383+ > <rect x =" 228.67 " y =" 138.66 " width =" 21.67 " height =" 183.34 " fill =" #1abc9c " />
384+ > <rect x =" 251.33 " y =" 175.70 " width =" 21.67 " height =" 146.30 " fill =" #1abc9c " />
385+ > <rect x =" 274 " y =" 202.84 " width =" 21.67 " height =" 119.16 " fill =" #1abc9c " />
386+ > <rect x =" 296.67 " y =" 192.72 " width =" 21.67 " height =" 129.28 " fill =" #1abc9c " />
387+ > <rect x =" 319.33 " y =" 211.81 " width =" 21.67 " height =" 110.19 " fill =" #1abc9c " />
388+ > <rect x =" 342 " y =" 230.91 " width =" 21.67 " height =" 91.09 " fill =" #1abc9c " />
389+ > <rect x =" 364.67 " y =" 260.81 " width =" 21.67 " height =" 61.19 " fill =" #1abc9c " />
390+ > <rect x =" 387.33 " y =" 257.13 " width =" 21.67 " height =" 64.87 " fill =" #1abc9c " />
391+ > <rect x =" 410 " y =" 253.22 " width =" 21.67 " height =" 68.78 " fill =" #1abc9c " />
392+ > <rect x =" 432.67 " y =" 249.08 " width =" 21.67 " height =" 72.92 " fill =" #1abc9c " />
393+ > <rect x =" 455.33 " y =" 246.09 " width =" 21.67 " height =" 75.91 " fill =" #1abc9c " />
394+ > <rect x =" 478 " y =" 265.87 " width =" 21.67 " height =" 56.13 " fill =" #1abc9c " />
395+ > <rect x =" 500.67 " y =" 281.28 " width =" 21.67 " height =" 40.72 " fill =" #1abc9c " />
396+ > <rect x =" 523.33 " y =" 287.96 " width =" 21.67 " height =" 34.04 " fill =" #1abc9c " />
397+ > <rect x =" 546 " y =" 285.88 " width =" 21.67 " height =" 36.12 " fill =" #1abc9c " />
398+ > <rect x =" 568.67 " y =" 292.79 " width =" 21.67 " height =" 29.21 " fill =" #1abc9c " />
399+ > <rect x =" 591.33 " y =" 297.16 " width =" 21.67 " height =" 24.84 " fill =" #1abc9c " />
400+ > <rect x =" 614 " y =" 296.24 " width =" 21.67 " height =" 25.76 " fill =" #1abc9c " />
401+ > <rect x =" 636.67 " y =" 293.71 " width =" 21.67 " height =" 28.29 " fill =" #1abc9c " />
402+ > <rect x =" 659.33 " y =" 305.67 " width =" 21.67 " height =" 16.33 " fill =" #1abc9c " />
403+ > <rect x =" 682 " y =" 305.44 " width =" 21.67 " height =" 16.56 " fill =" #1abc9c " />
404+ > <rect x =" 704.67 " y =" 309.81 " width =" 21.67 " height =" 12.19 " fill =" #1abc9c " />
405+ > <rect x =" 727.33 " y =" 84.61 " width =" 21.67 " height =" 237.39 " fill =" #1abc9c " />
406+ > <rect x =" 377.50 " y =" 375 " width =" 12 " height =" 12 " fill =" #1abc9c " />
407+ > <text x =" 393.50 " y =" 385 " text-anchor =" start " fill =" #555 " font-size =" 11 " >count</text >
408+ > </svg >
309409
310410
311411From the histogram above we can already tell things like whether or not there are outliers, the central tendency of the data, and the spread.
0 commit comments