Height. The weight. Three neighbors

In search of an interesting and simple DataSet, I came across this handsome man .


About this beauty


It contains data on the growth and weight of 10,000 men and women . No description. Nothing extra". Only height, weight and floor mark. I liked this mysterious simplicity.


Well, let's get started!


What was interesting to me?



Let's go!


logo


First look


To start, load the necessary modules


#      import pandas as pd #     import matplotlib.pyplot as plt %matplotlib inline #    «  » from sklearn.neighbors import KNeighborsRegressor #         from sklearn.model_selection import train_test_split 

When the libraries stood up exactly - it was time to load the DataSet itself and look at the first 10 elements. This is necessary so that our gut is calm, that we have loaded everything correctly.


By the way, do not be alarmed that height and weight differ from what we are used to. This is because of a different measurement system: inches and pounds , instead of centimeters and kilograms .


 data = pd.read_csv('weight-height.csv') data.head(10) 

GenderHeightWeight
0Male74242
oneMale69162
2Male74213
3Male72220
fourMale70206
5Male68152
6Male69184
7Male68168
8Male67176
9Male63156

Good! We see that the first ten entries are “men”. We see their height (height) and weight (weight). Data loaded up well.


Now you can look at the number of rows in the set.


 data.shape >> (10000, 3) 

Ten thousand lines / records. And each has three parameters . Exactly what is needed!


It's time to fix the measurement system. Now here are centimeters and kilograms.


 data['Height'] *= 2.54 data['Weight'] /= 2.205 #    data.head(10) 

GenderHeightWeight
0Male188110
oneMale17574
2Male18896
3Male182one hundred
fourMale17794
5Male17169
6Male17583
7Male17476
8Male17080
9Male16171

Now it has become more familiar. And the first record tells us about a man with a height of ~ 190cm and a weight of ~ 110kg. Big man. Let's call him Bob.


But how to understand: is it a lot or a little compared to the rest? Is it possible that we are all plus or minus Beans? This is a bit later.


Now let's find out how symmetrical the two genders are in this dataset?


 data['Gender'].value_counts() >> Male 5000 Female 5000 Name: Gender, dtype: int64 

Ideally equally divided. And this is good, because if there were: 9999 men and 1 woman, then there would be no sense in pretending that this DataSet reveals both sexes equally well. In our case, everything is OK!


Divide and learn!


Now intuition tells us that it will be correct to separate the two sexes and explore separately. Indeed, in life we ​​often see that men and women have plus or minus different height and weight


 #  data_male = data[data['Gender'] == 'Male'].copy() #  data_female = data[data['Gender'] == 'Female'].copy() 

Let's take a look at the small descriptive statistics that the pandas module offers us.


Men :


 data_male.describe() 

HeightWeight
count50005000
mean17585
std79
min14851
25%17179
fifty%17585
75%18091
max201122

Women :


 data_female.describe() 

HeightWeight
count50005000
mean16262
std79
min13829th
25%15756
fifty%16262
75%16767
max18692

A small educational program on inf above

In plain language:


Descriptive statistics are a set of numbers / characteristics for a description. Perhaps this is the easiest to understand type of statistics.


Imagine that you are describing the parameters of a ball. He can be:


  • big small
  • smooth / rough
  • blue red
  • bouncing / and not really.

With a strong simplification, we can say that descriptive statistics is engaged in this . But he does this not with balls, but with data.


And here are the parameters from the table above:


  • count - The number of instances.
  • mean - The average or sum of all values ​​divided by their number.
  • std - The standard deviation or root of the variance. Shows the range of values ​​relative to the average.
  • min - The minimum value or minimum.
  • 25% - First quartile. Shows a value below which 25% of the records are.
  • 50% - Second quartile or median. Shows a value above and below which the same number of entries.
  • 75% - Third Quartile. By anology with the first quartile, but below 75% of the records.
  • max - The maximum value or maximum.

The average value is very sensitive to emissions! If four people receive a salary of 10,000 ₽, and the fifth - 460,000 ₽. That average will be - 100 000 ₽. And the median will remain the same - 10 000 ₽.


This does not mean that average is a bad indicator. It needs to be treated more carefully.


By the way, there’s also a snag with the median.


If the number of measurements is odd. That median is the value in the middle, if you put the data "by growth".


And if it’s even, then the median is the average between the two “most central” ones.


If the data set contains only integers and the median is fractional, do not be surprised. Most likely the number of measurements is even.


An example :


The son brought marks from school. There were five lessons he received: 1, 5, 3, 2, 4
Five ratings → odd amount
Growth: 1, 2, 3, 4, 5
Take the central - 3
Median score - 3


The next day, the son brought from school new grades: 4, 2, 3, 5
Four ratings → odd amount
We build by growth: 2, 3, 4, 5
Take the centerpieces: 3, 4
We find their average: 3.5
Median - 3.5


Conclusion: Well done son :)


We see that in men the average and median are 175cm and 85kg. And in women : 162cm and 62kg. This tells us that there are no strong emissions. Or they are symmetrical on both sides of the median. Which is very rare.


But both sexes have slight deviations of the mean from the median. But they are insignificant and they are visible only on hundredths. Move on!


Histogram


This is a graph that plots the values ​​from minimum to maximum in order of growth, and shows the number of individual instances.


 fig, axes = plt.subplots(2,2, figsize=(20,10)) plt.subplots_adjust(wspace=0, hspace=0) axes[0,0].hist(data_male['Height'], label='Male Height', bins=100, color='red') axes[0,1].hist(data_male['Weight'], label='Male Weight', bins=100, color='red', alpha=0.4) axes[1,0].hist(data_female['Height'], label='Female Height', bins=100, color='blue') axes[1,1].hist(data_female['Weight'], label='Female Weight', bins=100, color='blue', alpha=0.4) axes[0,0].legend(loc=2, fontsize=20) axes[0,1].legend(loc=2, fontsize=20) axes[1,0].legend(loc=2, fontsize=20) axes[1,1].legend(loc=2, fontsize=20) plt.savefig('plt_histogram.png') plt.show() 

hist


Data is distributed bell-shaped. Very similar to normal distribution .


In addition to statistical tests for normal distribution, there is a visual test. If the distribution by type and logic seems to be normal - we can assume with some assumptions that we are dealing with it.


One could do a statistical normality test and determine p-value, but I do not know how this is beyond the scope of the article.


Learning to work with pens


Pandas can count a lot for us. But you need to at least once count some statistics yourself. Now I will show how to calculate the standard deviation .


Let's do it on the example of men and the characteristic - growth.


Average


Formula:


M= frac1N sum limitsi=1Nni


where



The code:


 mean = data_male['Height'].mean() print('mean:\t{:.2f}'.format(mean)) >> mean: 175.33 

Average height - 175cm


Deviation squared


di=(niM)2


where



The code:


 data_male['Height_d'] = (data_male['Height'] - mean) ** 2 data_male['Height_d'].head(10) >> 0 149.927893 1 0.385495 2 166.739089 3 47.193692 4 4.721246 5 20.288347 6 0.375539 7 2.964214 8 25.997623 9 200.149603 Name: Height_d, dtype: float64 

Dispersion


Formula:


D= frac1N sum limitsi=1Ndi


where



The code:


 disp = data_male['Height_d'].mean() print('disp:\t{:.2f}'.format(disp)) >> disp: 52.89 

Dispersion - 53


Standard deviation


Formula:


std= sqrtD


where



The code:


 std = disp ** 0.5 print('std:\t{:.2f}'.format(std)) >> std: 7.27 

Standard Deviation - 7


Confidence Intervals


Now we will find out in what ranges of growth and weight 68%, 95% and 99.7% of men and women are .


This is not so difficult - you need to add and subtract the standard deviation from the average. It looks like this:



We write an auxiliary function that will consider this:


 def get_stats(series, title='noname'): #    print('= {} =\n'.format(title.upper())) #     pandas descr = series.describe() #   mean = descr['mean'] print('= Mean:\t{:.0f}'.format(mean)) #    std = descr['std'] print('= Std:\t{:.0f}'.format(std)) #    print('\n= = = =\n') #   ## 68% devi_1 = [mean - std, mean + std] ## 95% devi_2 = [mean - 2 * std, mean + 2 * std] ## 99.7% devi_3 = [mean - 3 * std, mean + 3 * std] #   print('= 68% is from\t\t{:.0f} to {:.0f}'.format(devi_1[0], devi_1[1])) print('= 95% is from\t\t{:.0f} to {:.0f}'.format(devi_2[0], devi_2[1])) print('= 99.7% is from\t\t{:.0f} to {:.0f}'.format(devi_3[0], devi_3[1])) 

Well, apply it to the data:


Men | Height


 get_stats(data_male['Height'], title='Male Height') >> = MALE HEIGHT = = Mean: 175 = Std: 7 = = = = = 68% is from 168 to 183 = 95% is from 161 to 190 = 99.7% is from 154 to 197 

Men | The weight


 get_stats(data_male['Height'], title='Male Height') >> = MALE WEIGHT = = Mean: 85 = Std: 9 = = = = = 68% is from 76 to 94 = 95% is from 67 to 103 = 99.7% is from 58 to 112 

Women | Height


 get_stats(data_male['Height'], title='Male Height') >> = FEMALE HEIGHT = = Mean: 162 = Std: 7 = = = = = 68% is from 155 to 169 = 95% is from 148 to 176 = 99.7% is from 141 to 182 

Women | The weight


 get_stats(data_male['Height'], title='Male Height') >> = FEMALE WEIGHT = = Mean: 62 = Std: 9 = = = = = 68% is from 53 to 70 = 95% is from 44 to 79 = 99.7% is from 36 to 87 

Hence the conclusions:



Now it remains only to apply machine learning to this set and try to predict weight by height.


Nearest neighbors


The algorithm "To the nearest neighbors" is simple. It exists for classification tasks - to distinguish a cat from a dog - and for regression tasks - to guess the weight by height. That's what we need!


For regression, he uses the following algorithm:



First you need to divide the data set into the training and test parts and test the algorithm


Experimenting on men


 X_train, X_test, y_train, y_test = train_test_split(data_male['Height'], data_male['Weight']) 

Divided, it is time to try.


 #   knr3 = KNeighborsRegressor(n_neighbors=3) knr3.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr3.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.8298400793623182 #   knr5 = KNeighborsRegressor(n_neighbors=5) knr5.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr5.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.7958051642678619 #   knr7 = KNeighborsRegressor(n_neighbors=7) knr7.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr7.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.7769249318420969 

We will not go far and stop at three neighbors. But the question is: can such a model guess my weight?


 knr3.predict([[180]])[0, 0] >> 88.67596236265881 

88kg is very close. This second, my weight is 89.8kg


Prediction chart for men


The time to build my favorite part of science is graphics.


 array_male = [] #   99.7% xaxis = range(154, 198) for h in xaxis: ans = knr3.predict([[h]]) array_male.append(ans[0, 0]) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_male, 'r-', linewidth=4) plt.title('Male heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_male.png') plt.show() 

male_plot


Model and prediction chart for women


 X_train, X_test, y_train, y_test = train_test_split(data_female['Height'], data_female['Weight']) knr3 = KNeighborsRegressor(n_neighbors=3) knr3.fit(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) knr3.score(X_train.values.reshape(-1,1), y_train.values.reshape(-1,1)) >> 0.8135681584074799 

 array_female = [] #   99.7% xaxis = range(141, 183) for h in xaxis: ans = knr3.predict([[h]]) array_female.append(ans[0, 0]) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_female, 'b-', linewidth=4) plt.title('Female heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_female.png') plt.show() 

female_plot


And of course it’s interesting how these graphs look together:


 #      xaxis = range(154, 183) plt.figure(figsize=(20,10)) plt.plot(xaxis, array_male[:-15], 'r-', linewidth=4) plt.plot(xaxis, array_female[13:], 'b-', linewidth=4) plt.title('Together heght-weight dependence', fontsize=30) plt.xlabel('Height', fontsize=30) plt.ylabel('Weight', fontsize=30) plt.grid() plt.savefig('plt_knn_together.png') plt.show() 

together_plot


Answers on questions


- What is the range of weight and height for most men and women?


99.7% of men: from 154cm to 197cm and from 58kg to 112kg.
And 99.7% of women: from 141cm to 182cm and from 36kg to 87kg.


- What kind of “average” man and “average” woman are they?


The average man is 175cm and 85kg.
And the average woman is 162cm and 62kg.


- Can the simple KNN machine learning model from these data predict weight by height?


Yes, the model predicted 88kg, and I have 89.8kg.


All that I did, I collected here


Cons of the article



Epilogue


Like if you hit the 99.7% interval



Source: https://habr.com/ru/post/466123/


All Articles