
2çªç®ã®ã¬ãã¹ã³ã§ã¯ãPythonã§ã®ããŒã¿ã®èŠèŠåã«çŠç¹ãåœãŠãŸãã ãŸããSeabornããã³Plotlyã©ã€ãã©ãªã®äž»ãªã¡ãœãããèŠãŠããã æåã®èšäºã§ããªãã¿ã®éä¿¡äºæ¥è
ã¯ã©ã€ã¢ã³ãã®æµåºã«é¢ããããŒã¿ã»ãããåæããt-SNEã¢ã«ãŽãªãºã ã䜿çšããŠn次å
空éãèŠã蟌ã¿ãŸãã ãŸãããªãŒãã³ã³ãŒã¹ã®2åç®ã®ç«ã¡äžãïŒ2017幎9æãã11æïŒã®äžç°ãšããŠããã®èšäºã«åºã¥ãè¬çŸ©ã®ãããªé²ç»ããããŸãã
UPDïŒçŸåšãã³ãŒã¹ã¯è±èªã§ã mlcourse.aiãšãããã©ã³ãåã§ãMedium ã«é¢ããèšäº ãKaggleïŒ Dataset ïŒããã³GitHubã«é¢ããè³æããããŸã ã
ããã§ãèšäºã¯ããªãé·ããªããŸãã æºåã¯ããïŒ è¡ããïŒ
ã·ãªãŒãºã®èšäºã®ãªã¹ã ãã®èšäºã®æŠèŠ
Seabornããã³Plotlyã®åºæ¬çãªæ¹æ³ã®ãã¢ã³ã¹ãã¬ãŒã·ã§ã³
æåã¯ããã€ãã®ããã«ãç°å¢ãèšå®ããŸããå¿
èŠãªãã¹ãŠã®ã©ã€ãã©ãªãã€ã³ããŒãããããã©ã«ãã®ããã©ã«ãç»å衚瀺ãèšå®ããŸãã
ãã®åŸã DataFrame
äœæ¥ããããŒã¿ãããŒãããŸãã äŸãšããŠã Kaggle Datasetsãããããªã²ãŒã ã®è²©å£²ããŒã¿ãšè©äŸ¡ãéžæããŸããã
df = pd.read_csv('../../data/video_games_sales.csv') df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16719 entries, 0 to 16718 Data columns (total 16 columns): Name 16717 non-null object Platform 16719 non-null object Year_of_Release 16450 non-null float64 Genre 16717 non-null object Publisher 16665 non-null object NA_Sales 16719 non-null float64 EU_Sales 16719 non-null float64 JP_Sales 16719 non-null float64 Other_Sales 16719 non-null float64 Global_Sales 16719 non-null float64 Critic_Score 8137 non-null float64 Critic_Count 8137 non-null float64 User_Score 10015 non-null object User_Count 7590 non-null float64 Developer 10096 non-null object Rating 9950 non-null object dtypes: float64(9), object(7) memory usage: 2.0+ MB
pandas
object
èããããã€ãã®å
åobject
ãæç€ºçã«float
ãŸãã¯int
ãã£ã¹ããããŸãã
df['User_Score'] = df.User_Score.astype('float64') df['Year_of_Release'] = df.Year_of_Release.astype('int64') df['User_Count'] = df.User_Count.astype('int64') df['Critic_Count'] = df.Critic_Count.astype('int64')
ããŒã¿ã¯ãã¹ãŠã®ã²ãŒã ã®ãã®ã§ã¯ãªãããã dropna
ã¡ãœããã䜿çšããŠãã®ã£ããã®ãªãã¬ã³ãŒãã®ã¿ãæ®ããŸãããã
df = df.dropna() print(df.shape)
(6825, 16)
ããŒãã«ã«ã¯6825åã®ãªããžã§ã¯ããããã16åã®ãµã€ã³ããããŸãã head
ã¡ãœããã䜿çšããŠæåã®ããã€ãã®ãšã³ããªãèŠãŠããã¹ãŠãæ£ããè§£æãããããšã確èªããŸãããã 䟿å®äžãå°æ¥äœ¿çšããæšèã®ã¿ãæ®ããŸããã
useful_cols = ['Name', 'Platform', 'Year_of_Release', 'Genre', 'Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Rating' ] df[useful_cols].head()

seaborn
ããã³plotly
ã®ã¡ãœããã«ç§»ãåã«ã seaborn
ããããŒã¿ãèŠèŠåããæãç°¡åã§äŸ¿å©ãªæ¹æ³ã«ã€ããŠèª¬æãpandas DataFrame
颿°ã䜿çšãplot
ã
ããšãã°ã幎ã«å¿ããŠããŸããŸãªåœã§ãããªã²ãŒã ã®è²©å£²ã¹ã±ãžã¥ãŒã«ãäœæããŸãã æåã«ãå¿
èŠãªåã®ã¿ããã£ã«ã¿ãŒã§é€å€ãã幎DataFrame
ã®ç·å£²äžãèšç®ããçµæã®DataFrame
ãã©ã¡ãŒã¿ãŒãªãã§plot
颿°ãåŒã³åºããŸãã
sales_df = df[[x for x in df.columns if 'Sales' in x] + ['Year_of_Release']] sales_df.groupby('Year_of_Release').sum().plot()
pandas
ã®plot
颿°ã®å®è£
ã¯ã matplotlib
ã©ã€ãã©ãªã«åºã¥ããŠããŸãã

kind
ãã©ã¡ãŒã¿ãŒã䜿çšããŠããã£ãŒãã®ã¿ã€ããbar chart
ãªã©ã«å€æŽã§ãbar chart
ã Matplotlib
ã¯ãéåžžã«æè»ãªã°ã©ãã£ãã¯ã®ã«ã¹ã¿ãã€ãºãå¯èœã«ããŸãã ã°ã©ãã§ã¯ãã»ãšãã©äœã§ã倿Žã§ããŸãããããã¥ã¡ã³ãããã£ãšèª¿ã¹ãŠå¿
èŠãªãã©ã¡ãŒã¿ãŒãèŠã€ããå¿
èŠããããŸãã ããšãã°ã rot
ãã©ã¡ãŒã¿ãŒã¯ã x
軞ã«å¯Ÿããã©ãã«ã®åŸé
ãæ
åœããŸãã
sales_df.groupby('Year_of_Release').sum().plot(kind='bar', rot=45)

ã·ãŒããŒã³
ããã§ã¯ã seaborn
ã©ã€ãã©ãªãŒã«ç§»ããŸãããã Seaborn
ã¯ãæ¬è³ªçã«matplotlib
ã©ã€ãã©ãªã«åºã¥ããé«ã¬ãã«APIã§ãã Seaborn
ã¯ãããé©åãªããã©ã«ãã®ãã£ãŒãèšå®ãå«ãŸããŠããŸãã ãŸããã©ã€ãã©ãªã«ã¯ã matplotlib
ã§ã¯å€ãã®ã³ãŒããå¿
èŠãšããéåžžã«è€éãªã¿ã€ãã®èŠèŠåããããŸãã
æåã®ãã®ãããªãè€éãªãã¿ã€ãã®ã°ã©ãpair plot
ïŒ scatter plot matrix
ïŒãscatter plot matrix
ãŸãããã ãã®èŠèŠåã¯ãããŸããŸãªæ©èœãã©ã®ããã«é¢é£ããŠãããã1ã€ã®å³ã§èŠãã®ã«åœ¹ç«ã¡ãŸãã
cols = ['Global_Sales', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count'] sns_plot = sns.pairplot(df[cols]) sns_plot.savefig('pairplot.png')
ã芧ã®ãšããã屿§ã®ååžã®ãã¹ãã°ã©ã ã¯ã°ã©ããããªãã¯ã¹ã®å¯Ÿè§ç·äžã«ãããŸãã æ®ãã®ã°ã©ãã¯ã察å¿ããèšå·ã®ãã¢ã®éåžžã®scatter plots
ã§ãã
ã°ã©ãããã¡ã€ã«ã«ä¿åããã«ã¯ã savefig
ã¡ãœããã䜿çšããŸãã

seaborn
ã dist plot
ååžãæ§ç¯ããããšãã§ããŸãã äŸãšããŠãæ¹è©å®¶ã®è©äŸ¡Critic_Score
ååžãèŠãŠã¿ãŸãããã ããã©ã«ãã§ã¯ãã°ã©ãã«ã¯ãã¹ãã°ã©ã ãšã«ãŒãã«å¯åºŠã®æšå®ã衚瀺ãããŸãã
sns.distplot(df.Critic_Score)

2ã€ã®æ°å€èšå·ã®é¢ä¿ã詳ããèŠãããã«ã joint plot
ãããjoint plot
ããã¯scatter plot
ãšhistogram
ãã€ããªããã§ãã Critic_Score
è©è«å®¶ã®Critic_Score
ãšUser_Score
ãŠãŒã¶ãŒã®è©äŸ¡ãã©ã®ããã«Critic_Score
ããŠãããèŠãŠã¿ãŸãããã

å¥ã®äŸ¿å©ãªçš®é¡ã®ãã£ãŒãã¯box plot
ã§ãã äžäœ5倧ã²ãŒã ãã©ãããã©ãŒã ã®æ¹è©å®¶ã®ã²ãŒã è©äŸ¡ãæ¯èŒããŸãããã
top_platforms = df.Platform.value_counts().sort_values(ascending = False).head(5).index.values sns.boxplot(y="Platform", x="Critic_Score", data=df[df.Platform.isin(top_platforms)], orient="h")

box plot
çè§£æ¹æ³ã«ã€ããŠããå°ã詳ãã説æãã䟡å€ããããšæãbox plot
ã Box plot
ã¯ãããã¯ã¹ïŒ box plot
ãšåŒã°ããçç±ïŒãã¢ã³ããããã€ã³ãã§æ§æãããŸãã ããã¯ã¹ã«ã¯ãååžã®ååäœç¯å²ãã€ãŸããããã25ïŒ
ïŒ Q1
ïŒããã³75ïŒ
ïŒ Q3
ïŒããŒã»ã³ã¿ã€ã«ã衚瀺ãããŸãã ããã¯ã¹å
ã®ããã·ã¥ã¯ãååžã®äžå€®å€ã瀺ããŸãã
ããã¯ã¹ãæŽçãããããå£ã²ãã«ç§»ããŸãããã ã²ãã¯ãå€ãå€ãé€ãç¹ã®æ£åžå
šäœãã€ãŸããéé(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
ã«å
¥ãæå°å€ãšæå€§å€ã衚ããŸããããã§ã IQR = Q3 - Q1
ã¯ååäœç¯å²ã§ãã ã°ã©ãäžã®ãããã¯outliers
瀺ããŸã-ã°ã©ãã®å£ã²ãã§æå®ãããå€ã®ç¯å²ã«åãŸããªãå€ã
çè§£ããããã«ãäžåºŠèŠãã»ããè¯ãã®ã§ãããã«ãŠã£ãããã£ã¢ã®åçããããŸãïŒ

ãŸããå¥ã®ã¿ã€ãã®ã°ã©ãïŒãã®èšäºã§æ€èšããæåŸã®ã°ã©ãïŒã¯heat map
ã§ãã Heat map
䜿çšãããšã2ã€ã®ã«ããŽãªã«å¿ããæ°å€ç¹æ§ã®ååžã確èªã§ããŸãã ãžã£ã³ã«ãšã²ãŒã ãã©ãããã©ãŒã å¥ã«ã²ãŒã ã®ç·å£²äžãèŠèŠåããŸãã
platform_genre_sales = df.pivot_table( index='Platform', columns='Genre', values='Global_Sales', aggfunc=sum).fillna(0).applymap(float) sns.heatmap(platform_genre_sales, annot=True, fmt=".1f", linewidths=.5)

ãããããªãŒ
matplotlib
ã©ã€ãã©ãªã«åºã¥ããŠèŠèŠåãæ€èšããŸããã ãã ããããã¯python
ã§ã°ã©ããäœæããããã®å¯äžã®ãªãã·ã§ã³ã§ã¯ãããŸããã ãŸãã plotly
ã©ã€ãã©ãªã«ã€ããŠãplotly
ããŸãããã Plotly
ã¯ãjavascriptã³ãŒããæãäžããããšãªãjupyter.notebook'eã§ã€ã³ã¿ã©ã¯ãã£ããªã°ã©ãã£ãã¯ãæ§ç¯ã§ãããªãŒãã³ãœãŒã¹ã©ã€ãã©ãªã§ãã
ã€ã³ã¿ã©ã¯ãã£ãã°ã©ãã®å©ç¹ã¯ãããŠã¹ããã€ã³ããããšãã«æ£ç¢ºãªæ°å€ã衚瀺ããããèŠèŠåã§éèŠã§ãªãã·ãªãŒãºãé衚瀺ã«ããããã°ã©ãã®ç¹å®ã®ã»ã¯ã·ã§ã³ãæ¡å€§ãããã§ããããšã§ãã
äœæ¥ãéå§ããåã«ãå¿
èŠãªãã¹ãŠã®ã¢ãžã¥ãŒã«ãã€ã³ããŒããã init_notebook_mode
ã䜿çšããŠinit_notebook_mode
ãåæåããŸãã
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot import plotly import plotly.graph_objs as go init_notebook_mode(connected=True)
ãŸãããªãªãŒã¹ãããã²ãŒã ã®æ°ãšãã®å¹Žããšã®å£²ãäžãã®ãã€ããã¯ã¹ãå«ãline plot
ãäœæããŸãã
plotly
ã¯ã Figure
ãªããžã§ã¯ãã®èŠèŠåãæ§ç¯ãããŸããããã¯ãããŒã¿ïŒã©ã€ãã©ãªå
ã§traces
ãšåŒã°ããè¡ã®é
åïŒãšã layout
ãªããžã§ã¯ããæ
åœãããã¶ã€ã³/ã¹ã¿ã€ã«ã§æ§æããlayout
ã åçŽãªå Žåã iplot
颿°ãiplot
é
åããã®ã¿åŒã³åºãããšãã§ããŸãã
show_link
ãã©ã¡ãŒã¿ãŒshow_link
ããã£ãŒãäžã®plot.lyãªã³ã©ã€ã³ãã©ãããã©ãŒã ãžã®ãªã³ã¯ãæ
åœããŸãã éåžžããã®æ©èœã¯å¿
èŠãªããããå¶çºçãªã¯ãªãã¯ãé²ãããã«é衚瀺ã«ããããšã奜ã¿ãŸãã

ãã£ãŒããããã«htmlãã¡ã€ã«ãšããŠä¿åã§ããŸãã
plotly.offline.plot(fig, filename='years_stats.html', show_link=False)
ãŸãããªãªãŒã¹ãããã²ãŒã ã®æ°ãšç·åçã«ãã£ãŠèšç®ãããã²ãŒã ãã©ãããã©ãŒã ã®åžå Žã·ã§ã¢ãèŠãŠã¿ãŸãããã ãããè¡ãã«ã¯ã bar chart
äœæãbar chart
ã

plotly
ã box plot
ãäœæã§ãbox plot
ã ã²ãŒã ã®ãžã£ã³ã«ã«å¿ããæ¹è©å®¶ã®è©äŸ¡ã®ååžãèæ
®ããŠãã ããã

plotly
ã䜿çšãããšãä»ã®çš®é¡ã®èŠèŠåãæ§ç¯ã§ããŸãã ã°ã©ãã¯ããã©ã«ãã®èšå®ã§ããªãããã§ãã ãã ããã©ã€ãã©ãªã䜿çšãããšãè²ããã©ã³ãã眲åãæ³šéãªã©ãããŸããŸãªèŠèŠåãªãã·ã§ã³ãæè»ã«æ§æã§ããŸãã
èŠèŠããŒã¿åæã®äŸ
ãã¬ã³ã ãªãã¬ãŒã¿ãŒã¯ã©ã€ã¢ã³ãã®æµåºã«é¢ããæåã®èšäºã§ç§ãã¡ãããç¥ã£ãŠããDataFrame
èªã¿èŸŒã¿ãŸãã
df = pd.read_csv('../../data/telecom_churn.csv')
ãã¹ãŠãæ£åžžãšèŠãªããããã©ããã確èªããŸããã-æåã®5è¡ãèŠãŠã¿ãŸãããïŒ head
ã¡ãœããïŒã
df.head()
è¡ïŒã¯ã©ã€ã¢ã³ãïŒãšåïŒç¹æ§ïŒã®æ°ïŒ
df.shape
(3333, 20)
å
åãèŠãŠããããã®ãããã«ãã®ã£ããããªãããšã確èªããŸããã-ã©ãã§ã3333ãšã³ããªã
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 20 columns): State 3333 non-null object Account length 3333 non-null int64 Area code 3333 non-null int64 International plan 3333 non-null object Voice mail plan 3333 non-null object Number vmail messages 3333 non-null int64 Total day minutes 3333 non-null float64 Total day calls 3333 non-null int64 Total day charge 3333 non-null float64 Total eve minutes 3333 non-null float64 Total eve calls 3333 non-null int64 Total eve charge 3333 non-null float64 Total night minutes 3333 non-null float64 Total night calls 3333 non-null int64 Total night charge 3333 non-null float64 Total intl minutes 3333 non-null float64 Total intl calls 3333 non-null int64 Total intl charge 3333 non-null float64 Customer service calls 3333 non-null int64 Churn 3333 non-null bool dtypes: bool(1), float64(8), int64(8), object(3) memory usage: 498.1+ KB
æšèã®èª¬æåœ¹è· | 説æ | çš®é¡ |
---|
éœéåºç | å·ã®æçŽã³ãŒã | ã«ããŽãªãŒ |
ã¢ã«ãŠã³ãã®é·ã | äŒç€Ÿã顧客ã«ãµãŒãã¹ãæäŸããŠããæé | å®éç |
åžå€å±çª | é»è©±çªå·ã®ãã¬ãã£ãã¯ã¹ | å®éç |
åœéèšç» | åœéããŒãã³ã°ïŒæ¥ç¶æžã¿/æªæ¥ç¶ïŒ | ãã€ã㪠|
ãã€ã¹ã¡ãŒã«ãã©ã³ | ãã€ã¹ã¡ãŒã«ïŒæ¥ç¶æžã¿/æªæ¥ç¶ïŒ | ãã€ã㪠|
vmailã¡ãã»ãŒãžã®æ° | é³å£°ã¡ãã»ãŒãžã®æ° | å®éç |
ç·æ¥å | æ¥äžã®äŒè©±ã®åèšæé | å®éç |
åèšæ¥é話 | æ¥äžã®é話ã®ç·æ° | å®éç |
åèšæ¥æé | æ¥äžã®ãµãŒãã¹ã®æ¯æãç·é¡ | å®éç |
åèšåå€ | 倿¹ã®åèšäŒè©±æé | å®éç |
ç·éè©±æ° | åèšå€ã®åŒã³åºã | å®éç |
å倿é | å€ã®åèšãµãŒãã¹æ | å®éç |
ç·å€æ° | å€ã®äŒè©±ã®åèšæé | å®éç |
åèšå€éé話 | å€ã®åèšéè©±æ° | å®éç |
åèšå®¿æ³æé | å€ã®ãµãŒãã¹ã®åèšæ¯æãé¡ | å®éç |
åèšåœéå | åœéé話ã®åèšæé | å®éç |
åèšåœéé»è©± | åèšåœéé»è©± | å®éç |
åèšæé | åœéé話æéã®åèš | å®éç |
ã«ã¹ã¿ããŒãµãŒãã¹ã³ãŒã« | ãµãŒãã¹ã»ã³ã¿ãŒãžã®åŒã³åºãåæ° | å®éç |
ã¿ãŒã²ãã倿°ïŒ ãã£ãŒã³ -æµåºã®å
åããã€ããªïŒ1-ã¯ã©ã€ã¢ã³ãã®æå€±ãã€ãŸãæµåºïŒã 次ã«ããã®æ©èœãæ®ãããäºæž¬ããã¢ãã«ãæ§ç¯ããŸãããããã¿ãŒã²ãããšåŒã°ããçç±ã§ãã
ã¿ãŒã²ããã¯ã©ã¹ã®ååžãã€ãŸãé¡§å®¢ã®æµåºãèŠãŠã¿ãŸãããã
df['Churn'].value_counts()
False 2850 True 483 Name: Churn, dtype: int64
df['Churn'].value_counts().plot(kind='bar', label='Churn') plt.legend() plt.title(' ');
次ã®ã°ã«ãŒãã®ãµã€ã³ãåºå¥ããŸãïŒ ãã£ãŒã³ãé€ããã¹ãŠïŒã
- ãã€ããªïŒ åœéãã©ã³ ã ãã€ã¹ã¡ãŒã«ãã©ã³
- ã«ããŽãªïŒ ç¶æ
- é åºïŒ ã«ã¹ã¿ããŒãµãŒãã¹ã®åŒã³åºã
- å®éçïŒå
šå¡
éç圢質ã®çžé¢é¢ä¿ãèŠãŠã¿ãŸãããã è²ä»ãã®çžé¢è¡åã¯ã ç·æ¥æéãªã©ã®å
åã話ãããåïŒ ç·æ¥å ïŒã«ãã£ãŠèæ
®ãããããšã瀺ããŠããŸãã ã€ãŸãã4ã€ã®æšèã¯æšãŠãããšãã§ããŸãããæçšãªæ
å ±ã¯äŒéãããŸããã
corr_matrix = df.drop(['State', 'International plan', 'Voice mail plan', 'Area code'], axis=1).corr()
sns.heatmap(corr_matrix);
次ã«ãé¢å¿ã®ãããã¹ãŠã®éçç¹æ§ã®ååžãèŠãŠã¿ãŸãããã ãã€ããª/ã«ããŽãª/é åºã®èšå·ãåå¥ã«èŠãŠãããŸãã
features = list(set(df.columns) - set(['State', 'International plan', 'Voice mail plan', 'Area code', 'Total day charge', 'Total eve charge', 'Total night charge', 'Total intl charge', 'Churn'])) df[features].hist(figsize=(20,12));
ã»ãšãã©ã®ãµã€ã³ã¯æ£åžžã«ååžããŠããããšãããããŸãã äŸå€ã¯ãã«ã¹ã¿ããŒãµãŒãã¹ã»ã³ã¿ãŒãžã®ã³ãŒã«ã®æ°ïŒããã§ã¯ãã¢ãœã³ååžã®æ¹ãé©ããŠããŸãïŒããã³ãã€ã¹ã¡ãã»ãŒãžã®æ° ïŒ çªå·vmailã¡ãã»ãŒãž ããŒãã§ããŒã¯ãã€ãŸããã€ã¹ã¡ãŒã«ããªã人ïŒã§ãã åœéé»è©±ã®æ°ã®ååžïŒ Total intl calls ïŒãåã£ãŠããŸãã
æšèã®ååžãäž»ãªå¯Ÿè§ç·äžã«æãããæšèã®ãã¢ã®æ£åžå³ãäž»ãªå¯Ÿè§ç·ã®å€åŽã«æããããããªç»åãæ§ç¯ããããšã¯äŸç¶ãšããŠæçšã§ãã ããã«ãããããã€ãã®çµè«ã«è³ãããšããããŸããããã®å Žåãé©ãããšãªããã¹ãŠãã»ãŒæç¢ºã§ãã
sns.pairplot(df[features + ['Churn']], hue='Churn');
次ã«ãå
åãã©ã®ããã«ã¿ãŒã²ããã«é¢é£ä»ããããŠãããã確èªããŸã-æµåºã
å¿ å®ãªé¡§å®¢ãšå»ã£ã顧客ã®éã®2ã€ã®ã°ã«ãŒãã§ã®éç屿§ã®ååžã®çµ±èšãèšè¿°ããç®±ã²ãå³ãäœæããŸãããã
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10)) for idx, feat in enumerate(features): sns.boxplot(x='Churn', y=feat, data=df, ax=axes[idx / 4, idx % 4]) axes[idx / 4, idx % 4].legend() axes[idx / 4, idx % 4].set_xlabel('Churn') axes[idx / 4, idx % 4].set_ylabel(feat);
äžèŠãããšãã åèšæ¥æ°ã ãã ã«ã¹ã¿ããŒãµãŒãã¹ã³ãŒã«ã ãããã³ã vmailã¡ãã»ãŒãžã®æ°ãã®å
åãæã倧ããç°ãªããŸã ã ç¶ããŠãã©ã³ãã ãã©ã¬ã¹ãïŒãŸãã¯åŸé
ããŒã¹ãã£ã³ã°ïŒã䜿çšããŠåé¡åé¡ã®ç¹åŸŽã®éèŠæ§ã倿ããæ¹æ³ãåŠç¿ããŸããæåã®2ã€ã¯æµåºãäºæž¬ããããã®éåžžã«éèŠãªç¹åŸŽã§ããããšãããããŸãã
å¿ å®ãª/åºçºãã人ã®éã§æ¥äžã«è©±ãããåæ°ã®ååžã§åçãå¥ã
ã«èŠãŠã¿ãŸãããã å·ŠåŽã«ã¯ç§ãã¡ã«ãšã£ãŠéŠŽæã¿ã®ããç®±ã²ãå³ããããå³åŽã«ã¯2ã€ã®ã°ã«ãŒãã®æ°å€èšå·ã®ååžã®å¹³æ»åããããã¹ãã°ã©ã ããããŸãïŒããããªç»åã§ã¯ãªããç®±ã²ãå³ãããã¹ãŠãã¯ã£ããããŠããŸãïŒã
è峿·±ã芳å¯ïŒå¹³åããŠãå»ã£ã顧客ã¯ã³ãã¥ãã±ãŒã·ã§ã³ãããå€ã䜿çšããŸãã ãããã圌ãã¯é¢çšã«äžæºãæ±ããŠãããæµåºãšæŠãããã®å¯Ÿçã®1ã€ã¯é¢çšçïŒã¢ãã€ã«éä¿¡ã®ã³ã¹ãïŒãäžããããšã§ãããã ãããããã®ãããªæªçœ®ãæ¬åœã«æ£åœåããããã©ãããäŒæ¥ã¯è¿œå ã®çµæžåæãè¡ãå¿
èŠããããŸãã
_, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) sns.boxplot(x='Churn', y='Total day minutes', data=df, ax=axes[0]); sns.violinplot(x='Churn', y='Total day minutes', data=df, ax=axes[1]);
次ã«ããµãŒãã¹ã»ã³ã¿ãŒãžã®åŒã³åºãæ°ã®ååžã瀺ããŸãïŒæåã®èšäºã§ãã®ãããªå³ãäœæããŸããïŒã 屿§ã®äžæã®å€ã¯å€ããããŸããïŒå±æ§ã¯éçæŽæ°ãŸãã¯é åºãšèŠãªãããšãã§ããŸãïŒ countplot
ã䜿çšããŠååžãããæç¢ºã«è¡šããŸãã 芳å¯ïŒæµåºã®å²åã¯ããµãŒãã¹ã»ã³ã¿ãŒãžã®4åã®åŒã³åºããã倧ããå¢å ããŸãã
sns.countplot(x='Customer service calls', hue='Churn', data=df);
ããã§ã åœéèšç»ããã³ãã€ã¹ã¡ãŒã«èšç»ã®ãã€ããªãµã€ã³ãšæµåºãšã®é¢ä¿ãèŠãŠã¿ãŸãããã èŠ³å¯ ïŒããŒãã³ã°ãæ¥ç¶ãããŠããå Žåãæµåºçã¯ã¯ããã«é«ããªããŸãã åœéããŒãã³ã°ã¯åŒ·åãªå
åã§ãã ããã¯ããã€ã¹ã¡ãŒã«ã«ã¯åœãŠã¯ãŸããŸããã
_, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) sns.countplot(x='International plan', hue='Churn', data=df, ax=axes[0]); sns.countplot(x='Voice mail plan', hue='Churn', data=df, ax=axes[1]);
æåŸã«ãã«ããŽãªå±æ§Stateãã©ã®ããã«æµåºã«é¢é£ä»ããããŠããããèŠãŠã¿ãŸãããã ãŠããŒã¯ãªå·ã®æ°ã¯éåžžã«å€ãããã圌ãšäžç·ã«ä»äºãããã®ã¯ããã»ã©å¿«é©ã§ã¯ãããŸãã-51ãæåã«ãµããªãŒãã¬ãŒããäœæããããåå·ã®æµåºã®å²åãèšç®ã§ããŸãã ãã ããåå·ã®ããŒã¿ã¯åå¥ã«ååã§ã¯ãªãããïŒåå·ã«ã¯3ã17人ã®åºçºé¡§å®¢ãããããŸããïŒããããã£ãŠãããããåèšç·Žã®ãªã¹ã¯ãããããã State屿§ãåé¡ã¢ãã«ã«è¿œå ãã¹ãã§ã¯ãããŸããïŒãã ãã ã¯ãã¹æ€èšŒã«ã€ããŠã¯ããããã§ãã¯ããŸãããæåŸ
ãã ããïŒïŒã
åç¶æ
ã®æµåºçïŒ
df.groupby(['State'])['Churn'].agg([np.mean]).sort_values(by='mean', ascending=False).T
ãã¥ãŒãžã£ãŒãžãŒå·ãšã«ãªãã©ã«ãã¢å·ã§ã¯æµåºã®å²åã25ïŒ
ãè¶
ããŠããŸããããã¯ã€ãšã¢ã©ã¹ã«ã§ã¯5ïŒ
æªæºã§ããããšãããããŸãã ãããããããã®çµè«ã¯ããŸãã«ãæ§ãããªçµ±èšã«åºã¥ããŠããããããããããã¯å©çšå¯èœãªããŒã¿ã®ç¹åŸŽã«ãããŸããïŒããã§ã¯ãMatthewsãšCramerã®çžé¢é¢ä¿ã«é¢ãã仮説ã確èªã§ããŸãããããã¯ãã®èšäºã®ç¯å²å€ã§ãïŒã
t-SNEã䜿çšããn次å
空éã®èŠãèŠ
åãæµåºããŒã¿ã®t-SNE衚çŸãäœæããŸãã ã¡ãœããã®ååã¯è€éã§ã-tååžã®Stohastic Neighbor Embeddingãæ°åŠãã¯ãŒã«ã§ãïŒãããŠãç§ãã¡ã¯ããã«ã¯å
¥ããŸãããã ãã㯠D.ãã³ãã³ãšJMLRã®å€§åŠé¢çã«ãããªãªãžãã«ã®èšäºã§ãïŒã倿¬¡å
ã®ç¹åŸŽç©ºéããå¹³é¢ãŸã§ïŒãŸãã¯3Dã§ãããã»ãšãã©ã®å Žå2DãéžæãããŸãïŒãå¹³é¢äžã§äºãã«é¢ããŠãããã€ã³ããé ãã«ãããè¿ããã€ã³ããè¿ãã«è¡šç€ºãããŸãã ã€ãŸããè¿ååã蟌ã¿ã¯ãè¿åãä¿åãããŠããããŒã¿ã®æ°ãã衚çŸã®äžçš®ã®æ€çŽ¢ã§ãã
ããã€ãã®è©³çްïŒç¶æ
ãšæµåºã®å
åãpd.factorize
ãŸãããã€ããªã®Yes / Noå
åã¯æ°åã«å€æãããŸãïŒ pd.factorize
ïŒã ãŸããéžæãã¹ã±ãŒãªã³ã°ããå¿
èŠããããŸã-åãã£ãŒãã£ããå¹³åãæžç®ããæšæºåå·®ã§é€ç®ããŸããããã¯StandardScaler
ã«ãã£ãŠè¡ãããŸãã
from sklearn.manifold import TSNE from sklearn.preprocessing import StandardScaler
%%time tsne = TSNE(random_state=17) tsne_representation = tsne.fit_transform(X_scaled)
CPU times: user 20 s, sys: 2.41 s, total: 22.4 s Wall time: 21.9 s
plt.scatter(tsne_representation[:, 0], tsne_representation[:, 1]);
çµæã®æµåºããŒã¿ã®t-SNE衚çŸãè²ä»ãããŸãïŒé-å¿ å®ããªã¬ã³ãž-åºçºãã顧客ïŒã
plt.scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['Churn'].map({0: 'blue', 1: 'orange'}));
åºçºãã顧客ã¯ã屿§ç©ºéã®äžéšã®é åã§äž»ã«ãã°ã«ãŒãåããããŠããããšãããããŸãã
ç»åãããããçè§£ããããã«ãããŒãã³ã°ãšãã€ã¹ã¡ãŒã«ãªã©ããã€ããªèšå·ã®æ®ãã®éšåã§è²ãä»ããããšãã§ããŸãã éãé åã¯ããã®ãã€ããªæ©èœãæã€ãªããžã§ã¯ãã«å¯Ÿå¿ããŠããŸãã
_, axes = plt.subplots(1, 2, sharey=True, figsize=(16,6)) axes[0].scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['International plan'].map({'Yes': 'blue', 'No': 'orange'})); axes[1].scatter(tsne_representation[:, 0], tsne_representation[:, 1], c=df['Voice mail plan'].map({'Yes': 'blue', 'No': 'orange'})); axes[0].set_title('International plan'); axes[1].set_title('Voice mail plan');
ããšãã°ãããŒãã³ã°ããªãã«ãªã£ãŠããããã€ã¹ã¡ãŒã«ããªãç¶æ
ã§ãå€ãã®éå»ãã顧客ãå·Šã®äººã
ã®éãŸãã«ã°ã«ãŒãåãããŠããããšã¯æããã§ãã
æåŸã«ãt-SNEã®æ¬ ç¹ã«æ³šæããŸãïŒã¯ããå¥ã®èšäºãæžãããšããå§ãããŸãïŒã
- èšç®ã®è€éãã
sklearn
, Multicore-TSNE ; random seed
, . t-SNE. â . - , .
. t-SNE ( , ), .
â 2
, .
â . - ( ).
yorko ( ).