ãããžã§ã¯ã管çããã³ã¿ã¹ã¯è¿œè·¡ã·ã¹ãã ã§ãã±ãããäœæãããšãç§ãã¡ã¯ãããããæ§èšŽã«é¢ããæ±ºå®ã®æšå®æ¡ä»¶ãèŠãŠåãã§ããŸãã
çä¿¡ãã±ããã®ã¹ããªãŒã ãåä¿¡ããå Žåãå人/ããŒã ã¯åªå
åºŠãšæéã§ãããã䞊ã¹ãå¿
èŠããããŸããããã¯åã¢ããŒã«ã解決ããã®ã«æéãããããŸãã
ããã«ãããäž¡åœäºè
ã®æéããã广çã«èšç»ã§ããŸãã
ã«ããã®äžã§ãç§ãã¡ã®ããŒã ã«çºè¡ããããã±ããã解決ããã®ã«ãããæéãäºæž¬ããMLã¢ãã«ãã©ã®ããã«åæããèšç·Žãããã«ã€ããŠã話ããŸãã
ç§èªèº«ã¯ãLABãšããããŒã ã®SREããžã·ã§ã³ã§åããŠããŸãã éçºè
ãšQAã®äž¡æ¹ãããæ°ãããã¹ãç°å¢ã®å±éãææ°ãªãªãŒã¹ããŒãžã§ã³ãžã®æŽæ°ãçºçããããŸããŸãªåé¡ã®è§£æ±ºçãªã©ã«é¢ããåãåãããåããŠããŸãã ãããã®ã¿ã¹ã¯ã¯éåžžã«ç°è³ªã§ãããè«ççã«ã¯ãå®äºãããŸã§ã«ç°ãªãæéãããããŸãã ç§ãã¡ã®ããŒã ã¯æ°å¹ŽåããååšããŠããããã®æéäžã«ãªã¯ãšã¹ãã®ããŒã¹ãèç©ãããŸããã ç§ã¯ãã®ããŒã¹ãåæããããã«åºã¥ããŠãæ©æ¢°åŠç¿ã®å©ããåããŠãã¢ããŒã«ïŒãã±ããïŒã®äºæ³éåºæéã®äºæž¬ãåŠçããã¢ãã«ãäœæããããšã«ããŸããã
ç§ãã¡ã®ä»äºã§ã¯JIRAã䜿çšããŸããããã®èšäºã§ç޹ä»ããã¢ãã«ã¯ç¹å®ã®è£œåãšã¯é¢ä¿ãããŸãããããŒã¿ããŒã¹ããå¿
èŠãªæ
å ±ãååŸããããšã¯åé¡ãããŸããã
ããã§ã¯ãèšèããè¡åã«ç§»ããŸãããã
äºåããŒã¿åæ
å¿
èŠãªãã®ããã¹ãŠããŒããã䜿çšãããŠããããã±ãŒãžã®ããŒãžã§ã³ã衚瀺ããŸãã
ãœãŒã¹ã³ãŒãimport warnings warnings.simplefilter('ignore') %matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import datetime from nltk.corpus import stopwords from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression from datetime import time, date for package in [pd, np, matplotlib, sklearn, nltk]: print(package.__name__, 'version:', package.__version__)
pandas version: 0.23.4 numpy version: 1.15.0 matplotlib version: 2.2.2 sklearn version: 0.19.2 nltk version: 3.3
csvãã¡ã€ã«ããããŒã¿ãããŠã³ããŒãããŸãã éå»1.5幎éã«ééããããã±ããã«é¢ããæ
å ±ãå«ãŸããŠããŸãã ããŒã¿ããã¡ã€ã«ã«æžã蟌ãåã«ããããã¯ãããã«ååŠçãããŠããŸããã ããšãã°ã説æä»ãã®ããã¹ããã£ãŒã«ãããã«ã³ããšããªãªããåé€ãããŸããã ãã ããããã¯äºååŠçã«éãããä»åŸããã¹ãã¯ããã«æ¶å»ãããŸãã
ããŒã¿ã»ããã®å
容ãèŠãŠã¿ãŸãããã åèšã§10783æã®ãã±ãããå
¥ã£ãŠããŸããã
ãã£ãŒã«ã説æäœæããŸãã | ãã±ããäœææ¥æ |
解決æžã¿ | ãã±ããçµäºæ¥æ |
Resolution_time | ãã±ãããäœæããŠããéãããŸã§ã«çµéããåæ°ã ã«ã¬ã³ããŒæéãšèŠãªãããŸãããªããªã å瀟ã¯ããŸããŸãªåœã«ãªãã£ã¹ãæã¡ãããŸããŸãªã¿ã€ã ãŸãŒã³ã§åããŠãããéšéå
šäœã«æ±ºãŸã£ãæéã¯ãããŸããã |
Engineer_N | ãšã³ãžãã¢ã®ãã³ãŒãåããããååïŒå°æ¥ãå人æ
å ±ãæ©å¯æ
å ±ãäžæ³šæã§æŒãããªãããã«ãèšäºã«ã¯ããªãã®æ°ã®ããšã³ã³ãŒãããããããŒã¿ããããŸãããå®éã«ã¯åã«ååã倿ŽãããŸãïŒã ãããã®ãã£ãŒã«ãã«ã¯ãæç€ºãããæ¥ä»ã»ããã®åãã±ããã®åä¿¡æã«ãé²è¡äžãã¢ãŒãã®ãã±ããã®æ°ãå«ãŸããŸãã èšäºã®çµããã«åãã£ãŠãããã®ãã£ãŒã«ãã«ã€ããŠåå¥ã«èª¬æããŸãã ç¹ã«æ³šæãå¿
èŠã§ãã |
è²å人 | åé¡ã®è§£æ±ºã«é¢äžããåŸæ¥å¡ã |
Issue_type | ãã±ããã®çš®é¡ã |
ç°å¢ | ãã±ãããäœæããããã¹ãäœæ¥ç°å¢ã®ååïŒç¹å®ã®ç°å¢ãŸãã¯ããŒã¿ã»ã³ã¿ãŒãªã©ã®å Žæå
šäœãæå³ããå ŽåããããŸãïŒã |
åªå
é äœ | ãã±ããã®åªå
é äœã |
ã¯ãŒã¯ã¿ã€ã | ãã®ãã±ããã«äºæ³ãããäœæ¥ã®çš®é¡ïŒãµãŒããŒã®è¿œå ãŸãã¯åé€ãç°å¢ã®æŽæ°ãç£èŠã®æäœãªã©ïŒ |
説æ | 説æ |
ãŸãšã | ãã±ããã®ã¿ã€ãã«ã |
ãŠã©ããã£ãŒ | ãã±ããããèŠããäººã®æ°ãã€ãŸã ãã±ããã®ã¢ã¯ãã£ããã£ããšã«ã¡ãŒã«éç¥ãåãåããŸãã |
æç¥š | ãã±ããã«ãæç¥šããããã±ããã®éèŠæ§ãšé¢å¿ã瀺ããäººã®æ°ã |
èšè
| ãã±ãããçºè¡ãã人ã |
Engineer_N_vacation | ãã±ããã®çºè¡æã«ãšã³ãžãã¢ãäŒæäžã ã£ããã©ããã |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10783 entries, ENV-36273 to ENV-49164 Data columns (total 37 columns): Created 10783 non-null object Resolved 10783 non-null object Resolution_time 10783 non-null int64 engineer_1 10783 non-null int64 engineer_2 10783 non-null int64 engineer_3 10783 non-null int64 engineer_4 10783 non-null int64 engineer_5 10783 non-null int64 engineer_6 10783 non-null int64 engineer_7 10783 non-null int64 engineer_8 10783 non-null int64 engineer_9 10783 non-null int64 engineer_10 10783 non-null int64 engineer_11 10783 non-null int64 engineer_12 10783 non-null int64 Assignee 10783 non-null object Issue_type 10783 non-null object Environment 10771 non-null object Priority 10783 non-null object Worktype 7273 non-null object Description 10263 non-null object Summary 10783 non-null object Watchers 10783 non-null int64 Votes 10783 non-null int64 Reporter 10783 non-null object engineer_1_vacation 10783 non-null int64 engineer_2_vacation 10783 non-null int64 engineer_3_vacation 10783 non-null int64 engineer_4_vacation 10783 non-null int64 engineer_5_vacation 10783 non-null int64 engineer_6_vacation 10783 non-null int64 engineer_7_vacation 10783 non-null int64 engineer_8_vacation 10783 non-null int64 engineer_9_vacation 10783 non-null int64 engineer_10_vacation 10783 non-null int64 engineer_11_vacation 10783 non-null int64 engineer_12_vacation 10783 non-null int64 dtypes: float64(12), int64(15), object(10) memory usage: 3.1+ MB
åèšã§ã10åã®ããªããžã§ã¯ãããã£ãŒã«ãïŒã€ãŸããããã¹ãå€ãå«ãïŒãš27åã®æ°å€ãã£ãŒã«ãããããŸãã
ãŸã第äžã«ãããã«ããŒã¿ã§æåºéãæ¢ããŠãã ããã ã芧ã®ãšãããæ±ºå®æéãæ°çŸäžåã®ãã±ããããããŸãã ããã¯æããã«é¢é£æ
å ±ã§ã¯ãããŸããããã®ãããªããŒã¿ã¯ã¢ãã«ã®æ§ç¯ã劚ããã ãã§ãã JIRAããã®ããŒã¿åéã¯ãäœææžã¿ã§ã¯ãªãã解決æžã¿ãã£ãŒã«ãã®ã¯ãšãªã«ãã£ãŠå®è¡ããããããããã«æ¥ãŸããã ãããã£ãŠãéå»1.5幎以å
ã«ééããããã±ããã¯ããã«å°çããŸãããããã£ãšæ©ãéãããšãã§ããŸããã ããããåãé€ãæéã§ãã 2017幎6æ1æ¥ããåã«äœæããããã±ããã¯ç Žæ£ããŸãã æ®ã9493æã®ãã±ããããããŸãã
çç±ãšããŠ-ç§ã¯åãããžã§ã¯ãã§ãããŸããŸãªç¶æ³ã®ããã«ããªãé·ãéã¶ãã¶ãããŠãããªã¯ãšã¹ããç°¡åã«èŠã€ããããšãã§ããå€ãã®å Žåãåé¡èªäœã解決ããã®ã§ã¯ãªãããå¶éæ³ã®æéåããã«ãã£ãŠéãããããšæããŸãã
ãœãŒã¹ã³ãŒã df[['Created', 'Resolved', 'Resolution_time']].sort_values('Resolution_time', ascending=False).head()

ãœãŒã¹ã³ãŒã df = df[df['Created'] >= '2017-06-01 00:00:00'] print(df.shape)
(9493, 33)
ããã§ã¯ãããŒã¿ã§äœãé¢çœãããèŠãŠã¿ãŸãããã ãŸããæãã·ã³ãã«ãªãã®ãèŠã€ããŸããã-ãã±ããã®äžã§æã人æ°ã®ããç°å¢ãæãã¢ã¯ãã£ããªãã¬ããŒã¿ãŒããªã©ã
ãœãŒã¹ã³ãŒã df.describe(include=['object'])

ãœãŒã¹ã³ãŒã df['Environment'].value_counts().head(10)
Environment_104 442 ALL 368 Location02 367 Environment_99 342 Location03 342 Environment_31 322 Environment_14 254 Environment_1 232 Environment_87 227 Location01 202 Name: Environment, dtype: int64
ãœãŒã¹ã³ãŒã df['Reporter'].value_counts().head()
Reporter_16 388 Reporter_97 199 Reporter_04 147 Reporter_110 145 Reporter_133 138 Name: Reporter, dtype: int64
ãœãŒã¹ã³ãŒã df['Worktype'].value_counts()
Support 2482 Infrastructure 1655 Update environment 1138 Monitoring 388 QA 300 Numbers 110 Create environment 95 Tools 62 Delete environment 24 Name: Worktype, dtype: int64
ãœãŒã¹ã³ãŒã df['Priority'].value_counts().plot(kind='bar', figsize=(12,7), rot=0, fontsize=14, title=' ');

ãŸããç§ãã¡ããã§ã«åŠãã ããšã ã»ãšãã©ã®å Žåããã±ããã®åªå
床ã¯éåžžã§ãããçŽ2åäœãé »åºŠã§ãããéèŠåºŠã¯ããã«äœããªã£ãŠããŸãã åªå
床ãäœãããšã¯ãã£ãã«ãããŸãããæããã«äººã
ã¯ãããå
¬éããããšãæããŠããŸãããã®å Žåããã¥ãŒã§ããªãé·ãæéãã³ã°ãããã®æ±ºå®ã®æéãé
ããå¯èœæ§ããããšä¿¡ããŠããŸãã åŸã§ãã¢ãã«ããã§ã«æ§ç¯ããŠãã®çµæãåæãããšããåªå
é äœãäœããšã¿ã¹ã¯ã®æéæ ã«åœ±é¿ãäžãããã¡ããå éåã«ã¯åœ±é¿ããªãããããã®ãããªæãã¯æ ¹æ ããªããããããŸããã
æã人æ°ã®ããç°å¢ãšæãã¢ã¯ãã£ããªã¬ããŒã¿ãŒã®åãããReporter_16ã倧å¹
ã«äŒžã³ãEnvironment_104ãç°å¢ã§æåã«æ¥ãããšãããããŸãã ãŸã æšæž¬ããŠããªãå Žåã§ããå°ãç§å¯ããäŒãããŸãããã®èšè
ã¯ããã®ç¹å®ã®ç°å¢ã«åãçµãã§ããããŒã ã®åºèº«ã§ãã
æãéèŠãªãã±ãããã©ã®ãããªç°å¢ããæ¥ãŠããã®ãèŠãŠã¿ãŸãããã
ãœãŒã¹ã³ãŒã df[df['Priority'] == 'Critical']['Environment'].value_counts().index[0]
'Environment_91'
次ã«ãåªå
床ã®ç°ãªããã±ãããåããã¯ãªãã£ã«ã«ãç°å¢ããäœææ¥ããã«ã€ããŠã®æ
å ±ãå°å·ããŸãã
ãœãŒã¹ã³ãŒã df[df['Environment'] == df[df['Priority'] == 'Critical']['Environment'].value_counts().index[0]]['Priority'].value_counts()
High 62 Critical 57 Normal 46 Name: Priority, dtype: int64
åªå
é äœã®ã³ã³ããã¹ãã§ãã±ããã®å®è¡æéãèŠãŠã¿ãŸãããã ããšãã°ãåªå
床ã®äœããã±ããã®å¹³åå®è¡æéã¯7äžåïŒçŽ1.5ãæïŒä»¥äžã§ããããšã«æ°ä»ãã®ã¯æ¥œããããšã§ãã ãã±ããã®å®è¡æéã®åªå
床ãžã®äŸåãç°¡åã«è¿œè·¡ã§ããŸãã
ãœãŒã¹ã³ãŒã df.groupby(['Priority'])['Resolution_time'].describe()

ãŸãã¯ãããã§ã¯ã°ã©ããšããŠãäžå€®å€ã ã芧ã®ãšãããç¶æ³ã¯ããã»ã©å€åããŠããªããããæåºéã¯å®éã«ã¯ååžã«åœ±é¿ããŸããã
ãœãŒã¹ã³ãŒã df.groupby(['Priority'])['Resolution_time'].median().sort_values().plot(kind='bar', figsize=(12,7), rot=0, fontsize=14);

次ã«ããã®æç¹ã§ãšã³ãžãã¢ãæã£ãŠãããã±ããã®æ°ã«å¿ããŠãåãšã³ãžãã¢ã®å¹³åãã±ãã解決æéãèŠãŠã¿ãŸãããã å®éããããã®ã°ã©ãã¯ãé©ããããšã«ãåäžã®ç»åã衚瀺ããŠããŸããã äœæ¥äžã®çŸåšã®ãã±ãããå¢å ããã«ã€ããŠå®è¡æéãé·ããªã人ãããã°ããã®é¢ä¿ãå察ã«ãªã人ãããŸãã äžéšã®äžã«ã¯ãäžæ¯ã¯ãŸã£ãã远跡å¯èœã§ã¯ãããŸããã
ãã ããä»åŸã®å±æãèŠããšãããŒã¿ã»ããã«ãã®æ©èœãååšãããšãã¢ãã«ã®ç²ŸåºŠã2å以äžåäžããå®è¡æéã«ç¢ºå®ã«åœ±é¿ãäžãããšèšããŸãã ç§ãã¡ã¯ãã 圌ãèŠãŸããã ãããŠãã¢ãã«ã¯èŠãŠããŸãã
ãœãŒã¹ã³ãŒã engineers = [i.replace('_vacation', '') for i in df.columns if 'vacation' in i] cols = 2 rows = int(len(engineers) / cols) fig, axes = plt.subplots(nrows=rows, ncols=cols, figsize=(16,24)) for i in range(rows): for j in range(cols): df.groupby(engineers[i * cols + j])['Resolution_time'].mean().plot(kind='bar', rot=0, ax=axes[i, j]).set_xlabel('Engineer_' + str(i * cols + j + 1)) del cols, rows, fig, axes
çµæãšããŠã®å
šäœå æ¬¡ã®æ©èœã®ãã¢ã¯ã€ãºçžäºäœçšã®å°ããªãããªãã¯ã¹ãäœæããŸãããïŒãã±ããã®è§£æ±ºæéãæç¥šæ°ããªãã¶ãŒããŒã®æ°ã 察è§ããŒãã¹ã䜿çšãããšãå屿§ã®ååžãåŸãããŸãã
ããããããã®ããããã±ããã®è§£æ±ºæéãççž®ããããšã¯ãå¢ãç¶ãããªãã¶ãŒããŒãžã®äŸåæ§ãèŠãããšãã§ããŸãã ãŸãã人ã
ã¯ç¥šã䜿ãããšã«ããŸãç©æ¥µçã§ã¯ãªãããšãããããŸãã
ãœãŒã¹ã³ãŒã pd.scatter_matrix(df[['Resolution_time', 'Watchers', 'Votes']], figsize=(15, 15), diagonal='hist');
ããã§ãããŒã¿ã®å°ããªäºååæãè¡ããã¿ãŒã²ãã屿§ïŒãã±ããã解決ããã®ã«ãããæéïŒãšããã±ããã®æç¥šæ°ããã®èåŸã«ããããªãã¶ãŒããŒãã®æ°ãåªå
床ãªã©ã®å
åã®éã®æ¢åã®äŸåé¢ä¿ã確èªããŸããã å
ã«é²ã¿ãŸãã
ã¢ãã«ã®æ§ç¯ã 建ç©ã®æšè
次ã¯ãã¢ãã«èªäœã®æ§ç¯ã«é²ã¿ãŸãã ãã ããæåã«ãã¢ãã«ã«çè§£ã§ãã圢åŒã«æ©èœãçµã¿èŸŒãå¿
èŠããããŸãã ã€ãŸã ã«ããŽãªãŒèšå·ãã¹ããŒã¹ãã¯ãã«ã«åè§£ããéå°ãåãé€ããŸãã ããšãã°ããã±ãããã¢ãã«ã§äœæããã³ã¯ããŒãºãããæéã®ãã£ãŒã«ããããã³æ
åœè
ãã£ãŒã«ãã¯å¿
èŠãããŸããã æçµçã«ãã®ã¢ãã«ã䜿çšããŠããŸã 誰ã«ãå²ãåœãŠãããŠããªãïŒãææ®ºãïŒãã±ããã®å®è¡æéãäºæž¬ããŸãã
å
ã»ã©è¿°ã¹ãããã«ãã¿ãŒã²ãããµã€ã³ã¯åé¡ã解決ãããšãã§ãããããã£ãŠããããå¥ã®ãã¯ãã«ãšããŠåããäžè¬çãªããŒã¿ã»ããããåé€ããŸãã ããã«ããã±ãããçºè¡ããéã«ã¬ããŒã¿ãŒã説æãã£ãŒã«ãã«åžžã«å
¥åãããšã¯éããªããããäžéšã®ãã£ãŒã«ãã¯ç©ºã§ããã ãã®å Žåãpandasã¯å€ãNaNã«èšå®ããç©ºã®æååã«çœ®ãæããŸãã
ãœãŒã¹ã³ãŒã y = df['Resolution_time'] df.drop(['Created', 'Resolved', 'Resolution_time', 'Assignee'], axis=1, inplace=True) df['Description'].fillna('', inplace=True) df['Summary'].fillna('', inplace=True)
ã«ããŽãªã®ç¹åŸŽãã¹ããŒã¹ãã¯ãã«ã«åè§£ããŸãïŒ ã¯ã³ããããšã³ã³ãŒãã£ã³ã° ïŒã ãã±ããã®èª¬æãšç®æ¬¡ã§ãã£ãŒã«ãã«è§ŠãããŸã§ã ããããå°ãç°ãªãæ¹æ³ã§äœ¿çšããŸãã äžéšã®ã¬ããŒã¿ãŒåã«ã¯[X]ãå«ãŸããŸãã ãã®ãããJIRAã¯ãäŒç€Ÿã§åããŠããªãéã¢ã¯ãã£ããªåŸæ¥å¡ãããŒã¯ããŸãã å°æ¥çã«ã¯ã¢ãã«ã䜿çšãããšãã«ãããã®åŸæ¥å¡ããã®ãã±ããã衚瀺ãããªããªãããããããããããŒã¿ãã¯ãªã¢ããããšã¯å¯èœã§ãããç§ã¯ããããæšèã®äžã«æ®ãããšã«ããŸããã
ãœãŒã¹ã³ãŒã def create_df(dic, feature_list): out = pd.DataFrame(dic) out = pd.concat([out, pd.get_dummies(out[feature_list])], axis = 1) out.drop(feature_list, axis = 1, inplace = True) return out X = create_df(df, df.columns[df.dtypes == 'object'].drop(['Description', 'Summary'])) X.columns = X.columns.str.replace(' \[X\]', '')
次ã«ããã±ããã®èª¬æãã£ãŒã«ããæ±ããŸãã ããããæãç°¡åãªæ¹æ³ã®1ã€ã§äœæ¥ããŸã-ãã±ããã§äœ¿çšãããŠãããã¹ãŠã®åèªãåéãããããã®äžã§æã人æ°ã®ãããã®ãã«ãŠã³ããããäœåãªãåèª-ããšãã°åèªãªã©ã®çµæã«æããã«åœ±é¿ãäžããªãåèªãç Žæ£ããŸããpleaseãïŒJIRAã§ã®ãã¹ãŠã®éä¿¡ã¯å³å¯ã«è±èªã§è¡ãããŸãïŒãããã¯æã人æ°ããããŸãã ã¯ãããããã¯ç§ãã¡ã®ç€Œåæ£ãã人ã
ã§ãã
ãŸããnltkã©ã€ãã©ãªã«åŸã£ãŠã ã¹ãããã¯ãŒã ããåé€ããäžèŠãªæåã®ããã¹ããããå®å
šã«ã¯ãªã¢ããŸãã ãããããã¹ãã§ã§ããæãç°¡åãªããšã§ããããšãæãåºãããŠãã ããã åèªãã ã¹ã¿ã³ã ãããŸãããæã人æ°ã®ããN-gramã®åèªãã«ãŠã³ãã§ããŸãããããã ãã«å¶éããŸãã
ãœãŒã¹ã³ãŒã all_words = np.concatenate(df['Description'].apply(lambda s: s.split()).values) stop_words = stopwords.words('english') stop_words.extend(['please', 'hi', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '(', ')', '=', '{', '}']) stop_words.extend(['h3', '+', '-', '@', '!', '#', '$', '%', '^', '&', '*', '(for', 'output)']) stop_symbols = ['=>', '|', '[', ']', '#', '*', '\\', '/', '->', '>', '<', '&'] words_series = pd.Series(list(all_words)) del all_words words_series = words_series[~words_series.isin(stop_words)] for symbol in stop_symbols: words_series = words_series[~words_series.str.contains(symbol, regex=False, na=False)]
ãã®ãã¹ãŠã®åŸã䜿çšããããã¹ãŠã®åèªãå«ãpandas.Seriesãªããžã§ã¯ããååŸããŸããã æã人æ°ã®ãããã®ãèŠãŠããªã¹ãããæåã®50åãåãåºããŠæšèãšããŠäœ¿çšããŠã¿ãŸãããã ãã±ããããšã«ããã®åèªã説æã§äœ¿çšãããŠãããã©ããã確èªãã䜿çšãããŠããå Žåã¯å¯Ÿå¿ããåã«1ãå
¥ãããã以å€ã®å Žåã¯0ãå
¥ããŸãã
ãœãŒã¹ã³ãŒã usefull_words = list(words_series.value_counts().head(50).index) print(usefull_words[0:10])
['error', 'account', 'info', 'call', '{code}', 'behavior', 'array', 'update', 'env', 'actual']
äžè¬çãªããŒã¿ã»ããã§ã¯ãéžæããåèªã«å¯ŸããŠåå¥ã®åãäœæããŸãã ããã§ã説æãã£ãŒã«ãèªäœãåãé€ãããšãã§ããŸãã
ãœãŒã¹ã³ãŒã for word in usefull_words: X['Description_' + word] = X['Description'].str.contains(word).astype('int64') X.drop('Description', axis=1, inplace=True)
ãã±ããã®ã¿ã€ãã«ãã£ãŒã«ãã«ãåãããšãè¡ããŸãã
ãœãŒã¹ã³ãŒã all_words = np.concatenate(df['Summary'].apply(lambda s: s.split()).values) words_series = pd.Series(list(all_words)) del all_words words_series = words_series[~words_series.isin(stop_words)] for symbol in stop_symbols: words_series = words_series[~words_series.str.contains(symbol, regex=False, na=False)] usefull_words = list(words_series.value_counts().head(50).index) for word in usefull_words: X['Summary_' + word] = X['Summary'].str.contains(word).astype('int64') X.drop('Summary', axis=1, inplace=True)
ç¹åŸŽè¡åXãšå¿çãã¯ãã«yã§çµæã確èªããŠã¿ãŸãããã
((9493, 1114), (9493,))
次ã«ããã®ããŒã¿ããã¬ãŒãã³ã°ïŒãã¬ãŒãã³ã°ïŒãµã³ãã«ãšãã¹ããµã³ãã«ã«75/25ã®å²åã§åå²ããŸãã åèš7119ã®äŸã§ãã¬ãŒãã³ã°ãè¡ãã2374ã®äŸã§ã¢ãã«ãè©äŸ¡ããŸãã ãŸããã«ããŽãªèšå·ã®ã¬ã€ã¢ãŠãã«ããã屿§ã®ãããªãã¯ã¹ã®æ¬¡å
ã1114ã«å¢å ããŸããã
ãœãŒã¹ã³ãŒã X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.25, random_state=17) print(X_train.shape, X_holdout.shape)
((7119, 1114), (2374, 1114))
ã¢ãã«ããã¬ãŒãã³ã°ããŸãã
ç·åœ¢ååž°
æè»œéã§ïŒäºæ³ãããïŒæã粟床ã®äœãã¢ãã«ã§ããç·åœ¢ååž°ããå§ããŸãããã ãã¬ãŒãã³ã°ããŒã¿ã®ç²ŸåºŠãšãé
å»¶ïŒããŒã«ãã¢ãŠãïŒãµã³ãã«ïŒã¢ãã«ã«ã¯èŠãããªãã£ãããŒã¿ïŒã®äž¡æ¹ã§è©äŸ¡ããŸãã
ç·åœ¢ååž°ã®å Žåãã¢ãã«ã¯å€ããå°ãªãã蚱容ç¯å²ã§ãã¬ãŒãã³ã°ããŒã¿ã«è¡šç€ºãããŸãããé
å»¶ãµã³ãã«ã®ç²ŸåºŠã¯éåžžã«äœããªããŸãã ãã¹ãŠã®ãã±ããã®éåžžã®å¹³åãäºæž¬ãããããããã«æªãã
ããã§ãçãäŒæ©ãåããã¹ã³ã¢ã¡ãœããã䜿çšããŠã¢ãã«ãå質ãè©äŸ¡ããæ¹æ³ãäŒããå¿
èŠããããŸãã
è©äŸ¡ã¯ã決å®ä¿æ°ã«ãã£ãŠè¡ãããŸãã
ã©ãã§ ã¢ãã«aã«ãã£ãŠäºæž¬ãããçµæã§ãã -ãµã³ãã«å
šäœã®å¹³åå€ã
ä¿æ°ã«ã€ããŠã¯ããŸã詳ãã説æããŸããã é¢å¿ã®ããã¢ãã«ã®ç²ŸåºŠãå®å
šã«åæ ããŠããªãããšã«æ³šæããŠãã ããã ãããã£ãŠãåæã«ãå¹³å絶察誀差ïŒMAEïŒã䜿çšããŠè©äŸ¡ããããã«äŸåããŸãã
ãœãŒã¹ã³ãŒã lr = LinearRegression() lr.fit(X_train, y_train) print('R^2 train:', lr.score(X_train, y_train)) print('R^2 test:', lr.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(lr.predict(X_train), y_train)) print('MAE test', mean_absolute_error(lr.predict(X_holdout), y_holdout))
R^2 train: 0.3884389470220214 R^2 test: -6.652435243123196e+17 MAE train: 8503.67256637168 MAE test: 1710257520060.8154
åŸé
ããŒã¹ã
ããŠããããªãã§ãåŸé
ããŒã¹ããªãã§ïŒ ã¢ãã«ããã¬ãŒãã³ã°ããŠãäœãèµ·ãããèŠãŠã¿ãŸãããã ããã«ã¯æªåé«ãXGBoostã䜿çšããŸãã æšæºã®ãã€ããŒãã©ã¡ãŒã¿ãŒèšå®ããå§ããŸãããã
ãœãŒã¹ã³ãŒã import xgboost xgb = xgboost.XGBRegressor() xgb.fit(X_train, y_train) print('R^2 train:', xgb.score(X_train, y_train)) print('R^2 test:', xgb.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(xgb.predict(X_train), y_train)) print('MAE test', mean_absolute_error(xgb.predict(X_holdout), y_holdout))
R^2 train: 0.5138516547636054 R^2 test: 0.12965507684512545 MAE train: 7108.165167471887 MAE test: 8343.433260957032
ç®±ããåºããŠããã«çµæã¯æªããããŸããã ãã€ããŒãã©ã¡ãŒã¿ãŒïŒn_estimatorsãlearning_rateãmax_depthïŒãéžæããŠãã¢ãã«ã調æŽããŠã¿ãŸãããã ãã®çµæããã¬ãŒãã³ã°ããŒã¿ã«å¯Ÿããã¢ãã«ã®ãªãŒããŒãã¬ãŒãã³ã°ããªãå Žåã®ãã¹ããµã³ãã«ã§æè¯ã®çµæã瀺ãããããããã150ã0.1ãããã³3ã®å€ã䜿çšããŸãã
n_estimatorsãéžæããŸã* R ^ 2ã®ä»£ããã«ãåçã®ã¹ã³ã¢ã¯MAEã§ããå¿
èŠããããŸãã
xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(1,151) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Number of estimators') plt.ylabel('$R^2 Score$') plt.legend(loc='best') plt.show();

learning_rateãéžæããŸã xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(0.05, 0.65, 0.05) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Learning rate') plt.ylabel('MAE') plt.legend(loc='best') plt.show();

max_depthãéžæããŸã xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(1, 11) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=0.1, max_depth=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Maximum depth') plt.ylabel('MAE') plt.legend(loc='best') plt.show();

ããã§ãéžæãããã€ããŒãã©ã¡ãŒã¿ãŒã䜿çšããŠã¢ãã«ããã¬ãŒãã³ã°ããŸãã
ãœãŒã¹ã³ãŒã xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=0.1, max_depth=3) xgb.fit(X_train, y_train) print('R^2 train:', xgb.score(X_train, y_train)) print('R^2 test:', xgb.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(xgb.predict(X_train), y_train)) print('MAE test', mean_absolute_error(xgb.predict(X_holdout), y_holdout))
R^2 train: 0.6745967150462303 R^2 test: 0.15415143189670344 MAE train: 6328.384400466232 MAE test: 8217.07897417256
éžæããããã©ã¡ãŒã¿ãŒãšèŠèŠåæ©èœã®éèŠæ§ãæã€æçµçµæ-ã¢ãã«ã«å¿ããæšèã®éèŠæ§ã ãããããã±ãããªãã¶ãŒããŒã®æ°ã§ããã4人ã®ãšã³ãžãã¢ãããã«è¡ããŸãã ãããã£ãŠããã±ããã®éçšæéã¯ããšã³ãžãã¢ã®éçšã«ãã£ãŠéåžžã«åŒ·ã圱é¿ãåããå¯èœæ§ããããŸãã ãããŠããããã®ããã€ãã®èªç±æéãããéèŠã§ããããšã¯è«ççã§ãã ããŒã ã«ã·ãã¢ãšã³ãžãã¢ãšããã«ããããšããçç±ã ãã§ïŒããŒã ã«ãžã¥ãã¢ãããªãå ŽåïŒã ã¡ãªã¿ã«ãç§å¯è£ã«ãæåã®å Žæã®ãšã³ãžãã¢ïŒãªã¬ã³ãžè²ã®ããŒïŒã¯ãããŒã å
šäœã§æãçµéšè±å¯ãªãšã³ãžãã¢ã®1人ã§ãã ããã«ããããã®ãšã³ãžãã¢ã®4人å
šå¡ã®åœ¹è·ã«ã·ãã¢ãã¬ãã£ãã¯ã¹ãä»ããŠããŸãã ã¢ãã«ããããããäžåºŠç¢ºèªããããšãããããŸãã
ãœãŒã¹ã³ãŒã features_df = pd.DataFrame(data=xgb.feature_importances_.reshape(1, -1), columns=X.columns).sort_values(axis=1, by=[0], ascending=False) features_df.loc[0][0:10].plot(kind='bar', figsize=(16, 8), rot=75, fontsize=14);

ãã¥ãŒã©ã«ãããã¯ãŒã¯
ãããã1ã€ã®åŸé
ããŒã¹ãã§åæ¢ããããšã¯ããããã¥ãŒã©ã«ãããã¯ãŒã¯ããŸãã¯ãããå®å
šã«æ¥ç¶ãããçŽæ¥ååžãã¥ãŒã©ã«ãããã¯ãŒã¯ã§ããå€å±€ããŒã»ãããã³ãèšç·ŽããããšããŸãã ä»åã¯ããã€ããŒãã©ã¡ãŒã¿ãŒã®æšæºèšå®ã§ã¯éå§ããŸããã 䜿çšããsklearnã©ã€ãã©ãªãŒã§ã¯ãããã©ã«ãã§100åã®ãã¥ãŒãã³ãæã€é ãå±€ã1ã€ã ãã§ããããã¬ãŒãã³ã°äžã«ã¢ãã«ã¯æšæºã®200åã®å埩ã®äžäžèŽã«é¢ããèŠåãåºããŸãã ããã«ããããã300ã200ã100ã®ãã¥ãŒãã³ãæã€3ã€ã®é ãå±€ã䜿çšããŸãã
ãã®çµæããã¬ãŒãã³ã°ãµã³ãã«ã§ã¢ãã«ããªãŒããŒãã¬ãŒãã³ã°ãããŠããªãããšãããããŸããããã¹ããµã³ãã«ã§é©åãªçµæã衚瀺ãããã®ãé²ãããšã¯ã§ããŸããã ãã®çµæã¯ãåŸé
ããŒã¹ãã®çµæãããããªãå£ã£ãŠããŸãã
ãœãŒã¹ã³ãŒã from sklearn.neural_network import MLPRegressor nn = MLPRegressor(random_state=17, hidden_layer_sizes=(300, 200 ,100), alpha=0.03, learning_rate='adaptive', learning_rate_init=0.0005, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) print('R^2 train:', nn.score(X_train, y_train)) print('R^2 test:', nn.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(nn.predict(X_train), y_train)) print('MAE test', mean_absolute_error(nn.predict(X_holdout), y_holdout))
R^2 train: 0.9771443840549647 R^2 test: -0.15166596239118246 MAE train: 1627.3212161350423 MAE test: 8816.204561947616
ãããã¯ãŒã¯ã®æé©ãªã¢ãŒããã¯ãã£ãéžæããããšã§éæã§ããããšãèŠãŠã¿ãŸãããã æåã«ã1ã€ã®é ãå±€ãš2ã€ã®é ãå±€ãæã€è€æ°ã®ã¢ãã«ããã¬ãŒãã³ã°ããŸã.1ã€ã®å±€ãæã€ã¢ãã«ã200åã®å埩ã§åæããæéããªãããšãããäžåºŠç¢ºèªããã°ã©ããããããããã«ãéåžžã«é·ãæéåæã§ããããšã確èªããŸãã å¥ã®ã¬ã€ã€ãŒã远å ããããšã¯ããã§ã«éåžžã«åœ¹ç«ã¡ãŸãã
ãœãŒã¹ã³ãŒããšã¹ã±ãžã¥ãŒã« plt.figure(figsize=(14, 8)) for i in [(500,), (750,), (1000,), (500,500)]: nn = MLPRegressor(random_state=17, hidden_layer_sizes=i, alpha=0.03, learning_rate='adaptive', learning_rate_init=0.0005, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) plt.plot(nn.loss_curve_, label=str(i)); plt.xlabel('Iterations') plt.ylabel('MSE') plt.legend(loc='best') plt.show()

ãããŠä»ãç§ãã¡ã¯å®å
šã«ç°ãªãã¢ãŒããã¯ãã£ã§ããå€ãã®ã¢ãã«ãèšç·ŽããŸãã 3 10 .
plt.figure(figsize=(14, 8)) for i in [(500,300,100), (80, 60, 60, 60, 40, 40, 40, 40, 20, 10), (80, 60, 60, 40, 40, 40, 20, 10), (150, 100, 80, 60, 40, 40, 20, 10), (200, 100, 100, 100, 80, 80, 80, 40, 20), (80, 40, 20, 20, 10, 5), (300, 250, 200, 100, 80, 80, 80, 40, 20)]: nn = MLPRegressor(random_state=17, hidden_layer_sizes=i, alpha=0.03, learning_rate='adaptive', learning_rate_init=0.001, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) plt.plot(nn.loss_curve_, label=str(i)); plt.xlabel('Iterations') plt.ylabel('MSE') plt.legend(loc='best') plt.show()

"" (200, 100, 100, 100, 80, 80, 80, 40, 20) :
2506
7351
, , . learning rate .
ãœãŒã¹ã³ãŒã nn = MLPRegressor(random_state=17, hidden_layer_sizes=(200, 100, 100, 100, 80, 80, 80, 40, 20), alpha=0.1, learning_rate='adaptive', learning_rate_init=0.007, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) print('R^2 train:', nn.score(X_train, y_train)) print('R^2 test:', nn.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(nn.predict(X_train), y_train)) print('MAE test', mean_absolute_error(nn.predict(X_holdout), y_holdout))
R^2 train: 0.836204705204337 R^2 test: 0.15858607391959356 MAE train: 4075.8553476632796 MAE test: 7530.502826043687
, . , . , , .
. : ( , 200 ). , "" . , 30 200 , issue type: Epic . , .. , , , , . 4 5 . , . , .
â 9 , . , , , .
ãœãŒã¹ã³ãŒã pd.Series([X_train.columns[abs(nn.coefs_[0][:,i]).argmax()] for i in range(nn.hidden_layer_sizes[0])]).value_counts().head(5).sort_values().plot(kind='barh', title='Feature importance', fontsize=14, figsize=(14,8));

. ãªãã§ïŒ 7530 8217. (7530 + 8217) / 2 = 7873, , , ? ããããããã§ã¯ãããŸããã , . , 7526.
, kaggle . , , .
ãœãŒã¹ã³ãŒã nn_predict = nn.predict(X_holdout) xgb_predict = xgb.predict(X_holdout) print('NN MSE:', mean_squared_error(nn_predict, y_holdout)) print('XGB MSE:', mean_squared_error(xgb_predict, y_holdout)) print('Ensemble:', mean_squared_error((nn_predict + xgb_predict) / 2, y_holdout)) print('NN MAE:', mean_absolute_error(nn_predict, y_holdout)) print('XGB MSE:', mean_absolute_error(xgb_predict, y_holdout)) print('Ensemble:', mean_absolute_error((nn_predict + xgb_predict) / 2, y_holdout))
NN MSE: 628107316.262393 XGB MSE: 631417733.4224195 Ensemble: 593516226.8298339 NN MAE: 7530.502826043687 XGB MSE: 8217.07897417256 Ensemble: 7526.763569558157
çµæåæ
? 7500 . ã€ãŸã 5 . . . , .
( ):
ãœãŒã¹ã³ãŒã ((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values(ascending=False).head(10).values
[469132.30504392, 454064.03521379, 252946.87342439, 251786.22682697, 224012.59016987, 15671.21520735, 13201.12440327, 203548.46460229, 172427.32150665, 171088.75543224]
. , .
ãœãŒã¹ã³ãŒã df.loc[((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values(ascending=False).head(10).index][['Issue_type', 'Priority', 'Worktype', 'Summary', 'Watchers']]

, - , . 4 .
, .
ãœãŒã¹ã³ãŒã print(((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values().head(10).values) df.loc[((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values().head(10).index][['Issue_type', 'Priority', 'Worktype', 'Summary', 'Watchers']]
[ 1.24606014, 2.6723969, 4.51969139, 10.04159236, 11.14335444, 14.4951508, 16.51012874, 17.78445744, 21.56106258, 24.78219295]

, , - , - . , , , .
Engineer
, 'Engineer', , , ? .
, 2 . , , , , . , , , "" , ( ) , , , . , " ", .
, . , , 12 , ( JQL JIRA):
assignee was engineer_N during (ticket_creation_date) and status was "In Progress"
10783 * 12 = 129396 , ⊠. , , , .. 5 .
, , , , 2 . .
çµæãšå°æ¥ã®èšç»
. SLO , .
, , ( : - , - , - ) , .