
ããŒã¿ãµã€ãšã³ã¹ã®åéã«ãããåŸæ¥ã®ããŒã«ã¯ã RãPythonãªã©ã®èšèªã§ãããªã©ãã¯ã¹ããæ§æãšãæ©æ¢°åŠç¿ããã³ããŒã¿åŠççšã®å€æ°ã®ã©ã€ãã©ãªã«ãããããã€ãã®å®çšçãªãœãªã¥ãŒã·ã§ã³ããã°ããååŸã§ããŸãã ãã ãããããã®ããŒã«ã®å¶éãé倧ãªé害ãšãªãç¶æ³ããããŸãããŸããåŠçé床ã®ç¹ã§é«ãããã©ãŒãã³ã¹ãéæããå¿
èŠãããå Žåããéåžžã«å€§ããªããŒã¿ã»ããã䜿çšããå¿
èŠãããå Žåã§ãã ãã®å Žåãã¹ãã·ã£ãªã¹ãã¯ãã¶ãã¶ãããŒã¯ãµã€ããã®å©ããåããŠããç£æ¥çšãããã°ã©ãã³ã°èšèªïŒ Scala ã Java ã C ++ïŒã®ããŒã«ãæ¥ç¶ããå¿
èŠããããŸãã
ãããããã¡ãåŽã¯ãšãŠãæãã§ããïŒ é·å¹Žã®éçºãçµãŠããç£æ¥çšãããŒã¿ãµã€ãšã³ã¹ã®ããŒã«ã¯å€§ããé²æ©ããä»æ¥ã§ã¯2ã3幎åã®ç¬èªã®ããŒãžã§ã³ãšã¯å€§ããç°ãªããŸãã SNA Hackathon 2019ã¿ã¹ã¯ã®äŸã䜿çšããŠãScala + Sparkãšã³ã·ã¹ãã ãPython Data Scienceã«ã©ãã ã察å¿ã§ããããèããŠã¿ãŸãããã
SNA Hackathon 2019ã®ãã¬ãŒã ã¯ãŒã¯å
ã§ãåå è
ã¯ããœãŒã·ã£ã«ãããã¯ãŒã¯ã®ãŠãŒã¶ãŒã®ãã¥ãŒã¹ãã£ãŒãããããã¹ããç»åããŸãã¯æ©èœãã°ã®ããŒã¿ã䜿çšãã3ã€ã®ãåéãã®ããããã«åé¡ããåé¡ã解決ããŸãã ãã®åºçç©ã§ã¯ãåŸæ¥ã®æ©æ¢°åŠç¿ããŒã«ã䜿çšããŠãSparkã§æšèã®ãã°ã«åºã¥ããŠåé¡ã解決ããæ¹æ³ã説æããŸãã
åé¡ã解決ããã«ã¯ãã¢ãã«ãéçºãããšãã«ããŒã¿åæã®å°é家ãçµéšããæšæºçãªæ¹æ³ã䜿çšããŸãã
- ç 究ããŒã¿åæãå®æœããã°ã©ããäœæããŸãã
- ããŒã¿å
ã®å
åã®çµ±èšçç¹æ§ãåæãããã¬ãŒãã³ã°ã»ãããšãã¹ãã»ããã®éãã調ã¹ãŸãã
- çµ±èšçç¹æ§ã«åºã¥ããŠãç¹åŸŽã®åæéžæãå®è¡ããŸãã
- èšå·ãšã¿ãŒã²ããå€æ°éã®çžé¢ãããã³èšå·éã®çžäºçžé¢ãèšç®ããŸãã
- æ©èœã®æçµã»ããã圢æããã¢ãã«ããã¬ãŒãã³ã°ãããã®å質ã確èªããŸãã
- ã¢ãã«ã®å
éšæ§é ãåæããŠãæé·ãã€ã³ããç¹å®ããŸãããã
ãæ
ãã§ã¯ã Zeppelinã€ã³ã¿ã©ã¯ãã£ããªããŒãããã¯ã Spark MLæ©æ¢°åŠç¿ã©ã€ãã©ãªããã®æ¡åŒµæ©èœPravdaML ã GraphX ã°ã©ãäœæããã±ãŒãžã VegasèŠèŠåã©ã€ãã©ãªããããŠãã¡ããApache Sparkãªã©ã®ããŒã«ã«ç²ŸéããŸãã ïŒ ãã¹ãŠã®ã³ãŒããšå®éšçµæã¯ã Zeplã³ã©ãã¬ãŒãã£ãããŒãããããã©ãããã©ãŒã ã§å©çšã§ããŸã ã
ããŒã¿ã®èªã¿èŸŒã¿
SNA Hackathon 2019ã§ã¬ã€ã¢ãŠããããããŒã¿ã®æ©èœã¯ãPythonã䜿çšããŠçŽæ¥åŠçã§ããããšã§ãããApache Parquetå圢åŒã®æ©èœã®ãããã§ãœãŒã¹ããŒã¿ã¯éåžžã«å¹ççã«å§çž®ãããã¡ã¢ãªã«ãé¡ã§ãèªã¿èŸŒãŸãããšæ°åã®ã¬ãã€ãã«å§çž®è§£é€ãããŸãã Apache Sparkã䜿çšããå ŽåãããŒã¿ãã¡ã¢ãªã«å®å
šã«ããŒãããå¿
èŠã¯ãããŸãããSparkã¢ãŒããã¯ãã£ã¯ããŒã¿ãæççã«åŠçããå¿
èŠã«å¿ããŠãã£ã¹ã¯ããããŒãããããã«èšèšãããŠããŸãã
ãããã£ãŠãæåã®ã¹ãããïŒæ¥ããšã®ããŒã¿ååžã®ç¢ºèªïŒã¯ãããã¯ã¹åãããããŒã«ã§ç°¡åã«å®è¡ã§ããŸãã
val train = sqlContext.read.parquet("/events/hackatons/SNAHackathon/2019/collabTrain") z.show(train.groupBy($"date").agg( functions.count($"instanceId_userId").as("count"), functions.countDistinct($"instanceId_userId").as("users"), functions.countDistinct($"instanceId_objectId").as("objects"), functions.countDistinct($"metadata_ownerId").as("owners")) .orderBy("date"))
察å¿ããã°ã©ãã Zeppelinã«è¡šç€ºãããã®ïŒ

Scalaã®æ§æã¯éåžžã«æè»æ§ããããåãã³ãŒãã¯ããšãã°æ¬¡ã®ããã«èŠãããããããŸããã
val train = sqlContext.read.parquet("/events/hackatons/SNAHackathon/2019/collabTrain") z.show( train groupBy $"date" agg( count($"instanceId_userId") as "count", countDistinct($"instanceId_userId") as "users", countDistinct($"instanceId_objectId") as "objects", countDistinct($"metadata_ownerId") as "owners") orderBy "date" )
ããã§éèŠãªèŠåãè¡ãå¿
èŠããããŸãã誰ããèªåã®å¥œã¿ã®èŠ³ç¹ããã®ã¿Scalaã³ãŒãã®äœæã«åãçµã倧èŠæš¡ãªããŒã ã§äœæ¥ããå Žåãã³ãã¥ãã±ãŒã·ã§ã³ã¯ã¯ããã«å°é£ã§ãã ãã®ãããã³ãŒãã¹ã¿ã€ã«ã®çµ±äžãããæŠå¿µãéçºããæ¹ãé©åã§ãã
ããããã¿ã¹ã¯ã«æ»ããŸãã æ¥ããšã®ç°¡åãªåæã§ã¯ã2æ17æ¥ãš18æ¥ã«ç°åžžãªãã€ã³ãã®ååšã瀺ãããŸããã ããããæè¿ã§ã¯äžå®å
šãªããŒã¿ãåéãããŠããã圢質ã®ååžã¯åã£ãŠããå¯èœæ§ããããŸãã ããã¯ãããã«åæããéã«èæ
®ããå¿
èŠããããŸãã ããã«ãäžæã®ãŠãŒã¶ãŒã®æ°ããªããžã§ã¯ãã®æ°ã«éåžžã«è¿ãããããªããžã§ã¯ãã®æ°ãç°ãªããŠãŒã¶ãŒã®ååžã調æ»ããããšã¯çã«ããªã£ãŠããŸãã
z.show(filteredTrain .groupBy($"instanceId_userId").count .groupBy("count").agg(functions.log(functions.count("count")).as("withCount")) .orderBy($"withCount".desc) .limit(100) .orderBy($"count"))

éåžžã«é·ãããŒã«ãæã€ææ°é¢æ°ã«è¿ãååžãèŠããããšäºæ³ãããŸãã ãã®ãããªã¿ã¹ã¯ã§ã¯ãååãšããŠãããŸããŸãªã¬ãã«ã®ã¢ã¯ãã£ããã£ãæã€ãŠãŒã¶ãŒã®ã¢ãã«ãã»ã°ã¡ã³ãåããããšã«ãããäœæ¥ã®å質ãåäžãããããšãã§ããŸãã ãããè¡ã䟡å€ããããã©ããã確èªããã«ã¯ããã¹ãã»ããå
ã®ãŠãŒã¶ãŒããšã®ãªããžã§ã¯ãæ°ã®ååžãæ¯èŒããŸãã

ãã¹ããšã®æ¯èŒã¯ããã¹ããŠãŒã¶ãŒããã°ã«å°ãªããšã2ã€ã®ãªããžã§ã¯ããæã£ãŠããããšã瀺ããŠããŸãïŒã©ã³ãã³ã°ã¿ã¹ã¯ã¯ããã«ãœã³ã§è§£æ±ºããããããããã¯å質ãè©äŸ¡ããããã®å¿
èŠæ¡ä»¶ã§ãïŒã å°æ¥ã¯ããã¬ãŒãã³ã°ã»ããã®ãŠãŒã¶ãŒããã詳ãã調ã¹ãããšããå§ãããŸãããã¬ãŒãã³ã°ã»ããã§ã¯ããŠãŒã¶ãŒå®çŸ©é¢æ°ããã£ã«ã¿ãŒã§å®£èšããŸãã
ããã§ãéèŠãªçºèšãè¡ãå¿
èŠããããŸããScala/ JavaãšPythonã§ã®Sparkã®äœ¿çšãèããç°ãªãã®ã¯ãUDFãå®çŸ©ãããšãã芳ç¹ããã§ãã PySparkã³ãŒãã¯åºæ¬çãªæ©èœã䜿çšããŸããããã¹ãŠãã»ãŒåãé床ã§æ©èœããŸããããªãŒããŒã©ã€ããããé¢æ°ã衚瀺ããããšãPySparkã®ããã©ãŒãã³ã¹ã¯æ¡éãã«äœäžããŸãã
æåã®MLãã€ãã©ã€ã³
次ã®ã¹ãããã§ã¯ãã¢ã¯ã·ã§ã³ãšå±æ§ã«é¢ããåºæ¬çãªçµ±èšã®èšç®ãè©Šã¿ãŸãã ãã ãããã®ããã«ã¯SparkMLã®æ©èœãå¿
èŠãªã®ã§ããŸããã®äžè¬çãªã¢ãŒããã¯ãã£ãèŠãŠãããŸãã

SparkMLã¯ã次ã®æŠå¿µã«åºã¥ããŠæ§ç¯ãããŠããŸãã
- ãã©ã³ã¹ãã©ãŒããŒ-ããŒã¿ã»ãããå
¥åãšããŠåãåããå€æŽãããã»ããïŒå€æïŒãè¿ããŸãã ååãšããŠãååŠçããã³åŸåŠçã¢ã«ãŽãªãºã ãç¹åŸŽæœåºãå®è£
ããããã«äœ¿çšãããçµæã®MLã¢ãã«ãè¡šãããšãã§ããŸãã
- Estimator-ããŒã¿ã»ãããå
¥åãšããŠåãåããTransformerïŒfitïŒãè¿ããŸãã åœç¶ãEstimatorã¯MLã¢ã«ãŽãªãºã ãè¡šãããšãã§ããŸãã
- ãã€ãã©ã€ã³ã¯ãæšå®åšã®ç¹æ®ãªã±ãŒã¹ã§ããããã©ã³ã¹ãã©ãŒããŒãšæšå®åšã®ãã§ãŒã³ã§æ§æãããŠããŸãã ã¡ãœãããåŒã³åºããããšãfitã¯ãã§ãŒã³ãééãããã©ã³ã¹ãã©ãŒããŒãèŠã€ãã£ãå Žåã¯ããŒã¿ã«é©çšããæšå®åšãèŠã€ãã£ãå Žåã¯ãã©ã³ã¹ãã©ãŒããŒããã¬ãŒãã³ã°ããŠããŒã¿ã«é©çšããããã«å
ã«é²ã¿ãŸãã
- PipelineModel-Pipelineã®çµæã«ã¯å
éšã«ãã§ãŒã³ãå«ãŸããŸããããã©ã³ã¹ãã©ãŒãã®ã¿ã§æ§æãããŸãã ãããã£ãŠãPipelineModelèªäœããã©ã³ã¹ãã©ãŒããŒã§ãã
MLã¢ã«ãŽãªãºã ã®åœ¢æã«å¯Ÿãããã®ãããªã¢ãããŒãã¯ãæ確ãªã¢ãžã¥ãŒã«æ§é ãšåªããåçŸæ§ãå®çŸããã®ã«åœ¹ç«ã¡ãŸããã¢ãã«ãšãã€ãã©ã€ã³ã®äž¡æ¹ãç¯çŽã§ããŸãã
ãŸãããã¬ãŒãã³ã°ã»ããã®ãŠãŒã¶ãŒã®ã¢ã¯ã·ã§ã³ã®ååžïŒãã£ãŒãããã¯ãã£ãŒã«ãïŒã®çµ±èšãèšç®ããåçŽãªãã€ãã©ã€ã³ãæ§ç¯ããŸãã
val feedbackAggregator = new Pipeline().setStages(Array(
ãã®ãã€ãã©ã€ã³ã§ã¯ã PravdaMLã®æ©èœãç©æ¥µçã«äœ¿çšãããŠããŸããã€ãŸããSparkMLçšã®æ¡åŒµããã䟿å©ãªãããã¯ãåããã©ã€ãã©ãªã§ãã
- MultinominalExtractorã¯ãã¯ã³ãããååã«åŸã£ãŠããæååã®é
åãã¿ã€ãã®æåããã¯ãã«ã«ãšã³ã³ãŒãããããã«äœ¿çšãããŸãã ããã¯ããã€ãã©ã€ã³ã®å¯äžã®æšå®åšã§ãïŒãšã³ã³ãŒããäœæããã«ã¯ãããŒã¿ã»ããããäžæã®è¡ãåéããå¿
èŠããããŸãïŒã
- VectorStatCollectorã¯ããã¯ãã«çµ±èšã®èšç®ã«äœ¿çšãããŸãã
- VectorExplodeã¯ãçµæãèŠèŠåã«äŸ¿å©ãªåœ¢åŒã«å€æããããã«äœ¿çšãããŸãã
äœæ¥ã®çµæã¯ãããŒã¿ã»ããå
ã®ã¯ã©ã¹ã®ãã©ã³ã¹ãåããŠããªãããšã瀺ãã°ã©ãã«ãªããŸãããã¿ãŒã²ããLikedã¯ã©ã¹ã®äžåè¡¡ã¯æ¥µç«¯ã§ã¯ãããŸããã

ãã¹ã察象ïŒãã°ã«ãããžãã£ãããšããã¬ãã£ããã®äž¡æ¹ãããïŒã«é¡äŒŒãããŠãŒã¶ãŒéã®é¡äŒŒååžã®åæã¯ãããžãã£ãã¯ã©ã¹ã«åã£ãŠããããšã瀺ããŠããŸãã

å
åã®çµ±èšåæ
次ã®æ®µéã§ã¯ãå±æ§ã®çµ±èšç¹æ§ã®è©³çŽ°ãªåæãå®è¡ããŸãã ä»åã¯ããã倧ããªã³ã³ãã¢ãå¿
èŠã§ãã
val statsAggregator = new Pipeline().setStages(Array( new NullToDefaultReplacer(),
ãããããåå¥ã®ãã£ãŒã«ãã§ã¯ãªãããã¹ãŠã®å±æ§ãäžåºŠã«åŠçããå¿
èŠããããããããã«2ã€ã®äŸ¿å©ãªPravdaMLãŠãŒãã£ãªãã£ã䜿çšããŸãã
- NullToDefaultReplacerã䜿çšãããšãããŒã¿ã®æ¬ èœèŠçŽ ãããã©ã«ãå€ïŒæ°å€ã®å Žåã¯0ãè«çå€æ°ã®å Žåã¯falseãªã©ïŒã§çœ®ãæããããšãã§ããŸãã ãã®å€æãè¡ããªããšãçµæã®ãã¯ãã«ã«NaNå€ã衚瀺ãããŸããããã¯å€ãã®ã¢ã«ãŽãªãºã ã«ãšã£ãŠèŽåœçã§ãïŒããšãã°ãXGBoostã¯ããã«èããããšãã§ããŸãïŒã ãŒãã§çœ®ãæãã代ããã«ãå¹³åã§çœ®ãæããããšãã§ããŸããããã¯NaNToMeanReplacerEstimatorã§å®è£
ãããŸãã
- AutoAssemblerã¯ãããŒãã«ã¬ã€ã¢ãŠããåæããåã®ã¿ã€ãã«äžèŽããååã®ãã¯ãã«åã¹ããŒã ãéžæããéåžžã«åŒ·åãªãŠãŒãã£ãªãã£ã§ãã
çµæã®ãã€ãã©ã€ã³ã䜿çšããŠã3ã€ã®ã»ããïŒãã¬ãŒãã³ã°ããŠãŒã¶ãŒãã£ã«ã¿ãŒãšãã¹ãã«ãããã¬ãŒãã³ã°ïŒã®çµ±èšãèšç®ããåå¥ã®ãã¡ã€ã«ã«ä¿åããŸãã
ãã£ãŒãã£ã®çµ±èšæ
å ±ãå«ã3ã€ã®ããŒã¿ã»ãããåãåã£ãåŸã次ã®ããšãåæããŸãã
- æåºéãå€ãå
åã¯ãããŸããã
-ãã®ãããªå
åãå¶éããããå€ãå€ã®èšé²ãé€å€ããå¿
èŠããããŸãã - äžå€®å€ãšæ¯èŒããŠå¹³åå€ã倧ããå€åããå
åã¯ãããŸããã
-ãã®ãããªã·ããã¯ãã¹ãæ³åã®ååžãããå Žåã«ããçºçããŸããããããã®å
åã察æ°åããããšã¯çã«ããªã£ãŠããŸãã - ãã¬ãŒãã³ã°ã»ãããšãã¹ãã»ããã®éã§å¹³åååžã«ã·ããããããŸããã
- ãã£ãŒãã£ãŒãããªãã¯ã¹ãã©ã®çšåºŠå¯ã«æºããããŠãããã
ãããã®åŽé¢ãæ確ã«ããããã«ã次ã®ãªã¯ãšã¹ãã圹ç«ã¡ãŸãã
def compareWithTest(data: DataFrame) : DataFrame = { data.where("date = 'All'") .select( $"features",
ãã®æ®µéã§ã¯ãèŠèŠåã®åé¡ã¯ç·æ¥ã§ãããã§ãããªã³ã®éåžžã®ããŒã«ã䜿çšãããšããã¹ãŠã®åŽé¢ãããã«è¡šç€ºããããšã¯é£ãããèšå€§ãªã°ã©ããå«ãããŒãããã¯ã¯è¥å€§åããDOMã«ããèããé
ããªãå§ããŸãã Vegas - vega-liteä»æ§ãäœæããããã®Scalaã®DSLã©ã€ãã©ãªã¯ããã®åé¡ã解決ã§ããŸãã Vegasã¯ãè±å¯ãªèŠèŠåæ©èœïŒmatplotlibãšåçïŒãæäŸããã ãã§ãªããDOMãæ¡åŒµããããšãªãCanvasã«æç»ããŸã:)ã
èå³ã®ãããã£ãŒãã®ä»æ§ã¯æ¬¡ã®ããã«ãªããŸãã
vegas.Vegas(width = 1024, height = 648)
以äžã®ãã£ãŒãã¯æ¬¡ã®ããã«ãªããŸãã
- X軞ã¯ããã¹ãã»ãããšãã¬ãŒãã³ã°ã»ããéã®ååžäžå¿ã®ã·ããã瀺ããŸãïŒ0ã«è¿ãã»ã©ã笊å·ã¯å®å®ããŸãïŒã
- éãŒãèŠçŽ ã®å²åã¯Y軞ã«æ²¿ã£ãŠãããããããŸãïŒå€ã倧ããã»ã©ãå±æ§ããšã®ãã€ã³ãæ°ãå€ãã»ã©ããŒã¿ãå€ããªããŸãïŒã
- ãµã€ãºã¯ãäžå€®å€ã«å¯Ÿããå¹³åå€ã®ã·ããã瀺ããŸãïŒãã€ã³ãã倧ããã»ã©ãã¹ãä¹åã®ååžãé«ããªããŸãïŒã
- è²ã¯ãæŸåºã®ååšã瀺ããŸãïŒèµ€ããæŸåºãå€ãïŒã
- ããŠããã©ãŒã ã¯æ¯èŒã¢ãŒãã§åºå¥ãããŸãããã¬ãŒãã³ã°ã»ããã«ãŠãŒã¶ãŒãã£ã«ã¿ãŒã䜿çšãããããã£ã«ã¿ãŒã䜿çšããªããã§ãã

ãããã£ãŠã次ã®çµè«ãå°ãåºãããšãã§ããŸãã
- äžéšã®æšèã«ã¯æŸå°ãã£ã«ã¿ãŒãå¿
èŠã§ã-90ããŒã»ã³ã¿ã€ã«ã®æ倧å€ãå¶éããŸãã
- ããã€ãã®å
åã¯ãææ°é¢æ°ã«è¿ãååžã瀺ããŠããŸã-察æ°ãåããŸãã
- äžéšã®æ©èœã¯ãã¹ãã«å«ãŸããŠããŸãã-ãã¬ãŒãã³ã°ããé€å€ããŸãã
çžé¢åæ
å±æ§ãã©ã®ããã«åæ£ããããã¬ãŒãã³ã°ã»ãããšãã¹ãã»ããã®éã§ã©ã®ããã«é¢ä¿ãããã«ã€ããŠã®äžè¬çãªã¢ã€ãã¢ãåŸãåŸãçžé¢ãåæããŠã¿ãŸãããã ãããè¡ãã«ã¯ã以åã®èŠ³æž¬ã«åºã¥ããŠç¹åŸŽæœåºãæ§æããŸãã
ãã®ãã€ãã©ã€ã³ã®æ°ããæ©æ¢°ã®ãã¡ãå
¥åããŒãã«ã®ä»»æã®SQLå€æãå¯èœã«ããSQLTransformerãŠãŒãã£ãªãã£ã¯æ³šç®ã«å€ããŸãã
çžé¢ãåæãããšãã¯ãã¯ã³ããããã£ãŒãã£ã®èªç¶ãªçžé¢ã«ãã£ãŠäœæããããã€ãºãé€å€ããããšãéèŠã§ãã ãã®ããããã¯ãã«ã®ã©ã®èŠçŽ ãã©ã®åæåã«å¯Ÿå¿ããããç解ããããšæããŸãã Sparkã§ã®ãã®ã¿ã¹ã¯ã¯ãåã¡ã¿ããŒã¿ïŒããŒã¿ãšå
±ã«ä¿åïŒãšå±æ§ã°ã«ãŒãã䜿çšããŠå®è¡ãããŸãã 次ã®ã³ãŒããããã¯ã¯ãStringåã®åãåã«ç±æ¥ããå±æ§åã®ãã¢ãé€å€ããããã«äœ¿çšãããŸãã
val attributes = AttributeGroup.fromStructField(raw.schema("features")).attributes.get val originMap = filteredTrain .schema.filter(_.dataType == StringType) .flatMap(x => attributes.map(_.name.get).filter(_.startsWith(x.name + "_")).map(_ -> x.name)) .toMap
ãã¯ãã«åãæã€ããŒã¿ã»ãããæå
ã«çœ®ããŠãSparkã䜿çšããŠçžäºçžé¢ãèšç®ããã®ã¯éåžžã«ç°¡åã§ãããçµæã¯ãããªãã¯ã¹ã«ãªããŸããå±éã®ããã«ããã¢ã®ã»ãããå°ãåçããå¿
èŠããããŸãã
val pearsonCorrelation =
ãããŠããã¡ãããèŠèŠåïŒç¹°ãè¿ããŸãããããŒãããããæãã«ã¯Vegasã®å©ããå¿
èŠã§ãã
vegas.Vegas("Pearson correlation heatmap") .withDataFrame(pearsonCorrelation .withColumn("isPositive", $"corr" > 0) .withColumn("abs_corr", functions.abs($"corr")) .where("feature1 < feature2 AND abs_corr > 0.05") .orderBy("feature1", "feature2")) .encodeX("feature1", Nom) .encodeY("feature2", Nom) .encodeColor("abs_corr", Quant, scale=Scale(rangeNominals=List("#FFFFFF", "#FF0000"))) .encodeShape("isPositive", Nom) .mark(vegas.Point) .show
çµæã¯Zepl-eã§èŠãæ¹ãè¯ãã§ãã äžè¬çãªç解ã®ããã«ïŒ

ããŒããããã¯ãããã€ãã®çžé¢é¢ä¿ãæããã«å©çšã§ããããšã瀺ããŠããŸãã æã匷ãçžé¢ããç¹åŸŽã®ãããã¯ãéžæããŠã¿ãŸããããããã«ã¯ã GraphXã©ã€ãã©ãªã䜿çšããŸãïŒçžé¢è¡åãã°ã©ãã«å€æããéã¿ã§ãšããžããã£ã«ã¿ãŒåŠçããŸãããã®åŸãæ¥ç¶ãããã³ã³ããŒãã³ããèŠã€ããéå£åã³ã³ããŒãã³ãã®ã¿ãæ®ããŸãïŒè€æ°ã®èŠçŽ ããïŒã ãã®ãããªæé ã¯ã DBSCANã¢ã«ãŽãªãºã ã®ã¢ããªã±ãŒã·ã§ã³ã«æ¬è³ªçã«é¡äŒŒããŠããã次ã®ãšããã§ãã
çµæã¯è¡šåœ¢åŒã§è¡šç€ºãããŸãã

ã¯ã©ã¹ã¿ãªã³ã°ã®çµæã«åºã¥ããŠãæãçžé¢æ§ã®é«ãã°ã«ãŒãã¯ãã°ã«ãŒãã®ã¡ã³ããŒã·ããïŒmembership_status_AïŒãšãªããžã§ã¯ãã®ã¿ã€ãïŒinstanceId_objectTypeïŒã«é¢é£ä»ããããèšå·ã®åšãã«åœ¢æããããšçµè«ä»ããããšãã§ããŸãã æšèã®çžäºäœçšã®æé©ãªã¢ããªã³ã°ã®ããã«ãã¢ãã«ã®ã»ã°ã¡ã³ããŒã·ã§ã³ãé©çšããããšã¯çã«ããªã£ãŠããŸã-ãŠãŒã¶ãŒãååšããã°ã«ãŒããšããã§ãªãã°ã«ãŒãã«å¥ã
ã«ç°ãªãã¿ã€ãã®ãªããžã§ã¯ãã®ç°ãªãã¢ãã«ããã¬ãŒãã³ã°ããããã
æ©æ¢°åŠç¿
æãèå³æ·±ãã®ã¯æ©æ¢°åŠç¿ã§ãã SparkMLããã³PravdaMLæ¡åŒµæ©èœã䜿çšããŠæãåçŽãªã¢ãã«ïŒããžã¹ãã£ãã¯ååž°ïŒããã¬ãŒãã³ã°ããããã®ãã€ãã©ã€ã³ã¯æ¬¡ã®ãšããã§ãã
new Pipeline().setStages(Array( new SQLTransformer().setStatement( """SELECT *, IF(array_contains(feedback, 'Liked'), 1.0, 0.0) AS label FROM __THIS__"""), new NullToDefaultReplacer(), new AutoAssembler() .setColumnsToExclude("date", "instanceId_userId", "instanceId_objectId", "feedback", "label") .setOutputCol("features"), Scaler.scale(Interceptor.intercept(UnwrappedStage.repartition( new LogisticRegressionLBFSG(), numPartitions = 127)))
ããã§ã¯ãå€ãã®éŠŽæã¿ã®ããèŠçŽ ã ãã§ãªããããã€ãã®æ°ããèŠçŽ ã確èªã§ããŸãã
- LogisticRegressionLBFSGã¯ãããžã¹ãã£ãã¯ååž°ã®åæ£ãã¬ãŒãã³ã°ãåããæšå®éã§ãã
- åæ£MLã¢ã«ãŽãªãºã ããæ倧ã®ããã©ãŒãã³ã¹ãéæããããã ããŒã¿ã¯ããŒãã£ã·ã§ã³éã§æé©ã«åæ£ãããå¿
èŠããããŸãã UnwrappedStage.repartitionãŠãŒãã£ãªãã£ã¯ããã«åœ¹ç«ã¡ãååå²æäœããã€ãã©ã€ã³ã«è¿œå ããŠããã¬ãŒãã³ã°æ®µéã§ã®ã¿äœ¿çšãããããã«ããŸãïŒçµå±ãäºæž¬ãæ§ç¯ãããšãã¯äžèŠã«ãªããŸãïŒã
- ç·åœ¢ã¢ãã«ã§è¯ãçµæãåŸãããããã«ã ããŒã¿ã¯ã¹ã±ãŒãªã³ã°ããå¿
èŠãããããã®ããã«Scaler.scaleãŠãŒãã£ãªãã£ã責任ãè² ããŸãã ãã ãã2ã€ã®é£ç¶ããç·åœ¢å€æïŒååž°éã¿ã«ããã¹ã±ãŒãªã³ã°ãšä¹ç®ïŒãååšãããšãäžèŠãªè²»çšãçºçããããããããã®æäœãæããããããšãæãŸããã§ãã PravdaMLã䜿çšãããš ãåºåã¯1ã€ã®å€æãå«ãã¯ãªãŒã³ãªã¢ãã«ã«ãªããŸã:)ã
- ãã¡ããããã®ãããªã¢ãã«ã«ã¯ãInterceptor.interceptæäœã䜿çšããŠè¿œå ããç¡æã®ã¡ã³ããŒãå¿
èŠã§ãã
çµæã®ãã€ãã©ã€ã³ã¯ããã¹ãŠã®ããŒã¿ã«é©çšããããŠãŒã¶ãŒããšã®AUC 0.6889ãæäŸããŸãïŒæ€èšŒã³ãŒãã¯Zeplã§å©çšå¯èœã§ãïŒã ããŒã¿ã®ãã£ã«ã¿ãŒåŠçãæ©èœã®å€æãã»ã°ã¡ã³ãã¢ãã«ã®ãã¹ãŠã®ç 究ãé©çšããããšã¯ä»ã§ãæ®ã£ãŠããŸãã æçµçãªãã€ãã©ã€ã³ã¯æ¬¡ã®ããã«ãªããŸãã
new Pipeline().setStages(Array( new SQLTransformer().setStatement(s"SELECT instanceId_userId, instanceId_objectId, ${expressions.mkString(", ")} FROM __THIS__"), new SQLTransformer().setStatement("""SELECT *, IF(array_contains(feedback, 'Liked'), 1.0, 0.0) AS label, concat(IF(membership_status = 'A', 'OwnGroup_', 'NonUser_'), instanceId_objectType) AS type FROM __THIS__"""), new NullToDefaultReplacer(), new AutoAssembler() .setColumnsToExclude("date", "instanceId_userId", "instanceId_objectId", "feedback", "label", "type","instanceId_objectType") .setOutputCol("features"), CombinedModel.perType( Scaler.scale(Interceptor.intercept(UnwrappedStage.repartition( new LogisticRegressionLBFSG(), numPartitions = 127))), numThreads = 6) ))
PravdaML â CombinedModel.perType. , numThreads = 6. .
, , per-user AUC 0.7004. ? , " " XGBoost :
new Pipeline().setStages(Array( new SQLTransformer().setStatement("""SELECT *, IF(array_contains(feedback, 'Liked'), 1.0, 0.0) AS label FROM __THIS__"""), new NullToDefaultReplacer(), new AutoAssembler() .setColumnsToExclude("date", "instanceId_userId", "instanceId_objectId", "feedback", "label") .setOutputCol("features"), new XGBoostRegressor() .setNumRounds(100) .setMaxDepth(15) .setObjective("reg:logistic") .setNumWorkers(17) .setNthread(4) .setTrackerConf(600000L, "scala") ))
, â XGBoost Spark ! DLMC , PravdaML , ( ). XGboost " " 10 per-user AUC 0.6981.
çµæåæ
, , , . SparkML , . PravdaML : Parquet Spark:
Parquet, PravdaML â TopKTransformer, .
Vegas ( Zepl ):

, - . XGBoost?
val significance = sqlContext.read.parquet( "sna2019/xgBoost15_100_raw/stages/*/featuresSignificance" vegas.Vegas() .withDataFrame(significance.na.drop.orderBy($"significance".desc).limit(40)) .encodeX("name", Nom, sortField = Sort("significance", AggOps.Mean)) .encodeY("significance", Quant) .mark(vegas.Bar) .show

, , XGBoost, , . . , XGBoost , , .
çµè«
, :). :
- , Scala Spark , , , , .
- Scala Spark Python: ETL ML, , , .
- , , , (, ) , , .
- , , . , , , -, .
, , , , -. , , " Scala " Newprolab.
, , â SNA Hackathon 2019 .