
åæããŠãã ããã Lenta.ruïŒããŒã1ïŒ
äœãã©ãããªã
èªãã®ãé¢åãªäººã®ããã«-èšäºã®äžéšã«ããããŒã¿ã»ãããžã®ãªã³ã¯ã
äœ-éå»18幎éïŒ1999幎9æ1æ¥ããïŒã®ãã¥ãŒã¹ãªãœãŒã¹Lenta.ruã®èšäºã®åæã æ¹æ³-Rèšèªã®äœ¿çšïŒå¥ã®ã»ã¯ã·ã§ã³ã§Yandexã®MyStermããã°ã©ã ã䜿çšïŒã ãªã...ç§ã®å Žåãããªãããšãã質åã«å¯Ÿããç°¡åãªçãã¯ãããã°ããŒã¿ã®ãçµéšãç©ããããšã§ãã ãã詳现ãªèª¬æã¯ãããã¬ãŒãã³ã°äžã«ååŸããã¹ãã«ãé©çšããã¹ãã«ã®ç¢ºèªãšããŠè¡šç€ºã§ããçµæãåŸãããšãã§ãããããã€ãã®å®éã®ã¿ã¹ã¯ã®éè¡ãã§ãã
ç§ã®çµæŽã¯ã1Cããã°ã©ããŒãšããŠ15幎ãCourstra.orgã®ããŒã¿ãµã€ãšã³ã¹å°éåéã®æåã®5ã³ãŒã¹ã§ããããã¯äž»ã«Rã®åºç€ãšä»äºãæäŸããŸãããæ»ãã çŸåšã®è·æ¥ã§ã¯80ã¬ãã«ã«ã»ãŒéããŠãããšããäºå®ã«ãããããããGoogleããã®æ±äººã¯ãŸã 誰ãéã£ãŠããŸããã ãããã£ãŠããšã³ããªãŒã®ãããå€ã¯åãJavaãšæ¯èŒããŠæ¯èŒçäœããããã¡ã€ã³ã¹ããªãŒã ã«ãã©ã€ãããŠããã°ããŒããæããããšã決å®ãããŸããã
ãã¡ãããç·Žç¿ãšããŒããã©ãªãªãå¿
èŠã§ãããšèªåã§æ±ºããã®ã§ãçŸåšã®æµ·ã®ããã€ãã®ããŒã¿ã»ãããååŸããåæãåæãåæãã䟡å€ããããŸã...ããããçŽæ¥åæã®ããã®ç§ã®ã¹ãã«ãé ã®äžã§çè§£ããåæã20-30æéãšæ®ãã¯æ€çŽ¢ãåéãã¯ã¬ã³ãžã³ã°ãããŒã¿ã®æºåã§ããããã2çªç®ãåãäžããããšã«ããŸããã ãããŠãç§ã¯éå»30幎éã«ç±³åœã§è²©å£²ãããèªç©ºåžã®åæã鮿ã®çµ±èšãšã¯å¯Ÿç
§çã«ãäœãç¹å¥ãªãããããè峿·±ããã®ã欲ããã£ãã
ç 究察象ãšããŠLenta.ruãéžæãããŸããã ãã®äžéšã¯ãç§ã¯é·å¹Žã®èªè
ã ããã§ãããã ããç·šéè
ïŒããããã°ïŒãããæããã¹ã©ã°ã宿çã«åãåºããŸãã äžéšã«ã¯ãããŒã¿ãã€ãã³ã°ãæ¯èŒçç°¡åã«æããããã§ãã ããããæ£çŽãªãšããããªããžã§ã¯ãã®éžæã«è¿ã¥ãéã«ãããã®æ¥ä»ã§äœãããããããã³ãã©ã®è³ªåããããããšãã質åãå®éã«ã¯èæ
®ããŸããã§ããã ããã¯ãçŸæç¹ã§ã¯ããŒã¿ã®ååŸãšã¯ãªãŒãã³ã°ã®ã¿ããã¹ã¿ãŒããŠãããåæã«é¢ããç¥èãéåžžã«ä¹ããããã§ãã ãã¡ãããå°ãªããšãéå»5ã10幎éã§çºè¡ããããã¥ãŒã¹ã®å¹³åæ¥æ°ã¯ã©ã®ããã«å€åããã®ããšãã質åã«ã¯çãããããšæããŸãããããã以äžã¯èããŸããã§ããã
ãããã£ãŠããã®èšäºã§ã¯ãLenta.ruã®åæã«é©ããããŒã¿ã®æœåºãšç²Ÿè£œã«å°å¿µããŸãã
ã€ãã
æåã«æ±ºå®ããå¿
èŠããã£ãã®ã¯ããªãœãŒã¹ããŒãžã®ã³ã³ãã³ããååŸããŠè§£æããæ¹æ³ã§ããã Googleã¯ã rvestããã±ãŒãžã䜿çšããããšããå§ãããŸããrvestããã±ãŒãžã䜿çšãããšãã¢ãã¬ã¹ã§ããŒãžããã¹ããååŸããxPathã䜿çšããŠå¿
èŠãªãã£ãŒã«ãã®ã³ã³ãã³ããåŒãåºãããšãã§ããŸãã ãã¡ãããå
ã«é²ããšããã®ã¿ã¹ã¯ã2ã€ã«åå²ããå¿
èŠããããŸãããããŒãžã®ååŸãšçŽæ¥è§£æã§ããããããåŸã§æ°ä»ããŸããããä»ã®ãšããæåã®ã¹ãããã¯èšäºèªäœãžã®ãªã³ã¯ã®ãªã¹ããååŸããããšã§ããã
çã調æ»ã®åŸããµã€ãã§ãã¢ãŒã«ã€ããã»ã¯ã·ã§ã³ãçºèŠãããç°¡åãªã¹ã¯ãªããã䜿çšããŠãç¹å®ã®æ¥ä»ã®ãã¹ãŠã®ãã¥ãŒã¹ãžã®ãªã³ã¯ãå«ãããŒãžã«ãªãã€ã¬ã¯ãããããã®ããŒãžãžã®ãã¹ã¯https://lenta.ru/2017/07/01ã®ããã«ãªããŸãã/ãŸãã¯https://lenta.ru/2017/03/09/ æ®ã£ãŠããã®ã¯ããããã®ãã¹ãŠã®ããŒãžã調ã¹ãŠããããã®ãŸãã«ãã¥ãŒã¹ãªã³ã¯ãååŸããããšã§ããã
ãããã®ç®çïŒååŸãšè§£æïŒã®ããã«ãç§ã¯æ¬¡ã®ããã±ãŒãžã䜿çšããŸããã
require(lubridate) require(rvest) require(dplyr) require(tidyr) require(purrr) require(XML) require(data.table) require(stringr) require(jsonlite) require(reshape2)
éå»8幎éã«ãã¹ãŠã®èšäºãžã®ãã¹ãŠã®ãªã³ã¯ãååŸã§ããããã«ããããªãããŒãªã³ãŒãã§ã¯ãããŸããã
articlesStartDate <- as.Date("2010-01-01") articlesEndDate <- as.Date("2017-06-30") ## STEP 1. Prepare articles links list # Dowload list of pages with archived articles. # Takes about 40 minutes GetNewsListForPeriod <- function() { timestamp() # Prepare vector of links of archive pages in https://lenta.ru//yyyy/mm/dd/ format dayArray <- seq(as.Date(articlesStartDate), as.Date(articlesEndDate), by="days") archivePagesLinks <- paste0(baseURL, "/", year(dayArray), "/", formatC(month(dayArray), width = 2, format = "d", flag = "0"), "/", formatC(day(dayArray), width = 2, format = "d", flag = "0"), "/") # Go through all pages and extract all news links articlesLinks <- c() for (i in 1:length(archivePagesLinks)) { pg <- read_html(archivePagesLinks[i], encoding = "UTF-8") linksOnPage <- html_nodes(pg, xpath=".//section[@class='b-longgrid-column']//div[@class='titles']//a") %>% html_attr("href") articlesLinks <- c(articlesLinks, linksOnPage) saveRDS(articlesLinks, file.path(tempDataFolder, "tempArticlesLinks.rds")) } # Add root and write down all the news links articlesLinks <- paste0(baseURL, articlesLinks) writeLines(articlesLinks, file.path(tempDataFolder, "articles.urls")) timestamp() }
2010-01-01
ãã2017-06-30
ãŸã§ã®æ¥ä»ã®é
å2017-06-30
çæãã 2017-06-30
ã倿archivePagesLinks
ããã¹ãŠã®ãããããã¢ãŒã«ã€ãããŒãžããžã®ãªã³ã¯ãåŸãããŸããã
> head(archivePagesLinks) [1] "https://lenta.ru/2010/01/01/" [2] "https://lenta.ru/2010/01/02/" [3] "https://lenta.ru/2010/01/03/" [4] "https://lenta.ru/2010/01/04/" [5] "https://lenta.ru/2010/01/05/" [6] "https://lenta.ru/2010/01/06/" > length(archivePagesLinks) [1] 2738
read_html
ã¡ãœããã䜿çšããŠãããŒãžã®ã³ã³ãã³ããã«ãŒãã«ãããã¡ãŒã«ãããŠã³ããŒãããã html_attr
ã¡ãœãããšhtml_attr
ã¡ãœããhtml_attr
䜿çšããŠãèšäºãžã®çŽæ¥ãªã³ã¯html_attr
ååŸããŸããã
> head(articlesLinks) [1] "https://lenta.ru/news/2009/12/31/kids/" [2] "https://lenta.ru/news/2009/12/31/silvio/" [3] "https://lenta.ru/news/2009/12/31/postpone/" [4] "https://lenta.ru/photo/2009/12/31/meeting/" [5] "https://lenta.ru/news/2009/12/31/boeviks/" [6] "https://lenta.ru/news/2010/01/01/celebrate/" > length(articlesLinks) [1] 379862
æåã®çµæãååŸããåŸãç§ã¯åé¡ãå®çŸããŸããã äžèšã®ã³ãŒãã®40
ã¯çŽ40
ã ãã®éã«2738
ãªã³ã¯ãåŠçãããããšãèæ
®ãã92
379862
ãªã³ã¯ãåŠçãã379862
5550
åãŸãã¯92
å379862
ãš379862
ã§ããŸãããããã¯åæããŸãããããã§ã¯ãããŸãã... Built-in readLines {base}
and download.file {utils}
methods download.file {utils}
ã ãã§ããã¹ããååŸã§ããåæ§ã®çµæãåŸãããŸããã åæ§ã«read_html
ã³ã³ãã³ãread_html
ããŠã³ããŒãããŠè§£æãç¶ç¶ã§ããhtmlParse {XML}
ã¡ãœãããç¶æ³ãæ¹åããŸããã§ããã getURL {RCurl}
ã䜿çšããåãçµæã è¡ãæ¢ãŸãã

åé¡ã®è§£æ±ºçãæ¢ããŠãGoogleãšç§ã¯ããªã¯ãšã¹ãã®ã䞊åãå®è¡ã®æ¹åã«ç®ãåããããšã«ããã®ã§ãã³ãŒãã®åäœæã«ãããã¯ãŒã¯ãã¡ã¢ãªãããã»ããµã®ããããããŒããããŸããã§ããã Googleã¯parallel-package {parallel}
åãã£ãŠæãããšãä¿ããŸããã æ°æéã®èª¿æ»ãšãã¹ãã®çµæã2ã€ã®ã䞊åãã¹ã¬ããã§ãèµ·åããããšã§å©çãåŸãããããšãããããŸããã Googleã®æççãªæ
å ±ã«ãããšããã®ããã±ãŒãžã䜿çšãããšãèšç®ãæäœãããŒã¿ãšäžŠååã§ããŸããããã£ã¹ã¯ãŸãã¯å€éšãœãŒã¹ãæäœããå Žåããã¹ãŠã®ãªã¯ãšã¹ãã¯åãããã»ã¹å
ã§å®è¡ãããæŽåãããŸãïŒç¶æ³ãçè§£ããã®ã¯ééã£ãŠãããããããŸããïŒã ã¯ããç§ãçè§£ããŠããããã«ããããã¯ãå§ãŸã£ããšããŠããè€æ°ã®å©çšå¯èœãªã³ã¢ãã€ãŸã 8åïŒç§ã¯æã£ãŠããŸããã§ããïŒãšå®éã®äžŠååŠçã§ãããçŽ690åéå«ç
ããªããã°ãªããŸããã§ããã
次ã®ã¢ã€ãã¢ã¯ãè€æ°ã®Rããã»ã¹ã䞊è¡ããŠå®è¡ããããšã§ãããã§ã¯ãªã³ã¯ã®å€§ããªãªã¹ãã®ç¬èªã®éšåãåŠçããŸãã ãã ããGoogleã¯ãRã»ãã·ã§ã³ããããã€ãã®æ°ããRã»ãã·ã§ã³ãéå§ããæ¹æ³ããšãã質åã«ã¯äœãèšããŸããã§ããã ã³ãã³ãã©ã€ã³ããRã¹ã¯ãªãããå®è¡ãããªãã·ã§ã³ã«ã€ããŠãèããŸããããCMDã§ã®ç§ã®çµéšã¯ãã¿ã€ãdirã§ããã©ã«ããŒå
ã®ãã¡ã€ã«ã®ãªã¹ããååŸããŸããã¬ãã«ã§ããã ç§ã¯åã³å°æããŸããã
Googleãæ°ããçµæãæäŸããªããªã£ããšããç§ã¯è¯å¿ããã£ãŠèŽè¡ã«å©ããæ±ããããšã«ããŸããã Googleã¯é »ç¹ã«stackoverflowãæäŸããã®ã§ãããã§éã詊ããŠã¿ãããšã«ããŸããã ããŒãå¥ãã©ãŒã©ã ã§ã®ã³ãã¥ãã±ãŒã·ã§ã³ã®çµéšããããåå¿è
ããã®è³ªåã«å¯Ÿããåå¿ãç¥ã£ãŠããã®ã§ãç§ã¯åé¡ãã§ããã ãæç¢ºãã€æç¢ºã«è¿°ã¹ãããšããŸããã ãããŠèŠã ãæ°æéåŸã ããã»ã«ãã£ã¹ãã詳现ãªçããåãåã£ããç§ã®ã³ãŒãã«çœ®ãæããåŸãç§ã®åé¡ã¯ã»ãŒå®å
šã«è§£æ±ºããã 確ãã«ãèŠåããããŸãããããã©ã®ããã«æ©èœããããå®å
šã«çè§£ããŠããŸããã§ããã wget
ã«ã€ããŠåããŠèãããšããã³ãŒãã§WARC
ããŠäœãããŠããã®ãããªã颿°ãã¡ãœããã«æž¡ããã®ãçè§£ã§ããŸããã§ããïŒç¹°ãè¿ããŸãããç¥åŠæ ¡ãçµããã以åã®èšèªã§ãã§ã€ã³ãã䜿çšããŸããã§ããïŒã ãã ããé·ãéã³ãŒããèŠããšãããã§ãåçºãè¡ãããŸãã ãããŠãæ©èœããšã«è§£æããŠãæççã«å®è¡ãã詊ã¿ã远å ããããšã«ãããç¹å®ã®çµæãéæã§ããŸãã ãŸããåãGoogleã¯wget
察åŠããã®ã«åœ¹ç«ã¡ãŸããã
ãã¹ãŠã®éçºã¯macOSç°å¢ã§è¡ããããšããäºå®ã«ãããããããwgetïŒããã³å°æ¥ã¯Windowsã§ã®ã¿åäœããMyStemïŒã䜿çšããå¿
èŠããããããã©ã³ã¿ã€ã ãWindowsã«çµã蟌ãå¿
èŠããããŸããã curlã¯äŒŒããããªããšãã§ãããšããèãããããæéããããŠå®è£
ããããšæããããããŸããããä»ã®ãšãã忢ããã«é²ãããšã«ããŸããã
ãã®çµæããœãªã¥ãŒã·ã§ã³ã®æ¬è³ªã¯æ¬¡ã®ããã«èŠçŽãããŸãã-èšäºãžã®ãªã³ã¯ãå«ãäºåã«æºåããããã¡ã€ã«ãwget
ã³ãã³ãã«çµã¿èŸŒãŸããŸããã
wget
å®è¡ã³ãŒãèªäœã¯æ¬¡ã®ããã«ãªããŸããã
system("wget --warc-file=lenta -i lenta.urls", intern = FALSE)
å®è¡åŸãWebããŒãžã®htmlã³ã³ãã³ããå«ã倿°ã®ãã¡ã€ã«ïŒéä¿¡ããããªã³ã¯ããšã«1ã€ïŒãåãåããŸããã ãŸããç§ãèªç±ã«WARC
ã«ã¯ããªãœãŒã¹ãšã®éä¿¡ã®ãã°ãšãWebããŒãžã®åãã³ã³ãã³ããå«ãŸããŠããŸããã Bob Rudisãè§£æããããšã瀺åããã®ã¯WARC
ã§ããã å¿
èŠãªãã®ã ããã«ãç§ã¯èªãã ããŒãžã®ã³ããŒãæã£ãŠããã®ã§ãåã³èªãããšãã§ããŸããã
æåã®ããã©ãŒãã³ã¹æž¬å®ã§ã¯ã倿¿æ³ïŒé·ãéãã®èšèã䜿çšãããã£ãïŒã䜿çšããŠ2000
ãªã³ã¯ãããŠã³ããŒãããã®ã«10
1890
ããã¹ãŠã®èšäºã§1890
-ãã»ãŒ3åéããããããŒãã-ã»ãŒ3åé«éã§ãããååã§ã¯ãããŸããã ããã€ãã®ã¹ããããæ»ãã parallel-package {parallel}
ãèæ
®parallel-package {parallel}
ãŠæ°ããã¡ã«ããºã ããã¹ãããçµæãããã§ãå©çãåŸãããªãããšã«æ°ä»ããŸããã
éšå£«ã®åãããããŸããã è€æ°ã®çã®äžŠåããã»ã¹ãå®è¡ããŸãã ãåçŸå¯èœãªç ç©¶ãïŒã³ãŒã¹ã§èª¬æãããŠããååïŒïŒå€éšwgetããã°ã©ã ã䜿çšããå®éã«å®è¡ãWindowsç°å¢ã«çµã³ä»ããïŒããæ¢ã«äžæ©é²ãã ããšãèæ
®ããŠãç§ã¯å¥ã®ã¹ããããèžãã§ã¢ã€ãã¢ã«æ»ãããšã«ããŸãã䞊åããã»ã¹ãå®è¡ããŠããŸããããã§ã«Rã®å€éšã«ãããŸããCMDãã¡ã€ã«ãååŸããŠãåã®ã³ãã³ããåŸ
ããã«è€æ°ã®é£ç¶ããã³ãã³ããå®è¡ããæ¹æ³ïŒäžŠåèªã¿åãïŒã ãã°ãããSTART
ã³ãã³ãã䜿çšãããšãå¥ã®ãŠã£ã³ããŠã§å®è¡ããã³ãã³ããå®è¡ã§ããããšãããããŸããã ãã®ç¥èãæŠåšã«ã次ã®ã³ãŒããçãŸããŸããã
## STEP 2. Prepare wget CMD files for parallel downloading # Create CMD file. # Downloading process for 400K pages takes about 3 hours. Expect about 70GB # in html files and 12GB in compressed WARC files CreateWgetCMDFiles <- function() { timestamp() articlesLinks <- readLines(file.path(tempDataFolder, "articles.urls")) dir.create(warcFolder, showWarnings = FALSE) # Split up articles links array by 10K links numberOfLinks <- length(articlesLinks) digitNumber <- nchar(numberOfLinks) groupSize <- 10000 filesGroup <- seq(from = 1, to = numberOfLinks, by = groupSize) cmdCodeAll <- c() for (i in 1:length(filesGroup)) { # Prepare folder name as 00001-10000, 10001-20000 etc firstFileInGroup <- filesGroup[i] lastFileInGroup <- min(firstFileInGroup + groupSize - 1, numberOfLinks) leftPartFolderName <- formatC(firstFileInGroup, width = digitNumber, format = "d", flag = "0") rigthPartFolderName <- formatC(lastFileInGroup, width = digitNumber, format = "d", flag = "0") subFolderName <- paste0(leftPartFolderName, "-", rigthPartFolderName) subFolderPath <- file.path(downloadedArticlesFolder, subFolderName) dir.create(subFolderPath) # Write articles.urls for each 10K folders that contains 10K articles urls writeLines(articlesLinks[firstFileInGroup:lastFileInGroup], file.path(subFolderPath, "articles.urls")) # Add command line in CMD file that will looks like: # 'START wget --warc-file=warc\000001-010000 -i 000001-010000\list.urls -P 000001-010000' cmdCode <-paste0("START ..\\wget -i ", subFolderName, "\\", "articles.urls -P ", subFolderName) # Use commented code below for downloading with WARC files: #cmdCode <-paste0("START ..\\wget --warc-file=warc\\", subFolderName," -i ", # subFolderName, "\\", "articles.urls -P ", subFolderName) cmdCodeAll <- c(cmdCodeAll, cmdCode) } # Write down command file cmdFile <- file.path(downloadedArticlesFolder, "start.cmd") writeLines(cmdCodeAll, cmdFile) print(paste0("Run ", cmdFile, " to start downloading.")) print("wget.exe should be placed in working directory.") timestamp() }
ãã®ã³ãŒãã¯ããªã³ã¯ã®é
åããããã10,000åã®ãããã¯ã«åå²ããŸãïŒç§ã®å Žåã38åã®ãããã¯ãååŸãããŸããïŒã ã«ãŒãå
ã®åãããã¯00001-10000
ã 10001-20000
ãªã©ã®åœ¢åŒã®ãã©ã«ããŒãäœæãããããã«ç¬èªã®ãã¡ã€ã«ãarticles.urlsãã远å ããïŒ1äžåã®1ã®ãªã³ã¯ã»ãããšãšãã«ïŒãããŠã³ããŒãããããã¡ã€ã«ãããã«å°éããŸãã åããµã€ã¯ã«ã§ãCMDãã¡ã€ã«ãåéããã38åã®ãŠã£ã³ããŠãåæã«èµ·åãããŸãã
START ..\wget --warc-file=warc\000001-010000 -i 000001-010000\articles.urls -P 000001-010000 START ..\wget --warc-file=warc\010001-020000 -i 010001-020000\articles.urls -P 010001-020000 START ..\wget --warc-file=warc\020001-030000 -i 020001-030000\articles.urls -P 020001-030000 ... START ..\wget --warc-file=warc\350001-360000 -i 350001-360000\articles.urls -P 350001-360000 START ..\wget --warc-file=warc\360001-370000 -i 360001-370000\articles.urls -P 360001-370000 START ..\wget --warc-file=warc\370001-379862 -i 370001-379862\articles.urls -P 370001-379862
çæãããCMDãã¡ã€ã«ã®èµ·åãéå§ãããšã wget
ã³ãã³ãã§äºæ³ããã38åã®ãŠã£ã³ããŠãèµ·åããæ¬¡ã®ã³ã³ãã¥ãŒã¿ãŒããŒããæäŸãããŸã3.5GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012

åèšæéã¯180
ãŸãã¯3
ã§ãã ãæŸèæãã®åæå®è¡ã«ãããã·ã³ã°ã«ã¹ã¬ããã®wget
å®è¡ã®ã»ãŒ10-
ã²ã€ã³ãšã read_html {rvest}
å
ã®äœ¿çšã«æ¯ã¹ãŠ30-
ã²ã€ã³ãåŸãããŸããã ããã¯æåã®å°ããªåå©ã§ãããåŸã§äœåºŠãé©çšããªããã°ãªããªãã£ãåæ§ã®ãæŸèæãã¢ãããŒãã§ããã
ããŒããã£ã¹ã¯ã§ã®å®è¡çµæã¯æ¬¡ã®ããã«è¡šç€ºãããŸããã
> indexFiles <- list.files(downloadedArticlesFolder, full.names = TRUE, recursive = TRUE, pattern = "index") > length(indexFiles) [1] 379703 > sum(file.size(indexFiles))/1024/1024 [1] 66713.61 > warcFiles <- list.files(downloadedArticlesFolder, full.names = TRUE, recursive = TRUE, pattern = "warc") > length(warcFiles) [1] 38 > sum(file.size(warcFiles))/1024/1024 [1] 18770.4
ããã¯ãåèšãµã€ãºã66713.61MB
åã®WebããŒãžãšãåèšãµã€ãºã66713.61MB
38
å§çž®WARC
ãã¡ã€ã«ãããŠã³ããŒãããããšãæå³ããŸãã ç°¡åãªèšç®ã«ããã 159
ããŒãžãã倱ããããããšã159
ã Bob Rudisã®äŸã«åŸã£ãŠWARC
ãã¡ã€ã«ãè§£æããããšã§ãããã®éåœãç¥ãããšãã§ãããããããŸããããç§ã¯ãããããšã©ãŒã«æžãçããŠãçŽæ¥379703
ãã¡ã€ã«ã379703
ãŠãç¬èªã®æ¹æ³ã§é²ãããšã«ããŸããã
è§£æ
ããŠã³ããŒãããããŒãžããäœããåŒãåºãåã«ãäœãæ£ç¢ºã«åŒãåºãããã©ã®ãããªæ
å ±ã«èå³ãããã®ãââã倿ããå¿
èŠããããŸããã ããŒãžã®å
容ãé·æéç ç©¶ããåŸã次ã®ã³ãŒããæºåããŸããã
# Parse srecific file # Parse srecific file ReadFile <- function(filename) { pg <- read_html(filename, encoding = "UTF-8") # Extract Title, Type, Description metaTitle <- html_nodes(pg, xpath=".//meta[@property='og:title']") %>% html_attr("content") %>% SetNAIfZeroLength() metaType <- html_nodes(pg, xpath=".//meta[@property='og:type']") %>% html_attr("content") %>% SetNAIfZeroLength() metaDescription <- html_nodes(pg, xpath=".//meta[@property='og:description']") %>% html_attr("content") %>% SetNAIfZeroLength() # Extract script contect that contains rubric and subrubric data scriptContent <- html_nodes(pg, xpath=".//script[contains(text(),'chapters: [')]") %>% html_text() %>% strsplit("\n") %>% unlist() if (is.null(scriptContent[1])) { chapters <- NA } else if (is.na(scriptContent[1])) { chapters <- NA } else { chapters <- scriptContent[grep("chapters: ", scriptContent)] %>% unique() } articleBodyNode <- html_nodes(pg, xpath=".//div[@itemprop='articleBody']") # Extract articles body plaintext <- html_nodes(articleBodyNode, xpath=".//p") %>% html_text() %>% paste0(collapse="") if (plaintext == "") { plaintext <- NA } # Extract links from articles body plaintextLinks <- html_nodes(articleBodyNode, xpath=".//a") %>% html_attr("href") %>% unique() %>% paste0(collapse=" ") if (plaintextLinks == "") { plaintextLinks <- NA } # Extract links related to articles additionalLinks <- html_nodes(pg, xpath=".//section/div[@class='item']/div/..//a") %>% html_attr("href") %>% unique() %>% paste0(collapse=" ") if (additionalLinks == "") { additionalLinks <- NA } # Extract image Description and Credits imageNodes <- html_nodes(pg, xpath=".//div[@class='b-topic__title-image']") imageDescription <- html_nodes(imageNodes, xpath="div//div[@class='b-label__caption']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() imageCredits <- html_nodes(imageNodes, xpath="div//div[@class='b-label__credits']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() # Extract video Description and Credits if (is.na(imageDescription)&is.na(imageCredits)) { videoNodes <- html_nodes(pg, xpath=".//div[@class='b-video-box__info']") videoDescription <- html_nodes(videoNodes, xpath="div[@class='b-video-box__caption']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() videoCredits <- html_nodes(videoNodes, xpath="div[@class='b-video-box__credits']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() } else { videoDescription <- NA videoCredits <- NA } # Extract articles url url <- html_nodes(pg, xpath=".//head/link[@rel='canonical']") %>% html_attr("href") %>% SetNAIfZeroLength() # Extract authors authorSection <- html_nodes(pg, xpath=".//p[@class='b-topic__content__author']") authors <- html_nodes(authorSection, xpath="//span[@class='name']") %>% html_text() %>% SetNAIfZeroLength() if (length(authors) > 1) { authors <- paste0(authors, collapse = "|") } authorLinks <- html_nodes(authorSection, xpath="a") %>% html_attr("href") %>% SetNAIfZeroLength() if (length(authorLinks) > 1) { authorLinks <- paste0(authorLinks, collapse = "|") } # Extract publish date and time datetimeString <- html_nodes(pg, xpath=".//div[@class='b-topic__info']/time[@class='g-date']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() datetime <- html_nodes(pg, xpath=".//div[@class='b-topic__info']/time[@class='g-date']") %>% html_attr("datetime") %>% unique() %>% SetNAIfZeroLength() if (is.na(datetimeString)) { datetimeString <- html_nodes(pg, xpath=".//div[@class='b-topic__date']") %>% html_text() %>% unique() %>% SetNAIfZeroLength() } data.frame(url = url, filename = filename, metaTitle= metaTitle, metaType= metaType, metaDescription= metaDescription, chapters = chapters, datetime = datetime, datetimeString = datetimeString, plaintext = plaintext, authors = authors, authorLinks = authorLinks, plaintextLinks = plaintextLinks, additionalLinks = additionalLinks, imageDescription = imageDescription, imageCredits = imageCredits, videoDescription = videoDescription, videoCredits = videoCredits, stringsAsFactors=FALSE) }
éå§ããã«ã¯ãèŠåºãã æåã«ã title
ããã : : : Lenta.ru
ãã㪠: : : Lenta.ru
ãã®ãååŸããŸãã : : : Lenta.ru
ãèŠåºããçŽæ¥èŠåºããšå°èŠåºãã®ããã»ã¯ã·ã§ã³ã«åå²ããããšãæã¿ãŸããã ãã ãããã®åŸãå®å
šäžã®çç±ãããããŒãžã¡ã¿ããŒã¿ããèŠåºããçŽç²ãªåœ¢åŒã§ä¿è·ããããšã«ããŸããã
<time class="g-date" datetime="2017-07-10T12:35:00Z" itemprop="datePublished" pubdate=""> 15:35, 10 2017</time>
ããæ¥ä»ãšæå»ãååŸããŸãããŸããå®å
š2017-07-10T12:35:00Z
ã 2017-07-10T12:35:00Z
ã§ãªã15:35, 10 2017
ããã¹ããã¬ãŒã³ããŒã·ã§ã³ãååŸããããšã«ããŸããã ãã®ããã¹ã衚瀺ã«ãããäœããã®çç±ã§time[@class='g-date']
ãããŒãžã«ãªãå Žåã®èšäºã®æéãååŸã§ããŸããã
èšäºã®èè
ã¯éåžžã«ãŸãã§ããã念ã®ããããã®æ
å ±ãåŒãåºãããšã«ããŸããã ãŸããèšäºèªäœã®ããã¹ããšãã®äžã®ãé¢é£ãªã³ã¯ãã»ã¯ã·ã§ã³ã«è¡šç€ºããããªã³ã¯ã¯ãç§ã«ãšã£ãŠè峿·±ããã®ã§ããã èšäºã®åé ã§ã念ã®ãããåçããããªã«é¢ããæ
å ±ã®æ£ããè§£æã«å°ãæéãè²»ãããŸããã
èŠåºããšå¯èŠåºããååŸããããã«ïŒæåã¯èŠåºãããchapters: ["","","lenta.ru:_:_:_______"], // Chapters
å®å
šã«åçããŠchapters: ["","","lenta.ru:_:_:_______"], // Chapters
è¡ãä¿åããããšã«ããŸããchapters: ["","","lenta.ru:_:_:_______"], // Chapters
ãäžçããšã瀟äŒããåŒãåºãããšã¯ãã¿ã€ãã«ãããããå°ãç°¡åã«ãªããŸãã
ç§ãç¹ã«èå³ãæã£ãã®ã¯ãå
±ææ°ãèšäºã«å¯Ÿããã³ã¡ã³ãæ°ããããŠãã¡ããã³ã¡ã³ãèªäœïŒèªåœå
容ãäžæçãªæŽ»åïŒã§ãããããã¯ãèªè
ãèšäºã«ã©ã®ããã«åå¿ãããã«é¢ããå¯äžã®æ
å ±ã§ããã ããããç§ãæåããªãã£ãã®ã¯ããŸãã«æãè峿·±ãããšã§ããã ããŒã«ãšã«ã ã¡ã³ãã®ã«ãŠã³ãã¯ãããŒãžã®ããŒãåŸã«å®è¡ãããã¹ã¯ãªããã«ãã£ãŠèšå®ãããŸãã ãããŠãç§ã®ã¹ããŒããªã³ãŒãã¯ãã¹ãŠããã®ç¬éãŸã§ããŒãžãåçž®ããã察å¿ãããã£ãŒã«ãã空ã®ãŸãŸã«ããŸããã ã³ã¡ã³ãã«ã¯ã¹ã¯ãªãããèªã¿èŸŒãŸããŸããããã«ããã°ãããããšèšäºã§ã¯ã³ã¡ã³ããç¡å¹ã«ãªããã³ã¡ã³ããååŸããããšã¯ã§ããŸããã ãããããŠã¯ã©ã€ã/ããŒãã³/ã«ã³ããªãŒãºãšããèšèã®ååšã®äŸåæ§ãšã³ã¡ã³ãã®åŒ·çãªéãèŠããã®ã§ãç§ã¯ãŸã ãã®åé¡ã«åãçµãã§ããŸãã
å®éãå
ååã®ãã³ãã®ãããã§ãç§ã¯ãŸã ãã®æ
å ±ã«ã¢ã¯ã»ã¹ã§ããŸããããããã«ã€ããŠã¯ä»¥äžã§è©³ãã説æããŸãã
ããã¯åºæ¬çã«ç§ãæçšã ãšæã£ããã¹ãŠã®æ
å ±ã§ãã
次ã®ã³ãŒãã«ãããæåã®ãã©ã«ããŒïŒ38åäžïŒã«ãããã¡ã€ã«ã®è§£æãéå§ã§ããŸããã
folderNumber <- 1 # Read and parse files in folder with provided number ReadFilesInFolder <- function(folderNumber) { timestamp() # Get name of folder that have to be parsed folders <- list.files(downloadedArticlesFolder, full.names = FALSE, recursive = FALSE, pattern = "-") folderName <- folders[folderNumber] currentFolder <- file.path(downloadedArticlesFolder, folderName) files <- list.files(currentFolder, full.names = TRUE, recursive = FALSE, pattern = "index") # Split files in folder in 1000 chunks and parse them using ReadFile numberOfFiles <- length(files) print(numberOfFiles) groupSize <- 1000 filesGroup <- seq(from = 1, to = numberOfFiles, by = groupSize) dfList <- list() for (i in 1:length(filesGroup)) { firstFileInGroup <- filesGroup[i] lastFileInGroup <- min(firstFileInGroup + groupSize - 1, numberOfFiles) print(paste0(firstFileInGroup, "-", lastFileInGroup)) dfList[[i]] <- map_df(files[firstFileInGroup:lastFileInGroup], ReadFile) } # combine rows in data frame and write down df <- bind_rows(dfList) write.csv(df, file.path(parsedArticlesFolder, paste0(folderName, ".csv")), fileEncoding = "UTF-8") }
00001-10000
ãšãã圢åŒã®ãã©ã«ããŒã®ã¢ãã¬ã¹ãšãã®å
容ãåãåã£ãåŸããã¡ã€ã«é
åããããã1000
ãããã¯ã«åå²ãã map_df
䜿çšããŠããã®ãããªãããã¯ããšã«ReadFile
颿°map_df
èµ·åããŸããã
枬å®ã®çµæã 10000
ä»¶ã®èšäºãåŠçããã®ã«çŽ8
ãŸããïŒ read_html
ã¡ãœããã¯95%
æéãèŠããŸããïŒã ãã¹ãŠã®åã倿¿ã«300
ãŸãã¯5
ã
ãããŠããã€ããçŽæãã2æç®ã§ãã çã®äžŠåRã»ãã·ã§ã³ã®éå§ïŒCMDã®çµéšã®å©ç¹ã¯ãã§ã«ååšããŠããŸããïŒã ãããã£ãŠããã®ã¹ã¯ãªããã䜿çšããŠãå¿
èŠãªCMDãã¡ã€ã«ãååŸããŸããã
## STEP 3. Parse downloaded articles # Create CMD file for parallel articles parsing. # Parsing process takes about 1 hour. Expect about 1.7Gb in parsed files CreateCMDForParsing <- function() { timestamp() # Get list of folders that contain downloaded articles folders <- list.files(downloadedArticlesFolder, full.names = FALSE, recursive = FALSE, pattern = "-") # Create CMD contains commands to run parse.R script with specified folder number nn <- 1:length(folders) cmdCodeAll <- paste0("start C:/R/R-3.4.0/bin/Rscript.exe ", file.path(getwd(), "parse.R "), nn) cmdFile <- file.path(downloadedArticlesFolder, "parsing.cmd") writeLines(cmdCodeAll, cmdFile) print(paste0("Run ", cmdFile, " to start parsing.")) timestamp() }
ã¿ã€ãïŒ
start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 1 start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 2 start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 3 ... start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 36 start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 37 start C:/R/R-3.4.0/bin/Rscript.exe C:/Users/ildar/lenta/parse.R 38
次ã«38åã®ã¹ã¯ãªãããèµ·åããŸããã
args <- commandArgs(TRUE) n <- as.integer(args[1]) # Set workling directory and locale for macOS and Windows if (Sys.info()['sysname'] == "Windows") { workingDirectory <- paste0(Sys.getenv("HOMEPATH"), "\\lenta") Sys.setlocale("LC_ALL", "Russian") } else { workingDirectory <- ("~/lenta") Sys.setlocale("LC_ALL", "ru_RU.UTF-8") } setwd(workingDirectory) source("get_lenta_articles_list.R") ReadFilesInFolder(n)

ãã®ãããªäžŠååã«ããã 100%
ãµãŒããŒ100%
ããŒããã 30
ã¿ã¹ã¯ãå®äºããããšãã§ããŸããã ããã«10-
åå©ã
ããšã¯ãæ°ããäœæããã38åã®ãã¡ã€ã«ãçµã¿åãããã¹ã¯ãªãããå®è¡ããã ãã§ãã
## STEP 4. Prepare combined articles data # Read all parsed csv and combine them in one. # Expect about 1.7Gb in combined file UnionData <- function() { timestamp() files <- list.files(parsedArticlesFolder, full.names = TRUE, recursive = FALSE) dfList <- c() for (i in 1:length(files)) { file <- files[i] print(file) dfList[[i]] <- read.csv(file, stringsAsFactors = FALSE, encoding = "UTF-8") } df <- bind_rows(dfList) write.csv(df, file.path(parsedArticlesFolder, "untidy_articles_data.csv"), fileEncoding = "UTF-8") timestamp() }
ãããŠãæçµçã«ããã£ã¹ã¯äžã«1.739MB
æª1.739MB
çž®å°ãããŠããªãæ¥ä»ããããŸãã
> file.size(file.path(parsedArticlesFolder, "untidy_articles_data.csv"))/1024/1024 [1] 1739.047
äžèº«ã¯äœã§ããïŒ
> str(dfM, vec.len = 1) 'data.frame': 379746 obs. of 21 variables: $ X.1 : int 1 2 ... $ X : int 1 2 ... $ url : chr "https://lenta.ru/news/2009/12/31/kids/" ... $ filename : chr "C:/Users/ildar/lenta/downloaded_articles/000001-010000/index.html" ... $ metaTitle : chr " " ... $ metaType : chr "article" ... $ metaDescription : chr " . "| __truncated__ ... $ rubric : logi NA ... $ chapters : chr " chapters: [\"\",\"_\",\"lenta.ru:_:_____"| __truncated__ ... $ datetime : chr "2009-12-31T21:24:33Z" ... $ datetimeString : chr " 00:24, 1 2010" ... $ title : chr " : : Lenta.ru" ... $ plaintext : chr " . "| __truncated__ ... $ authors : chr NA ... $ authorLinks : chr NA ... $ plaintextLinks : chr "http://www.interfax.ru/" ... $ additionalLinks : chr "https://lenta.ru/news/2009/12/29/golovan/ https://lenta.ru/news/2009/09/01/children/" ... $ imageDescription: chr NA ... $ imageCredits : chr NA ... $ videoDescription: chr NA ... $ videoCredits : chr NA ...
. , . .
, ( ). , - , . , Google Chrome Network :
https://graph.facebook.com/?id=https%3A%2F%2Flenta.ru%2Fnews%2F2017%2F08%2F10%2Fudostov%2F
:
{ "share": { "comment_count": 0, "share_count": 243 }, "og_object": { "id": "1959067174107041", "description": ..., "title": ..., "type": "article", "updated_time": "2017-08-10T09:21:29+0000" }, "id": "https://lenta.ru/news/2017/08/10/udostov/" }
, , ( ). , . 4
, "".
, 4 CMD , :
## STEP 5. Prepare wget CMD files for parallel downloading social # Create CMD file. CreateWgetCMDFilesForSocial <- function() { timestamp() articlesLinks <- readLines(file.path(tempDataFolder, "articles.urls")) dir.create(warcFolder, showWarnings = FALSE) dir.create(warcFolderForFB, showWarnings = FALSE) dir.create(warcFolderForVK, showWarnings = FALSE) dir.create(warcFolderForOK, showWarnings = FALSE) dir.create(warcFolderForCom, showWarnings = FALSE) # split up articles links array by 10K links numberOfLinks <- length(articlesLinks) digitNumber <- nchar(numberOfLinks) groupSize <- 10000 filesGroup <- seq(from = 1, to = numberOfLinks, by = groupSize) cmdCodeAll <- c() cmdCodeAllFB <- c() cmdCodeAllVK <- c() cmdCodeAllOK <- c() cmdCodeAllCom <- c() for (i in 1:length(filesGroup)) { # Prepare folder name as 00001-10000, 10001-20000 etc firstFileInGroup <- filesGroup[i] lastFileInGroup <- min(firstFileInGroup + groupSize - 1, numberOfLinks) leftPartFolderName <- formatC(firstFileInGroup, width = digitNumber, format = "d", flag = "0") rigthPartFolderName <- formatC(lastFileInGroup, width = digitNumber, format = "d", flag = "0") subFolderName <- paste0(leftPartFolderName, "-", rigthPartFolderName) subFolderPathFB <- file.path(downloadedArticlesFolderForFB, subFolderName) dir.create(subFolderPathFB) subFolderPathVK <- file.path(downloadedArticlesFolderForVK, subFolderName) dir.create(subFolderPathVK) subFolderPathOK <- file.path(downloadedArticlesFolderForOK, subFolderName) dir.create(subFolderPathOK) subFolderPathCom <- file.path(downloadedArticlesFolderForCom, subFolderName) dir.create(subFolderPathCom) # Encode and write down articles.urls for each 10K folders that contains # 10K articles urls. # For FB it has to be done in a bit different way because FB allows to pass # up to 50 links as a request parameter. articlesLinksFB <- articlesLinks[firstFileInGroup:lastFileInGroup] numberOfLinksFB <- length(articlesLinksFB) digitNumberFB <- nchar(numberOfLinksFB) groupSizeFB <- 50 filesGroupFB <- seq(from = 1, to = numberOfLinksFB, by = groupSizeFB) articlesLinksFBEncoded <- c() for (k in 1:length(filesGroupFB )) { firstFileInGroupFB <- filesGroupFB[k] lastFileInGroupFB <- min(firstFileInGroupFB + groupSizeFB - 1, numberOfLinksFB) articlesLinksFBGroup <- paste0(articlesLinksFB[firstFileInGroupFB:lastFileInGroupFB], collapse = ",") articlesLinksFBGroup <- URLencode(articlesLinksFBGroup , reserved = TRUE) articlesLinksFBGroup <- paste0("https://graph.facebook.com/?fields=engagement&access_token=PlaceYourTokenHere&ids=", articlesLinksFBGroup) articlesLinksFBEncoded <- c(articlesLinksFBEncoded, articlesLinksFBGroup) } articlesLinksVK <- paste0("https://vk.com/share.php?act=count&index=1&url=", sapply(articlesLinks[firstFileInGroup:lastFileInGroup], URLencode, reserved = TRUE), "&format=json") articlesLinksOK <- paste0("https://connect.ok.ru/dk?st.cmd=extLike&uid=okLenta&ref=", sapply(articlesLinks[firstFileInGroup:lastFileInGroup], URLencode, reserved = TRUE), "") articlesLinksCom <- paste0("https://c.rambler.ru/api/app/126/comments-count?xid=", sapply(articlesLinks[firstFileInGroup:lastFileInGroup], URLencode, reserved = TRUE), "") writeLines(articlesLinksFBEncoded, file.path(subFolderPathFB, "articles.urls")) writeLines(articlesLinksVK, file.path(subFolderPathVK, "articles.urls")) writeLines(articlesLinksOK, file.path(subFolderPathOK, "articles.urls")) writeLines(articlesLinksCom, file.path(subFolderPathCom, "articles.urls")) # Add command line in CMD file cmdCode <-paste0("START ..\\..\\wget --warc-file=warc\\", subFolderName," -i ", subFolderName, "\\", "articles.urls -P ", subFolderName, " --output-document=", subFolderName, "\\", "index") cmdCodeAll <- c(cmdCodeAll, cmdCode) } cmdFile <- file.path(downloadedArticlesFolderForFB, "start.cmd") print(paste0("Run ", cmdFile, " to start downloading.")) writeLines(cmdCodeAll, cmdFile) cmdFile <- file.path(downloadedArticlesFolderForVK, "start.cmd") writeLines(cmdCodeAll, cmdFile) print(paste0("Run ", cmdFile, " to start downloading.")) cmdFile <- file.path(downloadedArticlesFolderForOK, "start.cmd") writeLines(cmdCodeAll, cmdFile) print(paste0("Run ", cmdFile, " to start downloading.")) cmdFile <- file.path(downloadedArticlesFolderForCom, "start.cmd") writeLines(cmdCodeAll, cmdFile) print(paste0("Run ", cmdFile, " to start downloading.")) print("wget.exe should be placed in working directory.") timestamp() }
, 10, ( ), . , WARC , . , 100 . , , , . 50 , .
:
## Parse downloaded articles social ReadSocial <- function() { timestamp() # Read and parse all warc files in FB folder dfList <- list() dfN <- 0 warcs <- list.files(file.path(downloadedArticlesFolderForFB, "warc"), full.names = TRUE, recursive = FALSE) for (i in 1:length(warcs)) { filename <- warcs[i] print(filename) res <- readLines(filename, warn = FALSE, encoding = "UTF-8") anchorPositions <- which(res == "WARC-Type: response") responsesJSON <- res[anchorPositions + 28] getID <- function(responses) { links <- sapply(responses, function(x){x$id}, USE.NAMES = FALSE) %>% unname() links} getQuantity <- function(responses) { links <- sapply(responses, function(x){x$engagement$share_count}, USE.NAMES = FALSE) %>% unname() links} for(k in 1:length(responsesJSON)) { if(responsesJSON[k]==""){ next } responses <- fromJSON(responsesJSON[k]) if(!is.null(responses$error)) { next } links <- sapply(responses, function(x){x$id}, USE.NAMES = FALSE) %>% unname() %>% unlist() quantities <- sapply(responses, function(x){x$engagement$share_count}, USE.NAMES = FALSE) %>% unname() %>% unlist() df <- data.frame(link = links, quantity = quantities, social = "FB", stringsAsFactors = FALSE) dfN <- dfN + 1 dfList[[dfN]] <- df } } dfFB <- bind_rows(dfList) # Read and parse all warc files in VK folder dfList <- list() dfN <- 0 warcs <- list.files(file.path(downloadedArticlesFolderForVK, "warc"), full.names = TRUE, recursive = FALSE) for (i in 1:length(warcs)) { filename <- warcs[i] print(filename) res <- readLines(filename, warn = FALSE, encoding = "UTF-8") anchorPositions <- which(res == "WARC-Type: response") links <- res[anchorPositions + 4] %>% str_replace_all("WARC-Target-URI: https://vk.com/share.php\\?act=count&index=1&url=|&format=json", "") %>% sapply(URLdecode) %>% unname() quantities <- res[anchorPositions + 24] %>% str_replace_all(" |.*\\,|\\);", "") %>% as.integer() df <- data.frame(link = links, quantity = quantities, social = "VK", stringsAsFactors = FALSE) dfN <- dfN + 1 dfList[[dfN]] <- df } dfVK <- bind_rows(dfList) # Read and parse all warc files in OK folder dfList <- list() dfN <- 0 warcs <- list.files(file.path(downloadedArticlesFolderForOK, "warc"), full.names = TRUE, recursive = FALSE) for (i in 1:length(warcs)) { filename <- warcs[i] print(filename) res <- readLines(filename, warn = FALSE, encoding = "UTF-8") anchorPositions <- which(res == "WARC-Type: response") links <- res[anchorPositions + 4] %>% str_replace_all("WARC-Target-URI: https://connect.ok.ru/dk\\?st.cmd=extLike&uid=okLenta&ref=", "") %>% sapply(URLdecode) %>% unname() quantities <- res[anchorPositions + 22] %>% str_replace_all(" |.*\\,|\\);|'", "") %>% as.integer() df <- data.frame(link = links, quantity = quantities, social = "OK", stringsAsFactors = FALSE) dfN <- dfN + 1 dfList[[dfN]] <- df } dfOK <- bind_rows(dfList) # Read and parse all warc files in Com folder dfList <- list() dfN <- 0 warcs <- list.files(file.path(downloadedArticlesFolderForCom, "warc"), full.names = TRUE, recursive = FALSE) x <- c() for (i in 1:length(warcs)) { filename <- warcs[i] print(filename) res <- readLines(filename, warn = FALSE, encoding = "UTF-8") anchorPositions <- which(str_sub(res, start = 1, end = 9) == '{"xids":{') x <- c(x, res[anchorPositions]) } for (i in 1:length(warcs)) { filename <- warcs[i] print(filename) res <- readLines(filename, warn = FALSE, encoding = "UTF-8") anchorPositions <- which(str_sub(res, start = 1, end = 9) == '{"xids":{') x <- c(x, res[anchorPositions]) responses <- res[anchorPositions] %>% str_replace_all('\\{\\"xids\\":\\{|\\}', "") if(responses==""){ next } links <- str_replace_all(responses, "(\":[^ ]+)|\"", "") quantities <- str_replace_all(responses, ".*:", "") %>% as.integer() df <- data.frame(link = links, quantity = quantities, social = "Com", stringsAsFactors = FALSE) dfN <- dfN + 1 dfList[[dfN]] <- df } dfCom <- bind_rows(dfList) dfCom <- dfCom[dfCom$link!="",] # Combine dfs and reshape them into "link", "FB", "VK", "OK", "Com" dfList <- list() dfList[[1]] <- dfFB dfList[[2]] <- dfVK dfList[[3]] <- dfOK dfList[[4]] <- dfCom df <- bind_rows(dfList) dfCasted <- dcast(df, link ~ social, value.var = "quantity") dfCasted <- dfCasted[order(dfCasted$link),] write.csv(dfCasted, file.path(parsedArticlesFolder, "social_articles.csv"), fileEncoding = "UTF-8") timestamp() }
. .
Cleaning
, 379746 obs. of 21 variables and size of 1.739MB
, . fread {data.table}
. :
> system.time(dfM <- read.csv(untidyDataFile, stringsAsFactors = FALSE, encoding = "UTF-8")) 133.17 1.50 134.67 > system.time(dfM <- fread(untidyDataFile, stringsAsFactors = FALSE, encoding = "UTF-8")) Read 379746 rows and 21 (of 21) columns from 1.698 GB file in 00:00:18 17.67 0.54 18.22
, - NA
. - â - ( Parsing). , , :
# Load required packages require(lubridate) require(dplyr) require(tidyr) require(data.table) require(tldextract) require(XML) require(stringr) require(tm) # Set workling directory and locale for macOS and Windows if (Sys.info()['sysname'] == "Windows") { workingDirectory <- paste0(Sys.getenv("HOMEPATH"), "\\lenta") Sys.setlocale("LC_ALL", "Russian") } else { workingDirectory <- ("~/lenta") Sys.setlocale("LC_ALL", "ru_RU.UTF-8") } setwd(workingDirectory) # Set common variables parsedArticlesFolder <- file.path(getwd(), "parsed_articles") tidyArticlesFolder <- file.path(getwd(), "tidy_articles") # Creare required folders if not exist dir.create(tidyArticlesFolder, showWarnings = FALSE) ## STEP 5. Clear and tidy data # Section 7 takes about 2-4 hours TityData <- function() { dfM <- fread(file.path(parsedArticlesFolder, "untidy_articles_data.csv"), stringsAsFactors = FALSE, encoding = "UTF-8") # SECTION 1 print(paste0("1 ",Sys.time())) # Remove duplicate rows, remove rows with url = NA, create urlKey column as a key dtD <- dfM %>% select(-V1,-X) %>% distinct(url, .keep_all = TRUE) %>% na.omit(cols="url") %>% mutate(urlKey = gsub(":|\\.|/", "", url)) # Function SplitChapters is used to process formatted chapter column and retrive rubric # and subrubric SplitChapters <- function(x) { splitOne <- strsplit(x, "lenta.ru:_")[[1]] splitLeft <- strsplit(splitOne[1], ",")[[1]] splitLeft <- unlist(strsplit(splitLeft, ":_")) splitRight <- strsplit(splitOne[2], ":_")[[1]] splitRight <- splitRight[splitRight %in% splitLeft] splitRight <- gsub("_", " ", splitRight) paste0(splitRight, collapse = "|") } # SECTION 2 print(paste0("2 ",Sys.time())) # Process chapter column to retrive rubric and subrubric # Column value such as: # chapters: ["_","","lenta.ru:__:_:______"], // Chapters # should be represented as rubric value " " # and subrubric value "" dtD <- dtD %>% mutate(chapters = gsub('\"|\\[|\\]| |chapters:', "", chapters)) %>% mutate(chaptersFormatted = as.character(sapply(chapters, SplitChapters))) %>% separate(col = "chaptersFormatted", into = c("rubric", "subrubric") , sep = "\\|", extra = "drop", fill = "right", remove = FALSE) %>% filter(!rubric == "NA") %>% select(-chapters, -chaptersFormatted) # SECTION 3 print(paste0("3 ",Sys.time())) # Process imageCredits column and split into imageCreditsPerson # and imageCreditsCompany # Column value such as: ": / " should be represented # as imageCreditsPerson value " " and # imageCreditsCompany value " " pattern <- ': | |: |: |, |()||«|»|\\(|)|\"' dtD <- dtD %>% mutate(imageCredits = gsub(pattern, "", imageCredits)) %>% separate(col = "imageCredits", into = c("imageCreditsPerson", "imageCreditsCompany") , sep = "/", extra = "drop", fill = "left", remove = FALSE) %>% mutate(imageCreditsPerson = as.character(sapply(imageCreditsPerson, trimws))) %>% mutate(imageCreditsCompany = as.character(sapply(imageCreditsCompany, trimws))) %>% select(-imageCredits) # SECTION 4 print(paste0("4 ",Sys.time())) # Function UpdateDatetime is used to process missed values in datetime column # and fill them up with date and time retrived from string presentation # such as "13:47, 18 2017" or from url such # as https://lenta.ru/news/2017/07/18/frg/. Hours and Minutes set randomly # from 8 to 21 in last case months <- c("", "", "", "", "", "", "", "", "", "", "", "") UpdateDatetime <- function (datetime, datetimeString, url) { datetimeNew <- datetime if (is.na(datetime)) { if (is.na(datetimeString)) { parsedURL <- strsplit(url, "/")[[1]] parsedURLLength <- length(parsedURL) d <- parsedURL[parsedURLLength-1] m <- parsedURL[parsedURLLength-2] y <- parsedURL[parsedURLLength-3] H <- round(runif(1, 8, 21)) M <- round(runif(1, 1, 59)) S <- 0 datetimeString <- paste0(paste0(c(y, m, d), collapse = "-"), " ", paste0(c(H, M, S), collapse = ":")) datetimeNew <- ymd_hms(datetimeString, tz = "Europe/Moscow", quiet = TRUE) } else { parsedDatetimeString <- unlist(strsplit(datetimeString, ",")) %>% trimws %>% strsplit(" ") %>% unlist() monthNumber <- which(grepl(parsedDatetimeString[3], months)) dateString <- paste0(c(parsedDatetimeString[4], monthNumber, parsedDatetimeString[2]), collapse = "-") datetimeString <- paste0(dateString, " ", parsedDatetimeString[1], ":00") datetimeNew <- ymd_hms(datetimeString, tz = "Europe/Moscow", quiet = TRUE) } } datetimeNew } # Process datetime and fill up missed values dtD <- dtD %>% mutate(datetime = ymd_hms(datetime, tz = "Europe/Moscow", quiet = TRUE)) %>% mutate(datetimeNew = mapply(UpdateDatetime, datetime, datetimeString, url)) %>% mutate(datetime = as.POSIXct(datetimeNew, tz = "Europe/Moscow",origin = "1970-01-01")) # SECTION 5 print(paste0("5 ",Sys.time())) # Remove rows with missed datetime values, rename metaTitle to title, # remove columns that we do not need anymore dtD <- dtD %>% as.data.table() %>% na.omit(cols="datetime") %>% select(-filename, -metaType, -datetimeString, -datetimeNew) %>% rename(title = metaTitle) %>% select(url, urlKey, datetime, rubric, subrubric, title, metaDescription, plaintext, authorLinks, additionalLinks, plaintextLinks, imageDescription, imageCreditsPerson, imageCreditsCompany, videoDescription, videoCredits) # SECTION 6 print(paste0("6 ",Sys.time())) # Clean additionalLinks and plaintextLinks symbolsToRemove <- "href=|-â-|«|»|âŠ|,|â¢|â|â|\n|\"|,|[|]|<a|<br" symbolsHttp <- "http:\\\\\\\\|:http://|-http://|.http://" symbolsHttp2 <- "http://http://|https://https://" symbolsReplace <- "[-|-|#!]" dtD <- dtD %>% mutate(plaintextLinks = gsub(symbolsToRemove,"", plaintextLinks)) %>% mutate(plaintextLinks = gsub(symbolsHttp, "http://", plaintextLinks)) %>% mutate(plaintextLinks = gsub(symbolsReplace, "e", plaintextLinks)) %>% mutate(plaintextLinks = gsub(symbolsHttp2, "http://", plaintextLinks)) %>% mutate(additionalLinks = gsub(symbolsToRemove,"", additionalLinks)) %>% mutate(additionalLinks = gsub(symbolsHttp, "http://", additionalLinks)) %>% mutate(additionalLinks = gsub(symbolsReplace, "e", additionalLinks)) %>% mutate(additionalLinks = gsub(symbolsHttp2, "http://", additionalLinks)) # SECTION 7 print(paste0("7 ",Sys.time())) # Clean additionalLinks and plaintextLinks using UpdateAdditionalLinks # function. Links such as: # "http://www.dw.com/ru/../B2 https://www.welt.de/politik/.../de/" # should be represented as "dw.com welt.de" # Function UpdateAdditionalLinks is used to process and clean additionalLinks # and plaintextLinks UpdateAdditionalLinks <- function(additionalLinks, url) { if (is.na(additionalLinks)) { return(NA) } additionalLinksSplitted <- gsub("http://|https://|http:///|https:///"," ", additionalLinks) additionalLinksSplitted <- gsub("http:/|https:/|htt://","", additionalLinksSplitted) additionalLinksSplitted <- trimws(additionalLinksSplitted) additionalLinksSplitted <- unlist(strsplit(additionalLinksSplitted, " ")) additionalLinksSplitted <- additionalLinksSplitted[!additionalLinksSplitted==""] additionalLinksSplitted <- additionalLinksSplitted[!grepl("lenta.", additionalLinksSplitted)] additionalLinksSplitted <- unlist(strsplit(additionalLinksSplitted, "/[^/]*$")) additionalLinksSplitted <- paste0("http://", additionalLinksSplitted) if (!length(additionalLinksSplitted) == 0) { URLSplitted <- c() for(i in 1:length(additionalLinksSplitted)) { parsed <- tryCatch(parseURI(additionalLinksSplitted[i]), error = function(x) {return(NA)}) parsedURL <- parsed["server"] if (!is.na(parsedURL)) { URLSplitted <- c(URLSplitted, parsedURL) } } if (length(URLSplitted)==0){ NA } else { URLSplitted <- URLSplitted[!is.na(URLSplitted)] paste0(URLSplitted, collapse = " ") } } else { NA } } # Function UpdateAdditionalLinksDomain is used to process additionalLinks # and plaintextLinks and retrive source domain name UpdateAdditionalLinksDomain <- function(additionalLinks, url) { if (is.na(additionalLinks)|(additionalLinks=="NA")) { return(NA) } additionalLinksSplitted <- unlist(strsplit(additionalLinks, " ")) if (!length(additionalLinksSplitted) == 0) { parsedDomain <- tryCatch(tldextract(additionalLinksSplitted), error = function(x) {data_frame(domain = NA, tld = NA)}) parsedDomain <- parsedDomain[!is.na(parsedDomain$domain), ] if (nrow(parsedDomain)==0) { #print("--------") #print(additionalLinks) return(NA) } domain <- paste0(parsedDomain$domain, ".", parsedDomain$tld) domain <- unique(domain) domain <- paste0(domain, collapse = " ") return(domain) } else { return(NA) } } dtD <- dtD %>% mutate(plaintextLinks = mapply(UpdateAdditionalLinks, plaintextLinks, url)) %>% mutate(additionalLinks = mapply(UpdateAdditionalLinks, additionalLinks, url)) # Retrive domain from external links using updateAdditionalLinksDomain # function. Links such as: # "http://www.dw.com/ru/../B2 https://www.welt.de/politik/.../de/" # should be represented as "dw.com welt.de" numberOfLinks <- nrow(dtD) groupSize <- 10000 groupsN <- seq(from = 1, to = numberOfLinks, by = groupSize) for (i in 1:length(groupsN)) { n1 <- groupsN[i] n2 <- min(n1 + groupSize - 1, numberOfLinks) dtD$additionalLinks[n1:n2] <- mapply(UpdateAdditionalLinksDomain, dtD$additionalLinks[n1:n2], dtD$url[n1:n2]) dtD$plaintextLinks[n1:n2] <- mapply(UpdateAdditionalLinksDomain, dtD$plaintextLinks[n1:n2], dtD$url[n1:n2]) } # SECTION 8 print(paste0("8 ",Sys.time())) # Clean title, descriprion and plain text. Remove puntuation and stop words. # Prepare for the stem step stopWords <- readLines("stop_words.txt", warn = FALSE, encoding = "UTF-8") dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = tolower(title), stemMetaDescription = tolower(metaDescription), stemPlaintext = tolower(plaintext)) dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = enc2utf8(stemTitle), stemMetaDescription = enc2utf8(stemMetaDescription), stemPlaintext = enc2utf8(stemPlaintext)) dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = removeWords(stemTitle, stopWords), stemMetaDescription = removeWords(stemMetaDescription, stopWords), stemPlaintext = removeWords(stemPlaintext, stopWords)) dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = removePunctuation(stemTitle), stemMetaDescription = removePunctuation(stemMetaDescription), stemPlaintext = removePunctuation(stemPlaintext)) dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = str_replace_all(stemTitle, "\\s+", " "), stemMetaDescription = str_replace_all(stemMetaDescription, "\\s+", " "), stemPlaintext = str_replace_all(stemPlaintext, "\\s+", " ")) dtD <- dtD %>% as.tbl() %>% mutate(stemTitle = str_trim(stemTitle, side = "both"), stemMetaDescription = str_trim(stemMetaDescription, side = "both"), stemPlaintext = str_trim(stemPlaintext, side = "both")) # SECTION 9 print(paste0("9 ",Sys.time())) write.csv(dtD, file.path(tidyArticlesFolder, "tidy_articles_data.csv"), fileEncoding = "UTF-8") # SECTION 10 Finish print(paste0("10 ",Sys.time())) # SECTION 11 Adding social dfM <- read.csv(file.path(tidyArticlesFolder, "tidy_articles_data.csv"), stringsAsFactors = FALSE, encoding = "UTF-8") dfS <- read.csv(file.path(parsedArticlesFolder, "social_articles.csv"), stringsAsFactors = FALSE, encoding = "UTF-8") dt <- as.tbl(dfM) dtS <- as.tbl(dfS) %>% rename(url = link) %>% select(url, FB, VK, OK, Com) dtG <- left_join(dt, dtS, by = "url") write.csv(dtG, file.path(tidyArticlesFolder, "tidy_articles_data.csv"), fileEncoding = "UTF-8") }
, time stamp
print(paste0("1 ",Sys.time()))
. 2.7GHz i5, 16Gb Ram, SSD, macOS 10.12, R version 3.4.0
:
[1] "1 2017-07-21 16:36:59" [1] "2 2017-07-21 16:37:13" [1] "3 2017-07-21 16:38:15" [1] "4 2017-07-21 16:39:11" [1] "5 2017-07-21 16:42:58" [1] "6 2017-07-21 16:42:58" [1] "7 2017-07-21 16:43:35" [1] "8 2017-07-21 18:41:25" [1] "9 2017-07-21 19:00:32" [1] "10 2017-07-21 19:01:04"
3.5GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012
:
[1] "1 2017-07-21 14:36:44" [1] "2 2017-07-21 14:37:08" [1] "3 2017-07-21 14:38:23" [1] "4 2017-07-21 14:41:24" [1] "5 2017-07-21 14:46:02" [1] "6 2017-07-21 14:46:02" [1] "7 2017-07-21 14:46:57" [1] "8 2017-07-21 18:58:04" [1] "9 2017-07-21 19:30:27" [1] "10 2017-07-21 19:35:18"
( ), UpdateAdditionalLinksDomain
( ) . tldextract {tldextract}
. - â .
â 2 ? 7 4 2 .
, . çµæïŒ
> str(dfM, vec.len = 1) 'data.frame': 379746 obs. of 21 variables: $ X.1 : int 1 2 ... $ X : int 1 2 ... $ url : chr "https://lenta.ru/news/2009/12/31/kids/" ... $ filename : chr "C:/Users/ildar/lenta/downloaded_articles/000001-010000/index.html" ... > str(dfM, vec.len = 1) Classes 'data.table' and 'data.frame': 376913 obs. of 19 variables: $ url : chr "https://lenta.ru/news/2009/12/31/kids/" ... $ urlKey : chr "httpslentarunews20091231kids" ... $ datetime : chr "2010-01-01 00:24:33" ... $ rubric : chr "" ... $ subrubric : chr NA ... $ title : chr " " ... $ metaDescription : chr " . "| __truncated__ ... $ plaintext : chr " . "| __truncated__ ... $ authorLinks : chr NA ... $ additionalLinks : chr NA ... $ plaintextLinks : chr "interfax.ru" ... $ imageDescription : chr NA ... $ imageCreditsPerson : chr NA ... $ imageCreditsCompany: chr NA ... $ videoDescription : chr NA ... $ videoCredits : chr NA ... $ stemTitle : chr " " ... $ stemMetaDescription: chr " "| __truncated__ ... $ stemPlaintext : chr " "| __truncated__ ...
.
, :
> file.size(file.path(tidyArticlesFolder, "tidy_articles_data.csv"))/1024/1024 [1] 2741.01
REPRODUCIBLE RESEARCH
( STEMMING) , - . - ( STEMMING) . 1 1999
.
700 .
2
700
:
> head(articlesLinks) [1] "https://lenta.ru/news/1999/08/31/stancia_mir/" [2] "https://lenta.ru/news/1999/08/31/vzriv/" [3] "https://lenta.ru/news/1999/08/31/credit_japs/" [4] "https://lenta.ru/news/1999/08/31/fsb/" [5] "https://lenta.ru/news/1999/09/01/dagestan/" [6] "https://lenta.ru/news/1999/09/01/kirgiz1/" > length(articlesLinks) [1] 702246
700000 ( 18 ) 70 4.5
. çµæïŒ
> indexFiles <- list.files(downloadedArticlesFolder, full.names = TRUE, recursive = TRUE, pattern = "index") > length(indexFiles) [1] 702246 > sum(file.size(indexFiles))/1024/1024 [1] 123682.1
123GB
- 70 60
( ). 2.831MB
:
> file.size(file.path(parsedArticlesFolder, "untidy_articles_data.csv"))/1024/1024 [1] 3001.875
8 ( , 7). 3.5GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012
:
[1] "1 2017-07-27 08:21:46" [1] "2 2017-07-27 08:22:44" [1] "3 2017-07-27 08:25:21" [1] "4 2017-07-27 08:30:56" [1] "5 2017-07-27 08:38:29" [1] "6 2017-07-27 08:38:29" [1] "7 2017-07-27 08:40:01" [1] "8 2017-07-27 15:55:18" [1] "9 2017-07-27 16:44:49" [1] "10 2017-07-27 16:53:02"
4.5GB
, :
> file.size(file.path(tidyArticlesFolder, "tidy_articles_data.csv"))/1024/1024 [1] 4534.328
STEMMING
. , , ///
( )
( ). MyStem . :
# Load required packages require(data.table) require(dplyr) require(tidyr) require(stringr) require(gdata) # Set workling directory and locale for macOS and Windows if (Sys.info()['sysname'] == "Windows") { workingDirectory <- paste0(Sys.getenv("HOMEPATH"), "\\lenta") Sys.setlocale("LC_ALL", "Russian") } else { workingDirectory <- ("~/lenta") Sys.setlocale("LC_ALL", "ru_RU.UTF-8") } setwd(workingDirectory) # Load library that helps to chunk vectors source("chunk.R") # Set common variables tidyArticlesFolder <- file.path(getwd(), "tidy_articles") stemedArticlesFolder <- file.path(getwd(), "stemed_articles") # Create required folders if not exist dir.create(stemedArticlesFolder, showWarnings = FALSE) ## STEP 6. Stem title, description and plain text # Write columns on disk, run mystem, read stemed data and add to data.table StemArticlesData <- function() { # Read tidy data and keep only column that have to be stemed. # Add === separate rows in stem output. # dt that takes about 5GB RAM for 700000 obs. of 25 variables # and 2.2GB for 700000 obs. of 5 variables as tbl timestamp(prefix = "## START reading file ") tidyDataFile <- file.path(tidyArticlesFolder, "tidy_articles_data.csv") dt <- fread(tidyDataFile, stringsAsFactors = FALSE, encoding = "UTF-8") %>% as.tbl() dt <- dt %>% mutate(sep = "===") %>% select(sep, X, stemTitle, stemMetaDescription, stemPlaintext) # Check memory usage print(ll(unit = "MB")) # Prepare the list that helps us to stem 3 column sectionList <- list() sectionList[[1]] <- list(columnToStem = "stemTitle", stemedColumn = "stemedTitle", sourceFile = file.path(stemedArticlesFolder, "stem_titles.txt"), stemedFile = file.path(stemedArticlesFolder, "stemed_titles.txt")) sectionList[[2]] <- list(columnToStem = "stemMetaDescription", stemedColumn = "stemedMetaDescription", sourceFile = file.path(stemedArticlesFolder, "stem_metadescriptions.txt"), stemedFile = file.path(stemedArticlesFolder, "stemed_metadescriptions.txt")) sectionList[[3]] <- list(columnToStem = "stemPlaintext", stemedColumn = "stemedPlaintext", sourceFile = file.path(stemedArticlesFolder, "stem_plaintext.txt"), stemedFile = file.path(stemedArticlesFolder, "stemed_plaintext.txt")) timestamp(prefix = "## steming file ") # Write the table with sep, X, columnToStem columns and run mystem. # It takes about 30 min to process Title, MetaDescription and Plaintext # in 700K rows table. # https://tech.yandex.ru/mystem/ for (i in 1:length(sectionList)) { write.table(dt[, c("sep","X", sectionList[[i]]$columnToStem)], sectionList[[i]]$sourceFile, fileEncoding = "UTF-8", sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE) system(paste0("mystem -nlc ", sectionList[[i]]$sourceFile, " ", sectionList[[i]]$stemedFile), intern = FALSE) } # Remove dt from memory and call garbage collection rm(dt) gc() # Check memory usage print(ll(unit = "MB")) timestamp(prefix = "## process file ") # Process stemed files. it takes about 60 min to process 3 stemed files for (i in 1:length(sectionList)) { stemedText <- readLines(sectionList[[i]]$stemedFile, warn = FALSE, encoding = "UTF-8") # Split stemed text in chunks chunkList <- chunk(stemedText, chunk.size = 10000000) # Clean chunks one by one and remove characters that were added by mystem resLines <- c() for (j in 1:length(chunkList)) { resTemp <- chunkList[[j]] %>% str_replace_all("===,", "===") %>% strsplit(split = "\\\\n|,") %>% unlist() %>% str_replace_all("(\\|[^ ]+)|(\\\\[^ ]+)|\\?|,|_", "") resLines <- c(resLines, resTemp[resTemp!=""]) } # Split processed text in rows using === added at the beginnig chunkedRes <- chunk(resLines, chunk.delimiter = "===", fixed.delimiter = FALSE, keep.delimiter = TRUE) # Process each row and extract key (row number) and stemed content stemedList <- lapply(chunkedRes, function(x) { data.frame(key = as.integer(str_replace_all(x[1], "===", "")), content = paste0(x[2:length(x)], collapse = " "), stringsAsFactors = FALSE)}) # Combine all rows in data frame with key and content colums sectionList[[i]]$dt <- bind_rows(stemedList) colnames(sectionList[[i]]$dt) <- c("key", sectionList[[i]]$stemedColumn) } # Remove variables used in loop and call garbage collection rm(stemedText, chunkList, resLines, chunkedRes, stemedList) gc() # Check memory usage print(ll(unit = "MB")) # read tidy data again timestamp(prefix = "## reading file (again)") dt <- fread(tidyDataFile, stringsAsFactors = FALSE, encoding = "UTF-8") %>% as.tbl() # add key column as a key and add tables with stemed data to tidy data timestamp(prefix = paste0("## combining tables ")) dt <- dt %>% mutate(key = X) dt <- left_join(dt, sectionList[[1]]$dt, by = "key") dt <- left_join(dt, sectionList[[2]]$dt, by = "key") dt <- left_join(dt, sectionList[[3]]$dt, by = "key") sectionList[[1]]$dt <- "" sectionList[[2]]$dt <- "" sectionList[[3]]$dt <- "" dt <- dt %>% select(-V1, -X, -urlKey, -metaDescription, -plaintext, -stemTitle, -stemMetaDescription, -stemPlaintext, - key) write.csv(dt, file.path(stemedArticlesFolder, "stemed_articles_data.csv"), fileEncoding = "UTF-8") file.remove(sectionList[[1]]$sourceFile) file.remove(sectionList[[2]]$sourceFile) file.remove(sectionList[[3]]$sourceFile) file.remove(sectionList[[1]]$stemedFile) file.remove(sectionList[[2]]$stemedFile) file.remove(sectionList[[3]]$stemedFile) # Remove dt, sectionList and call garbage collection rm(dt) gc() # Check memory usage print(ll(unit = "MB")) timestamp(prefix = "## END ") }
, :
> file.size(file.path(stemedArticlesFolder, ))/1024/1024 [1] 2273.52 > str(x, vec.len = 1) Classes 'data.table' and 'data.frame': 697601 obs. of 21 variables: $ V1 : chr ... $ url : chr ... $ datetime : chr ... $ rubric : chr ... $ subrubric : chr NA ... $ title : chr \\\\\\\\ ... $ authorLinks : chr NA ... $ additionalLinks : chr NA ... $ plaintextLinks : chr NA ... $ imageDescription : chr NA ... $ imageCreditsPerson : chr NA ... $ imageCreditsCompany : chr NA ... $ videoDescription : chr NA ... $ videoCredits : chr NA ... $ FB : int 0 0 ... $ VK : int NA 0 ... $ OK : int 0 0 ... $ Com : int NA NA ... $ stemedTitle : chr ... $ stemedMetaDescription: chr | __truncated__ ... $ stemedPlaintext : chr | __truncated__ ... - attr(*, )=<externalptr>
. . , "", . .
PS
:
- " " ? macOS, Linux, Windows?
- timestamp' print(" N")?
2.7GHz i5, 16Gb Ram, SSD, macOS 10.12, R version 3.4.0
3.5GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012
?- ?
- "code smells"?
PPS - , , , , , . ⊠.
. Lenta-anal.ru ( 2)
Data Engineering
- ( ) , . , Linux Foundation Certified System Administrator (LFCS) Sander van Vugt (, ). 15 ( , Kernel ). , , . , - " -, , , .".
. 2 2 , RStudio Server . , , . , ( ).
4 :
DefCollections <- function() { collections <- c("c01_daytobeprocessed", "c02_linkstobeprocessed", "c03_pagestobeprocessed", "c04_articlestobeprocessed") return(collections) }
4 :
DefScripts <- function() { scripts <- c("01_days_process.R", "02_links_process.R", "03_pages_process.R", "04_articles_process.R") return(scripts) }
:
c01_daytobeprocessed
, , 2010-01-10 - 2010-01-10
.01_days_process.R
, c01_daytobeprocessed
, , c02_linkstobeprocessed
.02_links_process.R
, c02_linkstobeprocessed
, , , , c03_pagestobeprocessed
.03_pages_process.R
, c03_pagestobeprocessed
, , , ( ) c04_articlestobeprocessed
.
, ( ) , . , cron , . ⊠. ( c("1999-09-01", as.Date(Sys.time())
) .
, :
- 5 cron, . , 80%,
c01_daytobeprocessed
, 10 01_days_process.R
. - , 5 , 80%, 100 .

, c("1999-09-01", as.Date(Sys.time())
. , 2 2 , 16, 8. , 700 , .
â , .
. , CSV, , , .
LENTA-ANAL.RU .
( 01-09-1999 â 04-12-2017) lenta-ru-data-set_19990901_20171204.zip .
*:
[ { "_id": { "$oid": "5a0b067b537f258f034ffc51" }, "link": "https://lenta.ru/news/2017/11/14/cazino/", "linkDate": "20171114", "status": 0, "updated_at": "20171204154248 ICT", "process": "", "page": [ { "urlKey": "httpslentarunews20171114cazino", "url": "https://lenta.ru/news/2017/11/14/cazino/", "metaTitle": " ", "metaType": "article", "metaDescription": "17 , â , . 260 « », . .", "datetime": "2017-11-14 17:40:00", "datetimeString": " 17:40, 14 2017", "plaintext": " (), . , 14 , «.» () . , 2015 , 51- 33- . 15 , . , , , 66 . , , , 260 , , , , . 12 , , . «.», , . , 200 . « , , » (171.2 ) « » (210 ). «.» , . . 31 , . . 1 2009 .", "imageCreditsPerson": " ", "imageCreditsCompany": " ", "dateToUse": "20171114", "rubric": " ", "subrubric": " ", "stemedTitle": " ", "stemedMetaDescription": "17 260 ", "stemedPlaintext": " 14 2015 51 33 15 66 260 12 200 1712 210 31 1 2009 " } ], "social": [ { "FB": 1, "VK": 0, "OK": 1, "Com": 1 } ], "comments": [ { "id": 30154446, "hasLink": false, "hasGreyWord": false, "text": " . , , . , . , , , . , ; , â . , , , , , , ...\n\n â ", "moderation": "approved", "createdAt": "2017-11-14 23:10:31", "sessionSourceIcon": "livejournal", "userId": 2577791, "userpic": "https://avatars-c.rambler.ru/ca/v3/c/av/6d3dcf4b71dfde1edcfe149198747a48099f867369dfdb4ad8e2be69e41e2276b288a819cb98d3f3dafd72e1a2d46bf6529d74e4d245e4ada673fc057a48bfd5ba3c2f1d1f39fc57c9ec3d3f3a0ea8d1", "displayName": "vladtmb", "level": 0, "childrenCount": 0, "hasChild": false, "stemedText": " " } ] } ]
* , sample (100 )
, - . , .