🌙 🤚🏽 🦑 Rのデータを操作するための最良のパッケージ、パート1 🤴🏽 🏖️ 👨🏿‍🎤

R- dplyrおよびdata.tableデータを操作するための2つの優れたパッケージがあります。各パッケージには独自の長所があります。 dplyrエレガントで自然言語に似ていますが、 data.table簡潔で、1行で多くのことができます。さらに、場合によっては、 data.table高速であり（ここで比較分析を利用できます）、これにより、メモリまたはパフォーマンスに制限がある場合に選択を決定できます。 dplyrとdata.table比較は、 Stack OverflowとQuoraでも読むことができます。

ここでは、マニュアルとdata.table簡単な説明を、 data.tableについてはこちらを参照してdplyr 。 DataScience +でdplyrを読むこともできます。

コンテキスト

私はdplyrとdata.tableを長い間使用してデータを操作していました。誰かがパッケージの1つだけに精通している場合、2番目のパッケージを調べるために、両方で同じことを行うコードを調べると便利です。

dplyr

dplyrには、ほとんどのデータ操作を実行するために設計された5つの動詞があります。選択-1つ以上の列を選択します。フィルター-任意の基準に基づいて行を選択します。配置-昇順または降順で1つ以上の列でデータを並べ替えます。変更-データに新しい列を追加します。要約-データの一部を強調表示します。

data.table

data.table一般的な形式data.table非常に短いdata.table [ i、j、by ]であり、次のように解釈できますdata.table取得し、 iを使用して行を選択し、でグループ化して jを計算します。

データを操作する

まず、プロジェクト用のいくつかのパッケージをインストールします。

 library(dplyr) library(data.table) library(lubridate) library(jsonlite) library(tidyr) library(ggplot2) library(compare)

DATA.GOVのデータを使用します。これは、州の医療保険の請求に対する支払いに関するデータであり、ここからダウンロードできます。 fromJSONパッケージのfromJSON関数を使用して、JSON形式でデータをロードします。 JSONはブラウザとサーバー間の非同期相互作用の標準データ形式であるため、データの受信に使用される以下のコードを理解しておくと役立ちます。 jsonliteパッケージを使用したJSONデータのjsonliteは、こちらとこちらをご覧ください。ただし、 data.tableとdata.tableのみに焦点を合わせたい場合は、2つの異なるウィンドウで以下のコードを安全に実行し、詳細に進むことはできません。

 spending=fromJSON("https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD") names(spending)

 "meta" "data"

 meta=spending$meta hospital_spending=data.frame(spending$data) colnames(hospital_spending)=make.names(meta$view$columns$name) hospital_spending=select(hospital_spending,-c(sid:meta)) glimpse(hospital_spending)

 Observations: 70598 Variables: $ Hospital.Name (fctr) SOUTHEAST ALABAMA MEDICAL CENT... $ Provider.Number. (fctr) 010001, 010001, 010001, 010001... $ State (fctr) AL, AL, AL, AL, AL, AL, AL, AL... $ Period (fctr) 1 to 3 days Prior to Index Hos... $ Claim.Type (fctr) Home Health Agency, Hospice, I... $ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0... $ Avg.Spending.Per.Episode..State. (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,... $ Avg.Spending.Per.Episode..Nation. (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0... $ Percent.of.Spending..Hospital. (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ... $ Percent.of.Spending..State. (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ... $ Percent.of.Spending..Nation. (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ... $ Measure.Start.Date (fctr) 2014-01-01T00:00:00, 2014-01-0... $ Measure.End.Date (fctr) 2014-12-31T00:00:00, 2014-12-3...

上記のように、すべての列は因子変数としてインポートされます。数値データを作ってみましょう。

 cols = 6:11; #  ,     hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric)

最後の2列は、測定の開始と終了を示しています。 lubridateパッケージを使用して修正します。

 cols = 12:13; #       hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms)

次に、列が正しいタイプであることを確認しましょう。

 sapply(hospital_spending, class)

 $Hospital.Name "factor" $Provider.Number. "factor" $State "factor" $Period "factor" $Claim.Type "factor" $Avg.Spending.Per.Episode..Hospital. "numeric" $Avg.Spending.Per.Episode..State. "numeric" $Avg.Spending.Per.Episode..Nation. "numeric" $Percent.of.Spending..Hospital. "numeric" $Percent.of.Spending..State. "numeric" $Percent.of.Spending..Nation. "numeric" $Measure.Start.Date "POSIXct" "POSIXt" $Measure.End.Date "POSIXct" "POSIXt"

データを含むテーブルを作成する

data.table()関数を使用して、データ（data.table）を含むテーブルを作成できます。

 hospital_spending_DT = data.table(hospital_spending) class(hospital_spending_DT)

 "data.table" "data.frame"

いくつかの列を選択

dplyr列を選択するには、動詞select使用しselect 。 data.tableでは、順番に列名を指定できます。

単一変数選択

「病院名」変数を選択します。

 from_dplyr = select(hospital_spending, Hospital.Name) from_data_table = hospital_spending_DT[,.(Hospital.Name)]

ここで、 dplyrとdata.table結果が同じであることを確認する必要があります。

 compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

1つの変数を削除する

 from_dplyr = select(hospital_spending, -Hospital.Name) from_data_table = hospital_spending_DT[,!c("Hospital.Name"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

参照によって入力データテーブル（data.table）を変更する関数:=も使用できます。
copy()関数も使用しcopy() 。これは、元のオブジェクトのコピーを作成します。コピーリンクを使用する次のデータ操作は、初期オブジェクトには影響しません。

 DT=copy(hospital_spending_DT) DT=DT[,Hospital.Name:=NULL] "Hospital.Name"%in%names(DT)

 FALSE

同様に、複数の変数を同時に削除できます。

 DT=copy(hospital_spending_DT) DT=DT[,c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"):=NULL] c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date")%in%names(DT)

 FALSE FALSE FALSE FALSE

複数の変数を選択

変数Hospital.Name、State、Measure.Start.DateおよびMeasure.End.Dateを選択してみましょう。

 from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date) from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

複数の変数を削除する

元のhospital_spendingデータセットとhospital_spending_DTデータテーブル（data.table）から変数Hospital.Name、State、Measure.Start.Date、Measure.End.Dateを削除しましょう。

 from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)) from_data_table = hospital_spending_DT[,!c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

dplyrは、 select動詞で使用できるcontains() 、 starts_with()およびends_with()がcontains()ていselect 。 data.table正規表現が許可されていdata.table 。例として、名前に「Date」という単語を含む列を選択します。

 from_dplyr = select(hospital_spending,contains("Date")) from_data_table = subset(hospital_spending_DT,select=grep("Date",names(hospital_spending_DT))) compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

 names(from_dplyr)

 "Measure.Start.Date" "Measure.End.Date"

列の名前を変更する

 setnames(hospital_spending_DT,c("Hospital.Name", "Measure.Start.Date","Measure.End.Date"), c("Hospital","Start_Date","End_Date")) names(hospital_spending_DT)

 "Hospital" "Provider.Number." "State" "Period" "Claim.Type" "Avg.Spending.Per.Episode..Hospital." "Avg.Spending.Per.Episode..State." "Avg.Spending.Per.Episode..Nation." "Percent.of.Spending..Hospital." "Percent.of.Spending..State." "Percent.of.Spending..Nation." "Start_Date" "End_Date"

 hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date) compare(hospital_spending,hospital_spending_DT, allowAll=TRUE)

 TRUE dropped attributes

Rのデータを操作するための最良のパッケージ、パート1