
åçãœãŒã¹
誰ããæ°ããæ¹æ³ãç¥ã£ãŠããŸããã誰ããããã«æ°ããæ¹æ³ãç¥ã£ãŠããããã§ã¯ãããŸããã ãã®èšäºã§ã¯ãPostgreSQLã®ã«ãŠã³ãæé©åæ¹æ³ã詳ããèŠãŠãããŸãã è¡æ°ã®ã«ãŠã³ããæ¡éãã«é«éåã§ããããªãã¯ããããŸãã
çå£ã«åé¡ã«åãçµãå Žåã countã®ããã€ãã®ãªãã·ã§ã³ã匷調衚瀺ããå¿
èŠããããŸã ãåãªãã·ã§ã³ã«ã¯ç¬èªã®ã¡ãœããããããŸãã æ±ºå®ããå¿
èŠããããã®ïŒ
- è¡ã®æ£ç¢ºãªæ°ãå¿
èŠãã©ããããŸãã¯æšå®ã§ååãã©ãã
- éè€ãèæ
®ããå¿
èŠãããããäžæã®å€ã®ã¿ã«é¢å¿ããããã
- ããŒãã«ã®ãã¹ãŠã®è¡ãã«ãŠã³ãããå¿
èŠããããã©ããããŸãã¯ç¹å®ã®æ¡ä»¶ãæºããè¡ã®ã¿ãéžæããå¿
èŠããããã©ããã
ç¹å®ã®ç¶æ³ããšã«ãœãªã¥ãŒã·ã§ã³ãåæããé床ãšãªãœãŒã¹æ¶è²»ãæ¯èŒããŸãã äžå
åãããããŒã¿ããŒã¹ã®ç¶æ³ã調ã¹ãåŸãCitusã䜿çšããŠã忣ããŒã¿ããŒã¹ã§ã®countã®äžŠåå®è¡ãå®èšŒããŸãã
å
容
DBã®æºå
ãã¹ãã§ã¯ã pgbenchãåæåãããcountãšããããŒã¿ããŒã¹ã䜿çšããŸãã
[user@comp ~]$ pgbench -i count
ãã¹ãããŒãã«ãäœæããŸãã
éè€ããŠæ°ãã
æ£ç¢ºãªã«ãŠã³ã
ãããã£ãŠãæåããå§ããŸããããããŒãã«å
šäœãŸãã¯éè€éšåãããéšåã®æ£ç¢ºãªè¡æ°-å€ãè¯ãcount(*)
ååŸããããšãæ€èšããŸãã ãã®ã³ãã³ãã®ã©ã³ã¿ã€ã ã¯ãè¡æ°ãã«ãŠã³ãããä»ã®æ¹æ³ã®é床ãè©äŸ¡ããããã®åºç€ãæäŸããŸãã
Pgbenchã¯ãããã©ãŒãã³ã¹çµ±èšãç¹°ãè¿ãç
§äŒããã³åéããããã®äŸ¿å©ãªããŒã«ã§ãã
count(1)
vs count(*)
ã«é¢ããã¡ã¢ã count(*)
ã¯çŸåšã®è¡ã®ãã¹ãŠã®åã®å€ãåŠçããå¿
èŠãããããã count(1)
éããšèãããããããŸããã å®éã«ã¯ãã®éã§ãã SELECT *
æ§é ãšã¯ç°ãªãã count(*)
å
ã®ã¢ã¹ã¿ãªã¹ã¯count(*)
ã¯äœãæå³ããŸããã PostgreSQLã¯ãåŒcount(*)
ãåŒæ°ãªãã®ã«ãŠã³ãã®ç¹æ®ãªã±ãŒã¹ãšããŠæ±ããŸãã ïŒãã®åŒãcount()
圢åŒã§èšè¿°ããã®ãæ£ããã§ãããïŒã äžæ¹ã count(1)
ã¯1ã€ã®åŒæ°ãåããPostgreSQLã¯ãã®åŒæ°ïŒ1ïŒãå®éã«NULLã§ãªãããšãåè¡ã§ç¢ºèªããå¿
èŠããããŸãã
count(1)
ã䜿çšãã以åã®ãã¹ãã§ã¯ã次ã®çµæãçæãããŸããã
ãããã«ããã count(1)
ãšcount(*)
ã©ã¡ããå®çŸ©äžäœéã§ãã åæãã©ã³ã¶ã¯ã·ã§ã³ã®äžè²«æ§ã確ä¿ããããã«ãPostgreSQLã¯Multiversion Concurrency ControlïŒMVCCïŒã䜿çšããŸãã ããã¯ãåãã©ã³ã¶ã¯ã·ã§ã³ãããŒãã«å
ã®ç°ãªãè¡ãããã«ç°ãªãè¡æ°ãèŠãããšãã§ããããšãæå³ããŸãã ãããã£ãŠãDBMSããã£ãã·ã¥ã«å
¥ããããšãã§ããè¡ã®æ°ã«å¯ŸããŠåäžã®æ£ããå€ã¯ãããŸãããã·ã¹ãã ã¯ãã¹ãŠã®è¡ãã¹ãã£ã³ããŠãã©ã®è¡ãåäžã®ãã©ã³ã¶ã¯ã·ã§ã³ããèŠããããèšç®ããå¿
èŠããããŸãã æ£ç¢ºãªã«ãŠã³ãã®å®è¡æéã¯ãããŒãã«ã®ãµã€ãºã®å¢å ã«æ¯äŸããŠå¢å ããŸãã
EXPLAIN SELECT count(*) FROM items; Aggregate (cost=20834.00..20834.01 rows=1 width=0) -> Seq Scan on items (cost=0.00..18334.00 rows=1000000 width=0)
èŠæ±ã³ã¹ãã®88ïŒ
ã®ã¢ã«ãŠã³ããã¹ãã£ã³ããŸãã ããŒãã«ã®ãµã€ãºã2åã«ãããšã ã¹ãã£ã³ãšéèšã®ã³ã¹ããæ¯äŸããŠå¢å ããã¯ãšãªã®å®è¡æéãçŽ2åã«ãªããŸã ã
è¡æ° | å¹³åæé |
---|
100äž | 85ããªç§ |
200äž | 161ããªç§ |
400äž | 343ããªç§ |
ã¹ããŒãã¢ããããæ¹æ³ã¯ïŒ 2ã€ã®ãªãã·ã§ã³ããããŸãïŒæšå®å€ãå¿
èŠã§ããããšã決å®ããããèªåã§ãã£ãã·ã¥ã«è¡æ°ãå
¥ããŸãã 2çªç®ã®ã±ãŒã¹ã§ã¯ãåããŒãã«ã®å€ãšã countããã°ããå®è¡ããåWHEREåŒãåå¥ã«ä¿åããå¿
èŠããããŸã ã
items
ããŒãã«å
šäœã®count(*)
å€ãæåã§ãã£ãã·ã¥ããäŸãèŠãŠã¿ãŸãããã æ¬¡ã®ããªã¬ãŒããŒã¹ã®ãœãªã¥ãŒã·ã§ã³ã¯ã AãElein Mustainã«ãã£ãŠææ¡ãããæ¹æ³ã®é©å¿ã§ãã PostgreSQL MVCCãšã³ãžã³ã¯ã items
ãšè¡ã«ãŠã³ãå€ãå«ãããŒãã«ãšã®éã®äžè²«æ§ãç¶æããŸãã
BEGIN; CREATE TABLE row_counts ( relname text PRIMARY KEY, reltuples bigint );
ãã®å Žåã®ãã£ãã·ã¥å€ã®èªã¿åããšæŽæ°ã®é床ã¯ããŒãã«ã®ãµã€ãºã«äŸåãããè¡æ°ã®å€ã®ååŸã¯éåžžã«é«éã§ãã ãã ãããã®ææ³ã§ã¯ãæ¿å
¥ããã³å逿äœã®ãªãŒããŒããããå¢å ããŸãã ããªã¬ãŒã䜿çšããªãå Žåãæ¬¡ã®ã³ãã³ãã¯4.7ç§ã§å®è¡ãããŸãããããªã¬ãŒã䜿çšããæ¿å
¥ã§ã¯50åé
ããªããŸã ã
INSERT INTO items (n, s) SELECT (random()*1000000)::integer AS n, md5(random()::text) AS s FROM generate_series(1,1000000);
æ Œä»ã
ããŒãã«å
šäœã®ã¹ã³ã¢
ããŒãã«å
ã®è¡æ°ããã£ãã·ã¥ããã¢ãããŒãã«ããã貌ãä»ãæäœãé
ããªããŸãã æ£ç¢ºãªæ°ã§ã¯ãªããæšå®å€ã«æºè¶³ããæºåãã§ããŠããå Žåãæ¿å
¥æéã«åœ±é¿ãäžããã«é«éã®èªã¿åãæäœãååŸã§ããŸãã ãã®ããã«ãPostgreSQLã«ãã£ãŠåéããããªãŒããŒãããã䜿çšã§ããŸãã ãããã®ãœãŒã¹ã¯ã çµ±èšæ
å ±ã³ã¬ã¯ã¿ãšèªåããã¥ãŒã ããŒã¢ã³ã§ãã
æšå®å€ãååŸããããã®ãªãã·ã§ã³ïŒ
ããããããä¿¡é Œæ§ã®é«ããœãŒã¹ãããããã®ããŒã¿ã¯ããé »ç¹ã«æŽæ°ãããŸãã Andrew GierthïŒRhodiumToadïŒã®ã¢ããã€ã¹ïŒ
èŠããŠãããŠãã ããïŒã¹ã±ãžã¥ãŒã©ã¯å®éã«ã¯reltuplesã䜿çšããŸãã; reltuples / relpagesæ¯ã«çŸåšã®ããŒãžæ°ãæããŸãã
ããã§ã®ããžãã¯ã¯æ¬¡ã®ãšããã§ããããŒãã«å
ã®ããŒã¿éãå¢å ããŠããç©çããŒãžã«åãŸãè¡ã®å¹³åæ°ã¯éåžžãåèšæ°ã»ã©å€åããŸããã çŸåšã®è¡æ°ã®ããæ£ç¢ºãªæšå®å€ãååŸããããã«ãå¹³åè¡æ°ã«ãããŒãã«ãå æããŠããçŸåšã®ããŒãžæ°ã«é¢ããçŸåšã®æ
å ±ãæããããšãã§ããŸãã
ãµã³ãã«ã®ã¹ã³ã¢
åã®ã»ã¯ã·ã§ã³ã§ã¯ãããŒãã«å
šäœã®æšå®è¡æ°ãååŸããããšãæ€èšããŸããããåãããšã¯å¯èœã§ããã WHERE
äžèŽããè¡ã«å¯ŸããŠã®ã¿å¯èœã§ããïŒ Michael Fuhrã¯è峿·±ãæ¹æ³ãæãã€ããŸãããã¯ãšãªã«å¯ŸããŠEXPLAIN
ãå®è¡ããçµæãåæããŸãã
CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$ DECLARE rec record; rows integer; BEGIN FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)'); EXIT WHEN rows IS NOT NULL; END LOOP; RETURN rows; END; $$ LANGUAGE plpgsql VOLATILE STRICT;
ãã®é¢æ°ã¯æ¬¡ã®ããã«äœ¿çšã§ããŸãã
SELECT count_estimate('SELECT 1 FROM items WHERE n < 1000');
ãã®ã¡ãœããã®ç²ŸåºŠã¯ã WHERE
éžææ§ãè©äŸ¡ããããã«ããŸããŸãªã¡ãœããã䜿çšããã¹ã±ãžã¥ãŒã©ãŒãšãã¯ãšãªããè¿ãããè¡æ°ãååŸã§ããå Žæã«äŸåããŸãã
æç¢ºãªã«ãŠã³ãïŒéè€ãªãïŒ
æ£ç¢ºãªã«ãŠã³ã
ã¡ã¢ãªäžè¶³ã®ããã©ã«ãåäœ
éè€ããã«ãŠã³ãã¯ãã£ããå®è¡ãããå ŽåããããŸãããåå¥ã«ãŠã³ãã®æ¹ãã¯ããã«æªãã§ãã éãããäœæ¥ã¡ã¢ãªãšã€ã³ããã¯ã¹ãªãã§ã¯ãPostgreSQLã¯æé©åãå¹ççã«å®è¡ã§ããŸããã ããã©ã«ãæ§æã§ã¯ãDBMSã¯å䞊åãªã¯ãšã¹ãïŒ work_mem
ïŒã«ããŒãå¶éã課ããŸãã éçºã«äœ¿çšããã³ã³ãã¥ãŒã¿ãŒã§ã¯ããã®ããã©ã«ãå€ã¯4ã¡ã¬ãã€ãã«èšå®ãããŠããŸããã
work_mem
å·¥å Žèšå®ã§100äžè¡ãåŠçããããã©ãŒãã³ã¹ãè©äŸ¡ããŸãããã
echo "SELECT count(DISTINCT n) FROM items;" | pgbench -d count -t 50 -P 1 -f -
EXPLAIN
ãå®è¡ãããšãã¯ãšãªã®å®è¡æéã®ã»ãšãã©ãéçŽã«è²»ããããããšãããããŸãã ãŸãã ããã¹ãã¿ã€ãã®åã®è¡æ°ãã«ãŠã³ãããããšã¯ãæŽæ°ãããã¯ããã«é
ãããšã«æ³šæããŠãã ããã

ãéåäœãã®å
éšã§äœãèµ·ãããŸããïŒ EXPLAIN
åºåã®ãã®ããã·ãŒãžã£ã®èª¬æã¯äžéæã§ãã åæ§ã®ã¯ãšãªã®åæã¯ãç¶æ³ãçè§£ããã®ã«åœ¹ç«ã¡ãŸãã count distinct
ãselect distinct
眮ãæãcount distinct
ã

EXPLAIN (ANALYZE, VERBOSE) SELECT DISTINCT n FROM items; Unique (cost=131666.34..136666.34 rows=498824 width=4) (actual time=766.775..1229.040 rows=631846 loops=1) Output: n -> Sort (cost=131666.34..134166.34 rows=1000000 width=4) (actual time=766.774..1075.712 rows=1000000 loops=1) Output: n Sort Key: items.n Sort Method: external merge Disk: 13632kB -> Seq Scan on public.items (cost=0.00..18334.00 rows=1000000 width=4) (actual time=0.006..178.153 rows=1000000 loops=1) Output: n
work_memãäžååã§ãå€éšããŒã¿æ§é ïŒã€ã³ããã¯ã¹ãªã©ïŒãååšããªãç¶æ³ã§ã¯ãPostgreSQLã¯ã¡ã¢ãªãšãã£ã¹ã¯éã§ããŒãžãœãŒãããŒãã«ãå®è¡ã ãçµæãå埩ããŠéè€ãåé€ãsort | uniq
sort | uniq
ç¹ã«æŽæ°ån
ã§ã¯ãªããæåås
ã䜿çšããå ŽåããœãŒãã¯ã¯ãšãªã®å®è¡æéã®å€§éšåãå ããŸãã äž¡æ¹ã®å Žåã®éè€ïŒäžæã®ãã£ã«ã¿ãŒïŒã®åé€ã¯ãã»ãŒåãé床ã§å®è¡ãããŸãã
ç¹æ®ãªéçŽ
äžæã®å€ã®æ°ãã«ãŠã³ãããããã«ãThomas Vondraã¯ãé·ããå¶éãããã¿ã€ãïŒ64ããããè¶
ããŠã¯ãªããªãïŒã§æ©èœããç¹æ®ãªéçŽã¡ãœãããäœæããŸããã ãã®æ¹æ³ã¯ãäœæ¥ã¡ã¢ãªãå¢ããããã€ã³ããã¯ã¹ãäœæãããããªããŠããããã©ã«ãã®ãœãŒãããŒã¹ã®æ¹æ³ãããé«éã§ãã æ¬¡ã®æé ã«åŸã£ãŠã€ã³ã¹ããŒã«ããŸãã
- tvondra / count_distinctãããžã§ã¯ãã®ã³ããŒãäœæããŸãã
- å®å®ãããã©ã³ãã«åãæ¿ããŸãïŒ
git checkout REL2_0_STABLE
ã make install
å®è¡ããŸãã- ããŒã¿ããŒã¹ã§ã
CREATE EXTENSION. count_distinct;
å®è¡ããŸãCREATE EXTENSION. count_distinct;
CREATE EXTENSION. count_distinct;
ã
ãã®èšäºã§ã¯ã Thomasãéèšã®ä»çµã¿ã«ã€ããŠèª¬æããŠããŸãã 圌ã®ã¡ãœããã¯ã¡ã¢ãªå
ã«äžæã®èŠçŽ ã®ãœãŒããããé
åãäœæããããã»ã¹ã§ãããå§çž®ãããšç°¡åã«èšããŸãã
echo "SELECT COUNT_DISTINCT(n) FROM items;" | pgbench -d count -t 50 -P 1 -f -
ããã¯ããã¹ãããŒã¿ã§å¹³å742ããªç§ã§å®è¡ãããæšæºã«ãŠã³ãdistinctãããéãåäœããŸãã count_distinctãªã©ãCã§èšè¿°ãããæ¡åŒµæ©èœã¯work_memãã©ã¡ãŒã¿ãŒã«éå®ãããªããããããã»ã¹ã§äœæãããé
åã¯ãåœåèšç»ãããããå€ãã®ã¡ã¢ãªãæ¥ç¶ããšã«äœ¿çšããå¯èœæ§ãããããšã«æ³šæããŠãã ããã
ããã·ã¥éèš
åèšç®ããããã¹ãŠã®åãwork_memã«åãŸãå ŽåãPostgreSQLã¯ããã·ã¥ããŒãã«ã䜿çšããŠäžæã®å€ãååŸããŸãã

SET work_mem='1GB'; EXPLAIN SELECT DISTINCT n FROM items; HashAggregate (cost=20834.00..25822.24 rows=498824 width=4) Group Key: n -> Seq Scan on items (cost=0.00..18334.00 rows=1000000 width=4)
ããã¯ã調æ»ããæ¹æ³ã®äžã§æéã§ãã n
ã§å¹³å372ããªç§ã s
23ç§ã§å®è¡ãããŸãã select distinct n
select count(distinct n)
ãselect count(distinct n)
ãšãããã·ã¥ã®éèšãHashAggregateã«é©çšãããããšãæ¡ä»¶ã«ã select count(distinct n)
select distinct n
ããã³select count(distinct n)
ã¯ãšãªã¯ã»ãŒåãæéåäœããŸãã
泚æïŒ work_mem
ã¯å䞊åãªã¯ãšã¹ããåç
§ãããããäœæ¥ã¡ã¢ãªã«é«ãå¶éãèšå®ãããšãäžå¿«ãªçµæãæãå¯èœæ§ããããŸãã ããã«ãããè¯ããã®ãæãã€ãããšãã§ããŸãã
ã€ã³ããã¯ã¹ã®ã¿ã®ã¹ãã£ã³
ãã®æ©èœã¯PostgreSQL 9.2ã§ç»å ŽããŸããã ã€ã³ããã¯ã¹ã«ã¯ãšãªã«å¿
èŠãªãã¹ãŠã®ããŒã¿ãå«ãŸããŠããå Žåãã·ã¹ãã ã¯ããŒãã«èªäœïŒãããŒããïŒã«è§Šããããšãªããã€ã³ããã¯ã¹ã®ã¿ã䜿çšã§ããŸãã ã€ã³ããã¯ã¹ã¿ã€ãã¯ã€ã³ããã¯ã¹ã®ã¿ã®ã¹ãã£ã³ããµããŒãããå¿
èŠããããŸã ïŒäŸïŒ btree ïŒã GiSTããã³SP-GiSTã€ã³ããã¯ã¹ã¯ãç¹å®ã®ã¯ã©ã¹ã®æŒç®åã«å¯ŸããŠã®ã¿ã€ã³ããã¯ã¹ã®ã¿ã®ã¹ãã£ã³ããµããŒãããŸã ã
ån
ããã³s
btreeã€ã³ããã¯ã¹ãäœæããŸãã
CREATE INDEX items_n_idx ON items USING btree (n); CREATE INDEX items_s_idx ON items USING btree (s);
ãããã®åããäžæã®å€ãéžæããããã«ãå¥ã®æŠç¥ã䜿çšãããããã«ãªããŸããã

EXPLAIN SELECT DISTINCT n FROM items; Unique (cost=0.42..28480.42 rows=491891 width=4) -> Index Only Scan using items_n_idx on items (cost=0.42..25980.42 rows=1000000 width=4)
ããããããã§å¥åŠãªåé¡ã«ééãSELECT COUNT(DISTINCT n) FROM items
ã¯ã SELECT DISTINCT n
ãããã©ã«ãã§ãããè¡ãã«ãããããããã€ã³ããã¯ã¹ã䜿çšããŸããã ããã°ã®ãã³ãïŒ ãpostgresã50åé«éåããããªãã¯ïŒã ïŒã«åŸãããšã§ããµãã¯ãšãªã®count
ãšã¯count distinct
count
ãæžãæããããšã«ãããã¹ã±ãžã¥ãŒã©ãŒã«ãã³ããäžããããšãã§ããŸãã
é åºå¯Ÿç§°ã®äºåæšæ¢çŽ¢ã¯é«éã§ãã ãã®ã¯ãšãªã®å¹³åæèŠæéã¯177ããªç§ïŒås
270ããªç§ïŒã§ãã
åè work_memã®å€ãããŒãã«å
šäœããã¹ãããã®ã«ååãªå ŽåãPostgreSQLã¯ã€ã³ããã¯ã¹ãååšããå Žåã§ãHashAggregateãéžæããŸãã ããã¯é説ã§ããã·ã¹ãã ã«ããå€ãã®ã¡ã¢ãªãå²ãåœãŠããšãææªã®ã¯ãšãªãã©ã³ãéžæãããããšã«ãªããŸãã SET enable_hashagg=false;
èšå®ããããšã«ããã ã€ã³ããã¯ã¹ã®ã¿ã®ã¹ãã£ã³ã®éžæã匷å¶ããããšãã§ãSET enable_hashagg=false;
ãä»ã®ãªã¯ãšã¹ãã®èšç»ãæãªããªãããã«ã trueã«æ»ãããšãå¿ããªãã§ãã ããã
æ Œä»ã
HyperLogLog
以åã«æ€èšãããæ¹æ³ã¯ãã€ã³ããã¯ã¹ãããã·ã¥ããŒãã«ãã¡ã¢ãªå
ã®ãœãŒããããé
åããŸãã¯éäžåããŒã¿ããŒã¹ã®çµ±èšããŒãã«ãžã®ã¢ã¯ã»ã¹ã«äŸåããŸãã æ¬åœã«å€§éã®ããŒã¿ãããå Žåãããã³/ãŸãã¯ãããã忣ããŒã¿ããŒã¹ã®è€æ°ã®ããŒãéã§å
±æãããå Žåããããã®æ¹æ³ã¯ç§ãã¡ã«é©ããªããªããŸãã
ãã®å Žåã確ççããŒã¿æ§é ãå©ãã«ãªããŸããããã¯ã倧ãŸããªèŠç©ããããã°ããè¡ãããšãã§ããååã«äžŠååãããŠããŸãã count distinctã§ãããã®æ§é ã®1ã€ã詊ããŠã¿ãŸãããã HyperLogLogïŒHLLïŒãšåŒã°ããã«ãŒãã£ããªãã£ãŒæšå®éãæ€èšããŠãã ããã äžé£ã®èŠçŽ ã衚ãããã«å°éã®ã¡ã¢ãªã䜿çšããŸãã ãã®ã¡ã«ããºã ã®åéåæŒç®ã¯æå€±ãªãã§æ©èœãããããæ°éæšå®ã®ç²ŸåºŠãæãªãããšãªãä»»æã®HLLå€ãçµã¿åãããããšãã§ããŸãã
HLLã¯ããè¯ããããã·ã¥é¢æ°ã®ããããã£ãç¹ã«ããã·ã¥å€éã®è·é¢ã䜿çšããŸãã å€ãåçã«ååžããã颿°ã¯ãå€ãå¯èœãªéãåºããåŸåããããŸãã æ°ããããã·ã¥ã远å ããããšã空ãé åãå°ãããªããèŠçŽ ãäºãã«ãã£ã€ãå§ããŸãã ããã·ã¥å€éã®æå°è·é¢ãåæããããšã«ãããã¢ã«ãŽãªãºã ã¯ãœãŒã¹èŠçŽ ã®æãå¯èœæ§ã®é«ãæ°ãæšå®ã§ããŸãã
éåºŠãæž¬å®ããŸãããã ãŸããPostgreSQLã®æ¡åŒµæ©èœãã€ã³ã¹ããŒã«ããŸãã
- postgresql-hllãããžã§ã¯ãã®ã³ããŒãäœæããŸãã
make install
å®è¡ããŸãã- ããŒã¿ããŒã¹ã«hllæ¡åŒµæ©èœãäœæããŸããCREATE
CREATE EXTENSION hll;
ã
HLLã¯ãããŒãã«ã®é 次ã¹ãã£ã³ïŒé 次ã¹ãã£ã³ïŒäžã«é«éããŒã¿éçŽãå®è¡ããŸãã
EXPLAIN SELECT
count distinct
å®è¡ãããšãã®å¹³åHLLé床ã¯ãån
239ããªç§ã s
284ããªç§ã§ããã 100äžä»¶ã®ã¬ã³ãŒãã®ã€ã³ããã¯ã¹ã®ã¿ã®ã¹ãã£ã³ãããå°ãé
ããªããŸããã HLLã®çã®åŒ·ãã¯ãæå€±ãªãã«è¡ããã飿³ããã³å¯æã®åéåæŒç®ã«ããæããã«ãªããŸãã ããã¯ããããã䞊è¡ããŠå®è¡ããçµã¿åãããŠæçµçµæãèšç®ã§ããããšãæå³ããŸãã
䞊åå
Googleã¢ããªãã£ã¯ã¹ãªã©ã®ãªã¢ã«ã¿ã€ã åæãåéããã¢ããªã±ãŒã·ã§ã³ã¯ã countãç©æ¥µçã«äœ¿çšãããã®æäœã¯ååã«äžŠååãããŠããŸãã ãã®ã»ã¯ã·ã§ã³ã§ã¯ã Citus Cloudã«ãããã€ãããå°ããªCitusã¯ã©ã¹ã¿ãŒã«åºã¥ããŠãããã€ãã®è¡ã«ãŠã³ãæ¹æ³ã®ããã©ãŒãã³ã¹ã枬å®ããŸãã
ã¢ã€ãã¢ã¯ã忣ããŒã¿ããŒã¹ããŒããè€æ°ã®ãã·ã³ã«å±éããããšã§ãã ããŒãã¯åãã¹ããŒã ãæã¡ãåããŒãã«ã¯äžè¬çãªããŒã¿ã»ããïŒã·ã£ãŒãïŒã®äžéšãå«ãŸããŸãã è¡æ°ã®ã«ãŠã³ãã¯ã䞊è¡ããŠãã€ãŸãç°ãªããã·ã³ã§åæã«å®è¡ãããŸãã
ã¯ã©ã¹ã¿ãŒã®ã»ããã¢ãã
ç§ãã¡ã®ç®æšã¯æ¯èŒããã©ãŒãã³ã¹ãè©äŸ¡ããããšã§ãããæå€§é床ãååŸããªãããããã¹ãã§ã¯å°ããªã¯ã©ã¹ã¿ãŒã®ã¿ãäœæããŸãã
Citus Cloudã§ã¯ã8å°ã®ãã·ã³ã®ã¯ã©ã¹ã¿ãŒãäœæããåãã·ã³ã«å¯èœãªéãåŒ±ãæ§æãéžæããŸããã ãã®äŸãåçŸãããå Žåã¯ã ããã«ç»é²ããŠãã ãã ã
ã¯ã©ã¹ã¿ãŒãäœæããåŸã調æŽããŒãã«æ¥ç¶ããŠSQLã¯ãšãªãå®è¡ããŸãã ãŸããããŒãã«ãäœæããŸãã
CREATE TABLE items ( n integer, s text );
çŸæç¹ã§ã¯ãããŒãã«ã¯ã³ãŒãã£ããŒã¿ãŒããŒã¿ããŒã¹ã«ã®ã¿ååšããŸãã ããŒãã«ãåå²ãããã®éšåãäœæ¥ããŒãã«é
眮ããå¿
èŠããããŸãã Citusã¯ãéžæãããååžåïŒååžåïŒã®å€ãåŠçããããšã«ãããç¹å®ã®ã»ã°ã¡ã³ãïŒæçïŒã«åè¡ãå²ãåœãŠãŸãã æ¬¡ã®äŸã§ã¯ã itemsããŒãã«ã®å°æ¥ã®è¡ã忣ããã¿ã¹ã¯ãèšå®ããŸãn
åã®å€ã®ããã·ã¥ã䜿çšããŠãç¹å®ã®ã»ã°ã¡ã³ãã«å±ãããã©ããã倿ããŸãã
SELECT master_create_distributed_table('items', 'n', 'hash'); SELECT master_create_worker_shards('items', 32, 1);
調æŽããŒãã䜿çšããŠãã©ã³ãã ããŒã¿ãããŒã¿ããŒã¹ã»ã°ã¡ã³ãã«ããŒãããŸãã ïŒCitusã¯ãããŒã¿ããã°ããããŒãããããã«äœ¿çšãããMX ïŒãã¹ã¿ãŒã¬ã¹ã¢ãŒãïŒããµããŒãããŠããŸãããçŸåšã¯èå³ããããŸãããïŒ
ã¯ã©ã¹ã¿ãŒã³ãŒãã£ããŒã¿ãŒããŒã¿ããŒã¹ã®URLãåä¿¡ããåŸãé«éãããã¯ãŒã¯æ¥ç¶ã®ã³ã³ãã¥ãŒã¿ãŒã§æ¬¡ã®ã³ãŒããå®è¡ããŸãã ïŒçæãããããŒã¿ã¯ãã¹ãŠãã®ãã·ã³ãããããã¯ãŒã¯ãä»ããŠéä¿¡ããããããååãªé床ãå¿
èŠã§ããïŒ
cat << EOF > randgen.sql COPY ( SELECT (random()*100000000)::integer AS n, md5(random()::text) AS s FROM generate_series(1,100000000) ) TO STDOUT; EOF psql $CITUS_URL -q -f randgen.sql | \ psql $CITUS_URL -c "COPY items (n, s) FROM STDIN"
éäžããŒã¿ããŒã¹ã®äŸã§ã¯ã100äžè¡ã䜿çšããŸããã ä»åã¯ã1åãåããŸãããã
æ£ç¢ºãªã«ãŠã³ã
éè€ãã
éåžžã®ã«ãŠã³ã ïŒéè€ãªãïŒã¯åé¡ãåŒãèµ·ãããŸããã ã³ãŒãã£ããŒã¿ãŒã¯ããã¹ãŠã®ããŒãã§ã¯ãšãªãå®è¡ããçµæãèŠçŽããŸãã EXPLAIN
åºåã«ã¯ãäœæ¥ããŒãã®1ã€ã§éžæããããã©ã³ïŒã忣ã¯ãšãªãïŒãšã³ãŒãã£ããŒã¿ãŒã§éžæããããã©ã³ïŒããã¹ã¿ãŒã¯ãšãªãïŒã衚瀺ãããŸãã
EXPLAIN VERBOSE SELECT count(*) FROM items; Distributed Query into pg_merge_job_0003 Executor: Real-Time Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=*** port=5432 dbname=citus -> Aggregate (cost=65159.34..65159.35 rows=1 width=0) Output: count(*) -> Seq Scan on public.items_102009 items (cost=0.00..57340.27 rows=3127627 width=0) Output: n, s Master Query -> Aggregate (cost=0.00..0.02 rows=1 width=0) Output: (sum(intermediate_column_3_0))::bigint -> Seq Scan on pg_temp_2.pg_merge_job_0003 (cost=0.00..0.00 rows=0 width=0) Output: intermediate_column_3_0
åç
§çšïŒã¯ã©ã¹ã¿ãŒã§ã¯ããã®ãªã¯ãšã¹ãã¯1.2ç§å®è¡ãããŸãã 忣ããŒã¿ããŒã¹ã䜿çšããå Žåã åå¥ã®ã«ãŠã³ãã¯ããæ·±å»ãªåé¡ãåŒãèµ·ãããŸãã
åå¥ïŒéè€ãªãïŒ
忣ããŒã¿ããŒã¹ã§äžæã®åå€ãèšç®ããããšã®é£ããã¯ãç°ãªãããŒãã§éè€ãæ¢ãå¿
èŠãããããšã§ãã ãã ããååžåã®å€ãèªã¿åãå Žåãããã¯åé¡ã§ãã ãã®åã«åãå€ãæã€è¡ã¯1ã€ã®ã»ã°ã¡ã³ãã«åé¡ãããã»ã°ã¡ã³ãéã®éè€ãé²ããŸãã
Citusã¯ãååžåã®äžæã®å€ãã«ãŠã³ãããã«ã¯ãåããŒãã§count distinct
ã®count distinct
ã¯ãšãªãå®è¡ããçµæã远å ããå¿
èŠãããããšãç¥ã£ãŠããŸãã ã¯ã©ã¹ã¿ãŒã¯ãã®ã¿ã¹ã¯ã3.4ç§ã§å®è¡ããŸãã
éåžžã®åïŒéååžïŒã§äžæã®å€ã®æ°ãèŠã€ããããšã¯ããå°é£ã§ãã è«ççã«ã¯ã2ã€ã®å¯èœæ§ããããŸãã
- ãã¹ãŠã®è¡ã調æŽããŒãã«ã³ããŒããããã§ã«ãŠã³ãããŸãã
- , , , , .
. .
«» (repartitioning). , , , . , . . Citus , .
, HLL, . (non-distribution), . HLL . HLL , , .
Citus postgresql-hll. citus.count_distinct_error_rate , Citus count distinct HLL. äŸïŒ
SET citus.count_distinct_error_rate = 0.005; EXPLAIN VERBOSE SELECT count(DISTINCT n) FROM items; Distributed Query into pg_merge_job_0090 Executor: Real-Time Task Count: 32 Tasks Shown: One of 32 -> Task Node: host=*** port=5432 dbname=citus -> Aggregate (cost=72978.41..72978.42 rows=1 width=4) Output: hll_add_agg(hll_hash_integer(n, 0), 15) -> Seq Scan on public.items_102009 items (cost=0.00..57340.27 rows=3127627 width=4) Output: n, s Master Query -> Aggregate (cost=0.00..0.02 rows=1 width=0) Output: (hll_cardinality(hll_union_agg(intermediate_column_90_0)))::bigint -> Seq Scan on pg_temp_2.pg_merge_job_0090 (cost=0.00..0.00 rows=0 width=0) Output: intermediate_column_90_0
: 3,2 n
3,8 s
. 100 (non-distribution) ! HLL â .
ãŸãšã
æ¹æ³ | /1 | | | |
---|
PG Stats | 0,3 | - | - | - |
EXPLAIN | 0,3 | - | + | - |
| 2 ( ) | + | - | - |
count(*) | 85 | + | + | - |
count(1) | 99 | + | + | - |
Index Only Scan | 177 | + | + | + |
HLL | 239 | - | + | + |
HashAgg | 372 | + | + | + |
Custom Agg | 435 ( 64-bit) | + | + | + |
Mergesort | 742 | + | + | + |
index-only scan , HyperLogLog (HLL) (> 100 ). , (distinct count) .