アペフチ

RubyでのCSV処理はApache Arrowが速い

昨日RubyでのApache Arrowの使い方(Parquetもあるよ)という記事を書いて、Rubyでデータを扱う時に

FastestCSV > Arrow > Parquet > CSV

の順で速くなる、ということを書いた。これにArrowやRubyのCSVの開発者・メンテナーである@kouさんからこんな指摘を貰った。

each_chunk なんてAPIがあったんだ……! やってみました。

                                        user     system      total        real
CSV(Ruby 標準添付CSVライブラリー)           850541456
 22.359375   0.390625  22.750000 ( 22.763864)
CSV(FastestCSV RubyGem)           850541456
  1.531250   0.140625   1.671875 (  1.718251)
CSV(Red Arrow each_record_batch)  850541456
 10.875000   1.812500  12.687500 ( 10.767980)
CSV(Red Arrow each_record)        850541456
 71.437500   1.171875  72.609375 ( 72.159371)
CSV(Red Arrow 対象カラムだけ each)       850541456
  3.734375   1.843750   5.578125 (  3.226278)
CSV(Red Arrow 対象カラムだけ each_chunk) 850541456
  0.859375   1.437500   2.296875 (  0.478962)
Arrow(each_record_batch)          850541456
  9.859375   0.078125   9.937500 (  9.992725)
Arrow(each_record)                850541456
 72.781250   0.171875  72.953125 ( 73.775244)
Arrow(対象カラムだけ each)               850541456
  2.125000   0.015625   2.140625 (  2.133398)
Arrow(対象カラムだけ each_chunk)         850541456
  0.015625   0.000000   0.015625 (  0.024413)
Parquet(each_record_batch)        850541456
 11.062500   2.078125  13.140625 ( 10.599412)
Parquet(each_record)              850541456
 13.343750   0.906250  14.250000 ( 12.923848)
Parquet(対象カラムだけ each)             850541456
  3.703125   0.718750   4.421875 (  3.152494)
Parquet(対象カラムだけ each_chunk)       850541456
  0.656250   0.468750   1.125000 (  0.243396)

見ての通りArrowが圧倒的に速い。

Arrow > Parquet > FastestCSV > CSV

となってますな。

CSVフォーマットを扱うのでも、FastestCSVよりArrowで each_chunk メソッドを使う方が速い。素晴らしい。

参考リンク

ベンチマークスクリプト

require "benchmark"
require "csv"
require "fastest-csv"
require "arrow"
require "parquet"

CSVFILE = "sample-data.csv"
ARROWFILE = "sample-data.arrow"
PARQUETFILE = "sample-data.parquet"

Benchmark.bmbm do |x|
  x.report "CSV(Ruby 標準添付CSVライブラリー)" do
    amount = 0
    CSV.foreach CSVFILE, headers: true do |row|
      amount += row[4].to_i
    end
    puts amount
  end

  x.report "CSV(FastestCSV RubyGem)" do
    amount = 0
    headers = true
    FastestCSV.foreach CSVFILE do |row|
      if headers
        headers = false
        next
      end
      amount += row[4].to_i
    end
    puts amount
  end

  x.report "CSV(Red Arrow each_record_batch)" do
    amount = 0
    Arrow::Table.load(CSVFILE).each_record_batch do |records|
      records.each do |record|
        amount += record[4].to_i
      end
    end
    puts amount
  end

  x.report "CSV(Red Arrow each_record)" do
    amount = 0
    Arrow::Table.load(CSVFILE).each_record do |record|
      amount += record[4].to_i
    end
    puts amount
  end

  x.report "CSV(Red Arrow 対象カラムだけ each)" do
    amount = 0
    Arrow::Table.load(CSVFILE).find_column(4).each do |record|
      amount += record.to_i
    end
    puts amount
  end

  x.report "CSV(Red Arrow 対象カラムだけ each_chunk)" do
    amount = 0
    Arrow::Table.load(CSVFILE)[4].data.each_chunk do |array|
      amount += array.cast(Arrow::Int64DataType.new).sum
    end
    puts amount
  end

  x.report "Arrow(each_record_batch)" do
    amount = 0
    Arrow::Table.load(ARROWFILE).each_record_batch do |records|
      records.each do |record|

        amount += record[4].to_i
      end
    end
    puts amount
  end

  x.report "Arrow(each_record)" do
    amount = 0
    Arrow::Table.load(ARROWFILE).each_record do |record|
      amount += record[4].to_i
    end
    puts amount
  end

  x.report "Arrow(対象カラムだけ each)" do
    amount = 0
    Arrow::Table.load(ARROWFILE).find_column(4).each do |record|
      amount += record.to_i
    end
    puts amount
  end

  x.report "Arrow(対象カラムだけ each_chunk)" do
    amount = 0
    Arrow::Table.load(ARROWFILE)[4].data.each_chunk do |array|
      amount += array.cast(Arrow::Int64DataType.new).sum
    end
    puts amount
  end

  x.report "Parquet(each_record_batch)" do
    amount = 0
    Arrow::Table.load(PARQUETFILE).each_record_batch do |records|
      records.each do |record|
        amount += record[4].to_i
      end
    end
    puts amount
  end

  x.report "Parquet(each_record)" do
    amount = 0
    Arrow::Table.load(PARQUETFILE).each_record do |record|
      amount += record[4].to_i
    end
    puts amount
  end

  x.report "Parquet(対象カラムだけ each)" do
    amount = 0
    Arrow::Table.load(PARQUETFILE).find_column(4).each do |record|
      amount += record.to_i
    end
    puts amount
  end

  x.report "Parquet(対象カラムだけ each_chunk)" do
    amount = 0
    Arrow::Table.load(PARQUETFILE)[4].data.each_chunk do |array|
      amount += array.cast(Arrow::Int64DataType.new).sum
    end
    puts amount
  end
end