Skip to content

Archive Compression Considerations #2

@metachris

Description

@metachris

Input data

  • 14h CSVs (with raw transactions + timestamp + hash)
  • Transactions: 1,514,668
  • Disk usage: 1.7G
filename                            entries      size
txs-2023-08-07-10-00.csv             15,965       19M
txs-2023-08-07-11-00.csv            106,435      144M
txs-2023-08-07-12-00.csv            117,599      131M
txs-2023-08-07-13-00.csv            117,184      143M
txs-2023-08-07-14-00.csv            126,056      121M
txs-2023-08-07-15-00.csv            125,871      131M
txs-2023-08-07-16-00.csv            124,732      135M
txs-2023-08-07-17-00.csv            122,725      133M
txs-2023-08-07-18-00.csv            117,119      126M
txs-2023-08-07-19-00.csv            113,833      127M
txs-2023-08-07-20-00.csv            109,858      125M
txs-2023-08-07-21-00.csv            105,749      121M
txs-2023-08-07-22-00.csv            112,109      114M
txs-2023-08-07-23-00.csv             99,433      101M

Compression

Method Level Size Ratio Runtime
lz4 9 841M 0.49 38s
lz4 12 840M 0.49 1m 55s
zip 6 644M 0.38 45s
zip 9 640M 0.38 1m 23s
zstd 3 580M 0.34 9s
zstd 14 578M 0.34 2m 45s
zstd 15 577M 0.34 3m 47s
zstd 16 524M 0.31 4m 45s

Summarizer Script

  • Runtime: 1m 12s
  • Parquet output size: 74M (using gzip compression)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions