国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Table of Contents
Compression technology
Code
Configure algorithm or codec
Compression level
Performance Test
File size
Write
Read
Conclusion
Implementation details
Home Java javaTutorial Compression algorithms in Parquet Java

Compression algorithms in Parquet Java

Jan 20, 2025 pm 06:04 PM

Compression algorithms in Parquet Java

Apache Parquet is a columnar storage format targeted at analytical workloads, but it can be used to store any type of structured data, addressing a variety of use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at both stages of the processing process. This reduces storage costs and improves read performance.

This article explains Parquet’s file compression in Java, provides usage examples, and analyzes its performance.

Compression technology

Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the use of more specific and efficient compression techniques based on locality and value redundancy of the same type of data.

Parquet writes information in binary format and applies compression at two different levels, each using a different technique:

  • When writing the value of the column, it will adaptively select the encoding type according to the characteristics of the initial value: dictionary encoding, run-length encoding, bit packing, incremental encoding, etc.
  • Whenever a certain number of bytes is reached (default is 1MB), a page is formed and the binary block is compressed using a programmer-configurable algorithm (no compression, GZip, Snappy, LZ4, ZSTD, etc.).

Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).

The performance of different compression technologies depends heavily on your data, so there is no one-size-fits-all solution that guarantees the fastest processing time and lowest storage consumption. You need to perform your own tests.

Code

Configuration is simple and only requires explicit setting when writing. When reading a file, Parquet discovers which compression algorithm is used and applies the corresponding decompression algorithm.

Configure algorithm or codec

In Carpet and Parquet using Protocol Buffers and Avro, to configure the compression algorithm, just call the builder's withCompressionCodec method:

Carpet

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Avro

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Protocol Buffers

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

The value must be one of the values ??available in the CompressionCodecName enumeration: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, LZ4_RAW should be used).

Compression level

Some compression algorithms provide a way to fine-tune the compression level. This level is usually related to how much effort they need to put into finding repeating patterns; the higher the compression level, the more time and memory the compression process requires.

Although they come with default values, they can be modified using Parquet's generic configuration mechanism, albeit using different keys for each codec.

Additionally, the values ??to choose are not standard and depend on each codec, so you must refer to the documentation for each algorithm to understand what each level offers.

ZSTD

To reference level configuration, the ZSTD codec declares a constant: ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL.

Possible values ??range from 1 to 22, default value is 3.

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

LZO

To reference level configuration, the LZO codec declares a constant: LzoCodec.LZO_COMPRESSION_LEVEL_KEY.

Possible values ??range from 1 to 9, 99 and 999, with the default value being '999'.

ParquetWriter<Organization> writer = AvroParquetWriter.<Organization>builder(outputFile)
    .withSchema(new Organization().getSchema())
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

GZIP

It does not declare any constants, you must use the string "zlib.compress.level" directly, possible values ??range from 0 to 9, the default value is "6".

ParquetWriter<Organization> writer = ProtoParquetWriter.<Organization>builder(outputFile)
    .withMessage(Organization.class)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Performance Test

To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:

  • New York City Taxi Trip: Contains a large number of numeric and a small number of string values ??in several columns. It has 23 columns and contains 19.6 million records.
  • Cohesion Project of the Italian Government: Many columns contain floating point values ??as well as a large number of various text strings. It has 91 columns and contains 2 million rows.

I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.

As expected, I will be using Carpet with the default configuration provided by parquet-java and the default compression level for each algorithm.

You can find the source code on GitHub, testing was done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.

File size

To understand the performance of each compression, we will use the equivalent CSV file as a reference.

格式gov.it紐約出租車
CSV1761 MB2983 MB
未壓縮564 MB760 MB
SNAPPY220 MB542 MB
GZIP**146 MB**448 MB
ZSTD148 MB**430 MB**
LZ4_RAW209 MB547 MB
LZO215 MB518 MB

Of the two tests, compression using GZip and Zstandard was the most efficient.

Using only Parquet encoding technology, file size can be reduced to 25%-32% of the original CSV size. With additional compression applied, it will be reduced to 9% to 15% of the CSV size.

Write

How much overhead does compressing information bring?

If we write the same information three times and calculate the average seconds, we get:

算法gov.it紐約出租車
未壓縮25.057.9
SNAPPY25.256.4
GZIP39.391.1
ZSTD27.364.1
LZ4_RAW**24.9**56.5
LZO26.0**56.1**

SNAPPY, LZ4 and LZO achieve similar times to no compression, while ZSTD adds some overhead. GZIP had the worst performance, with write times slowing down by 50%.

Read

Reading a file is faster than writing because less computation is required.

The time in seconds to read all columns in the file is:

算法gov.it紐約出租車
未壓縮11.437.4
SNAPPY**12.5****39.9**
GZIP13.640.9
ZSTD13.141.5
LZ4_RAW12.841.6
LZO13.141.1

The read time is close to that of uncompressed information, and the decompression overhead is between 10% and 20%.

Conclusion

No algorithm is significantly better than the others in terms of read and write times, all are in a similar range. In most cases, compressing information can make up for the space savings (and transmission) time lost.

In these two use cases, the deciding factor in choosing one or the other algorithm is probably the compression ratio achieved, with ZSTD and Gzip being prominent (but writing times being inferior).

Each algorithm has its advantages, so the best option is to test it with your data and consider which factor is more important:

  • Minimizes storage usage as you store large amounts of rarely used data.
  • Minimize file generation time.
  • Minimize read time as files are read multiple times.

Like everything in life, it's a trade-off and you have to see what best compensates for it. In Carpet, by default it uses Snappy for compression if you don't configure anything.

Implementation details

The value must be one of the values ??available in the CompressionCodecName enumeration. Associated with each enumeration value is the name of the class that implements the algorithm:

CarpetWriter<T> writer = new CarpetWriter.Builder<>(outputFile, clazz)
    .withCompressionCodec(CompressionCodecName.ZSTD)
    .build();

Parquet will use reflection to instantiate the specified class, which must implement the CompressionCodec interface. If you look at its source code, you'll see that it's in the Hadoop project, not Parquet. This shows how well Parquet is coupled to Hadoop in its Java implementation.

To use one of these codecs, you must ensure that you have added the JAR containing its implementation as a dependency.

Not all implementations are present in the transitive dependencies you have when adding parquet-java, or you may be excluding Hadoop dependencies too aggressively.

In the org.apache.parquet:parquet-hadoop dependency, include implementations of SnappyCodec, ZstandardCodec, and Lz4RawCodec, which transitively imports the snappy-java, zstd-jni, and aircompressor dependencies along with the actual implementations of these three algorithms.

In hadoop-common:hadoop-common dependency, contains the implementation of GzipCodec.

Where are the implementations of BrotliCodec and LzoCodec? They are not in any Parquet or Hadoop dependencies, so if you use them without adding additional dependencies, your application will not be able to use files compressed in those formats.

  • To support LZO, you need to add the dependency org.anarres.lzo:lzo-hadoop to your pom or gradle file.
  • The situation with Brotli is more complicated: the dependency is not in Maven Central and you must also add the JitPack repository.

The above is the detailed content of Compression algorithms in Parquet Java. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial
1502
276
Asynchronous Programming Techniques in Modern Java Asynchronous Programming Techniques in Modern Java Jul 07, 2025 am 02:24 AM

Java supports asynchronous programming including the use of CompletableFuture, responsive streams (such as ProjectReactor), and virtual threads in Java19. 1.CompletableFuture improves code readability and maintenance through chain calls, and supports task orchestration and exception handling; 2. ProjectReactor provides Mono and Flux types to implement responsive programming, with backpressure mechanism and rich operators; 3. Virtual threads reduce concurrency costs, are suitable for I/O-intensive tasks, and are lighter and easier to expand than traditional platform threads. Each method has applicable scenarios, and appropriate tools should be selected according to your needs and mixed models should be avoided to maintain simplicity

Best Practices for Using Enums in Java Best Practices for Using Enums in Java Jul 07, 2025 am 02:35 AM

In Java, enums are suitable for representing fixed constant sets. Best practices include: 1. Use enum to represent fixed state or options to improve type safety and readability; 2. Add properties and methods to enums to enhance flexibility, such as defining fields, constructors, helper methods, etc.; 3. Use EnumMap and EnumSet to improve performance and type safety because they are more efficient based on arrays; 4. Avoid abuse of enums, such as dynamic values, frequent changes or complex logic scenarios, which should be replaced by other methods. Correct use of enum can improve code quality and reduce errors, but you need to pay attention to its applicable boundaries.

Understanding Java NIO and Its Advantages Understanding Java NIO and Its Advantages Jul 08, 2025 am 02:55 AM

JavaNIO is a new IOAPI introduced by Java 1.4. 1) is aimed at buffers and channels, 2) contains Buffer, Channel and Selector core components, 3) supports non-blocking mode, and 4) handles concurrent connections more efficiently than traditional IO. Its advantages are reflected in: 1) Non-blocking IO reduces thread overhead, 2) Buffer improves data transmission efficiency, 3) Selector realizes multiplexing, and 4) Memory mapping speeds up file reading and writing. Note when using: 1) The flip/clear operation of the Buffer is easy to be confused, 2) Incomplete data needs to be processed manually without blocking, 3) Selector registration must be canceled in time, 4) NIO is not suitable for all scenarios.

How does a HashMap work internally in Java? How does a HashMap work internally in Java? Jul 15, 2025 am 03:10 AM

HashMap implements key-value pair storage through hash tables in Java, and its core lies in quickly positioning data locations. 1. First use the hashCode() method of the key to generate a hash value and convert it into an array index through bit operations; 2. Different objects may generate the same hash value, resulting in conflicts. At this time, the node is mounted in the form of a linked list. After JDK8, the linked list is too long (default length 8) and it will be converted to a red and black tree to improve efficiency; 3. When using a custom class as a key, the equals() and hashCode() methods must be rewritten; 4. HashMap dynamically expands capacity. When the number of elements exceeds the capacity and multiplies by the load factor (default 0.75), expand and rehash; 5. HashMap is not thread-safe, and Concu should be used in multithreaded

Effective Use of Java Enums and Best Practices Effective Use of Java Enums and Best Practices Jul 07, 2025 am 02:43 AM

Java enumerations not only represent constants, but can also encapsulate behavior, carry data, and implement interfaces. 1. Enumeration is a class used to define fixed instances, such as week and state, which is safer than strings or integers; 2. It can carry data and methods, such as passing values ??through constructors and providing access methods; 3. It can use switch to handle different logics, with clear structure; 4. It can implement interfaces or abstract methods to make differentiated behaviors of different enumeration values; 5. Pay attention to avoid abuse, hard-code comparison, dependence on ordinal values, and reasonably naming and serialization.

What is a Singleton design pattern in Java? What is a Singleton design pattern in Java? Jul 09, 2025 am 01:32 AM

Singleton design pattern in Java ensures that a class has only one instance and provides a global access point through private constructors and static methods, which is suitable for controlling access to shared resources. Implementation methods include: 1. Lazy loading, that is, the instance is created only when the first request is requested, which is suitable for situations where resource consumption is high and not necessarily required; 2. Thread-safe processing, ensuring that only one instance is created in a multi-threaded environment through synchronization methods or double check locking, and reducing performance impact; 3. Hungry loading, which directly initializes the instance during class loading, is suitable for lightweight objects or scenarios that can be initialized in advance; 4. Enumeration implementation, using Java enumeration to naturally support serialization, thread safety and prevent reflective attacks, is a recommended concise and reliable method. Different implementation methods can be selected according to specific needs

Java Optional example Java Optional example Jul 12, 2025 am 02:55 AM

Optional can clearly express intentions and reduce code noise for null judgments. 1. Optional.ofNullable is a common way to deal with null objects. For example, when taking values ??from maps, orElse can be used to provide default values, so that the logic is clearer and concise; 2. Use chain calls maps to achieve nested values ??to safely avoid NPE, and automatically terminate if any link is null and return the default value; 3. Filter can be used for conditional filtering, and subsequent operations will continue to be performed only if the conditions are met, otherwise it will jump directly to orElse, which is suitable for lightweight business judgment; 4. It is not recommended to overuse Optional, such as basic types or simple logic, which will increase complexity, and some scenarios will directly return to nu.

How to fix java.io.NotSerializableException? How to fix java.io.NotSerializableException? Jul 12, 2025 am 03:07 AM

The core workaround for encountering java.io.NotSerializableException is to ensure that all classes that need to be serialized implement the Serializable interface and check the serialization support of nested objects. 1. Add implementsSerializable to the main class; 2. Ensure that the corresponding classes of custom fields in the class also implement Serializable; 3. Use transient to mark fields that do not need to be serialized; 4. Check the non-serialized types in collections or nested objects; 5. Check which class does not implement the interface; 6. Consider replacement design for classes that cannot be modified, such as saving key data or using serializable intermediate structures; 7. Consider modifying

See all articles