ARROW-1030: Python: Account for library versioning in parquet-cpp #698

xhochy · 2017-05-16T12:58:43Z

This mainly uses the same logic we already use for arrow-cpp

wesm · 2017-05-16T13:59:40Z

The manylinux1 build failed looks like

Change-Id: Id45eaf3a5d067df644a23f0d6ff094fa02fd69a4

xhochy · 2017-05-17T15:03:09Z

Fixed the build

wesm

+1. I will have to rebase #700

This mainly uses the same logic we already use for arrow-cpp Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#698 from xhochy/parquet-abi-version-bundling and squashes the following commits: 4aa17f8 [Uwe L. Korn] ARROW-1030: Python: Account for library versioning in parquet-cpp

## What's Changed This PR relates to apache#698 and is the second in a series intended to provide full Avro read / write support in native Java. It adds round-trip tests for both schemas (Arrow schema -> Avro -> Arrow) and data (Arrow VSR -> Avro block -> Arrow VSR). It also adds a number of fixes and improvements to the Avro Consumers so that data arrives back in its original form after a round trip. The main changes are: * Added a top level method in AvroToArrow to convert Avro schema directly to Arrow schema (this may exist elsewhere, but is needed to provide an API that matches the logic of this implementation) * Avro unions of [ type, null ] or [ null, type ] now have special handling, these are interpreted as a single nullable type rather than a union. Setting legacyMode = false in the AvroToArrowConfig object is required to enable this behaviour, otherwise unions are interpreted literally. Unions with more than 2 elements are always interpreted literally (but, per apache#108, in practice Java's current Union implementation is probably not usable with Avro atm). * Added support for new logical types (decimal 256, timestamp nano and 3 local timestamp types) * Existing timestamp-mills and timestamp-micros times now interpreted as zone-aware (previously they were interpreted as local, but now the local timestamp types are interpreted as local - I think this is correct per the [Avro spec](https://avro.apache.org/docs/1.12.0/specification/#timestamps)). Requires setting legacyMode = false. * Removed namespaces from generated Arrow field names in complex types. E.g. the Avro field myNamepsace.outerRecord.structField.intField should be called just "intField" inside the Arrow struct. This doesn't affect the skip field logic, which still works using the qualified names. This requires setting legacyMode = false. * Remove unexpected metadata in generated Arrow fields (empty alias lists and attributes interpreted as part of the field schema). This requires setting legacyMode = false. * Use the expected child vector names for Arrow LIST and MAP types when reading. For LIST, the default child vector is called "$data$" which is illegal in Avro, so the child field name is also changed to "item" in the producers. This requires setting legacyMode = false. Breaking changes have been removed from this PR. Per discussion below, all breaking changes are now behind a "legacyMode" flag in the AvroToArrowConfig object, which is enabled by default in all the original code paths. Closes apache#698 . This change is meant to allow for round trip of schemas and individual Avro data blocks (one Avro data block -> one VSR). File-level capabilities are not included. I have not included anything to recycle the VSR as part of the read API, this feels like it belongs with the file-level piece. Also I have not done anything specific for enums / dict encoding as of yet.

ARROW-1030: Python: Account for library versioning in parquet-cpp

4aa17f8

Change-Id: Id45eaf3a5d067df644a23f0d6ff094fa02fd69a4

xhochy force-pushed the parquet-abi-version-bundling branch from 3158c4c to 4aa17f8 Compare May 17, 2017 11:43

wesm approved these changes May 17, 2017

View reviewed changes

asfgit closed this in a4f3259 May 17, 2017

wesm mentioned this pull request May 17, 2017

ARROW-1029: [Python] Fixes for building pyarrow with Parquet support on MSVC. Add to appveyor build #700

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-1030: Python: Account for library versioning in parquet-cpp #698

ARROW-1030: Python: Account for library versioning in parquet-cpp #698

Uh oh!

xhochy commented May 16, 2017

Uh oh!

wesm commented May 16, 2017

Uh oh!

xhochy commented May 17, 2017

Uh oh!

wesm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ARROW-1030: Python: Account for library versioning in parquet-cpp #698

ARROW-1030: Python: Account for library versioning in parquet-cpp #698

Uh oh!

Conversation

xhochy commented May 16, 2017

Uh oh!

wesm commented May 16, 2017

Uh oh!

xhochy commented May 17, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants