dynamic-pb-parser provides functionality similar to Hive's get_json_object but for dynamically parsing data described by Protobuf.
Since Protobuf serialized data is not self-describing, users of this function need to compile their own proto files into a descriptor (
.desc) file (command shown below) and pass the path to the.descfile.
Based on the reason above, this UDF offers two integration methods:
-
Use the
protoccommand to convert proto files into a descriptor file (the descriptor file name is customizable):protoc --include_imports -I. -otest.desc *.proto -
Usage example:
DynamicPBParser parser = DynamicPBParser.newBuilder() .descFilePath("target/test-classes/test.desc") .syntax("StandardSyntax") .build(); parser.parse(content, 'me.lihongyu.bean.Person$name'); parser.parse(content, 'me.lihongyu.bean.Person$cloth.brand.type'); parser.parse(DynamicPBParser.parse(content, 'me.lihongyu.bean.Person$proto_data'), 'me.lihongyu.bean.AddressBook$email');
-
DynamicPBParser.parsehas two input parameters:- The Base64 encoded PB data.
- The field path to be parsed.
- Field Path Syntax:
- The
$symbol is used to separate the class name (message name) and the field name. - Format for nested objects:
package_name.message_name$field1_name.field2_name - Format for nested arrays (repeated fields):
package_name.message_name$field1_name[*].field2_name[0], wherefield1_name[*]can also be simplified tofield1_name. - Extension Fields:
- For a situation where message A has an extension field
$x$ defined within message B, to parse$x$ , the data of A can be treated as B. Example:package a.b; message A { extensions 100 to 199; } package c.d; message B { extend A { optional int32 x = 100; } } data=Base64(A); result = parser.parse(data, 'c.d.B$x');
- For a situation where message A has an extension field
$x$ not defined within any message, but directly under a package, to parse$x$ , you must write the explicit and complete extension field path, enclosed in English parentheses(). Example:package a.b; message A { extensions 100 to 199; } package c.d; extend A { optional int32 x = 100; // Protobuf considers the full path of this field to be `c.d.x` } data=Base64(A); result = parser.parse(data, "a.b.A$(c.d.x)");
- For a situation where message A has an extension field
- The
- Output Parameters:
- The result is always a string type.
- If an object is returned:
- It will return the Base64 encoded PB serialized result of the object, e.g.,
"CggKBG5pa2UQARC2YA==". - If it is a Byte array, it will return the Base64 encoded string.
- In other cases, it returns the result of
toString().
- It will return the Base64 encoded PB serialized result of the object, e.g.,
- If an array (repeated field) type is returned, it will be in the format
[xxx,xxx,xxx]:- If the elements are numeric types, the number itself will be returned, e.g.,
[1,2,3]. - If the elements are object types, the output follows rule 2.
- If the elements are numeric types, the number itself will be returned, e.g.,
- If an exception occurs:
java.lang.IllegalArgumentException: XXX.XXX can not be found in any description file! Please check out if it exist., please check whether the message definition forXXX.XXXexists in the.descfile, or carefully verify thepackage_nameandmessage_namefor typos. - If an exception occurs:
IllegalArgumentException msg: xxx(fieldName) is not found in xxx(message name), please check the field name for typos.
- support extension field
- support array type
- exception or null?
- support proto importing
- optimize the performance
Would you like me to clarify any of the syntax rules or provide another example of how to use the parser?
dynamic-pb-parser提供类似hive的get_json_object的功能,可以动态解析用Protobuf描述的数据。
因为Protobuf序列化的数据不能自解释,所以需要使用此函数的同学自行编译自己的proto文件为desc文件(命令见下文),并将desc文件路径传入
基于以上原因,此UDF提供两种接入方式
- 使用protoc命令将proto文件转换为desc文件(desc文件名可自定义):
protoc --include_imports -I. -otest.desc *.proto - 用法如下:
DynamicPBParser parser = DynamicPBParser.newBuilder() .descFilePath("target/test-classes/test.desc") .syntax("StandardSyntax") .build(); parser.parse(content, 'me.lihongyu.bean.Person$name'); parser.parse(content, 'me.lihongyu.bean.Person$cloth.brand.type'); parser.parse(DynamicPBParser.parse(content, 'me.lihongyu.bean.Person$proto_data'), 'me.lihongyu.bean.AddressBook$email');
-
DynamicPBParser.parse有两个入参:- 用Base64编码后的pb数据
- 需要解析的字段路径
-
字段路径语法:
- 使用
$符号分隔类名和字段名 - 嵌套对象的格式:
package_name.message_name$field1_name.field2_name
- 使用
-
嵌套数组的格式:
package_name.message_name$field1_name[*].field2_name[0],其中field1_name[*]也可简写为field1_name -
扩展字段
- 对于message A 扩展字段x 定义在message B里的情况,解析x,可以把A的数据当作B来看,例:
package a.b; message A { extensions 100 to 199; } package c.d; message B { extend A { optional int32 x = 100; } } data=Base64(A); result = parser.parse(data, 'c.d.B$x');- 对于message A 扩展字段x 未定义在某message里的情况,而是直接定义在package下的情况,解析x,需要写明确完整扩展字段路径,并用英文小括号
()括起来,例:
package a.b; message A { extensions 100 to 199; } package c.d; extend A { optional int32 x = 100;//protobuf认为此字段的完整路径是`c.d.x` } data=Base64(A); result = parser.parse(data, "a.b.A$(c.d.x)"); -
出参:
- 永远是string类型
- 如果返回的是object
- 会返回用Base64编码后的object PB序列化结果,类似
"CggKBG5pa2UQARC2YA==" - 如果是Byte数组,会返回Base64编码后的字符串
- 其它情况返回toString()的结果
- 会返回用Base64编码后的object PB序列化结果,类似
- 如果返回的是数组类型,会返回
[xxx,xxx,xxx]- 如果元素是数字类型,会返回数字本身,如
[1,2,3] - 其中当元素是object类型时,同2
- 如果元素是数字类型,会返回数字本身,如
- 发生异常
java.lang.IllegalArgumentException: XXX.XXX can not be found in any description file! Please check out if it exist.,请检查desc文件中是否含有XXX.XXX的message定义,或仔细核对是否package_name和message_name有笔误 - 发生异常
IllegalArgumentException msg: xxx(fieldName) is not found in xxx(message name)时,请检查字段名是否有笔误
- support extension field
- support array type
- exception or null?
- support proto importing
- optimize the performance