Skip to content

lhyundeadsoul/pb-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dynamic-pb-parser Introduction

dynamic-pb-parser provides functionality similar to Hive's get_json_object but for dynamically parsing data described by Protobuf.

Usage

Since Protobuf serialized data is not self-describing, users of this function need to compile their own proto files into a descriptor (.desc) file (command shown below) and pass the path to the .desc file.

Integration Methods

Based on the reason above, this UDF offers two integration methods:

  1. Use the protoc command to convert proto files into a descriptor file (the descriptor file name is customizable):

    protoc --include_imports -I. -otest.desc *.proto
    
  2. Usage example:

    DynamicPBParser parser = DynamicPBParser.newBuilder()
        .descFilePath("target/test-classes/test.desc")
        .syntax("StandardSyntax")
        .build();
    parser.parse(content, 'me.lihongyu.bean.Person$name');
    parser.parse(content, 'me.lihongyu.bean.Person$cloth.brand.type');
    parser.parse(DynamicPBParser.parse(content, 'me.lihongyu.bean.Person$proto_data'), 'me.lihongyu.bean.AddressBook$email');

Output, Input, and Syntax

  1. DynamicPBParser.parse has two input parameters:
    1. The Base64 encoded PB data.
    2. The field path to be parsed.
  2. Field Path Syntax:
    1. The $ symbol is used to separate the class name (message name) and the field name.
    2. Format for nested objects: package_name.message_name$field1_name.field2_name
    3. Format for nested arrays (repeated fields): package_name.message_name$field1_name[*].field2_name[0], where field1_name[*] can also be simplified to field1_name.
    4. Extension Fields:
      1. For a situation where message A has an extension field $x$ defined within message B, to parse $x$, the data of A can be treated as B. Example:
        package a.b;
        message A {
            extensions 100 to 199;
        }
        
        package c.d;
        message B {
            extend A {
                optional int32 x = 100;
            }
        }
        
        data=Base64(A);
        result = parser.parse(data, 'c.d.B$x'); 
      2. For a situation where message A has an extension field $x$ not defined within any message, but directly under a package, to parse $x$, you must write the explicit and complete extension field path, enclosed in English parentheses (). Example:
        package a.b;
        message A {
            extensions 100 to 199;
        }
        
        package c.d;
        extend A {
            optional int32 x = 100; // Protobuf considers the full path of this field to be `c.d.x`
        }
        
        data=Base64(A);
        result = parser.parse(data, "a.b.A$(c.d.x)");
  3. Output Parameters:
    1. The result is always a string type.
    2. If an object is returned:
      1. It will return the Base64 encoded PB serialized result of the object, e.g., "CggKBG5pa2UQARC2YA==".
      2. If it is a Byte array, it will return the Base64 encoded string.
      3. In other cases, it returns the result of toString().
    3. If an array (repeated field) type is returned, it will be in the format [xxx,xxx,xxx]:
      1. If the elements are numeric types, the number itself will be returned, e.g., [1,2,3].
      2. If the elements are object types, the output follows rule 2.

Q&A

  1. If an exception occurs: java.lang.IllegalArgumentException: XXX.XXX can not be found in any description file! Please check out if it exist., please check whether the message definition for XXX.XXX exists in the .desc file, or carefully verify the package_name and message_name for typos.
  2. If an exception occurs: IllegalArgumentException msg: xxx(fieldName) is not found in xxx(message name), please check the field name for typos.

Todo List

  • support extension field
  • support array type
  • exception or null?
  • support proto importing
  • optimize the performance

Would you like me to clarify any of the syntax rules or provide another example of how to use the parser?

dynamic-pb-parser 介绍

dynamic-pb-parser提供类似hive的get_json_object的功能,可以动态解析用Protobuf描述的数据。

使用方法

因为Protobuf序列化的数据不能自解释,所以需要使用此函数的同学自行编译自己的proto文件为desc文件(命令见下文),并将desc文件路径传入

接入方式

基于以上原因,此UDF提供两种接入方式

  1. 使用protoc命令将proto文件转换为desc文件(desc文件名可自定义):
    protoc --include_imports -I. -otest.desc *.proto  
    
  2. 用法如下:
     DynamicPBParser parser = DynamicPBParser.newBuilder()  
         .descFilePath("target/test-classes/test.desc")  
         .syntax("StandardSyntax")  
         .build();  
    parser.parse(content, 'me.lihongyu.bean.Person$name');  
    parser.parse(content, 'me.lihongyu.bean.Person$cloth.brand.type');  
    parser.parse(DynamicPBParser.parse(content, 'me.lihongyu.bean.Person$proto_data'), 'me.lihongyu.bean.AddressBook$email');  

出参、入参和语法

  1. DynamicPBParser.parse有两个入参:

    1. 用Base64编码后的pb数据
    2. 需要解析的字段路径
  2. 字段路径语法:

    1. 使用$符号分隔类名和字段名
    2. 嵌套对象的格式:package_name.message_name$field1_name.field2_name
  3. 嵌套数组的格式:package_name.message_name$field1_name[*].field2_name[0],其中field1_name[*]也可简写为field1_name

  4. 扩展字段

    1. 对于message A 扩展字段x 定义在message B里的情况,解析x,可以把A的数据当作B来看,例:
    package a.b;  
    message A {  
        extensions 100 to 199;  
    }        
      
    package c.d;  
    message B {  
        extend A {  
            optional int32 x = 100;  
        }  
    }  
      
    data=Base64(A);  
    result = parser.parse(data, 'c.d.B$x');   
    
    1. 对于message A 扩展字段x 未定义在某message里的情况,而是直接定义在package下的情况,解析x,需要写明确完整扩展字段路径,并用英文小括号()括起来,例:
    package a.b;  
    message A {  
        extensions 100 to 199;  
    }        
      
    package c.d;  
    extend A {  
        optional int32 x = 100;//protobuf认为此字段的完整路径是`c.d.x`  
    }  
      
    data=Base64(A);  
    result = parser.parse(data, "a.b.A$(c.d.x)");  
    
  5. 出参:

    1. 永远是string类型
    2. 如果返回的是object
      1. 会返回用Base64编码后的object PB序列化结果,类似"CggKBG5pa2UQARC2YA=="
      2. 如果是Byte数组,会返回Base64编码后的字符串
      3. 其它情况返回toString()的结果
    3. 如果返回的是数组类型,会返回[xxx,xxx,xxx]
      1. 如果元素是数字类型,会返回数字本身,如[1,2,3]
      2. 其中当元素是object类型时,同2

Q&A

  1. 发生异常java.lang.IllegalArgumentException: XXX.XXX can not be found in any description file! Please check out if it exist.,请检查desc文件中是否含有XXX.XXX的message定义,或仔细核对是否package_namemessage_name有笔误
  2. 发生异常IllegalArgumentException msg: xxx(fieldName) is not found in xxx(message name)时,请检查字段名是否有笔误

Todo List

  • support extension field
  • support array type
  • exception or null?
  • support proto importing
  • optimize the performance

About

一个可以按字段路径动态解析protobuf的工具

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages