Skip to content

Conversation

@arne-bdt
Copy link
Contributor

@arne-bdt arne-bdt commented Sep 28, 2024

Faster parsing of RDF/XML by avoiding duplicated resolving of IRIs and adding cache for IRIx in parsers
(Parsers: RRX.RDFXML_SAX, RRX.RDFXML_StAX_ev, RRX.RDFXML_StAX_sr )

GitHub issue resolved #2740

Pull request Description:

  • added "public Node createURI(IRIx iriX, ...);" to the ParserProfile, which simply uses the given IRI instead of resolving it again.
  • adding general IRIx caching (org.apache.jena.atlas.lib.cache.CacheSimple) in the parsers where the already cached org.apache.jena.riot.system.ParserProfileStd#resolver is not applicable
  • removed unused code and variables from ParserRRX_StAX_SR and ParserRRX_StAX_EV
  • added org.apache.jena.riot.lang.rdfxml.TestXMLParser in jena-benchmarks-jmh

Benchmark results:

Benchmark                                                       (param0_GraphUri)  (param1_ParserLang)  Mode  Cnt   Score   Error  Units
TestXMLParser.parseXML                                   ../testing/citations.rdf       RRX.RDFXML_SAX  avgt    5  47,232 ± 0,778   s/op
TestXMLParser.parseXML                                   ../testing/citations.rdf   RRX.RDFXML_StAX_ev  avgt    5  76,502 ± 4,390   s/op
TestXMLParser.parseXML                                   ../testing/citations.rdf   RRX.RDFXML_StAX_sr  avgt    5  48,689 ± 2,224   s/op
TestXMLParser.parseXML                                   ../testing/citations.rdf      RRX.RDFXML_ARP1  avgt    5  86,298 ± 2,440   s/op
TestXMLParser.parseXML                                ../testing/BSBM/bsbm-5m.xml       RRX.RDFXML_SAX  avgt    5   9,576 ± 0,402   s/op
TestXMLParser.parseXML                                ../testing/BSBM/bsbm-5m.xml   RRX.RDFXML_StAX_ev  avgt    5  11,562 ± 0,535   s/op
TestXMLParser.parseXML                                ../testing/BSBM/bsbm-5m.xml   RRX.RDFXML_StAX_sr  avgt    5   9,406 ± 0,465   s/op
TestXMLParser.parseXML                                ../testing/BSBM/bsbm-5m.xml      RRX.RDFXML_ARP1  avgt    5  19,738 ± 1,526   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX  avgt    5   0,998 ± 0,223   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_ev  avgt    5   1,325 ± 0,093   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr  avgt    5   0,985 ± 0,018   s/op
TestXMLParser.parseXML          CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml      RRX.RDFXML_ARP1  avgt    5   2,357 ± 0,163   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX  avgt    5   0,146 ± 0,029   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_ev  avgt    5   0,192 ± 0,007   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr  avgt    5   0,140 ± 0,016   s/op
TestXMLParser.parseXML         CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml      RRX.RDFXML_ARP1  avgt    5   0,309 ± 0,098   s/op
TestXMLParser.parseXMLJena510                            ../testing/citations.rdf       RRX.RDFXML_SAX  avgt    5  57,690 ± 0,932   s/op
TestXMLParser.parseXMLJena510                            ../testing/citations.rdf   RRX.RDFXML_StAX_ev  avgt    5  84,579 ± 4,109   s/op
TestXMLParser.parseXMLJena510                            ../testing/citations.rdf   RRX.RDFXML_StAX_sr  avgt    5  56,949 ± 0,815   s/op
TestXMLParser.parseXMLJena510                            ../testing/citations.rdf      RRX.RDFXML_ARP1  avgt    5  82,940 ± 0,815   s/op
TestXMLParser.parseXMLJena510                         ../testing/BSBM/bsbm-5m.xml       RRX.RDFXML_SAX  avgt    5  13,280 ± 0,458   s/op
TestXMLParser.parseXMLJena510                         ../testing/BSBM/bsbm-5m.xml   RRX.RDFXML_StAX_ev  avgt    5  14,994 ± 0,803   s/op
TestXMLParser.parseXMLJena510                         ../testing/BSBM/bsbm-5m.xml   RRX.RDFXML_StAX_sr  avgt    5  13,132 ± 0,166   s/op
TestXMLParser.parseXMLJena510                         ../testing/BSBM/bsbm-5m.xml      RRX.RDFXML_ARP1  avgt    5  19,125 ± 1,044   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml       RRX.RDFXML_SAX  avgt    5   1,311 ± 0,018   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_ev  avgt    5   1,693 ± 0,021   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml   RRX.RDFXML_StAX_sr  avgt    5   1,332 ± 0,179   s/op
TestXMLParser.parseXMLJena510   CGMES_v2.4.15_RealGridTestConfiguration_EQ_V2.xml      RRX.RDFXML_ARP1  avgt    5   2,305 ± 0,280   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml       RRX.RDFXML_SAX  avgt    5   0,194 ± 0,028   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_ev  avgt    5   0,227 ± 0,016   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml   RRX.RDFXML_StAX_sr  avgt    5   0,194 ± 0,025   s/op
TestXMLParser.parseXMLJena510  CGMES_v2.4.15_RealGridTestConfiguration_SSH_V2.xml      RRX.RDFXML_ARP1  avgt    5   0,291 ± 0,039   s/op

  • Tests are included.
  • no Documentation change and updates needed
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@arne-bdt arne-bdt force-pushed the GH-2740-Faster-parsing-of-RDFXML branch from de49918 to 0feaf9e Compare September 29, 2024 12:22
@arne-bdt arne-bdt marked this pull request as draft September 29, 2024 13:54
@arne-bdt arne-bdt force-pushed the GH-2740-Faster-parsing-of-RDFXML branch from 0feaf9e to 3b9a9cf Compare September 29, 2024 14:11
@arne-bdt arne-bdt marked this pull request as ready for review September 29, 2024 14:13
@arne-bdt
Copy link
Contributor Author

arne-bdt commented Sep 29, 2024

  • undid removal of httpClient from org.apache.jena.riot.RDFParserBuilder
  • made org.apache.jena.riot.RDFParserBuilder#httpHeader available again by uncommenting it
  • moved initalization of httpClient with HttpEnv.getDftHttpClient() from org.apache.jena.riot.RDFParserBuilder#build to org.apache.jena.riot.RDFParser#openTypedInputStream in the case when the RDFParserBuilder has not been initialized with a custom httpClient. That way, it HttpEnv is not necessarily initialized when reading files but only when needed.
  • updated PR and issue descriptions accordingly

@arne-bdt arne-bdt force-pushed the GH-2740-Faster-parsing-of-RDFXML branch from 3b9a9cf to a8b14d9 Compare October 1, 2024 12:39
@arne-bdt
Copy link
Contributor Author

arne-bdt commented Oct 1, 2024

@afs
Should I maybe split this PR in two and make one part for the caching and one for the HTTPEnv initialization?

Copy link
Member

@rvesse rvesse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it would be worth splitting this PR

The HttpClient delayed init seems like a quick and independent win whereas the RDF/XML parsing and caching changes are more involved and warrants separate review IMO

@afs
Copy link
Member

afs commented Oct 2, 2024

@afs Should I maybe split this PR in two and make one part for the caching and one for the HTTPEnv initialization?

Yes please.

@afs
Copy link
Member

afs commented Oct 2, 2024

Small request - could you put a blank line after the first line of the commit message? It separates the "subject" (the %s in log formatting) from the body (%b), otherwise the subject is the whole message.

https://git-scm.com/docs/git-commit#_discussion

@arne-bdt arne-bdt marked this pull request as draft October 2, 2024 09:25
@afs
Copy link
Member

afs commented Oct 2, 2024

The http client delayed setup looks OK but, as Rob says, it is orthogonal. I thought it might require an AtomicReference and the usual double locking pattern in HttpEnv.getDftHttpClient but the way here that utilizes class loading for locking looks OK (this is assuming the costs are indeed the java classloading of the networking code).

@arne-bdt arne-bdt force-pushed the GH-2740-Faster-parsing-of-RDFXML branch 4 times, most recently from 56dc224 to f12ffcc Compare October 4, 2024 07:59
@arne-bdt arne-bdt marked this pull request as ready for review October 4, 2024 08:03
Copy link
Member

@afs afs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and it is faster!

Just a couple of incidental javadoc suggestions.

Comment on lines 41 to 44
* Configures the parser to be safe and sets necessary properties.
* This method should be called when another factory than javax.xml.stream.XMLInputFactory#newDefaultFactory()
* shall be used.
* @param xmlInputFactory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Configures the parser to be safe and sets necessary properties.
* This method should be called when another factory than javax.xml.stream.XMLInputFactory#newDefaultFactory()
* shall be used.
* @param xmlInputFactory
* Configures the parser to be safe and sets necessary properties.
* This method should be called when a factory other than
* javax.xml.stream.XMLInputFactory#newDefaultFactory() is used.
* @param xmlInputFactory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognized, that I kept using javax.xml.stream.XMLInputFactory#newInstance() instead of javax.xml.stream.XMLInputFactory#newDefaultFactory(), so I fixed the javadoc accordingly.

Parsers: RRX.RDFXML_SAX, RRX.RDFXML_StAX_ev, RRX.RDFXML_StAX_sr

- added "public Node createURI(IRIx iriX, ...);" to the ParserProfile, which simply uses the given IRI instead of resolving it again.
- adding general IRIx caching (org.apache.jena.atlas.lib.cache.CacheSimple) in the parsers
  where the already cached org.apache.jena.riot.system.ParserProfileStd#resolver is not applicable
- removed unused code and variables from ParserRRX_StAX_SR and ParserRRX_StAX_EV
- added `org.apache.jena.riot.lang.rdfxml.TestXMLParser` in jena-benchmarks-jmh
@arne-bdt arne-bdt force-pushed the GH-2740-Faster-parsing-of-RDFXML branch from f12ffcc to 7c43951 Compare October 7, 2024 06:30
@afs afs merged commit 2c9782c into apache:main Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Faster parsing of RDF/XML by avoiding duplicated resolving of IRIs and adding cache for IRIx in parsers

3 participants