Use file content heuristics to decide file reader.#1962
Use file content heuristics to decide file reader.#1962Dimi1010 wants to merge 86 commits intoseladb:devfrom
Conversation
…sed on the magic number.
…ics detection method.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1962 +/- ##
==========================================
+ Coverage 83.81% 83.84% +0.02%
==========================================
Files 313 313
Lines 55676 55972 +296
Branches 11606 11883 +277
==========================================
+ Hits 46665 46927 +262
- Misses 7810 7832 +22
- Partials 1201 1213 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Tests/Pcap++Test/Tests/FileTests.cpp
Outdated
| PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader)); | ||
| PTF_ASSERT_TRUE(genericReader->open()); | ||
| // ------- IFileReaderDevice::createReader() Factory | ||
| // TODO: Move to a separate unit test. |
There was a problem hiding this comment.
We should add the following to get more coverage:
- Open a snoop file
- Open a file that is not any of the options
- Open pcap files with different magic numbers
- Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
There was a problem hiding this comment.
3d713ab adds the following tests:
- Pcap, PcapNG, Zst file with correct content + extension
- Pcap, PcanNG file with correct content + wrong extension
- Bogus content file with correct extension (pcap, pcapng, zst)
- Bogus content file with wrong extension (txt)
Haven't found a snoop file to add. Do we have any?
Open pcap files with different magic numbers
Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.
Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
Pending on #1962 (comment) .
Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
|
@Dimi1010 some CI tests fail... |
…in `createReader` instead of having the Format detector assume that is what is intended.
…t detector from libpcap behaviour.
…AII initialization.
…n tryCreateDevice.
…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp # Tests/Pcap++Test/Tests/FileTests.cpp
|
@seladb can we merge this? It has been sitting for a while. |
| }; | ||
|
|
||
| /// @brief Heuristic file format detector that scans the magic number of the file format header. | ||
| class CaptureFileFormatDetector |
There was a problem hiding this comment.
Since we're not parsing all formats (maybe except Zstd) in PcapPlusPlus, we can reuse the logic we already have. Maybe it can run the open() method (or extract a portion of it) for each reader type until it can to find the right type?
There was a problem hiding this comment.
WDYM, we are not parsing all formats? Did you mean "now"?
Also, the necessary logic to detect the file format is already extracted in this class. Tbh, the open() call should probably delegate the format detection to this class if more comprehensive magic number format validation is needed.
IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).
I think integrating the functionality into open() would be suboptimal for the following reasons:
- It potentially adds more responsibilities to the function that just "open the device".
- Looping through all the devices would involve iterating through a loop of more complicated operations.
Constructing the device and possibly repeated file open / close for eachopen()call as it is designed to function independently. - An
open()call can fail for multiple other reasons, not affiliated with the file format specifically.
The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.
The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.
A new function
createReaderandtryCreateReaderhas been added due to changes in the public API of the factory.The functions differ in the error handling scheme, as
createReaderthrows andtryCreateReaderreturnsnullptron error.Method behaviour changes during erroneous scenarios:
getReadercreateReadertryCreateReadernullptrPcapFileDeviceReadernullptr