Skip to content

Comments

Use file content heuristics to decide file reader.#1962

Open
Dimi1010 wants to merge 86 commits intoseladb:devfrom
Dimi1010:feature/heuristic-file-selection
Open

Use file content heuristics to decide file reader.#1962
Dimi1010 wants to merge 86 commits intoseladb:devfrom
Dimi1010:feature/heuristic-file-selection

Conversation

@Dimi1010
Copy link
Collaborator

@Dimi1010 Dimi1010 commented Sep 12, 2025

The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.

The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.

A new function createReader and tryCreateReader has been added due to changes in the public API of the factory.
The functions differ in the error handling scheme, as createReader throws and tryCreateReader returns nullptr on error.

Method behaviour changes during erroneous scenarios:

Scenario getReader createReader tryCreateReader
File not found N/A Throws exception Return nullptr
Unsupported format Return PcapFileDeviceReader Throws exception Return nullptr

@codecov
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 88.78205% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.84%. Comparing base (bcc9ffd) to head (5a62e9b).

Files with missing lines Patch % Lines
Pcap++/src/PcapFileDevice.cpp 85.97% 19 Missing and 4 partials ⚠️
Tests/Pcap++Test/Tests/FileTests.cpp 91.36% 5 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1962      +/-   ##
==========================================
+ Coverage   83.81%   83.84%   +0.02%     
==========================================
  Files         313      313              
  Lines       55676    55972     +296     
  Branches    11606    11883     +277     
==========================================
+ Hits        46665    46927     +262     
- Misses       7810     7832      +22     
- Partials     1201     1213      +12     
Flag Coverage Δ
alpine320 76.40% <79.78%> (+0.01%) ⬆️
fedora42 76.16% <80.21%> (+0.01%) ⬆️
macos-14 81.97% <82.49%> (-0.01%) ⬇️
macos-15 81.96% <83.65%> (-0.01%) ⬇️
mingw32 70.42% <79.41%> (+0.04%) ⬆️
mingw64 70.39% <79.41%> (+0.11%) ⬆️
npcap ?
rhel94 75.76% <79.21%> (+0.01%) ⬆️
ubuntu2004 59.58% <59.30%> (-0.02%) ⬇️
ubuntu2004-zstd 59.68% <57.98%> (-0.05%) ⬇️
ubuntu2204 75.70% <79.21%> (+0.01%) ⬆️
ubuntu2204-icpx 59.06% <59.32%> (-0.02%) ⬇️
ubuntu2404 76.08% <79.21%> (+0.01%) ⬆️
ubuntu2404-arm64 76.08% <79.78%> (+0.03%) ⬆️
unittest 83.84% <88.78%> (+0.02%) ⬆️
windows-2022 85.62% <85.60%> (+0.12%) ⬆️
windows-2025 85.65% <85.60%> (+0.13%) ⬆️
winpcap 85.65% <85.60%> (-0.09%) ⬇️
xdp 51.62% <0.95%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Dimi1010 Dimi1010 added the API deprecation Pull requests that deprecate parts of the public interface. label Sep 12, 2025
@Dimi1010 Dimi1010 marked this pull request as ready for review September 12, 2025 11:36
@Dimi1010 Dimi1010 requested a review from seladb as a code owner September 12, 2025 11:36
PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader));
PTF_ASSERT_TRUE(genericReader->open());
// ------- IFileReaderDevice::createReader() Factory
// TODO: Move to a separate unit test.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the following to get more coverage:

  • Open a snoop file
  • Open a file that is not any of the options
  • Open pcap files with different magic numbers
  • Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3d713ab adds the following tests:

  • Pcap, PcapNG, Zst file with correct content + extension
  • Pcap, PcanNG file with correct content + wrong extension
  • Bogus content file with correct extension (pcap, pcapng, zst)
  • Bogus content file with wrong extension (txt)

Haven't found a snoop file to add. Do we have any?

Open pcap files with different magic numbers

Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

Pending on #1962 (comment) .

Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
@seladb
Copy link
Owner

seladb commented Sep 21, 2025

@Dimi1010 some CI tests fail...

@Dimi1010 Dimi1010 requested a review from seladb December 30, 2025 10:38
@seladb
Copy link
Owner

seladb commented Dec 31, 2025

@Dimi1010 I'm working on parsing pcap files without libpcap: #2034
Maybe we can rework this PR after my PR is merged?

@Dimi1010
Copy link
Collaborator Author

@seladb can we merge this? It has been sitting for a while.

};

/// @brief Heuristic file format detector that scans the magic number of the file format header.
class CaptureFileFormatDetector
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not parsing all formats (maybe except Zstd) in PcapPlusPlus, we can reuse the logic we already have. Maybe it can run the open() method (or extract a portion of it) for each reader type until it can to find the right type?

Copy link
Collaborator Author

@Dimi1010 Dimi1010 Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM, we are not parsing all formats? Did you mean "now"?

Also, the necessary logic to detect the file format is already extracted in this class. Tbh, the open() call should probably delegate the format detection to this class if more comprehensive magic number format validation is needed.

IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).

I think integrating the functionality into open() would be suboptimal for the following reasons:

  • It potentially adds more responsibilities to the function that just "open the device".
  • Looping through all the devices would involve iterating through a loop of more complicated operations.
    Constructing the device and possibly repeated file open / close for each open() call as it is designed to function independently.
  • An open() call can fail for multiple other reasons, not affiliated with the file format specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API deprecation Pull requests that deprecate parts of the public interface. enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add indication if LightPcapNG backend is compiled with ZSTD compression support.

3 participants