Skip to content

Conversation

mohitjha-elastic
Copy link
Collaborator

@mohitjha-elastic mohitjha-elastic commented Apr 24, 2025

Proposed Commit Message

atlassian_jira and atlassian_cloud: update cursor logic to remove duplicate events.

After reviewing the Atlassian Jira[1] and Atlassian Confluence[2] API
documentation, It has been noticed that the APIs return data on or
after the specified start date. Currently, the date from the response
body is getting saved into the cursor. Because of this, when the
request is being made in the next interval, it includes data we've
already fetched—leading to duplicate events being published to
Elasticsearch. Hence, updated the cursor logic, added 1ms to it would
fetch the data afterwards.

This change has been tested on the data available in the test folder as
well as using the mock server.

[1]https://developer.atlassian.com/cloud/jira/platform/rest/v3/api-group-audit-records/#api-rest-api-3-auditing-record-get
[2]https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-audit/#api-wiki-rest-api-audit-get

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

How to test this PR locally

To test atlassian_jira integration

Clone integrations repo.
Install the elastic package locally.
Start the elastic stack using the elastic package.
Move to integrations/packages/atlassian_jira directory.
Run the following command to run tests.
elastic-package test -v

To test atlassian_confluence integration

Clone integrations repo.
Install the elastic package locally.
Start the elastic stack using the elastic package.
Move to integrations/packages/atlassian_confluence directory.
Run the following command to run tests.
elastic-package test -v

Related issues

@mohitjha-elastic mohitjha-elastic added Integration:atlassian_jira Atlassian Jira (Community supported) Integration:atlassian_confluence Atlassian Confluence (Community supported) bugfix Pull request that fixes a bug issue Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Team:Sit-Crest Crest developers on the Security Integrations team [elastic/sit-crest-contractors] labels Apr 24, 2025
@mohitjha-elastic mohitjha-elastic self-assigned this Apr 24, 2025
@mohitjha-elastic mohitjha-elastic requested a review from a team as a code owner April 24, 2025 11:12
@elasticmachine
Copy link

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

Copy link
Contributor

@kcreddy kcreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohitjha-elastic, While adding fingerprint also works for this issue, can you check if it can be solved at cursor level itself without having to propagate the duplicate event into the ingest pipeline and then ignoring it?

@ShourieG
Copy link
Contributor

@mohitjha-elastic, While adding fingerprint also works for this issue, can you check if it can be solved at cursor level itself without having to propagate the duplicate event into the ingest pipeline and then ignoring it?

Agree with @kcreddy here, 1st option would be to try and optimise better cursor date handling. If that's not possible we should do fingerprints.

@mohitjha-elastic
Copy link
Collaborator Author

@kcreddy @ShourieG
We currently don’t have any additional fields that can be stored in the cursor. One possible workaround I can think of is adding +1 second to the timestamp when storing it in the cursor. However, this might risk missing events if any are published during that one-second window (although I assume that scenario is unlikely).

Also, storing current timestamp in cursor does not seem feasible that might miss the data in case of API failure.

Since the filter fetches data on or after the start time, the only potential duplication after each interval would be those events (It must be only few) with a timestamp exactly matching the cursor timestamp. In that case, this approach seems quite reasonable.

Let me know what you think.

@elastic-vault-github-plugin-prod
Copy link

elastic-vault-github-plugin-prod bot commented Apr 24, 2025

🚀 Benchmarks report

Package atlassian_confluence 👍(0) 💚(0) 💔(1)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
audit 2762.43 2159.83 -602.6 (-21.81%) 💔

Package atlassian_jira 👍(0) 💚(0) 💔(1)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
audit 3194.89 2257.34 -937.55 (-29.35%) 💔

To see the full report comment with /test benchmark fullreport

@kcreddy
Copy link
Contributor

kcreddy commented Apr 25, 2025

@kcreddy @ShourieG We currently don’t have any additional fields that can be stored in the cursor. One possible workaround I can think of is adding +1 second to the timestamp when storing it in the cursor. However, this might risk missing events if any are published during that one-second window (although I assume that scenario is unlikely).

Also, storing current timestamp in cursor does not seem feasible that might miss the data in case of API failure.

Since the filter fetches data on or after the start time, the only potential duplication after each interval would be those events (It must be only few) with a timestamp exactly matching the cursor timestamp. In that case, this approach seems quite reasonable.

Let me know what you think.

@mohitjha-elastic +1 second could infact miss events. We should increment to lowest unit allowed by the API, which seems to be milliseconds.
https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-audit/#api-wiki-rest-api-audit-get. Can you check if this can be done to avoid missing data?

The problem with fingerprint is that the fields that we believe are unique may not always be that way and often need to add more fields because users complain of missing data. It might not be bad to just fingerprint on event.original field to avoid this problem. There is also a cost to fingerprinting (~25% lower indexing rate) because ingest nodes has to first check if that document exists in the index.

1. Remove fingerprint.
2. Update cursor logic to add 1ms to it to remove duplicate events.
@mohitjha-elastic mohitjha-elastic force-pushed the atlassian_jira-and-atlassian_confluence-bugfix branch from d79b1a2 to eb4fc49 Compare April 25, 2025 12:34
@mohitjha-elastic mohitjha-elastic changed the title [Atlassian JIRA and Atlassian Conflunce] Add Fingerprint in Pipelines to Remove Duplicate Events [Atlassian JIRA and Atlassian Conflunce] Update Cursor Logic to Remove Duplicate Events Apr 25, 2025
@mohitjha-elastic
Copy link
Collaborator Author

Thanks, @kcreddy for clarifying the use case of adding fingerprints.
Yes, I checked and increment the lowest unit (1 millisecond) to avoid missing data.
Updated the PR.

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest the following for the commit message; line breaks and no markdown in git commit messages, otherwise unaltered.

atlassian_jira and atlassian_cloud: update cursor logic to remove duplicate events.

After reviewing the Atlassian Jira[1] and Atlassian Confluence[2] API
documentation, It has been noticed that the APIs return data on or
after the specified start date. Currently, the date from the response
body is getting saved into the cursor. Because of this, when the
request is being made in the next interval, it includes data we've
already fetched—leading to duplicate events being published to
Elasticsearch. Hence, updated the cursor logic, added 1ms to it would
fetch the data afterwards.

This change has been tested on the data available in the test folder as
well as using the mock server.

[1]https://developer.atlassian.com/cloud/jira/platform/rest/v3/api-group-audit-records/#api-rest-api-3-auditing-record-get
[2]https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-audit/#api-wiki-rest-api-audit-get

Copy link
Contributor

@kcreddy kcreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please resolve @efd6 comments before merging.

Add final new line in http json files.
@mohitjha-elastic mohitjha-elastic force-pushed the atlassian_jira-and-atlassian_confluence-bugfix branch from 04c7cb6 to 6955823 Compare April 28, 2025 07:03
@mohitjha-elastic mohitjha-elastic requested a review from efd6 April 28, 2025 07:03
@mohitjha-elastic
Copy link
Collaborator Author

@efd6 @kcreddy
Updated the commit message and resolved the comments.
Thanks!

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @mohitjha-elastic

Copy link

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@efd6 efd6 merged commit b16e934 into elastic:main Apr 28, 2025
7 checks passed
@elastic-vault-github-plugin-prod

Package atlassian_confluence - 1.29.1 containing this change is available at https://epr.elastic.co/package/atlassian_confluence/1.29.1/

@elastic-vault-github-plugin-prod

Package atlassian_jira - 1.30.2 containing this change is available at https://epr.elastic.co/package/atlassian_jira/1.30.2/

@mohitjha-elastic mohitjha-elastic deleted the atlassian_jira-and-atlassian_confluence-bugfix branch May 15, 2025 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Pull request that fixes a bug issue Integration:atlassian_confluence Atlassian Confluence (Community supported) Integration:atlassian_jira Atlassian Jira (Community supported) Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Team:Sit-Crest Crest developers on the Security Integrations team [elastic/sit-crest-contractors]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Atlassian Jira]: Duplicate record pulls due to cursor not progressing
5 participants