Skip to content

A python module to scrape arxiv.org for a date range and category

License

Notifications You must be signed in to change notification settings

Mahdisadjadi/arxivscraper

Repository files navigation

DOI License: MIT

arXivScraper

An ArXiV scraper to retrieve records from given categories and date range.

Install

Use pip (or pip3 for python3):

$ pip install arxivscraper

or download the source and use setup.py:

$ python setup.py install

To update the module using pip:

pip install arxivscraper --upgrade

Usage

Basic Example

Import arxivscraper and create a scraper to fetch preprints from a category within a date range:

import arxivscraper

scraper = arxivscraper.Scraper(
    category='cond-mat',
    date_from='2017-05-27',
    date_until='2017-06-07'
)
output = scraper.scrape()

Parameters

The Scraper class accepts the following parameters:

  • category (str): The arXiv category code (e.g., 'cs', 'math', 'cond-mat', 'stat', etc.). Supports both base categories and subcategories in multiple formats:

    • Base categories: 'cs', 'math', 'stat', etc.
    • Subcategories with dot notation: 'cs.AI', 'cs.SE', etc.
    • Subcategories with colon notation: 'cs:AI', 'stat:ML', etc.
    • Physics legacy format: 'physics:cond-mat', 'physics:astro-ph', etc.
  • date_from (str, optional): Starting date in format 'YYYY-MM-DD'. Defaults to the first day of the current month.

  • date_until (str, optional): End date in format 'YYYY-MM-DD'. Defaults to today's date.

  • t (int, optional): Waiting time in seconds between retries on HTTP 503 errors. Default: 30.

  • timeout (int, optional): Maximum time in seconds for the entire scraping operation. Default: 300.

  • filters (dict, optional): Dictionary to filter results. Keys can be: 'title', 'abstract', 'author', 'categories', or 'affiliation'. Values are lists of words to match (logical OR). Default: {} (no filtering).

Output

The scrape() method returns a list of dictionaries. Each dictionary represents a paper with the following fields:

  • id: arXiv ID
  • title: Paper title
  • abstract: Paper abstract
  • categories: arXiv categories
  • authors: List of author names
  • affiliation: List of author affiliations
  • doi: Digital Object Identifier
  • created: Creation date
  • updated: Last updated date
  • url: URL to the paper on arXiv

Example with pandas DataFrame:

import pandas as pd

output = scraper.scrape()
df = pd.DataFrame(output)

Filtering Results

To filter results based on specific criteria, pass a filters dictionary. Filters use logical OR, so records matching any of the specified words in a filter key will be included:

scraper = arxivscraper.Scraper(
    category='stat',
    date_from='2017-08-01',
    date_until='2017-08-10',
    filters={
        'categories': ['stat.ml'],
        'abstract': ['learning']
    }
)
output = scraper.scrape()

This will return papers in the Statistics category where either the category includes 'stat.ml' OR the abstract contains 'learning'.

Contributing

Ideas/bugs/comments? Please open an issue or submit a pull request on Github.

How to cite

If arxivscraper was useful in your work/research, please consider to cite it as :

Mahdi Sadjadi (2017). arxivscraper: Zenodo. http://doi.org/10.5281/zenodo.889853

or

@misc{msadjadi,
  author       = {Mahdi Sadjadi},
  title        = {arxivscraper},
  year         = 2017,
  doi          = {10.5281/zenodo.889853},
  url          = {https://doi.org/10.5281/zenodo.889853}
}

Author

License

This project is licensed under the MIT License - see the LICENSE file for details.

Categories

Here is a list of all categories available on ArXiv. For a complete list of subcategories, see categories_v2.md. To generate this table, see arxivscraper/util/create_arxiv_category_markdown_table.

Category Code Category
cs Computer Science
econ Economics
eess Electrical Engineering and Systems Science
math Mathematics
astro-ph Astrophysics
cond-mat Condensed Matter
gr-qc General Relativity and Quantum Cosmology
hep-ex High Energy Physics - Experiment
hep-lat High Energy Physics - Lattice
hep-ph High Energy Physics - Phenomenology
hep-th High Energy Physics - Theory
math-ph Mathematical Physics
nlin Nonlinear Sciences
nucl-ex Nuclear Experiment
nucl-th Nuclear Theory
physics Physics (Other)
quant-ph Quantum Physics
q-bio Quantitative Biology
q-fin Quantitative Finance
stat Statistics

Start History

Star History Chart

About

A python module to scrape arxiv.org for a date range and category

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages