An ArXiV scraper to retrieve records from given categories and date range.
Use pip (or pip3 for python3):
$ pip install arxivscraperor download the source and use setup.py:
$ python setup.py installTo update the module using pip:
pip install arxivscraper --upgradeImport arxivscraper and create a scraper to fetch preprints from a category within a date range:
import arxivscraper
scraper = arxivscraper.Scraper(
category='cond-mat',
date_from='2017-05-27',
date_until='2017-06-07'
)
output = scraper.scrape()The Scraper class accepts the following parameters:
-
category(str): The arXiv category code (e.g.,'cs','math','cond-mat','stat', etc.). Supports both base categories and subcategories in multiple formats:- Base categories:
'cs','math','stat', etc. - Subcategories with dot notation:
'cs.AI','cs.SE', etc. - Subcategories with colon notation:
'cs:AI','stat:ML', etc. - Physics legacy format:
'physics:cond-mat','physics:astro-ph', etc.
- Base categories:
-
date_from(str, optional): Starting date in format'YYYY-MM-DD'. Defaults to the first day of the current month. -
date_until(str, optional): End date in format'YYYY-MM-DD'. Defaults to today's date. -
t(int, optional): Waiting time in seconds between retries on HTTP 503 errors. Default:30. -
timeout(int, optional): Maximum time in seconds for the entire scraping operation. Default:300. -
filters(dict, optional): Dictionary to filter results. Keys can be:'title','abstract','author','categories', or'affiliation'. Values are lists of words to match (logical OR). Default:{}(no filtering).
The scrape() method returns a list of dictionaries. Each dictionary represents a paper with the following fields:
id: arXiv IDtitle: Paper titleabstract: Paper abstractcategories: arXiv categoriesauthors: List of author namesaffiliation: List of author affiliationsdoi: Digital Object Identifiercreated: Creation dateupdated: Last updated dateurl: URL to the paper on arXiv
Example with pandas DataFrame:
import pandas as pd
output = scraper.scrape()
df = pd.DataFrame(output)To filter results based on specific criteria, pass a filters dictionary. Filters use logical OR, so records matching any of the specified words in a filter key will be included:
scraper = arxivscraper.Scraper(
category='stat',
date_from='2017-08-01',
date_until='2017-08-10',
filters={
'categories': ['stat.ml'],
'abstract': ['learning']
}
)
output = scraper.scrape()This will return papers in the Statistics category where either the category includes 'stat.ml' OR the abstract contains 'learning'.
Ideas/bugs/comments? Please open an issue or submit a pull request on Github.
If arxivscraper was useful in your work/research, please consider to cite it as :
Mahdi Sadjadi (2017). arxivscraper: Zenodo. http://doi.org/10.5281/zenodo.889853
or
@misc{msadjadi,
author = {Mahdi Sadjadi},
title = {arxivscraper},
year = 2017,
doi = {10.5281/zenodo.889853},
url = {https://doi.org/10.5281/zenodo.889853}
}
-
Mahdi Sadjadi, 2017.
-
Website: mahdisadjadi.com
-
Twitter: @mahdisadjadi
This project is licensed under the MIT License - see the LICENSE file for details.
Here is a list of all categories available on ArXiv. For a complete list of subcategories, see categories_v2.md.
To generate this table, see arxivscraper/util/create_arxiv_category_markdown_table.
| Category Code | Category |
|---|---|
cs |
Computer Science |
econ |
Economics |
eess |
Electrical Engineering and Systems Science |
math |
Mathematics |
astro-ph |
Astrophysics |
cond-mat |
Condensed Matter |
gr-qc |
General Relativity and Quantum Cosmology |
hep-ex |
High Energy Physics - Experiment |
hep-lat |
High Energy Physics - Lattice |
hep-ph |
High Energy Physics - Phenomenology |
hep-th |
High Energy Physics - Theory |
math-ph |
Mathematical Physics |
nlin |
Nonlinear Sciences |
nucl-ex |
Nuclear Experiment |
nucl-th |
Nuclear Theory |
physics |
Physics (Other) |
quant-ph |
Quantum Physics |
q-bio |
Quantitative Biology |
q-fin |
Quantitative Finance |
stat |
Statistics |