Download metadata for all DOIs using the Crossref API
Store and process the Crossref Database
This repository downloads Crossref metadata using the Crossref API.The items retrieved are stored in MongoDB to preserve their raw structure.This design allows for flexible downstream analyses.
MongoDB
MongoDB is run via Docker.It's available on the host machine at http://localhost:27017/.
shdocker run \ --name=mongo-crossref \ --publish=27017:27017 \ --volume=`pwd`/mongo.db:/data/db \ --rm \ mongo:3.4.2
Execution
works
With mongo running, execute with the following commands:
```sh
Download all works
To start fresh, use --cursor=*
If querying fails midway, you can extract the cursor of the
last successful query from the tail of query-works.log.
Then rerun download.py, passing the intermediate cursor
to --cursor instead of *.
python download.py \ --component=works \ --batch-size=550 \ --log=logs/query-works.log \ --cursor=*
Export mongodb works collection to JSON
mongoexport \ --db=crossref \ --collection=works \ | xz > data/mongo-export/crossref-works.json.xz```
See data/mongo-export
for more information on crossref-works.json.xz
.Note that creating this file from the Crossref API takes several weeks.Users are encouraged to use the cached version available on figshare.
1.works-to-dataframe.ipynb
is a Jupyter notebook that extracts tabular datasets of works (TSVs), which are tracked using Git LFS:
doi.tsv.xz
: a table where each row is a work, with columns for the DOI, type, and issued date.doi-to-issn.tsv.xz
: a table where each row is a work (DOI) to journal (ISSN) mapping.
types
With mongo running, execute with the following command:
shpython download.py \ --component=types \ --log=logs/query-types.log
Environment
This repository uses conda to manage its environment as specified in environment.yml
.Install the environment with:
shconda env create --file=environment.yml
Then use source activate crossref
and source deactivate
to activate or deactivate the environment. On windows, use activate crossref
and deactivate
instead.
Acknowledgements
This work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4552 to @cgreene.
To restore the repository download the bundle
greenelab-crossref_-_2017-04-14_18-17-14.bundle and run:
git clone greenelab-crossref_-_2017-04-14_18-17-14.bundle -b master
Source:
https://github.com/greenelab/crossrefUploader:
greenelabUpload date: 2017-04-14