Skip to content

Commit

Permalink
Add Google Scholar notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
arunkannawadi committed Sep 17, 2021
1 parent 1d5a512 commit 7ad7167
Show file tree
Hide file tree
Showing 4 changed files with 310 additions and 5 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
# Independent citation counter

A (collection of) Jupyter notebook(s) that count independent citations from different bibliographic databases:
- SAO/NASA Astrophysics Data System (ADS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb)
- Google Scholar (coming soon)
- Google Scholar (beta version) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/gscholar.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fgscholar.ipynb)
- SAO/NASA Astrophysics Data System (ADS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fads.ipynb)


## What are independent citations?

Expand All @@ -14,4 +15,4 @@ This is less restrictive than the definition where citations by those who have n

## Why?

Some application packages need to exclude all counts of self-citations to evaluate the impact of one's research.
Some application packages need to exclude all counts of self-citations to evaluate the impact of one's research.
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Independent citation counter

A (collection of) Jupyter notebook(s) that count independent citations from different bibliographic databases:
- SAO/NASA Astrophysics Data System (ADS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb)
- Google Scholar (gscholar; coming soon)
- Google Scholar (beta version) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/gscholar.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fgscholar.ipynb)
- SAO/NASA Astrophysics Data System (ADS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fads.ipynb)

## What are independent citations?

Expand Down
298 changes: 298 additions & 0 deletions notebooks/gscholar.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Independent citation counter\n",
"\n",
"In this notebook, you can calculate the number of independent citations for all of your papers.\n",
"\n",
"### What the code will do?\n",
"For each entry in your Google Scholar profile, the code will output your independent citation count, total citation count and a link to access your independent citation counts.\n",
"\n",
"\n",
"**Sample output:**\n",
"\n",
"> The impact of cosmic variance on simulating weak lensing surveys\n",
">\n",
"> Citations: 9/15\n",
">\n",
"> Link: [http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27](http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27)\n",
"\n",
"The first line is the title of the paper, which has 9 independent citations and 15 total citations.\n",
"The link takes you to the Google Scholar page with the independent citations.\n",
"\n",
"**Note:**\n",
"Even if the program is unable to fetch independent citation counts, it will still output your total citations and provide a link to access your independent citations.\n",
"\n",
"\n",
"### How to use?\n",
"In the cell below, replace `qc6CJjYAAAAJ` with your Google Scholar profile ID.\n",
"You may also want to specify a proxy type (more details below).\n",
"Then, run all cells.\n",
"\n",
"### Troubleshooting\n",
"If you see a `MaxTriesExceededException`, it means Google Scholar caught a whiff of your action.\n",
"Try again later, or use a better proxy.\n",
"\n",
"<br>\n",
"\n",
"### Enter your Google Scholar profile ID\n",
"*unless you are Albert Einstein.*\n",
"\n",
"For example, if your Google Scholar profile URL is [`https://scholar.google.com/citations?user=qc6CJjYAAAAJ`](https://scholar.google.com/citations?user=qc6CJjYAAAAJ), then your profile ID is `qc6CJjYAAAAJ`."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# The only cell which you are expected to modify.\n",
"scholar_id = 'qc6CJjYAAAAJ'\n",
"\n",
"# `proxy_type` must be one of ScraperAPI, Luminati, FreeProxy, SingleProxy or NoProxy.\n",
"# NoProxy will give only the links to independent, not the counts.\n",
"proxy_type = 'NoProxy' # Case insensitive"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"#### More on `proxy_type`\n",
"\n",
"By default, the code provides only the links to page containing independent citations, and does not open the page to count them.\n",
"Google Scholar actively blocks automated requests to its citation database.\n",
"Continuous, repeated requests from a single IP address may lead to a ban.\n",
"However, if you need the counts, you may be able to circumvent this by using a proxy.\n",
"Below are a few options:\n",
"\n",
"- **FreeProxy**: Use continuously changing proxies for free.\n",
"\n",
" This protects your IP address, but is not very effective at circumventing Google Scholar's anti-bot prevention. You might want to use other options if you are unable to reach Google Scholar.\n",
"\n",
"\n",
"- **ScraperAPI** (recommended): [Create a free account](https://www.scraperapi.com/) without providing personal and payment information. Free account supports 5000 requests per month, more that sufficient to run this notebook for most researchers.\n",
"\n",
"- **Luminati** (untested): Similar to ScraperAPI, and is known to circumvent Google Scholar's anti-bot prevention better. No free account is available.\n",
"\n",
"- **SingleProxy**: Use a single proxy for all requests.\n",
"\n",
"- **NoProxy** (default): Using `NoProxy` will not fetch the counts by default. You can still try to fetch the counts (at your own risk) by setting `links_only` below to `False`. Use this sparingly if `FreeProxy` does not work and you don't want to create any accounts. You may also use this safely if you are already connected to a VPN.\n",
"\n",
"\n",
"\n",
"\n",
"Read the [official scholarly documentation](https://scholarly.readthedocs.io/en/latest/quickstart.html#using-proxies) for more details."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"links_only = (proxy_type.lower() == 'noproxy')"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"### Install and import the required packages."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"! pip install -q scholarly"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"try:\n",
" from scholarly import scholarly, ProxyGenerator #, MaxTriesExceededException\n",
"except IndexError:\n",
" \"\"\" Ignore the harmless IndexError occuring from a dependency\"\"\"\n",
" pass\n",
"import time, random\n",
"from getpass import getpass\n",
"try:\n",
" from urllib import quote # type: ignore ; Python 2\n",
"except ImportError:\n",
" from urllib.parse import quote # type: ignore ; Python 3"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"def set_proxy(proxy_type='NoProxy'):\n",
" \"\"\"Set a proxy for to scrape Google Scholar.\n",
"\n",
" Only `NoProxy`, `FreeProxy` and `ScraperAPI` have been tested.\n",
"\n",
" Parameters\n",
" ----------\n",
" proxy_type : str, optional\n",
" Type of proxy to use. Case insensitive. Options are:\n",
" `ScraperAPI`, `Luminati`, `FreeProxy`, `SingleProxy` and\n",
" `NoProxy` (default).\n",
" \"\"\"\n",
" if proxy_type.lower() == 'noproxy':\n",
" print(\"Using no proxies!\")\n",
" return\n",
"\n",
" pg = ProxyGenerator()\n",
" if proxy_type.lower() == 'scraperapi':\n",
" payload = {'api_key': getpass(\"Enter your ScraperAPI key:\"), }\n",
" pg.ScraperAPI(payload['api_key'])\n",
" print(\"Using ScraperAPI!\")\n",
" elif proxy_type.lower() == 'luminati':\n",
" pg.Luminati(getpass(\"Enter your Luminati username:\"), getpass(\"Enter your Luminati password:\"))\n",
" print(\"Using Luminati!\")\n",
" elif proxy_type.lower() == 'singleproxy':\n",
" proxy_address = getpass(\"Enter your proxy address:\")\n",
" pg.SingleProxy(proxy_address, proxy_address)\n",
" print(f\"Using SingleProxy: {proxy_address}\")\n",
" else:\n",
" pg.FreeProxies()\n",
" print(\"Using FreeProxy!\")\n",
"\n",
" scholarly.use_proxy(pg)\n",
"\n",
"def standardize_names(name):\n",
" if not \" \" in name:\n",
" return name\n",
" try:\n",
" parts = name.split(' ')\n",
" firstname, lastname = parts[0], parts[-1]\n",
" initial = firstname[0]\n",
" return quote(f\"'{initial} {lastname}'\")\n",
" except:\n",
" # This usually happens for collaboration papers\n",
" print(f\"Cannot split '{name}' into initial and last names!\")\n",
" return quote(f\"{name}\")\n",
"\n",
"\n",
"def fill_independent_citations(publication, links_only=True):\n",
" if not publication[\"source\"].name == \"AUTHOR_PUBLICATION_ENTRY\":\n",
" raise TypeError(\"Input source must be from a Google Scholar profile page\")\n",
"\n",
" if not publication[\"filled\"]: # TODO: Don't fill once the patch comes through\n",
" scholarly.fill(publication)\n",
"\n",
" citedby_url = publication.get(\"citedby_url\", None)\n",
" if citedby_url is None:\n",
" # If there are no citations, then there is nothing to do\n",
" publication[\"num_independent_citations\"] = 0\n",
" return None\n",
"\n",
" author_names = publication[\"bib\"][\"author\"].split(\" and \")\n",
" independent_query = \"+\".join([f\"-author:{standardize_names(name)}\" for name in author_names])\n",
" independent_url = citedby_url+\"&hl=en&scipsc=1&q=\"+independent_query\n",
" publication[\"independent_url\"] = independent_url\n",
"\n",
" if links_only:\n",
" return None\n",
"\n",
" try:\n",
" search_results = scholarly.search_pubs_custom_url(independent_url)\n",
" num_independent_citations = search_results.total_results if search_results.total_results else 0\n",
" except Exception as err:\n",
" num_independent_citations = -99\n",
"\n",
" publication[\"num_independent_citations\"] = num_independent_citations"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"set_proxy(proxy_type)\n",
"scholar = scholarly.search_author_id(scholar_id, filled=True)\n",
"scholar_name = scholar[\"name\"]\n",
"print(f\"Hello {scholar_name} !\")"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"if links_only:\n",
" print(\"Fetching the independent citation counts has been turned off for your own good\"\n",
" \" because you are not using a proxy.\"\n",
" \" You can turn it back on at your own risk by explicitly setting `links_only` to `False`.\"\n",
" )\n",
"else:\n",
" print(\"You are fetching the counts in addition to the links. The code will run slow intentionally.\")\n",
"\n",
"independent_citation_counts = []\n",
"for paper in scholar[\"publications\"]:\n",
" if not links_only:\n",
" # Sleep for some random time to mimic human behavior\n",
" time.sleep(random.uniform(2, 5))\n",
"\n",
" try:\n",
" if paper.get(\"num_independent_citations\", -1) < 0:\n",
" fill_independent_citations(paper, links_only=links_only)\n",
" independent_citation_counts.append(paper.get(\"num_independent_citations\", 0))\n",
" except Exception as err:\n",
" print(\"Google Scholar is aggressively blocking us! Quitting for now.\")\n",
" print(err)\n",
" finally:\n",
" print(\"\\n ------\\n\")\n",
" print(paper[\"bib\"][\"title\"])\n",
" print(f\"Citations: {paper.get('num_independent_citations', 'NA')}/{paper.get('num_citations')}\")\n",
" independent_url = paper.get(\"independent_url\", None)\n",
" if independent_url:\n",
" print(\"Link: \", \"http://scholar.google.com\"+independent_url)\n",
"\n",
"print(\"\\n --- End of list ---\")\n",
"\n",
"if not links_only:\n",
" print(\"Total number of independent citations = \", sum(independent_citation_counts))\n"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python",
"version": "3.8.5",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.5 64-bit ('citationCounts-5v719axv': pipenv)"
},
"interpreter": {
"hash": "b4453f8bf7bd99dd4af6a0448cb71b82e65d6462eebc2bf264831bb32049f637"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# fuzzywizzy
# pyyaml
scholarly
# unidecode
# xmltodict
# and a public Google Scholar profile!

0 comments on commit 7ad7167

Please sign in to comment.