Add Google Scholar notebook

arunkannawadi · Sep 17, 2021 · 7ad7167 · 7ad7167
1 parent 1d5a512
commit 7ad7167
Show file tree

Hide file tree

Showing 4 changed files with 310 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,9 @@
 # Independent citation counter
 
 A (collection of) Jupyter notebook(s) that count independent citations from different bibliographic databases:
-- SAO/NASA Astrophysics Data System (ADS)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb)
-- Google Scholar (coming soon)
+- Google Scholar (beta version) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/gscholar.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fgscholar.ipynb)
+- SAO/NASA Astrophysics Data System (ADS)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fads.ipynb)
+
 
 ## What are independent citations?
 
@@ -14,4 +15,4 @@ This is less restrictive than the definition where citations by those who have n
 
 ## Why?
 
-Some application packages need to exclude all counts of self-citations to evaluate the impact of one's research.
+Some application packages need to exclude all counts of self-citations to evaluate the impact of one's research.
diff --git a/docs/index.md b/docs/index.md
@@ -1,8 +1,8 @@
 # Independent citation counter
 
 A (collection of) Jupyter notebook(s) that count independent citations from different bibliographic databases:
-- SAO/NASA Astrophysics Data System (ADS) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb)
-- Google Scholar (gscholar; coming soon)
+- Google Scholar (beta version) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/gscholar.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fgscholar.ipynb)
+- SAO/NASA Astrophysics Data System (ADS)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arunkannawadi/independent-citation-counter/blob/master/notebooks/ads.ipynb) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/arunkannawadi/independent-citation-counter/master?filepath=notebooks%2Fads.ipynb)
 
 ## What are independent citations?
 

diff --git a/notebooks/gscholar.ipynb b/notebooks/gscholar.ipynb
@@ -0,0 +1,298 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Independent citation counter\n",
+    "\n",
+    "In this notebook, you can calculate the number of independent citations for all of your papers.\n",
+    "\n",
+    "### What the code will do?\n",
+    "For each entry in your Google Scholar profile, the code will output your independent citation count, total citation count and a link to access your independent citation counts.\n",
+    "\n",
+    "\n",
+    "**Sample output:**\n",
+    "\n",
+    "> The impact of cosmic variance on simulating weak lensing surveys\n",
+    ">\n",
+    "> Citations: 9/15\n",
+    ">\n",
+    "> Link:  [http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27](http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27)\n",
+    "\n",
+    "The first line is the title of the paper, which has 9 independent citations and 15 total citations.\n",
+    "The link takes you to the Google Scholar page with the independent citations.\n",
+    "\n",
+    "**Note:**\n",
+    "Even if the program is unable to fetch independent citation counts, it will still output your total citations and provide a link to access your independent citations.\n",
+    "\n",
+    "\n",
+    "### How to use?\n",
+    "In the cell below, replace `qc6CJjYAAAAJ` with your Google Scholar profile ID.\n",
+    "You may also want to specify a proxy type (more details below).\n",
+    "Then, run all cells.\n",
+    "\n",
+    "### Troubleshooting\n",
+    "If you see a `MaxTriesExceededException`, it means Google Scholar caught a whiff of your action.\n",
+    "Try again later, or use a better proxy.\n",
+    "\n",
+    "<br>\n",
+    "\n",
+    "### Enter your Google Scholar profile ID\n",
+    "*unless you are Albert Einstein.*\n",
+    "\n",
+    "For example, if your Google Scholar profile URL is [`https://scholar.google.com/citations?user=qc6CJjYAAAAJ`](https://scholar.google.com/citations?user=qc6CJjYAAAAJ), then your profile ID is `qc6CJjYAAAAJ`."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "# The only cell which you are expected to modify.\n",
+    "scholar_id = 'qc6CJjYAAAAJ'\n",
+    "\n",
+    "# `proxy_type` must be one of ScraperAPI, Luminati, FreeProxy, SingleProxy or NoProxy.\n",
+    "# NoProxy will give only the links to independent, not the counts.\n",
+    "proxy_type = 'NoProxy'  # Case insensitive"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "#### More on `proxy_type`\n",
+    "\n",
+    "By default, the code provides only the links to page containing independent citations, and does not open the page to count them.\n",
+    "Google Scholar actively blocks automated requests to its citation database.\n",
+    "Continuous, repeated requests from a single IP address may lead to a ban.\n",
+    "However, if you need the counts, you may be able to circumvent this by using a proxy.\n",
+    "Below are a few options:\n",
+    "\n",
+    "- **FreeProxy**: Use continuously changing proxies for free.\n",
+    "\n",
+    "    This protects your IP address, but is not very effective at circumventing Google Scholar's anti-bot prevention. You might want to use other options if you are unable to reach Google Scholar.\n",
+    "\n",
+    "\n",
+    "- **ScraperAPI** (recommended): [Create a free account](https://www.scraperapi.com/) without providing personal and payment information. Free account supports 5000 requests per month, more that sufficient to run this notebook for most researchers.\n",
+    "\n",
+    "- **Luminati** (untested): Similar to ScraperAPI, and is known to circumvent Google Scholar's anti-bot prevention better. No free account is available.\n",
+    "\n",
+    "- **SingleProxy**: Use a single proxy for all requests.\n",
+    "\n",
+    "- **NoProxy** (default): Using `NoProxy` will not fetch the counts by default. You can still try to fetch the counts (at your own risk) by setting `links_only` below to `False`. Use this sparingly if `FreeProxy` does not work and you don't want to create any accounts. You may also use this safely if you are already connected to a VPN.\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "Read the [official scholarly documentation](https://scholarly.readthedocs.io/en/latest/quickstart.html#using-proxies) for more details."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "links_only = (proxy_type.lower() == 'noproxy')"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### Install and import the required packages."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "! pip install -q scholarly"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "try:\n",
+    "    from scholarly import scholarly, ProxyGenerator #, MaxTriesExceededException\n",
+    "except IndexError:\n",
+    "    \"\"\" Ignore the harmless IndexError occuring from a dependency\"\"\"\n",
+    "    pass\n",
+    "import time, random\n",
+    "from getpass import getpass\n",
+    "try:\n",
+    "    from urllib import quote  # type: ignore ; Python 2\n",
+    "except ImportError:\n",
+    "    from urllib.parse import quote  # type: ignore ; Python 3"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "def set_proxy(proxy_type='NoProxy'):\n",
+    "    \"\"\"Set a proxy for to scrape Google Scholar.\n",
+    "\n",
+    "    Only `NoProxy`, `FreeProxy` and `ScraperAPI` have been tested.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    proxy_type : str, optional\n",
+    "        Type of proxy to use. Case insensitive. Options are:\n",
+    "        `ScraperAPI`, `Luminati`, `FreeProxy`, `SingleProxy` and\n",
+    "        `NoProxy` (default).\n",
+    "    \"\"\"\n",
+    "    if proxy_type.lower() == 'noproxy':\n",
+    "        print(\"Using no proxies!\")\n",
+    "        return\n",
+    "\n",
+    "    pg = ProxyGenerator()\n",
+    "    if proxy_type.lower() == 'scraperapi':\n",
+    "        payload = {'api_key': getpass(\"Enter your ScraperAPI key:\"), }\n",
+    "        pg.ScraperAPI(payload['api_key'])\n",
+    "        print(\"Using ScraperAPI!\")\n",
+    "    elif proxy_type.lower() == 'luminati':\n",
+    "        pg.Luminati(getpass(\"Enter your Luminati username:\"), getpass(\"Enter your Luminati password:\"))\n",
+    "        print(\"Using Luminati!\")\n",
+    "    elif proxy_type.lower() == 'singleproxy':\n",
+    "        proxy_address = getpass(\"Enter your proxy address:\")\n",
+    "        pg.SingleProxy(proxy_address, proxy_address)\n",
+    "        print(f\"Using SingleProxy: {proxy_address}\")\n",
+    "    else:\n",
+    "        pg.FreeProxies()\n",
+    "        print(\"Using FreeProxy!\")\n",
+    "\n",
+    "    scholarly.use_proxy(pg)\n",
+    "\n",
+    "def standardize_names(name):\n",
+    "    if not \" \" in name:\n",
+    "        return name\n",
+    "    try:\n",
+    "        parts = name.split(' ')\n",
+    "        firstname, lastname = parts[0], parts[-1]\n",
+    "        initial = firstname[0]\n",
+    "        return quote(f\"'{initial} {lastname}'\")\n",
+    "    except:\n",
+    "        # This usually happens for collaboration papers\n",
+    "        print(f\"Cannot split '{name}' into initial and last names!\")\n",
+    "        return quote(f\"{name}\")\n",
+    "\n",
+    "\n",
+    "def fill_independent_citations(publication, links_only=True):\n",
+    "    if not publication[\"source\"].name == \"AUTHOR_PUBLICATION_ENTRY\":\n",
+    "        raise TypeError(\"Input source must be from a Google Scholar profile page\")\n",
+    "\n",
+    "    if not publication[\"filled\"]:  # TODO: Don't fill once the patch comes through\n",
+    "        scholarly.fill(publication)\n",
+    "\n",
+    "    citedby_url = publication.get(\"citedby_url\", None)\n",
+    "    if citedby_url is None:\n",
+    "        # If there are no citations, then there is nothing to do\n",
+    "        publication[\"num_independent_citations\"] = 0\n",
+    "        return None\n",
+    "\n",
+    "    author_names = publication[\"bib\"][\"author\"].split(\" and \")\n",
+    "    independent_query = \"+\".join([f\"-author:{standardize_names(name)}\" for name in author_names])\n",
+    "    independent_url = citedby_url+\"&hl=en&scipsc=1&q=\"+independent_query\n",
+    "    publication[\"independent_url\"] = independent_url\n",
+    "\n",
+    "    if links_only:\n",
+    "        return None\n",
+    "\n",
+    "    try:\n",
+    "        search_results = scholarly.search_pubs_custom_url(independent_url)\n",
+    "        num_independent_citations = search_results.total_results if search_results.total_results else 0\n",
+    "    except Exception as err:\n",
+    "        num_independent_citations = -99\n",
+    "\n",
+    "    publication[\"num_independent_citations\"] = num_independent_citations"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "set_proxy(proxy_type)\n",
+    "scholar = scholarly.search_author_id(scholar_id, filled=True)\n",
+    "scholar_name = scholar[\"name\"]\n",
+    "print(f\"Hello {scholar_name} !\")"
+   ],
+   "outputs": [],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "if links_only:\n",
+    "    print(\"Fetching the independent citation counts has been turned off for your own good\"\n",
+    "          \" because you are not using a proxy.\"\n",
+    "          \" You can turn it back on at your own risk by explicitly setting `links_only` to `False`.\"\n",
+    "          )\n",
+    "else:\n",
+    "    print(\"You are fetching the counts in addition to the links. The code will run slow intentionally.\")\n",
+    "\n",
+    "independent_citation_counts = []\n",
+    "for paper in scholar[\"publications\"]:\n",
+    "    if not links_only:\n",
+    "        # Sleep for some random time to mimic human behavior\n",
+    "        time.sleep(random.uniform(2, 5))\n",
+    "\n",
+    "    try:\n",
+    "        if paper.get(\"num_independent_citations\", -1) < 0:\n",
+    "            fill_independent_citations(paper, links_only=links_only)\n",
+    "            independent_citation_counts.append(paper.get(\"num_independent_citations\", 0))\n",
+    "    except Exception as err:\n",
+    "        print(\"Google Scholar is aggressively blocking us! Quitting for now.\")\n",
+    "        print(err)\n",
+    "    finally:\n",
+    "        print(\"\\n ------\\n\")\n",
+    "        print(paper[\"bib\"][\"title\"])\n",
+    "        print(f\"Citations: {paper.get('num_independent_citations', 'NA')}/{paper.get('num_citations')}\")\n",
+    "        independent_url = paper.get(\"independent_url\", None)\n",
+    "        if independent_url:\n",
+    "            print(\"Link: \", \"http://scholar.google.com\"+independent_url)\n",
+    "\n",
+    "print(\"\\n --- End of list ---\")\n",
+    "\n",
+    "if not links_only:\n",
+    "    print(\"Total number of independent citations = \", sum(independent_citation_counts))\n"
+   ],
+   "outputs": [],
+   "metadata": {}
+  }
+ ],
+ "metadata": {
+  "orig_nbformat": 4,
+  "language_info": {
+   "name": "python",
+   "version": "3.8.5",
+   "mimetype": "text/x-python",
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "pygments_lexer": "ipython3",
+   "nbconvert_exporter": "python",
+   "file_extension": ".py"
+  },
+  "kernelspec": {
+   "name": "python3",
+   "display_name": "Python 3.8.5 64-bit ('citationCounts-5v719axv': pipenv)"
+  },
+  "interpreter": {
+   "hash": "b4453f8bf7bd99dd4af6a0448cb71b82e65d6462eebc2bf264831bb32049f637"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,6 @@
+# fuzzywizzy
+# pyyaml
+scholarly
+# unidecode
+# xmltodict
+# and a public Google Scholar profile!