-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1d5a512
commit 7ad7167
Showing
4 changed files
with
310 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,298 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Independent citation counter\n", | ||
"\n", | ||
"In this notebook, you can calculate the number of independent citations for all of your papers.\n", | ||
"\n", | ||
"### What the code will do?\n", | ||
"For each entry in your Google Scholar profile, the code will output your independent citation count, total citation count and a link to access your independent citation counts.\n", | ||
"\n", | ||
"\n", | ||
"**Sample output:**\n", | ||
"\n", | ||
"> The impact of cosmic variance on simulating weak lensing surveys\n", | ||
">\n", | ||
"> Citations: 9/15\n", | ||
">\n", | ||
"> Link: [http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27](http://scholar.google.com/scholar?cites=17631820148925503603&scipsc=1&q=-author:%27A%20Kannawadi%27+-author:%27R%20Mandelbaum%27+-author:%27C%20Lackner%27)\n", | ||
"\n", | ||
"The first line is the title of the paper, which has 9 independent citations and 15 total citations.\n", | ||
"The link takes you to the Google Scholar page with the independent citations.\n", | ||
"\n", | ||
"**Note:**\n", | ||
"Even if the program is unable to fetch independent citation counts, it will still output your total citations and provide a link to access your independent citations.\n", | ||
"\n", | ||
"\n", | ||
"### How to use?\n", | ||
"In the cell below, replace `qc6CJjYAAAAJ` with your Google Scholar profile ID.\n", | ||
"You may also want to specify a proxy type (more details below).\n", | ||
"Then, run all cells.\n", | ||
"\n", | ||
"### Troubleshooting\n", | ||
"If you see a `MaxTriesExceededException`, it means Google Scholar caught a whiff of your action.\n", | ||
"Try again later, or use a better proxy.\n", | ||
"\n", | ||
"<br>\n", | ||
"\n", | ||
"### Enter your Google Scholar profile ID\n", | ||
"*unless you are Albert Einstein.*\n", | ||
"\n", | ||
"For example, if your Google Scholar profile URL is [`https://scholar.google.com/citations?user=qc6CJjYAAAAJ`](https://scholar.google.com/citations?user=qc6CJjYAAAAJ), then your profile ID is `qc6CJjYAAAAJ`." | ||
], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"# The only cell which you are expected to modify.\n", | ||
"scholar_id = 'qc6CJjYAAAAJ'\n", | ||
"\n", | ||
"# `proxy_type` must be one of ScraperAPI, Luminati, FreeProxy, SingleProxy or NoProxy.\n", | ||
"# NoProxy will give only the links to independent, not the counts.\n", | ||
"proxy_type = 'NoProxy' # Case insensitive" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"#### More on `proxy_type`\n", | ||
"\n", | ||
"By default, the code provides only the links to page containing independent citations, and does not open the page to count them.\n", | ||
"Google Scholar actively blocks automated requests to its citation database.\n", | ||
"Continuous, repeated requests from a single IP address may lead to a ban.\n", | ||
"However, if you need the counts, you may be able to circumvent this by using a proxy.\n", | ||
"Below are a few options:\n", | ||
"\n", | ||
"- **FreeProxy**: Use continuously changing proxies for free.\n", | ||
"\n", | ||
" This protects your IP address, but is not very effective at circumventing Google Scholar's anti-bot prevention. You might want to use other options if you are unable to reach Google Scholar.\n", | ||
"\n", | ||
"\n", | ||
"- **ScraperAPI** (recommended): [Create a free account](https://www.scraperapi.com/) without providing personal and payment information. Free account supports 5000 requests per month, more that sufficient to run this notebook for most researchers.\n", | ||
"\n", | ||
"- **Luminati** (untested): Similar to ScraperAPI, and is known to circumvent Google Scholar's anti-bot prevention better. No free account is available.\n", | ||
"\n", | ||
"- **SingleProxy**: Use a single proxy for all requests.\n", | ||
"\n", | ||
"- **NoProxy** (default): Using `NoProxy` will not fetch the counts by default. You can still try to fetch the counts (at your own risk) by setting `links_only` below to `False`. Use this sparingly if `FreeProxy` does not work and you don't want to create any accounts. You may also use this safely if you are already connected to a VPN.\n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"\n", | ||
"Read the [official scholarly documentation](https://scholarly.readthedocs.io/en/latest/quickstart.html#using-proxies) for more details." | ||
], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"links_only = (proxy_type.lower() == 'noproxy')" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"### Install and import the required packages." | ||
], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"! pip install -q scholarly" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"try:\n", | ||
" from scholarly import scholarly, ProxyGenerator #, MaxTriesExceededException\n", | ||
"except IndexError:\n", | ||
" \"\"\" Ignore the harmless IndexError occuring from a dependency\"\"\"\n", | ||
" pass\n", | ||
"import time, random\n", | ||
"from getpass import getpass\n", | ||
"try:\n", | ||
" from urllib import quote # type: ignore ; Python 2\n", | ||
"except ImportError:\n", | ||
" from urllib.parse import quote # type: ignore ; Python 3" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"def set_proxy(proxy_type='NoProxy'):\n", | ||
" \"\"\"Set a proxy for to scrape Google Scholar.\n", | ||
"\n", | ||
" Only `NoProxy`, `FreeProxy` and `ScraperAPI` have been tested.\n", | ||
"\n", | ||
" Parameters\n", | ||
" ----------\n", | ||
" proxy_type : str, optional\n", | ||
" Type of proxy to use. Case insensitive. Options are:\n", | ||
" `ScraperAPI`, `Luminati`, `FreeProxy`, `SingleProxy` and\n", | ||
" `NoProxy` (default).\n", | ||
" \"\"\"\n", | ||
" if proxy_type.lower() == 'noproxy':\n", | ||
" print(\"Using no proxies!\")\n", | ||
" return\n", | ||
"\n", | ||
" pg = ProxyGenerator()\n", | ||
" if proxy_type.lower() == 'scraperapi':\n", | ||
" payload = {'api_key': getpass(\"Enter your ScraperAPI key:\"), }\n", | ||
" pg.ScraperAPI(payload['api_key'])\n", | ||
" print(\"Using ScraperAPI!\")\n", | ||
" elif proxy_type.lower() == 'luminati':\n", | ||
" pg.Luminati(getpass(\"Enter your Luminati username:\"), getpass(\"Enter your Luminati password:\"))\n", | ||
" print(\"Using Luminati!\")\n", | ||
" elif proxy_type.lower() == 'singleproxy':\n", | ||
" proxy_address = getpass(\"Enter your proxy address:\")\n", | ||
" pg.SingleProxy(proxy_address, proxy_address)\n", | ||
" print(f\"Using SingleProxy: {proxy_address}\")\n", | ||
" else:\n", | ||
" pg.FreeProxies()\n", | ||
" print(\"Using FreeProxy!\")\n", | ||
"\n", | ||
" scholarly.use_proxy(pg)\n", | ||
"\n", | ||
"def standardize_names(name):\n", | ||
" if not \" \" in name:\n", | ||
" return name\n", | ||
" try:\n", | ||
" parts = name.split(' ')\n", | ||
" firstname, lastname = parts[0], parts[-1]\n", | ||
" initial = firstname[0]\n", | ||
" return quote(f\"'{initial} {lastname}'\")\n", | ||
" except:\n", | ||
" # This usually happens for collaboration papers\n", | ||
" print(f\"Cannot split '{name}' into initial and last names!\")\n", | ||
" return quote(f\"{name}\")\n", | ||
"\n", | ||
"\n", | ||
"def fill_independent_citations(publication, links_only=True):\n", | ||
" if not publication[\"source\"].name == \"AUTHOR_PUBLICATION_ENTRY\":\n", | ||
" raise TypeError(\"Input source must be from a Google Scholar profile page\")\n", | ||
"\n", | ||
" if not publication[\"filled\"]: # TODO: Don't fill once the patch comes through\n", | ||
" scholarly.fill(publication)\n", | ||
"\n", | ||
" citedby_url = publication.get(\"citedby_url\", None)\n", | ||
" if citedby_url is None:\n", | ||
" # If there are no citations, then there is nothing to do\n", | ||
" publication[\"num_independent_citations\"] = 0\n", | ||
" return None\n", | ||
"\n", | ||
" author_names = publication[\"bib\"][\"author\"].split(\" and \")\n", | ||
" independent_query = \"+\".join([f\"-author:{standardize_names(name)}\" for name in author_names])\n", | ||
" independent_url = citedby_url+\"&hl=en&scipsc=1&q=\"+independent_query\n", | ||
" publication[\"independent_url\"] = independent_url\n", | ||
"\n", | ||
" if links_only:\n", | ||
" return None\n", | ||
"\n", | ||
" try:\n", | ||
" search_results = scholarly.search_pubs_custom_url(independent_url)\n", | ||
" num_independent_citations = search_results.total_results if search_results.total_results else 0\n", | ||
" except Exception as err:\n", | ||
" num_independent_citations = -99\n", | ||
"\n", | ||
" publication[\"num_independent_citations\"] = num_independent_citations" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"set_proxy(proxy_type)\n", | ||
"scholar = scholarly.search_author_id(scholar_id, filled=True)\n", | ||
"scholar_name = scholar[\"name\"]\n", | ||
"print(f\"Hello {scholar_name} !\")" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"source": [ | ||
"if links_only:\n", | ||
" print(\"Fetching the independent citation counts has been turned off for your own good\"\n", | ||
" \" because you are not using a proxy.\"\n", | ||
" \" You can turn it back on at your own risk by explicitly setting `links_only` to `False`.\"\n", | ||
" )\n", | ||
"else:\n", | ||
" print(\"You are fetching the counts in addition to the links. The code will run slow intentionally.\")\n", | ||
"\n", | ||
"independent_citation_counts = []\n", | ||
"for paper in scholar[\"publications\"]:\n", | ||
" if not links_only:\n", | ||
" # Sleep for some random time to mimic human behavior\n", | ||
" time.sleep(random.uniform(2, 5))\n", | ||
"\n", | ||
" try:\n", | ||
" if paper.get(\"num_independent_citations\", -1) < 0:\n", | ||
" fill_independent_citations(paper, links_only=links_only)\n", | ||
" independent_citation_counts.append(paper.get(\"num_independent_citations\", 0))\n", | ||
" except Exception as err:\n", | ||
" print(\"Google Scholar is aggressively blocking us! Quitting for now.\")\n", | ||
" print(err)\n", | ||
" finally:\n", | ||
" print(\"\\n ------\\n\")\n", | ||
" print(paper[\"bib\"][\"title\"])\n", | ||
" print(f\"Citations: {paper.get('num_independent_citations', 'NA')}/{paper.get('num_citations')}\")\n", | ||
" independent_url = paper.get(\"independent_url\", None)\n", | ||
" if independent_url:\n", | ||
" print(\"Link: \", \"http://scholar.google.com\"+independent_url)\n", | ||
"\n", | ||
"print(\"\\n --- End of list ---\")\n", | ||
"\n", | ||
"if not links_only:\n", | ||
" print(\"Total number of independent citations = \", sum(independent_citation_counts))\n" | ||
], | ||
"outputs": [], | ||
"metadata": {} | ||
} | ||
], | ||
"metadata": { | ||
"orig_nbformat": 4, | ||
"language_info": { | ||
"name": "python", | ||
"version": "3.8.5", | ||
"mimetype": "text/x-python", | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"pygments_lexer": "ipython3", | ||
"nbconvert_exporter": "python", | ||
"file_extension": ".py" | ||
}, | ||
"kernelspec": { | ||
"name": "python3", | ||
"display_name": "Python 3.8.5 64-bit ('citationCounts-5v719axv': pipenv)" | ||
}, | ||
"interpreter": { | ||
"hash": "b4453f8bf7bd99dd4af6a0448cb71b82e65d6462eebc2bf264831bb32049f637" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# fuzzywizzy | ||
# pyyaml | ||
scholarly | ||
# unidecode | ||
# xmltodict | ||
# and a public Google Scholar profile! |