Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAlex API support #135

Merged
merged 5 commits into from
Jan 14, 2025
Merged

OpenAlex API support #135

merged 5 commits into from
Jan 14, 2025

Conversation

corochann
Copy link
Contributor

As discussed in issues #84 & #133, Semantic Scholar replies error code 429 and the program completely hangs if you do not have Semantic Scholar API Key (I could not get any request to success even if I wait more than 1 hour) while it takes long time to get proper API Key even if you request through the Semantic Scholar form.

This PR adds OpenAlex API support as an alternative literature search. pyalex library is used for the simple implementation.
Since OpenAlex does not require API Key, this PR will allow broader people to run the AI Scientist code by setting --engine openalex.

Please note that I have not compared the search result between Semantic Scholar and OpenAlex (since I have no Semantic Scholar API Key), and citation may not work as expected. I just confirmed the below command will run until the end to produce
a pdf paper.
OPENALEX_MAIL_ADDRESS="MY EMAIL ADDRESS" python launch_scientist.py --model "gpt-4o-2024-05-13" --experiment nanoGPT_lite --num-ideas 2 --engine openalex

Additionally, this PR supports --skip_novelty_check option to work. The option was already implemented in argparse but does not work now.

@RaInSLc
Copy link

RaInSLc commented Oct 9, 2024

I tried to merge your PR locally, but after generating ideas, the result was still like this , is this restults true? or what can i fix ?

(ai_scientist) (.venv) PS C:\Users\Administrator\PycharmProjects\AI-Scientist> python launch_scientist.py --model "gpt-4o-2024-05-13" --experiment nanoGPT_lite --num-ideas 2 --engine openalex
Using GPUs: []
Using OpenAI API with model gpt-4o-2024-05-13.

Generating idea 1/2
Iteration 1/3
Failed to generate idea: Connection error.

Generating idea 2/2
Iteration 1/3
{'Name': 'adaptive_dropout', 'Title': 'Adaptive Dropout: Dynamic Regularization for Enhanced Model Training', 'Experiment': 'Modify the model to allow dynamic adjustment of dro
pout rates during training. Implement a mechanism to update the dropout rate based on the current epoch or validation loss. Compare the training dynamics, convergence speed, and final performance with the baseline model with a static dropout rate.', 'Interestingness': 7, 'Feasibility': 5, 'Novelty': 5}
Iteration 2/3
{'Name': 'adaptive_dropout', 'Title': 'Adaptive Dropout: Dynamic Regularization for Enhanced Model Training', 'Experiment': 'Modify the model to allow dynamic adjustment of dro
pout rates during training. Implement a mechanism to linearly decrease the dropout rate from an initial higher value to a lower value over the training epochs. Compare the trai
ning dynamics, convergence speed, and final performance with the baseline model with a static dropout rate. Log the dropout rate changes and analyze their impact on performance.', 'Interestingness': 7, 'Feasibility': 6, 'Novelty': 5}
Iteration 3/3
{'Name': 'adaptive_dropout', 'Title': 'Adaptive Dropout: Dynamic Regularization for Enhanced Model Training', 'Experiment': 'Modify the model to allow dynamic adjustment of dro
pout rates during training. Implement a mechanism to linearly decrease the dropout rate from an initial higher value to a lower value over the training epochs. Compare the trai
ning dynamics, convergence speed, and final performance with the baseline model with a static dropout rate. Log the dropout rate changes and analyze their impact on performance.', 'Interestingness': 7, 'Feasibility': 6, 'Novelty': 5}
Idea generation converged after 3 iterations.

Checking novelty of idea 0: adaptive_block_size
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Non-Parametric Estimation of a Multivariate Probability Density': len(abstract)=29825 is too long! Use first 1000 chars.
[WARNING] title='QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials': len(abstract)=1268 is too long! Use first 1000 chars.      
[WARNING] title='GROMACS: A message-passing parallel molecular dynamics implementation': len(abstract)=1832 is too long! Use first 1000 chars.
[WARNING] title='Eigenfaces for Recognition': len(abstract)=1462 is too long! Use first 1000 chars.
[WARNING] title='An introduction to cybernetics': len(abstract)=2464 is too long! Use first 1000 chars.
[WARNING] title='CHARMM: The biomolecular simulation program': len(abstract)=1381 is too long! Use first 1000 chars.
[WARNING] title="User's guide to PHREEQC (Version 2): A computer program for speciation, batch-reaction, one-dimensional transport, and inverse geochemical calculations": len(abstract)=1985 is too long! Use first 1000 chars.
Error: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /works?search=adaptive+sequence+length+training+neural+networks&per-page=10 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting': len(abstract)=1521 is too long! Use first 1000 chars.
[WARNING] title='SuperGlue: Learning Feature Matching With Graph Neural Networks': len(abstract)=1058 is too long! Use first 1000 chars.
[WARNING] title='A Metaverse: Taxonomy, Components, Applications, and Open Challenges': len(abstract)=1284 is too long! Use first 1000 chars.
Error: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /works?search=adaptive+block+size+transformer&per-page=10 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
Error: HTTPSConnectionPool(host='api.openalex.org', port=443): Max retries exceeded with url: /works?search=variable+context+window+training+transformers&per-page=10 (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1000)')))
Error: Connection error.
[WARNING] title='Long Short-Term Memory': len(abstract)=1205 is too long! Use first 1000 chars.
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Universal Transformers': len(abstract)=1762 is too long! Use first 1000 chars.
[WARNING] title='Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks': len(abstract)=1381 is too long! Use first 1000 chars.
[WARNING] title='Detecting Falls with Wearable Sensors Using Machine Learning Techniques': len(abstract)=1647 is too long! Use first 1000 chars.
[WARNING] title='Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware': len(abstract)=2322 is too long! Use first 1000 chars.
[WARNING] title='Eigenfaces for Recognition': len(abstract)=1462 is too long! Use first 1000 chars.
Decision made: novel after round 7

Checking novelty of idea 1: layerwise_learning_rates
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Highly accurate protein structure prediction with AlphaFold': len(abstract)=1701 is too long! Use first 1000 chars.
[WARNING] title='Deformable Convolutional Networks': len(abstract)=1023 is too long! Use first 1000 chars.
[WARNING] title='Dynamic Graph CNN for Learning on Point Clouds': len(abstract)=1413 is too long! Use first 1000 chars.
[WARNING] title='Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI': len(abstract)=2238 is too long! Use first 1000 chars.
[WARNING] title='Image Segmentation Using Deep Learning: A Survey': len(abstract)=1021 is too long! Use first 1000 chars.
[WARNING] title='Geometric Deep Learning: Going beyond Euclidean data': len(abstract)=1303 is too long! Use first 1000 chars.
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Deep neural network for traffic sign recognition systems: An analysis of spatial transformers and stochastic optimisation methods': len(abstract)=1054 is too long! Use first 1000 chars.
[WARNING] title='Understanding the Difficulty of Training Transformers': len(abstract)=1194 is too long! Use first 1000 chars.
[WARNING] title='Deformable Convolutional Networks': len(abstract)=1023 is too long! Use first 1000 chars.
[WARNING] title='BioBERT: a pre-trained biomedical language representation model for biomedical text mining': len(abstract)=1800 is too long! Use first 1000 chars.
[WARNING] title='Dynamic Graph CNN for Learning on Point Clouds': len(abstract)=1413 is too long! Use first 1000 chars.
[WARNING] title='Residual Attention Network for Image Classification': len(abstract)=1326 is too long! Use first 1000 chars.
[WARNING] title='Gradient-based learning applied to document recognition': len(abstract)=1510 is too long! Use first 1000 chars.
[WARNING] title='Dynamic Graph CNN for Learning on Point Clouds': len(abstract)=1413 is too long! Use first 1000 chars.
[WARNING] title='Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI': len(abstract)=2238 is too long! Use first 1000 chars.
[WARNING] title='Image Segmentation Using Deep Learning: A Survey': len(abstract)=1021 is too long! Use first 1000 chars.
[WARNING] title='A Survey on Vision Transformer': len(abstract)=1254 is too long! Use first 1000 chars.
[WARNING] title='Applications of machine learning to machine fault diagnosis: A review and roadmap': len(abstract)=1611 is too long! Use first 1000 chars.
[WARNING] title='Deep Learning in Mobile and Wireless Networking: A Survey': len(abstract)=1554 is too long! Use first 1000 chars.
[WARNING] title='WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing': len(abstract)=1185 is too long! Use first 1000 chars.
Decision made: novel after round 3

Checking novelty of idea 2: adaptive_dropout
[WARNING] title='Effectiveness of Psychotherapy for Personality Disorders': len(abstract)=29995 is too long! Use first 1000 chars.
[WARNING] title='Squaring the circle of selection and allocation in liver transplantation for HCC: An adaptive approach': len(abstract)=36011 is too long! Use first 1000 chars.
[WARNING] title='Withdrawing From School': len(abstract)=1302 is too long! Use first 1000 chars.
[WARNING] title='Assessing “Neighborhood Effects”: Social Processes and New Directions in Research': len(abstract)=1236 is too long! Use first 1000 chars.
[WARNING] title='Hepatocellular carcinoma': len(abstract)=1820 is too long! Use first 1000 chars.
[WARNING] title='Deep Learning‐Based Crack Damage Detection Using Convolutional Neural Networks': len(abstract)=1615 is too long! Use first 1000 chars.
[WARNING] title='A Multistep, Consensus-Based Approach to Organ Allocation in Liver Transplantation: Toward a “Blended Principle Model”': len(abstract)=29633 is too long! Use first 1000 chars.
[WARNING] title='A survey of the recent architectures of deep convolutional neural networks': len(abstract)=1818 is too long! Use first 1000 chars.
[WARNING] title='Event-Triggering in Distributed Networked Control Systems': len(abstract)=1495 is too long! Use first 1000 chars.
[WARNING] title='Automatic myocardial segmentation in dynamic contrast enhanced perfusion MRI using Monte Carlo dropout in an encoder-decoder convolutional neural network': len(abstract)=1981 is too long! Use first 1000 chars.
[WARNING] title='Neuroscience-Inspired Artificial Intelligence': len(abstract)=29895 is too long! Use first 1000 chars.
[WARNING] title='The dropout learning algorithm': len(abstract)=1820 is too long! Use first 1000 chars.
[WARNING] title='A Fast Dense Spectral–Spatial Convolution Network Framework for Hyperspectral Images Classification': len(abstract)=1331 is too long! Use first 1000 chars.    
[WARNING] title='Effective Approaches to Attention-based Neural Machine Translation': len(abstract)=1020 is too long! Use first 1000 chars.
[WARNING] title='Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks': len(abstract)=1148 is too long! Use first 1000 chars.
[WARNING] title='Probabilistic Sequential Network for Deep Learning of Complex Process Data and Soft Sensor Application': len(abstract)=1170 is too long! Use first 1000 chars. 
[WARNING] title='Deep nets vs expert designed features in medical physics: An IMRT QA case study': len(abstract)=1915 is too long! Use first 1000 chars.
[WARNING] title='Dropout: a simple way to prevent neural networks from overfitting': len(abstract)=1130 is too long! Use first 1000 chars.
[WARNING] title='Adaptive dropout for training deep neural networks': len(abstract)=1297 is too long! Use first 1000 chars.
[WARNING] title='Deep Convolutional Neural Networks for Large-scale Speech Tasks': len(abstract)=1265 is too long! Use first 1000 chars.
[WARNING] title='Long Short-Term Memory Recurrent Neural Network for Remaining Useful Life Prediction of Lithium-Ion Batteries': len(abstract)=1455 is too long! Use first 1000 chars.
[WARNING] title='An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks': len(abstract)=1054 is too long! Use first 1000 chars.
[WARNING] title='EfficientNetV2: Smaller Models and Faster Training': len(abstract)=1329 is too long! Use first 1000 chars.
[WARNING] title='You Only Look Once: Unified, Real-Time Object Detection': len(abstract)=1118 is too long! Use first 1000 chars.
[WARNING] title='The dropout learning algorithm': len(abstract)=1820 is too long! Use first 1000 chars.
[WARNING] title='An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks': len(abstract)=1062 is too long! Use first 1000 chars.
Decision made: not novel after round 3
Processing idea: adaptive_block_size
Failed to evaluate idea adaptive_block_size: [Errno 2] No such file or directory: 'templates\\nanoGPT_lite\\run_0\\final_info.json'
Traceback (most recent call last):
  File "C:\Users\Administrator\PycharmProjects\AI-Scientist\launch_scientist.py", line 437, in <module>
    success = do_idea(
              ^^^^^^^^
  File "C:\Users\Administrator\PycharmProjects\AI-Scientist\launch_scientist.py", line 164, in do_idea
    with open(osp.join(base_dir, "run_0", "final_info.json"), "r") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'templates\\nanoGPT_lite\\run_0\\final_info.json'

Processing idea: layerwise_learning_rates
Failed to evaluate idea layerwise_learning_rates: [Errno 2] No such file or directory: 'templates\\nanoGPT_lite\\run_0\\final_info.json'
Traceback (most recent call last):
  File "C:\Users\Administrator\PycharmProjects\AI-Scientist\launch_scientist.py", line 437, in <module>
    success = do_idea(
              ^^^^^^^^
  File "C:\Users\Administrator\PycharmProjects\AI-Scientist\launch_scientist.py", line 164, in do_idea
    with open(osp.join(base_dir, "run_0", "final_info.json"), "r") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'templates\\nanoGPT_lite\\run_0\\final_info.json'

@corochann
Copy link
Contributor Author

I guess the last No such file or directory error is not related to this PR, but you need to run the baseline code in advance to have run_0 directory.
https://github.com/SakanaAI/AI-Scientist?tab=readme-ov-file#create-nanogpt_lite-baseline-run-we-use-this-for-sanity-checking

@jiangzh-coder
Copy link

###i did following running:

windows环境下

conda create -n ai_scientist python=3.11
conda activate ai_scientist

Install pypi requirements 安装依赖包

pip install -r requirements.txt

OPENALEX执行

$env:OPENALEX_MAIL_ADDRESS = "@mail.edu.cn"

gpt的key

env:OPENAI_API_KEY="sk-proj-Z6IpUgbbcK6a1l5h。。。。。。“
python launch_scientist.py --model "gpt-4o-2024-05-13" --experiment nanoGPT_lite --num-ideas 2 --engine openalex

###but i got following errors:
image
#####i think that this can be due to following reasons:
image

###and i think that this step should be correct
image

########Can anyone help me?
best wishes
Jiang

@corochann
Copy link
Contributor Author

I guess the last No such file or directory error is not related to this PR, but you need to run the baseline code in advance to have run_0 directory.
https://github.com/SakanaAI/AI-Scientist?tab=readme-ov-file#create-nanogpt_lite-baseline-run-we-use-this-for-sanity-checking

# NOTE: YOU MUST FIRST RUN THE PREPARE SCRIPTS ABOVE!
cd templates/nanoGPT_lite && python experiment.py --out_dir run_0 && python plot.py

@conglu1997 conglu1997 added the enhancement New feature or request label Oct 25, 2024
@conglu1997
Copy link
Collaborator

Thanks for the contribution, have you tested this out on a few papers?

@jiangzh-coder
Copy link

sure, i will test it! best wishes!:)

@BradKML
Copy link

BradKML commented Dec 21, 2024

@corochann what can be done if people want to add other open APIs? What are the intended structured output? There are only 6 currently (paper title, authors, venue, year of publication, citation count, paper abstract), but should there be more?
Regarding future edits, search_for_papers (direct API calls) is the only place that needs to be changed in core, while get_citation_aider_prompt, perform_writeup, and check_idea_novelty just need an engine parameter.

@corochann
Copy link
Contributor Author

corochann commented Jan 8, 2025

Sorry for not responding for long time.

@conglu1997

have you tested this out on a few papers?

Yes, I tested 2 examples and checked that papers are generated. But I didn't (couldn't) confirm the citation is working as expected compared to semantic scholar api.

@BradKML

What are the intended structured output?

I think you can just return the dict which contains following keys.

paper = dict(
                title=title,
                authors=authors,
                venue=venue,
                year=work["publication_year"],
                abstract=abstract,
                citationCount=work["cited_by_count"],
            )

https://github.com/SakanaAI/AI-Scientist/pull/135/files#diff-430ad3705ff4e5e5d95e20e5723c4f138bc5ad7821033b474aa4aff469817929R337

If it's worth to merge, I will try to resolve the conflict.

@BradKML
Copy link

BradKML commented Jan 8, 2025

It is absolutely worth merging, but I am not sure of the scheduled update of this codebase by @conglu1997
Cus other works like CORE and sources supported by PaperRobot/SciPIP/SciMuse/FatcatScholar/Unpaywall/OAmg (probs all of them need a universal metadata standard) #104 (comment)

corochann added 2 commits January 8, 2025 17:00
# Conflicts:
#	README.md
#	ai_scientist/generate_ideas.py
#	ai_scientist/perform_writeup.py
@corochann
Copy link
Contributor Author

I merged the conflicts (but I haven't tested latest code yet).

@conglu1997
Copy link
Collaborator

conglu1997 commented Jan 8, 2025

Yes, I tested 2 examples and checked that papers are generated. But I didn't (couldn't) confirm the citation is working as expected compared to semantic scholar api.

@corochann Would you be able to link the papers? Tysm! :)

I can verify the outputs w.r.t. Semantic Scholar

@BradKML
Copy link

BradKML commented Jan 9, 2025

@RaInSLc could you do a distribution graph on the character length of abstracts? I am very curious as to what the average and standard deviation is such that 90+% of papers won't get truncated. My bet is that ~2000 would be the sweet spot for this

@corochann
Copy link
Contributor Author

Sorry for the late reply and thank you for the suggestion.
I re-executed using latest commit of this PR, below pdf is generated for nanoGPT_lite template.
https://drive.google.com/file/d/1japPWE1akytcc0w-yW30z8JiEYXntqzV/view?usp=sharing

@BradKML
Copy link

BradKML commented Jan 13, 2025

@corochann could you do a full experiment run on this, with all the experimental results and images filled out? Kinda curious

@corochann
Copy link
Contributor Author

You are right, I don't know why methods or other sections are not replaced, do you know the reason? How to run "full experiment"?
I'm running the script like below. I will try again.
OPENALEX_MAIL_ADDRESS="[email protected]" python launch_scientist.py --model "gpt-4o-2024-05-13" --experiment nanoGPT_lite --num-ideas 1 --engine openalex

@conglu1997
Copy link
Collaborator

conglu1997 commented Jan 14, 2025

This can be stochastic, I recommend more seeds of GPT-4o or Sonnet for higher success rate!

I will look thru this today :) Tysm!

@conglu1997 conglu1997 merged commit c9ff305 into SakanaAI:main Jan 14, 2025
@conglu1997
Copy link
Collaborator

Thank you so much, this is awesome!

@corochann
Copy link
Contributor Author

Thanks for the the review & merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants