Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http request / response objects and initial API endpoint #278

Merged
merged 3 commits into from
Nov 25, 2023
Merged

Conversation

jaytaph
Copy link
Member

@jaytaph jaytaph commented Nov 24, 2023

Starting on a network layer and engine api entrypoint. We use a custom http request/response since the crates found are a bit overkill for now. This PR adds a simple http response and request objects that are used for returning in the gosub engine API to the user-agents.

The current api consists of a single function:

fn fetch_url(
    method: &str,
    url: &str,
    headers: Headers,
    cookies: CookieJar,
) -> Result<FetchResponse, Error> {

That will retrieve the URL and returns a fetch response which is a collection of information (request send, response received, document tree, render tree (later), parser errors and timings.

This is not a finished system, but a simple start.

@jaytaph jaytaph marked this pull request as draft November 24, 2023 09:29
@jaytaph jaytaph marked this pull request as ready for review November 24, 2023 13:27
Cargo.toml Outdated
@@ -58,6 +58,10 @@ hickory-resolver = "0.24.0"
simple_logger = "4.2.0"
shared_singleton = "0.1.0"
testing_logger = "0.1.1"
cookie = { version = "0.18.0", features = ["secure", "private"] }
reqwest = { version = "0.11.22", features = ["blocking"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do blocking reqwest for now.. this will be changed into something better later on

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had reqwest in one of the binaries at one point, and that brought in Tokio. It was replaced with ureq, which is simple and synchronous and doesn't bring in an async runtime. Is there a reason ureq won't be adequate here? I'd pitch for using ureq where you're thinking of using reqwest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ureq is prob fine too. Ill change that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it won't be hard. See the code snippet in a comment below.

@@ -275,6 +275,14 @@ impl CharIterator {
result
}

/// Read directly from bytes
pub fn read_from_bytes(&mut self, bytes: &[u8], e: Option<Encoding>) -> io::Result<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reqwest (and pretty much every system) will return an array of bytes, so we should be able to read that. I'm not 100% sure about ownership and who needs to take it.. We don't want this to be copied a few times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing there will be a shared buffer behind an Arc that can be passed between threads. While things are simple and there is no threading, can probably build something off of Rc.

I think the return value here should be Result<()> (our version of Result from types).

use url::Url;

/// Response that is returned from the fetch function
pub struct FetchResponse {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the response from the gosub-engine when you fetch a url. Probably will change in the future, but I think its a start to generate data on a useragent

// measure the DNS lookup time.
fetch_response.timings.start(Timing::DnsLookup);

let mut resolver = crate::dns::Dns::new();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't do anything, but we use it anyway for timing purposes.

src/engine.rs Outdated
// Fetch the HTML document from the site
fetch_response.timings.start(Timing::ContentTransfer);

let m = Method::from_str(method).unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly converting our http request/response to reqwest objects. This can be done better, but this will be removed later on anyway

headers: HashMap<String, String>,
}

impl Headers {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very simple encapsulation of a hashmap. I think reqwest has a different way that may or may not be better.. but for now this will do.

}

pub fn sorted(&self) -> Vec<(&String, &String)> {
let mut sorted = self.headers.iter().collect::<Vec<_>>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashmaps are not deterministic in output/iteration, so we use a sorted() to always get the same results. A bit of overhead, but again, good enough for now

}

#[derive(Default, Debug, Clone)]
pub struct Timer {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer is just a start and end time, and calculates the differences between (all in milliseconds)

/// Timing information for a single request. Timing is measured in milliseconds.
#[derive(Default, Debug, Clone)]
pub struct TimingTable {
/// Time spent making DNS queries.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing sections. We probably need more but that's easy enough to do.

@jaytaph jaytaph changed the title Http req resp Http request / response objects and initial API endpoint Nov 24, 2023
src/engine.rs Outdated
url: &str,
headers: Headers,
cookies: CookieJar,
) -> Result<FetchResponse, Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Result<FetchResponse, Error> {
) -> Result<FetchResponse> {

To get this to work, you'd need to use the Result type from crate::types, and you might need to translate the error.

@emwalker
Copy link
Collaborator

emwalker commented Nov 24, 2023

@jaytaph I agree with implementing our own type for requests and responses. I do not agree with the reasoning, i.e., that none of the existing crates are quite adequate for what we need. The reason I think we want wrappers for this kind of thing is so that where a struct is widely used in our code, we can control the interface carefully and potentially swap out the third-party crate later on without too much hassle. (I.e., most of the code is not tightly coupled with the third party crate.)

I took a pass at the contents of this PR and have the following suggested changes, which will be easier to show in a patch rather than line by line:

diff --git a/src/engine.rs b/src/engine.rs
index d6e78d3..9ae5834 100644
--- a/src/engine.rs
+++ b/src/engine.rs
@@ -2,19 +2,18 @@ use crate::bytes::{CharIterator, Confidence, Encoding};
 use crate::dns::ResolveType;
 use crate::html5::parser::document::{Document, DocumentBuilder, DocumentHandle};
 use crate::html5::parser::Html5Parser;
-use crate::net::errors::Error;
 use crate::net::http::headers::Headers;
 use crate::net::http::request::Request;
 use crate::net::http::response::Response;
 use crate::timing::{Timing, TimingTable};
-use crate::types::ParseError;
+use crate::types::{Error, ParseError, Result};
 use cookie::CookieJar;
 use core::fmt::Debug;
-use core::str::FromStr;
-use reqwest::header::HeaderName;
-use reqwest::Method;
+use std::io::Read;
 use url::Url;
 
+const MAX_BYTES: u64 = 10_000_000;
+
 /// Response that is returned from the fetch function
 pub struct FetchResponse {
     /// Request that has been send
@@ -57,15 +56,12 @@ fn fetch_url(
     url: &str,
     headers: Headers,
     cookies: CookieJar,
-) -> Result<FetchResponse, Error> {
+) -> Result<FetchResponse> {
     let mut http_req = Request::new(method, url, "HTTP/1.1");
     http_req.headers = headers.clone();
     http_req.cookies = cookies.clone();
 
-    let parts = Url::parse(url);
-    if parts.is_err() {
-        return Err(Error::Generic(format!("Failed to parse URL: {}", url)));
-    }
+    let parts = Url::parse(url)?;
 
     let mut fetch_response = FetchResponse {
         request: http_req,
@@ -81,39 +77,47 @@ fn fetch_url(
     fetch_response.timings.start(Timing::DnsLookup);
 
     let mut resolver = crate::dns::Dns::new();
-    let res = resolver.resolve(parts.unwrap().host_str().unwrap(), ResolveType::Ipv4);
-    if res.is_err() {
-        return Err(Error::Generic(format!("Failed to resolve domain: {}", url)));
-    }
+    let hostname = parts.host_str().expect("no hostname");
+    let _ = resolver.resolve(hostname, ResolveType::Ipv4)?;
 
     fetch_response.timings.end(Timing::DnsLookup);
 
     // Fetch the HTML document from the site
     fetch_response.timings.start(Timing::ContentTransfer);
 
-    let m = Method::from_str(method).unwrap();
-    let u = Url::parse(url).unwrap();
-    let mut req = reqwest::blocking::Request::new(m, u);
+    let mut req = match method.to_ascii_lowercase().as_str() {
+        "get" => ureq::get(url),
+        "post" => ureq::post(url),
+        _ => return Err(Error::Generic(format!("unknown method: {method}"))),
+    };
+
     for (key, value) in headers.sorted() {
-        req.headers_mut()
-            .insert(HeaderName::from_str(key).unwrap(), value.parse().unwrap());
+        req = req.set(key, value);
     }
 
-    match reqwest::blocking::Client::new().execute(req) {
+    match req.call() {
         Ok(resp) => {
             fetch_response.response = Response::new();
-            fetch_response.response.status = resp.status().as_u16();
-            fetch_response.response.version = format!("{:?}", resp.version());
-            for (key, value) in resp.headers().iter() {
-                fetch_response
-                    .response
-                    .headers
-                    .set(key.as_str(), value.to_str().unwrap());
+            fetch_response.response.status = resp.status();
+            fetch_response.response.version = format!("{:?}", resp.http_version());
+            for key in &resp.headers_names() {
+                for value in resp.all(key) {
+                    fetch_response.response.headers.set(key.as_str(), value);
+                }
             }
             // for cookie in resp.cookies() {
             //     fetch_response.response.cookies.insert(cookie.name().to_string(), cookie.value().to_string());
             // }
-            fetch_response.response.body = resp.bytes().unwrap().to_vec();
+
+            let len = if let Some(header) = resp.header("Content-Length") {
+                header.parse::<usize>().unwrap_or_default()
+            } else {
+                MAX_BYTES as usize
+            };
+
+            let mut bytes: Vec<u8> = Vec::with_capacity(len);
+            resp.into_reader().take(MAX_BYTES).read_to_end(&mut bytes)?;
+            fetch_response.response.body = bytes;
         }
         Err(e) => {
             return Err(Error::Generic(format!("Failed to fetch URL: {}", e)));
diff --git a/src/types.rs b/src/types.rs
index 8e77176..bcf0560 100644
--- a/src/types.rs
+++ b/src/types.rs
@@ -61,6 +61,12 @@ pub enum Error {
 
     #[error("dns: domain not found")]
     DnsDomainNotFound,
+
+    #[error("there was a problem: {0}")]
+    Generic(String),
+
+    #[error("failed to parse url: {0}")]
+    Url(#[from] url::ParseError),
 }
 
 /// Result that can be returned which holds either T or an Error

Copy link
Member

@Sharktheone Sharktheone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small things, so feel free to ignore

src/engine.rs Outdated
fetch_response.timings.start(Timing::DnsLookup);

let mut resolver = crate::dns::Dns::new();
let hostname = parts.host_str().expect("no hostname");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let hostname = parts.host_str().expect("no hostname");
let Some(hostname) = parts.host_str() else {
return Err(Error::Generic(format!("invalid hostname: {}", url)));
};

I guess, the let-else syntax is a convenient solution for "error" handling Options

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

src/engine.rs Outdated

let mut req = match method.to_ascii_lowercase().as_str() {
"get" => ureq::get(url),
"post" => ureq::post(url),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sticking with GET and POST? We definitely should implement the other methods (e.g PUT, DELETE, PATCH) at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use "request" now:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut req = agent.request(method, url).set("User-Agent", USER_AGENT);

Comment on lines +108 to +110
// for cookie in resp.cookies() {
// fetch_response.response.cookies.insert(cookie.name().to_string(), cookie.value().to_string());
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code. I guess it's a TODO, so maybe add the TODO note

}

#[derive(Default, Debug, Clone)]
pub struct Timer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that basically what std::time::Instant does?

    let start = Instant::now();

    start.elapsed().as_millis();

I mean you can wrap it to still have the start and end method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks similar. I will let it what it is since it works and is not that important

@jaytaph jaytaph merged commit 37f5c5c into main Nov 25, 2023
4 checks passed
@jaytaph jaytaph deleted the http-req-resp branch November 25, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants