-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance geojson_json(sp) - transformation loop sp -> JSON -> R -> JSON #85
Comments
There are two issues:
|
printing of json is now more cumbersome, but could rework which print statement is used to minify on print
can try again #77 will still take some time though |
also, @ateucher - might want to try new |
Damn you work fast @sckott. I'll have a look for sure! |
@sckott: Great, while I was still benchmarking, you provide a ready-to-use fix. Here high-level benchmarking of your version for a relatively small geometry: input <- raster::getData('GADM', country='FRA', level=1)
devtools::install_github("ropensci/geojsonio")
system.time(geojsonio::geojson_json(input))
# user system elapsed
# 34.161 0.034 34.263
devtools::install_github("ropensci/geojsonio@json-io")
system.time(geojsonio::geojson_json(input))
# user system elapsed
# 4.643 0.204 4.876 Back to current version of geojson_benchmark <- function(input) {
dir <- tempfile()
file <- tempfile()
t1 <- system.time(rgdal::writeOGR(obj=input, dsn=dir, layer="",driver="GeoJSON"))
t2a1 <- system.time(file.copy(dir, file, overwrite = TRUE))
t2a2 <- system.time(js <- jsonlite::fromJSON(file,simplifyDataFrame = FALSE, simplifyMatrix = FALSE))
t2a3 <- system.time(gj <- jsonlite::toJSON(js, digits=22, auto_unbox=TRUE))
t2 <- system.time(gj <- readChar(file, file.info(dir)$size))
list(writeOGR=t1[["elapsed"]],
file.copy=t2a1[["elapsed"]],
fromJSON=t2a2[["elapsed"]],
toJSON=t2a3[["elapsed"]],
readChar=t2[["elapsed"]])
}
geo_bench <- function(country,level=1,rep=5) {
input <- raster::getData('GADM', country=country, level=level)
replicate(rep,geojson_benchmark(input))
}
> geo_bench('BEL', 1)
# [,1] [,2] [,3] [,4] [,5]
# writeOGR 0.093 0.108 0.103 0.111 0.106
# file.copy 0.001 0.001 0.001 0.001 0.001
# fromJSON 0.137 0.13 0.126 0.127 0.129
# toJSON 0.941 0.943 0.947 0.936 0.943
# readChar 0.026 0.014 0.027 0.014 0.027
> geo_bench('SWE', 1)
# [,1] [,2] [,3] [,4] [,5]
# writeOGR 4.473 4.356 4.352 3.981 4.204
# file.copy 0.037 0.036 0.038 0.034 0.035
# fromJSON 7.924 8.314 7.861 7.742 7.86
# toJSON 46.848 48.026 47.043 48.143 46.744
# readChar 1.322 1.249 1.175 1.3 1.28
> geo_bench('CAN', 3)
# [,1] [,2] [,3] [,4] [,5]
# writeOGR 66.758 78.971 80.513 84.47 99.44
# file.copy 0.562 1.032 1.045 1.676 1.671
# fromJSON 125.556 128.876 134.611 137.035 135.313
# toJSON 832.98 826.846 838.868 832.268 826.4
# readChar 28.233 17.654 32.044 22.903 23.725 Finally benchmarking your input <- raster::getData('GADM', country='CAN', level=1)
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")
readbyline <- function(x, minify = TRUE, ...) {
paste0(readLines(x, ...), collapse = "\n")
}
readbychar <- function(x) {
readChar(x, file.info(x)$size)
}
microbenchmark::microbenchmark(byline=readbyline(dsn),
bychar=readbychar(dsn),
times=3)
# Unit: seconds
# expr min lq mean median uq max neval cld
# byline 13.640235 13.750737 13.807696 13.861238 13.891426 13.92161 3 b
# bychar 7.474303 7.523513 7.638348 7.572722 7.720371 7.86802 3 a |
Thanks for the detailed benchmarks @javrucebo - I'll try implementing |
I see a similar ~10x speedup on a large SpatialPolygonsDataFrame (19MB, 48 features) in json-io branch. ~12 seconds vs ~130 seconds w/ geojson_json. Small difference in readLines vs readChar, but I also tested with input <- raster::getData('GADM', country='CAN', level=1)
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")
readbyline <- function(x, minify = TRUE, ...) {
paste0(readLines(x, ...), collapse = "\n")
}
readbychar <- function(x) {
readChar(x, file.info(x)$size)
}
microbenchmark::microbenchmark(byline=readbyline(dsn),
bychar=readbychar(dsn),
readr::read_file(dsn),
times=3)
|
And tested in context: library(geojsonio)
library(raster)
library(microbenchmark)
geojson_rw_test <- function(input, fun){
tmp <- tempfile(fileext = ".geojson")
suppressMessages(geojson_write(input, file = tmp))
fun(tmp)
}
readbyline <- function(x) {
paste0(readLines(x), collapse = "\n")
}
readbychar <- function(x) {
readChar(x, file.info(x)$size)
}
input <- raster::getData('GADM', country='CAN', level=1)
microbenchmark(geojson_rw_test(input, fun = readbyline),
geojson_rw_test(input, fun = readbychar),
geojson_rw_test(input, fun = readr::read_file),
times = 3)
|
@ateucher thanks for the benchmarks.
possibly. we already have some heavy deps, but if it is that much faster, perhaps we should think about it |
@javrucebo @ateucher let me know if you have any further thoughts |
A thought - though possibly too convoluted. Could include read_geojson_file <- function(x) {
if (requireNamespace("readr") {
return(readr::read_file(x))
} else {
return(readChar(x, nchars = file.info(x)$size)
}
} Then just include a note in the documentation that this function will be faster if you have |
@javrucebo does andy's suggestion above sound good to you? |
Andy's suggestion will certainly work, nevertheless I would favour to have a hard dependency. Many (if not most) R users I know personally would not read the "Suggests" section and then manually install an additional package for an - for them unknown - performance gain. But having dependencies automatically installed is something they know, and the But this is personal taste combined with knowing a not so experienced user base. Personally as I have the |
I agree with @javrucebo - it's probably worth having |
When you use readr::read_file(x) to readr::read_file(x, locale=locale()) otherwise, As your geojson files will end up to be UTF-8 (or ASCII) in the case a user sets above option to an incompatible encoding you end up with messed up characters: input <- raster::getData('GADM', country='AND', level=0)
# desired behaviour
js <- geojsonio::geojson_json(input)
stringi::stri_match_first_regex(js, '"NAME_ARABIC": "[^"]+"')
# [,1]
# [1,] "\"NAME_ARABIC\": \"أندورا \""
# non desired behaviour
options("readr.default_locale"=readr::locale(encoding="ISO-8859-2"))
js <- geojsonio::geojson_json(input)
stringi::stri_match_first_regex(js, '"NAME_ARABIC": "[^"]+"')
# [,1]
# [1,] "\"NAME_ARABIC\": \"ŘŁŮ\u0086ŘŻŮ\u0088عا \"" As your |
What about including an geojson_json(input, lat = NULL, lon = NULL, group = NULL,
geometry = "point", type = "FeatureCollection", encoding = "UTF-8" ...) and then internatlly use: readr::read_file(x, locale = locale(encoding = encoding)) Will there every be a need for a user to change it, or is that silly? |
Good point about encoding. I imagine we do want to allow the user to set encoding. |
No, we do not want the user to choose encoding of a JSON file, UTF is required (and we might not want to go down the route of allowing UTF-16/UTF-32 which might create more headaches if you play it through) https://tools.ietf.org/html/rfc7159#section-8.1 But encodings can create headaches already now. x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x
# [1] "façile"
Encoding(x)
# [1] "latin1"
js <- jsonlite::toJSON(x)
js
# ["façile"]
Encoding(js)
# [1] "UTF-8" But if we let input <- SpatialPointsDataFrame(cbind(50,10), data=data.frame(ID=x, stringsAsFactors = FALSE))
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")
system(paste("file",dsn))
# /tmp/RtmpPDcdFQ/file488e399430d1: ISO-8859 text And with # version 0.1.8 as on CRAN
geojsonio::geojson_json(input)
# Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
# lexical error: invalid bytes in UTF8 string.
# 1, "properties": { "ID": "fa�ile" }, "geometry": { "type": "
# (right here) ------^
# version @json-io
js <- geojsonio::geojson_json(input)
js
# {
# "type": "FeatureCollection",
#
# "features": [
# { "type": "Feature", "id": 1, "properties": { "ID": "fa�ile" }, "geometry": { "type": "Point", "coordinates": [ 50.0, 10.0 ] } }
# ]
# }
Encoding(js)
# "UTF-8"
# which is encoded in UTF-8 but was reading a wrong character
This goes beyond this thread though. |
One can give an encoding argument to rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON", encoding="latin1")
system(paste("file",dsn))
# /tmp/Rtmp2h1fhE/file4a2b11df724b: UTF-8 Unicode text But this does not really help, as you can have mixed encodings (in which case writeOGR fails) In a nutshell, I would not bother with this in |
okay, so |
okay, pushed the change, closing this, reopen if you think needs more work |
Hey @javrucebo - just wanted to say thanks for catching this! It's made a really big difference when working with large objects 👍 |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
I am wondering why in
geojson_json(sp)
we have following transformations:sp
object togeojson
on disk throughgeojson_write
R
object throughjsonlite::fromJSON
to_json
I do not see a difference in the resulting JSON's from step 1 and 3 apart from number of digits, but that might be different for more complex SpatialObjects? For any map I used from GADM I did not observe any difference using directly the JSON from
geojson_write
The current behaviour adds quite some timing overhead, see some benchmarking I described in ateucher/rmapshaper#28 (comment) (though the geojson transformations are not isolated there)
The text was updated successfully, but these errors were encountered: