Improve ability of command-line client to download large quantities of data [Tigris #321] #3

markphip · 2016-09-15T13:51:38Z

Description

Currently, or at least in version 0.9.13, the command line client adapter is
unable to retrieve files larger than a few megs without throwing an
OutOfMemoryError.  The reason is the results of the 'cat' call are read into a
ByteArrayOutputStream, and either at that point or later in the application, the
error is thrown.  This problem was discovered when trying to use Ivy
(http://jayasoft.org/ivy) to download large dependencies.

This patch adds a readStreamWithOverflow() method which reads the stream into a
ByteArrayOutputStream, however if more than 1 meg is read, a temporary file is
created, the byte array is dumped into it, and the rest of the stream is writted
into the temp file.  The appropriate InputStream is then returned.  This new
method is used by the getContent(SVNUrl...) method instead of the previous
implementation.  This patch makes it possible, at least in my testing, to
download any size of file without any memory problems and without any
preceivable performance penalty.

The patch is as follows from the 0.9.13 tag:

Index: CmdLineClientAdapter.java
===================================================================
--- CmdLineClientAdapter.java   (revision 1402)
+++ CmdLineClientAdapter.java   (working copy)
@@ -55,12 +55,14 @@
 package org.tigris.subversion.svnclientadapter.commandline;

 import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileNotFoundException;
 import java.io.FileOutputStream;
 import java.io.IOException;
 import java.io.InputStream;
+import java.io.OutputStream;
 import java.util.ArrayList;
 import java.util.Iterator;
 import java.util.LinkedList;
@@ -321,11 +323,7 @@
        try {
            InputStream content = _cmd.cat(toString(arg0), toString(arg1));

-           //read byte-by-byte and put it in a vector.
-           //then take the vector and fill a byteArray.
-           byte[] byteArray;
-           byteArray = streamToByteArray(content, false);
-           return new ByteArrayInputStream(byteArray);
+            return readStreamWithOverflow(content);
        } catch (IOException e) {
            throw SVNClientException.wrapException(e);
        } catch (CmdLineException e) {
@@ -1206,4 +1204,60 @@
         // TODO Auto-generated method stub
         return null;
     }
+
+
+    /**
+     *  Reads the contents of an InputStream into a byte array, overflowing to
+     *  a temporary file if needed.  If the bytes read exceeds 1MB, a tempory 
+     *  file is created and used to store the rest of the bytes. 
+     * 
+     * @param inStream The original InputStream
+     * @return A ByteArrayInputStream of read bytes, or if overflow occurred,
+     *         a FileInputStream reading from the temporary file 
+     * 
+     */
+    private InputStream readStreamWithOverflow(InputStream inStream) throws
IOException {
+
+        try {
+            ByteArrayOutputStream outBuffer = new ByteArrayOutputStream();
+            OutputStream outStream = outBuffer;
+            File overflowFile = null;
+            byte[] buffer = new byte[4096];
+            int total = 0;
+            int overflowCutoff = 1 * 1024 * 1024;
+            int len;
+            while ((len = inStream.read(buffer)) > 0) {
+                total += len;
+                // When too many bytes have been read, shove buffer into 
+                // temporary file and write everything else to it
+                if (total > overflowCutoff && overflowFile == null) {
+
+                    byte[] b = outBuffer.toByteArray();
+                    overflowFile = File.createTempFile("ivy", ".tmp");
+                    //System.out.println("Overflow detected, using
"+overflowFile+" file");
+                    outStream = new FileOutputStream(overflowFile);
+                    outStream.write(b, 0, b.length);
+                    outBuffer = null;
+                }    
+                outStream.write(buffer, 0, len);
+            }
+            
+            if (overflowFile != null) {
+                try {
+                    outStream.close();
+                } catch (IOException ex) {
+                    // ignore
+                }
+                return new FileInputStream(overflowFile);
+            } else {
+                return new ByteArrayInputStream(outBuffer.toByteArray());
+            }    
+        }
+        finally {
+            if (inStream != null) {
+                inStream.close();
+            }
+        }
+    }
+            
 }

Metadata Imported from Tigris (Issue 321)

Creation Date: 2005-06-21 12:22:59
Reporter: mrdon
Subcomponent: svnClientAdapter
Version: current
Milestone: 1.0.0
Keywords:
Cc: conor, dlr, markphip

Comments

2005-07-02 21:07:00 by dlr

Adding myself to the CC list.

2005-07-02 21:52:40 by dlr

This issue applies not only to cat, but to diff and propget as well.

2005-07-02 22:32:57 by dlr

That data is read out of a java.lang.Process object byte-by-byte into a Vector
in CmdLineClientAdapter.streamToByteArray() is rather inefficient.  Streaming
the data to disk when it exceeds a certain threshold is not a bad idea (though
the suggested 1 MB is too large a buffer), but I wonder if we could get away
with instead returning the value of Process.getInputStream()?

The JavaDoc of Process has a little to say about that:

http://java.sun.com/j2se/1.3/docs/api/java/lang/Process.html

"All its standard io (i.e. stdin, stdout, stderr) operations will be redirected
to the parent process through three streams (Process.getOutputStream(),
Process.getInputStream(), Process.getErrorStream()). The parent process uses
these streams to feed input to and get output from the subprocess. Because some
native platforms only provide limited buffer size for standard input and output
streams, failure to promptly write the input stream or read the output stream of
the subprocess may cause the subprocess to block, and even deadlock."

Clearly, that last sentence could theoretically pose a problem.  However, I
wonder how often such a thing would come up in practice?  The InputStream is
likely to be read immediately by callers, thus avoiding a deadlock or OS buffer
overrun.  This might be further obviated by wrapping the returned InputStream in
a BufferedInputStream.

2005-07-02 22:57:54 by dlr

The suggested approach of queuing up data exceeding a threshold in a temporary
file only works okay so long as you actually have permission to write to the
local disk, and have possibly up to 2x the space available for the data read off
of the InputStream on disk.  Furthermore, it's much slower since it's possibly
involving disk (hardware) I/O, instead of attempting to use only memory.  The
provided patch is a good example of how to take this approach, but also includes
some commented-out debug statements, and uses the text "ivy" when creating the
temporary file.  Use of Commons I/O (which would be a new dependency) and/or
Lang (an existing dependency) might obviate the need for some of this patch's code.

2005-07-03 00:46:36 by mrdon

Right, the key issue is if there is a need to immediately retrieve the contents of the Process's InputStream 
or it can be returned wrapped in a buffer.  I currently use this patch to download files up to 250 MB, but 
have not tested returning a buffered input stream directly.  As you pointed out, these seems to be an issue 
in other places as well.  I was talking with Conor MacNeill from Apache Ant at Javaone, and he said he was 
having similar issues with the JNI-based svn client.  In his case, the fix involved modifying API's which 
thankfully isn't the case here.  I certainly appreciate anything you can do to resolve this issue for future 
revisions.

2005-07-14 10:55:28 by dlr

Thanks Don.

We've been continuing this discussion on the dev list.  Here's a link to the
email thread:

http://subclipse.tigris.org/servlets/BrowseList?list=dev&by=thread&from=342908


Related JavaHL issue, including a scalability patch by Conor MacNeill:

http://subversion.tigris.org/issues/show_bug.cgi?id=2299

2005-08-03 17:42:53 by dlr

SVN 2299 has been committed to SVN's trunk, and is expected to show up in a
release no later than Subversion 1.3.0.

The text was updated successfully, but these errors were encountered:

markphip added the patch label Sep 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ability of command-line client to download large quantities of data [Tigris #321] #3

Improve ability of command-line client to download large quantities of data [Tigris #321] #3

markphip commented Sep 15, 2016