You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed that calling ADLStoreClient.enumerateDirectory with the startAfter field will cause the entire directory to be listed if the startAfter has unicode characters within it. For example, in a given directory where the unicode filename "澳门.tst" is the last file within the directory, providing that path as the startAfter will re-list the entire directory contents. I've seen this reproduced for other unicode paths. This makes paging through a directory of unicode paths impossible, as any time the last path within a page is a unicode path, it will restart from the beginning.
The text was updated successfully, but these errors were encountered:
This can be seen in the Hadoop Client which uses this SDK that tries to list an Azure Bucket with 4001 (1 more than the hard coded page size of 4k entries) files with unicode names:
hdfs dfs -ls /eoneill/Test5
--
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:348)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._getText2(UTF8StreamJsonParser.java:320)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:274)
at com.microsoft.azure.datalake.store.Core.listStatus(Core.java:859)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectoryInternal(ADLStoreClient.java:525)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:504)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:368)
at org.apache.hadoop.fs.adl.AdlFileSystem.listStatus(AdlFileSystem.java:473)
at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:220)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:297)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:356)
The process takes a while, and OOM's itself as it constantly rescans the same 4k files over and over again. It works fine with 3.9k files.
I've noticed that calling ADLStoreClient.enumerateDirectory with the startAfter field will cause the entire directory to be listed if the startAfter has unicode characters within it. For example, in a given directory where the unicode filename "澳门.tst" is the last file within the directory, providing that path as the startAfter will re-list the entire directory contents. I've seen this reproduced for other unicode paths. This makes paging through a directory of unicode paths impossible, as any time the last path within a page is a unicode path, it will restart from the beginning.
The text was updated successfully, but these errors were encountered: