startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

chrismackey · 2019-12-03T16:11:45Z

I've noticed that calling ADLStoreClient.enumerateDirectory with the startAfter field will cause the entire directory to be listed if the startAfter has unicode characters within it. For example, in a given directory where the unicode filename "澳门.tst" is the last file within the directory, providing that path as the startAfter will re-list the entire directory contents. I've seen this reproduced for other unicode paths. This makes paging through a directory of unicode paths impossible, as any time the last path within a page is a unicode path, it will restart from the beginning.

chrismackey · 2019-12-04T14:38:07Z

This can be seen in the Hadoop Client which uses this SDK that tries to list an Azure Bucket with 4001 (1 more than the hard coded page size of 4k entries) files with unicode names:

hdfs dfs -ls /eoneill/Test5
--
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:348)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._getText2(UTF8StreamJsonParser.java:320)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:274)
at com.microsoft.azure.datalake.store.Core.listStatus(Core.java:859)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectoryInternal(ADLStoreClient.java:525)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:504)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:368)
at org.apache.hadoop.fs.adl.AdlFileSystem.listStatus(AdlFileSystem.java:473)
at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:220)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:297)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:356)

The process takes a while, and OOM's itself as it constantly rescans the same 4k files over and over again. It works fine with 3.9k files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

chrismackey commented Dec 3, 2019

chrismackey commented Dec 4, 2019

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

Comments

chrismackey commented Dec 3, 2019

chrismackey commented Dec 4, 2019