Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

startAfter field in ADLStoreClient.enumerateDirectory ignored if unicode #37

Open
chrismackey opened this issue Dec 3, 2019 · 1 comment

Comments

@chrismackey
Copy link

I've noticed that calling ADLStoreClient.enumerateDirectory with the startAfter field will cause the entire directory to be listed if the startAfter has unicode characters within it. For example, in a given directory where the unicode filename "澳门.tst" is the last file within the directory, providing that path as the startAfter will re-list the entire directory contents. I've seen this reproduced for other unicode paths. This makes paging through a directory of unicode paths impossible, as any time the last path within a page is a unicode path, it will restart from the beginning.

@chrismackey
Copy link
Author

This can be seen in the Hadoop Client which uses this SDK that tries to list an Azure Bucket with 4001 (1 more than the hard coded page size of 4k entries) files with unicode names:

hdfs dfs -ls /eoneill/Test5
--
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:348)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._getText2(UTF8StreamJsonParser.java:320)
at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:274)
at com.microsoft.azure.datalake.store.Core.listStatus(Core.java:859)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectoryInternal(ADLStoreClient.java:525)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:504)
at com.microsoft.azure.datalake.store.ADLStoreClient.enumerateDirectory(ADLStoreClient.java:368)
at org.apache.hadoop.fs.adl.AdlFileSystem.listStatus(AdlFileSystem.java:473)
at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:220)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:297)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:356)

The process takes a while, and OOM's itself as it constantly rescans the same 4k files over and over again. It works fine with 3.9k files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant