Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running MLBoxes on windows machines. #136

Open
sergey-serebryakov opened this issue Nov 10, 2020 · 7 comments
Open

Running MLBoxes on windows machines. #136

sergey-serebryakov opened this issue Nov 10, 2020 · 7 comments
Milestone

Comments

@sergey-serebryakov
Copy link
Contributor

sergey-serebryakov commented Nov 10, 2020

Docker and other MLCommons-Box runners assume they run in Linux environment. Several updates are required to support windows machines as well. Let's use this thread to track what is required and also document the process of running boxes on windows.

__How to run docker-based MLBoxes on Windows machines?

  • Do this ...
  • Do that ...

Fixed:

To be fixed:

  • docker inspect command that uses /dev/null. Error:
    Could not find a part of the path 'C:\dev\null'
    
    Seems like it should either be removed for windows platform (that /dev/null), or the docker runner needs to be able to figure out where it runs (cmd, power shell). Depending on environment, either NUL or $null are used.
  • The function that creates mount points needs to be updated. Currently, for file names the following is generated:
    mounts:
        C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters: '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters'
    
  • Paths on a command line need to be quoted.
@hshaikusa
Copy link

Another error:

command issued for mnist example:

C:\mlperf\mlbox_11062020\box_examples\mnist> docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/download_logs:/mlbox_io1/download_logs serebrya/mlbox_mnist:0.0.2 download --data_dir=/mlbox_io0/data --log_dir=/mlbox_io1/download_logs

here is the error:

2020-11-10 16:58:42.772479: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-11-10 16:58:42.772697: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-11-10 16:58:42.772714: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

@sergey-serebryakov
Copy link
Contributor Author

@hshaikusa These errors are OK. When no GPUs are available, TF should fall back to CPU compute backend. I see these messages on Linux machines as well.

@hshaikusa
Copy link

hshaikusa commented Nov 12, 2020

@sergey-serebryakov , ok
here is another error i am facing for mnist:

command:
C:\mlperf\mlbox_11062020\box_examples\mnist> mlcommons_box_docker run --mlbox=. --platform=platforms/docker.yaml --task=run/train.yaml

outcome:

MLBox(root=C:\mlperf\mlbox_11062020\box_examples\mnist, name=mnist, version=0.1.0, task=MLBoxTask(inputs={'data_dir': 'directory', 'parameters_file': 'file'}, outputs={'log_dir': 'directory', 'model_dir': 'directory'}), invoke=MLBoxInvoke(task_name=train, input_binding={'data_dir': '$WORKSPACE/data', 'parameters_file': '$WORKSPACE/parameters/default.parameters.yaml'}, output_binding={'log_dir': '$WORKSPACE/train_logs', 'model_dir': '$WORKSPACE/model'}), platform=<mlcommons_box.common.objects.platform_config.PlatformConfig object at 0x0000015A78854F48>)
docker inspect --type=image serebrya/mlbox_mnist:0.0.2 > /dev/null 2>&1
The system cannot find the path specified.
Docker image (serebrya/mlbox_mnist:0.0.2) does not exist. Running 'configure' phase.
docker pull serebrya/mlbox_mnist:0.0.2
0.0.2: Pulling from serebrya/mlbox_mnist
Digest: sha256:75667646473cda957bd23b52b6f660fb462986d7776d323a654ae59269ce02b9
Status: Image is up to date for serebrya/mlbox_mnist:0.0.2
docker.io/serebrya/mlbox_mnist:0.0.2
mounts={'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data': '/mlbox_io0/data', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters': '/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs': '/mlbox_io2/train_logs', 'C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model': '/mlbox_io3/model'}, args=['train', '--data_dir=/mlbox_io0/data', '--parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml', '--log_dir=/mlbox_io2/train_logs', '--model_dir=/mlbox_io3/model']
docker run --rm --net=host --privileged=true --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model

docker: Error response from daemon: invalid mode: \mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters.
See 'docker run --help'.
Traceback (most recent call last):
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\mlbox_11062020\Scripts\mlcommons_box_docker.exe_main
.py", line 7, in
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 829, in call
return self.main(*args, **kwargs)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 1259, in invoke
return process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker_main
.py", line 45, in run
runner.run()
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker\docker_run.py", line 72, in run
self._run_or_die(cmd)
File "c:\programdata\anaconda3\envs\mlbox_11062020\lib\site-packages\mlcommons_box_docker\docker_run.py", line 117, in _run_or_die
raise RuntimeError('Command failed: {}'.format(cmd))
RuntimeError: Command failed: docker run --rm --net=host --privileged=true --volume

C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/data:/mlbox_io0/data --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters:/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/train_logs:/mlbox_io2/train_logs --volume C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/model:/mlbox_io3/model serebrya/mlbox_mnist:0.0.2 train --data_dir=/mlbox_io0/data --parameters_file=/mlbox_io1/C:\mlperf\mlbox_11062020\box_examples\mnist\workspace/parameters/default.parameters.yaml --log_dir=/mlbox_io2/train_logs --model_dir=/mlbox_io3/model

@sergey-serebryakov
Copy link
Contributor Author

@hshaikusa Thanks, there's one more issue to be fixed associated with how mount points are constructed. I updated the first message in this thread.

I cannot run docker on my win laptop (probably, due to McAfee). I asked our admins to allocate a Windows virtual instance that I can use for testing.

@swiftdiaries
Copy link
Contributor

I think we might need to support Windows specific filepath construction. Probably a workaround for now (as we're working to stabilize the code) is to maybe use WSL and add instructions for that.

@sergey-serebryakov
Copy link
Contributor Author

Update: I got access to Windows server and I could install docker. I should be able to provide a fix for Windows systems (local Docker runner) next week.

@hshaikusa
Copy link

@sergey-serebryakov cool. looking forward to the fixes. please plan for them to push to PyPI once you are done with your level of validation. I would like them to validate as an outsider who can download as per the instructions and play with them.

@relja128 relja128 added this to the Backlog milestone Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants