Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SlidesLive] Use token for JSON retrieval and improve metadata extraction #29958

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Sep 13, 2021

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

The SlidesLive extractor now needs a player_token query parameter in the JSON URL, whose value is given by the data-player-token attribute in the HTML element (<div>) with id="player".

Also, some of the expected metadata (timestamp, eg) wasn't being collected.

Resolves #29954.
May resolve #30881.

@dirkf dirkf force-pushed the df-slideslive-token-patch branch from 56c6a1c to 666a963 Compare September 13, 2021 14:04
@dirkf
Copy link
Contributor Author

dirkf commented Sep 13, 2021

The second download test fails (E) because ffmpeg is needed to download m3u8. Isn't ffmpeg in the CI VMs?

@pukkandan
Copy link
Contributor

pukkandan commented Sep 15, 2021

You are supposed to just use skip_download when downloading from manifests. Even if you have ffmpeg, the md5 may not be consistent making the test useless

@dirkf
Copy link
Contributor Author

dirkf commented Sep 15, 2021

Thanks, that makes sense, in that it accords with what I've seen. The second test didn't have skip_download, though (as the diffs show), so I wonder if it ever passed before.

@Wu-Chenyang
Copy link

The second test didn't have skip_download, though (as the diffs show), so I wonder if it ever passed before.

I think he means that you should add a skip_download to the second test.

@dirkf
Copy link
Contributor Author

dirkf commented Sep 20, 2021

Indeed, not running the failing part of the test would stop it failing.

The point is that this test as originally written didn't skip the download test (see git blame) and so I wondered whether it ever did pass when it was written.

@pukkandan
Copy link
Contributor

download test with ffmpeg passes on the machine the test is written in. Once ffmpeg version changes, it may no longer pass. So ig the maintainers never tested it 🤷

@dirkf
Copy link
Contributor Author

dirkf commented Sep 20, 2021

Is this ffmpeg version dependency because different versions may download different-sized initial pieces of the video, which should be addressed here?

Or is it that different versions of ffmpeg may reassemble the video in equivalent but bit-different ways? In which case we should be testing some invariant(s) rather than the MD5 of the download.

@pukkandan
Copy link
Contributor

I am not entirely sure. It's just something I have noticed from experience

btw, the link in ur comment is broken

@dirkf
Copy link
Contributor Author

dirkf commented Sep 20, 2021

btw, the link in ur comment is broken

Apparently, to put an anchor in GH Markdown you say:

<a id="anchor-name">text to be target of anchor</a>

And then this makes an anchor with actual HTML name user-content-anchor-name, which therefore you link with s/t like [link text](page_url#user-content-anchor-name)

Who could guess? Anyway, better now.

@kumuji
Copy link

kumuji commented Dec 6, 2021

Hi! Thank you for fixing the issue with slideslive!

I am trying to download my video I recorded for a conference: https://slideslive.com/38972123/mix3d-outofcontext-data-augmentation-for-3d-scenes
On the master branch I was having an issue with authorization (401), but with your commit it is solved.

However, now I get an Error related to parsing JSON:

youtube_dl.utils.ExtractorError: 38972123: Failed to parse JSON  (caused by JSONDecodeError('Expecting value: line 1 column 1 (char 0)')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Do you have a clue where should I look to solve it?

@dirkf
Copy link
Contributor Author

dirkf commented Dec 6, 2021

The supposed JSON downloaded from https://ben.slideslive.com/player/' + video_id turns out to be a playlist (#EXTM3U ...). If you d/l this, extract the slides.json URL from the playlist, and download it, you get something like this:

https://cdn.slideslive.com/data/presentations/38972123/v2/slides.json?1638065435

{
  u'slides': [
    {
      u'image': {
        u'name': u'X52r__1637863423__0000__Adfi'
      },
      u'type': u'image',
      u'time': 0
    },
    {
      u'video': {
        u'service': u'yoda',
        u'export_as_video': True,
        u'muted': True,
        u'duration_ms': 21280,
        u'resolution': {
          u'width': 1920,
          u'height': 1080
        },
        u'id': u'vvzVOxYp_ynO'
      },
      u'type': u'video',
      u'time': 4440
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0002__tkva'
      },
      u'type': u'image',
      u'time': 25720
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0003__QKb1'
      },
      u'type': u'image',
      u'time': 30520
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0004__nShc'
      },
      u'type': u'image',
      u'time': 39240
    },
    {
      u'video': {
        u'service': u'yoda',
        u'export_as_video': True,
        u'muted': True,
        u'duration_ms': 2880,
        u'resolution': {
          u'width': 1920,
          u'height': 1080
        },
        u'id': u'dGeEm6bx_0e1'
      },
      u'type': u'video',
      u'time': 43240
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0006__Czuy'
      },
      u'type': u'image',
      u'time': 46120
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0007__POUv'
      },
      u'type': u'image',
      u'time': 52880
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0008__3R0E'
      },
      u'type': u'image',
      u'time': 57240
    },
    {
      u'video': {
        u'service': u'yoda',
        u'export_as_video': True,
        u'muted': True,
        u'duration_ms': 14320,
        u'resolution': {
          u'width': 1920,
          u'height': 1080
        },
        u'id': u'Uk9xX-x5Muo4'
      },
      u'type': u'video',
      u'time': 65480
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0010__1LjO'
      },
      u'type': u'image',
      u'time': 79800
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0011__ueTK'
      },
      u'type': u'image',
      u'time': 82320
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0012__QGwP'
      },
      u'type': u'image',
      u'time': 86920
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0013__HuJC'
      },
      u'type': u'image',
      u'time': 88520
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0014__OBFk'
      },
      u'type': u'image',
      u'time': 90240
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0015__5oLL'
      },
      u'type': u'image',
      u'time': 93440
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0016__PIRH'
      },
      u'type': u'image',
      u'time': 100920
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0017__WXeP'
      },
      u'type': u'image',
      u'time': 107120
    },
    {
      u'video': {
        u'service': u'yoda',
        u'export_as_video': True,
        u'muted': True,
        u'duration_ms': 9200,
        u'resolution': {
          u'width': 1920,
          u'height': 1080
        },
        u'id': u'DYmVED0Js_gb'
      },
      u'type': u'video',
      u'time': 122240
    },
    {
      u'image': {
        u'name': u'X52r__1637863423__0019__YgcM'
      },
      u'type': u'image',
      u'time': 131440
    }
  ],
  u'slide_qualities': [
    u'big',
    u'medium'
  ]
}
As a test I extracted the service_name (`.service`) and service_id (`.id`) from `.slides[1].video` in the above and got this:

[info] Available formats for vvzVOxYp_ynO:
format code           extension  resolution note
hls-group_A1-audio_5  mp4        audio only 
dash-5                m4a        audio only [eng] DASH audio  128k , m4a_dash container, mp4a.40.2 (44100Hz)
dash-1                mp4        416x234    [eng] DASH video  145k , mp4_dash container, avc1.4d400d, video only
hls-275               mp4        416x234     275k , avc1.4d400d, video only
dash-2                mp4        640x360    [eng] DASH video  365k , mp4_dash container, avc1.4d401e, video only
hls-495               mp4        640x360     495k , avc1.4d401e, video only
dash-3                mp4        768x432    [eng] DASH video  730k , mp4_dash container, avc1.4d401e, video only
hls-860               mp4        768x432     860k , avc1.4d401e, video only
dash-0                mp4        1280x720   [eng] DASH video 3000k , mp4_dash container, avc1.4d401f, video only
hls-3130              mp4        1280x720   3130k , avc1.4d401f, video only
dash-4                mp4        1920x1080  [eng] DASH video 4500k , mp4_dash container, avc1.4d4028, video only
hls-4629              mp4        1920x1080  4629k , avc1.4d4028, video only (best)
However I suppose the other video items are also relevant, and the image data may be as well. Perhaps a playlist should be extracted in this case?

@dteney
Copy link

dteney commented Jan 6, 2022

Hi @dirkf
Thanks for all your work.. Is there currently a way to make this patch work to download slideslive audio/video ? I'm the original poster of the issue about slideslive downloads being broken, so I'm still very interested in this feature. Thanks again!

@dirkf
Copy link
Contributor Author

dirkf commented Jan 6, 2022

Depending on what type of yt-dl installation you have, you can replace the extractor/slideslive.py file with the PR version.

However, as you appear to have the Windows self-extracting version, it's not so easy. It is reported that WinRAR can manipulate the yt-dl self-extracting archive (~8MB). Or if Python is installed on your Windows system (rather than using the version bundled into the self-extracting archive), you can install yt-dl 2021.12.17 with pip or pip3 and then update the extractor.

When replacing the extractor, any compiled version (*.*pyc, *.pyo) needs to be removed as well.

@dteney
Copy link

dteney commented Jan 7, 2022

Thanks @dirkf I'm fine with that, I was referring to latest issue "Failed to parse JSON" reported above. I do get the same thing. Example below.

python -m youtube_dl --verbose -x --audio-format mp3 https://slideslive.com/38957205/accessibility-hci-ml-papers
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-x', '--audio-format', 'mp3', 'https://slideslive.com/38957205/accessibility-hci-ml-papers']
[debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.9.9 (CPython) - Windows-10-10.0.18363-SP0
[debug] exe versions: none
[debug] Proxy map: {}
[SlidesLive] 38957205: Downloading webpage
[SlidesLive] 38957205: Downloading JSON metadata
ERROR: 38957205: Failed to parse JSON (caused by JSONDecodeError('Expecting value: line 1 column 1 (char 0)')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

@dirkf
Copy link
Contributor Author

dirkf commented Jan 7, 2022

Ah, I understand.

Let's assume that the unparsable JSON is always a .m3u8 playlist pointing to JSON that has a similar format to the one I quoted, ie has a slides list each of whose items is either an image or a video, tagged with a time, presumably the time from the start of the presentation.

So is the right thing just to extract a playlist of all the videos in time order?

@dteney
Copy link

dteney commented Jan 7, 2022

slides list each of whose items is either an image or a video, tagged with a time, presumably the time from the start of the presentation.
So is the right thing just to extract a playlist of all the videos in time order?

I think that would be the right thing to do. Looking at the `'duration_ms' fields I'm guessing each chunk corresponds to a portion of the video to be synced with a specific slide. It did not use to be the case before so I wonder why the change, given that the website has no new functionalities compared to ~6 months ago. Anyway thanks again @dirkf.

@dteney
Copy link

dteney commented Jan 28, 2022

Anyone (@dirkf or others) keen to implement a fix for this ? I can offer a personal donation to the developer or the project if anyone can bump this up in the list of request priorities :)

@vojtad
Copy link

vojtad commented Feb 1, 2022

CTO from SlidesLive here.

When you fetch 'https://ben.slideslive.com/player/' + presentation_id using the player token you get a custom M3U8 playlist which contains video service (#EXT-SL-VOD-VIDEO-SERVICE-NAME) and ID (#EXT-SL-VOD-VIDEO-SERVICE-ID) for the actual video and link to a JSON (#EXT-SL-VOD-SLIDES-JSON-URL) which contains information about all the slides.

When video service is youtube you can use usual YouTube implementation to download video with that ID.

When video service is yoda you can use one of the servers from #EXT-SL-VOD-VIDEO-SERVERS (preferably always the first one) to download that particular video using following URL https://$VIDEO_SERVER/$VIDEO_ID/master.m3u8 which is Apple HLS playlist.

Slides JSON contains images and videos. These are static slides and animations. Video slides are a little bit trickier to download and right now there isn't straightforward way to do that, unfortunately. Each slide has a time field which is a timestamp in millis from the presentation start when it should appear. Video slides also have their duration in millis in video.duration_ms field.

However, we are in the process of upgrading our infrastructure a little bit right now. I cannot promise any kind of compatibility, unfortunately. Also, the custom M3U8 playlist might contain slides information directly in the future instead of the JSON link.

Also, I would like to remind everyone to download videos in moderation. Traffic is not free and if too many videos would be downloaded causing high traffic charges we would have to start fighting against it. Also, videos are a property of conferences organizers.

EDIT: I would love to help with fixing youtube-dl but it will have to wait until we finish the infrastructure upgrade.

@dirkf
Copy link
Contributor Author

dirkf commented Feb 1, 2022

Your contribution is highly appreciated, as is the reminder to respect the site's resources. When a conference user archives a session for private use it may well save the site the cost of serving that session repeatedly whenever the user reviews it. And while wishing the site success and longevity, circumstances might lead to the session being unavailable online.

@dteney
Copy link

dteney commented Feb 1, 2022

Also, I would like to remind everyone to download videos in moderation. Traffic is not free and if too many videos would be downloaded causing high traffic charges we would have to start fighting against it. Also, videos are a property of conferences organizers.

EDIT: I would love to help with fixing youtube-dl but it will have to wait until we finish the infrastructure upgrade.

This is great to hear, thanks for getting involved in the discussion @vojtad.

You may be interested in hearing that my use case is to download conference talks in audio format for offline/>1x listening (e.g. with an MP3 player). The possible time savings for academics make this is a key feature of conferences turning online. Conferences that choose YouTube as a host make this effortless. So I think it's indeed in Slideslive's interest to make this possible (perhaps as a built-in feature ?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slideslive Downloading from slideslive.com is broken
6 participants