Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JSInterp] Back-port JS interpreter upgrade from yt-dlp PR #1437 #30188

Closed
wants to merge 5 commits into from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Nov 2, 2021

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

As part of the PR "[youtube] Fix throttling by decrypting n-sig", @pukkandan implemented a more capable version of the JS interpreter module. Although it's not needed for the equivalent fix in yt-dl, having the same level of capability in yt-dl would facilitate other back-ports, eg in case a more complex YouTube challenge should have to be decoded in future. It might also suffice for certain other extractors where some JS trickery needs to be emulated.

This PR back-ports the jsinterp.py from yt-dlp/yt-dlp#1437 with changes to make it run in 2.6 <= Python < 3.6 as well.

  • yield from becomes nested iterator
  • nonlocal faked by a dummy namespace class
  • f'...' replaced by various older format syntaxes
  • replace collections.abc with compat_collections_abc added to compat.py.

All the tests in the unchanged back-ported test/test_jsinterp.py pass.

The diff below applies to the initial commit 7109df0.

diff yt_dlp/jsinterp.py youtube_dl/jsinterp.py

--- yt_dlp/jsinterp.py
+++ youtube_dl/jsinterp.py
@@ -1,4 +1,5 @@
-from collections.abc import MutableMapping
+from __future__ import unicode_literals
+
 import json
 import operator
 import re
@@ -7,7 +8,16 @@
     ExtractorError,
     remove_quotes,
 )
+from .compat import (
+    compat_collections_abc
+)
+MutableMapping = compat_collections_abc.MutableMapping

+
+class Nonlocal:
+    pass
+
+
 _OPERATORS = [
     ('|', operator.or_),
     ('^', operator.xor),
@@ -60,13 +70,14 @@
 
     def __iter__(self):
         for scope in self.stack:
-            yield from scope
+            for scope_item in iter(scope):
+                yield scope_item
 
     def __len__(self, key):
         return len(iter(self))
 
     def __repr__(self):
-        return f'LocalNameSpace{self.stack}'
+        return 'LocalNameSpace%s' % (self.stack, )
 
 
 class JSInterpreter(object):
@@ -80,12 +91,12 @@
 
     def _named_object(self, namespace, obj):
         self.__named_object_counter += 1
-        name = f'__yt_dlp_jsinterp_obj{self.__named_object_counter}'
+        name = '__youtube_dl_jsinterp_obj%s' % (self.__named_object_counter, )
         namespace[name] = obj
         return name
 
     @staticmethod
-    def _seperate(expr, delim=',', max_split=None):
+    def _separate(expr, delim=',', max_split=None):
         if not expr:
             return
         parens = {'(': 0, '{': 0, '[': 0, ']': 0, '}': 0, ')': 0}
@@ -111,17 +122,17 @@
         yield expr[start:]
 
     @staticmethod
-    def _seperate_at_paren(expr, delim):
-        seperated = list(JSInterpreter._seperate(expr, delim, 1))
-        if len(seperated) < 2:
-            raise ExtractorError(f'No terminating paren {delim} in {expr}')
-        return seperated[0][1:].strip(), seperated[1].strip()
+    def _separate_at_paren(expr, delim):
+        separated = list(JSInterpreter._separate(expr, delim, 1))
+        if len(separated) < 2:
+            raise ExtractorError('No terminating paren {0} in {1}'.format(delim, expr))
+        return separated[0][1:].strip(), separated[1].strip()
 
     def interpret_statement(self, stmt, local_vars, allow_recursion=100):
         if allow_recursion < 0:
             raise ExtractorError('Recursion limit reached')
 
-        sub_statements = list(self._seperate(stmt, ';'))
+        sub_statements = list(self._separate(stmt, ';'))
         stmt = (sub_statements or ['']).pop()
         for sub_stmt in sub_statements:
             ret, should_abort = self.interpret_statement(sub_stmt, local_vars, allow_recursion - 1)
@@ -151,7 +162,7 @@
             return None
 
         if expr.startswith('{'):
-            inner, outer = self._seperate_at_paren(expr, '}')
+            inner, outer = self._separate_at_paren(expr, '}')
             inner, should_abort = self.interpret_statement(inner, local_vars, allow_recursion - 1)
             if not outer or should_abort:
                 return inner
@@ -159,7 +170,7 @@
                 expr = json.dumps(inner) + outer
 
         if expr.startswith('('):
-            inner, outer = self._seperate_at_paren(expr, ')')
+            inner, outer = self._separate_at_paren(expr, ')')
             inner = self.interpret_expression(inner, local_vars, allow_recursion)
             if not outer:
                 return inner
@@ -167,16 +178,16 @@
                 expr = json.dumps(inner) + outer
 
         if expr.startswith('['):
-            inner, outer = self._seperate_at_paren(expr, ']')
+            inner, outer = self._separate_at_paren(expr, ']')
             name = self._named_object(local_vars, [
                 self.interpret_expression(item, local_vars, allow_recursion)
-                for item in self._seperate(inner)])
+                for item in self._separate(inner)])
             expr = name + outer
 
         m = re.match(r'try\s*', expr)
         if m:
             if expr[m.end()] == '{':
-                try_expr, expr = self._seperate_at_paren(expr[m.end():], '}')
+                try_expr, expr = self._separate_at_paren(expr[m.end():], '}')
             else:
                 try_expr, expr = expr[m.end() - 1:], ''
             ret, should_abort = self.interpret_statement(try_expr, local_vars, allow_recursion - 1)
@@ -184,29 +195,32 @@
                 return ret
             return self.interpret_statement(expr, local_vars, allow_recursion - 1)[0]
 
-        m = re.match(r'catch\s*\(', expr)
-        if m:
+        m = re.match(r'(?:(?Pcatch)|(?Pfor)|(?Pswitch))\s*\(', expr)
+        md = m.groupdict() if m else {}
+        if md.get('catch'):
             # We ignore the catch block
-            _, expr = self._seperate_at_paren(expr, '}')
+            _, expr = self._separate_at_paren(expr, '}')
             return self.interpret_statement(expr, local_vars, allow_recursion - 1)[0]
 
-        m = re.match(r'for\s*\(', expr)
-        if m:
-            constructor, remaining = self._seperate_at_paren(expr[m.end() - 1:], ')')
+        elif md.get('for'):
+            def raise_constructor_error(c):
+                raise ExtractorError(
+                    'Premature return in the initialization of a for loop in {0!r}'.format(c))
+
+            constructor, remaining = self._separate_at_paren(expr[m.end() - 1:], ')')
             if remaining.startswith('{'):
-                body, expr = self._seperate_at_paren(remaining, '}')
+                body, expr = self._separate_at_paren(remaining, '}')
             else:
                 m = re.match(r'switch\s*\(', remaining)  # FIXME
                 if m:
-                    switch_val, remaining = self._seperate_at_paren(remaining[m.end() - 1:], ')')
-                    body, expr = self._seperate_at_paren(remaining, '}')
+                    switch_val, remaining = self._separate_at_paren(remaining[m.end() - 1:], ')')
+                    body, expr = self._separate_at_paren(remaining, '}')
                     body = 'switch(%s){%s}' % (switch_val, body)
                 else:
                     body, expr = remaining, ''
-            start, cndn, increment = self._seperate(constructor, ';')
+            start, cndn, increment = self._separate(constructor, ';')
             if self.interpret_statement(start, local_vars, allow_recursion - 1)[1]:
-                raise ExtractorError(
-                    f'Premature return in the initialization of a for loop in {constructor!r}')
+                raise_constructor_error(constructor)
             while True:
                 if not self.interpret_expression(cndn, local_vars, allow_recursion):
                     break
@@ -219,22 +233,20 @@
                 except JS_Continue:
                     pass
                 if self.interpret_statement(increment, local_vars, allow_recursion - 1)[1]:
-                    raise ExtractorError(
-                        f'Premature return in the initialization of a for loop in {constructor!r}')
+                    raise_constructor_error(constructor)
             return self.interpret_statement(expr, local_vars, allow_recursion - 1)[0]
 
-        m = re.match(r'switch\s*\(', expr)
-        if m:
-            switch_val, remaining = self._seperate_at_paren(expr[m.end() - 1:], ')')
+        elif md.get('switch'):
+            switch_val, remaining = self._separate_at_paren(expr[m.end() - 1:], ')')
             switch_val = self.interpret_expression(switch_val, local_vars, allow_recursion)
-            body, expr = self._seperate_at_paren(remaining, '}')
+            body, expr = self._separate_at_paren(remaining, '}')
             body, default = body.split('default:') if 'default:' in body else (body, None)
             items = body.split('case ')[1:]
             if default:
-                items.append(f'default:{default}')
+                items.append('default:%s' % (default, ))
             matched = False
             for item in items:
-                case, stmt = [i.strip() for i in self._seperate(item, ':', 1)]
+                case, stmt = [i.strip() for i in self._separate(item, ':', 1)]
                 matched = matched or case == 'default' or switch_val == self.interpret_expression(case, local_vars, allow_recursion)
                 if matched:
                     try:
@@ -245,15 +257,15 @@
                         break
             return self.interpret_statement(expr, local_vars, allow_recursion - 1)[0]
 
-        # Comma seperated statements
-        sub_expressions = list(self._seperate(expr))
+        # Comma separated statements
+        sub_expressions = list(self._separate(expr))
         expr = sub_expressions.pop().strip() if sub_expressions else ''
         for sub_expr in sub_expressions:
             self.interpret_expression(sub_expr, local_vars, allow_recursion)
 
-        for m in re.finditer(rf'''(?x)
-                (?P\+\+|--)(?P{_NAME_RE})|
-                (?P{_NAME_RE})(?P\+\+|--)''', expr):
+        for m in re.finditer(r'''(?x)
+                (?P\+\+|--)(?P%(_NAME_RE)s)|
+                (?P%(_NAME_RE)s)(?P\+\+|--)''' % globals(), expr):
             var = m.group('var1') or m.group('var2')
             start, end = m.span()
             sign = m.group('pre_sign') or m.group('post_sign')
@@ -276,7 +288,7 @@
                 lvar = local_vars[m.group('out')]
                 idx = self.interpret_expression(m.group('index'), local_vars, allow_recursion)
                 if not isinstance(idx, int):
-                    raise ExtractorError(f'List indices must be integers: {idx}')
+                    raise ExtractorError('List indices must be integers: %s' % (idx, ))
                 cur = lvar[idx]
                 val = opfunc(cur, right_val)
                 lvar[idx] = val
@@ -313,20 +325,23 @@
             idx = self.interpret_expression(m.group('idx'), local_vars, allow_recursion)
             return val[idx]
 
+        def raise_expr_error(where, op, exp):
+            raise ExtractorError('Premature {0} return of {1} in {2!r}'.format(where, op, exp))
+
         for op, opfunc in _OPERATORS:
-            seperated = list(self._seperate(expr, op))
-            if len(seperated) < 2:
+            separated = list(self._separate(expr, op))
+            if len(separated) < 2:
                 continue
-            right_val = seperated.pop()
-            left_val = op.join(seperated)
+            right_val = separated.pop()
+            left_val = op.join(separated)
             left_val, should_abort = self.interpret_statement(
                 left_val, local_vars, allow_recursion - 1)
             if should_abort:
-                raise ExtractorError(f'Premature left-side return of {op} in {expr!r}')
+                raise_expr_error('left-side', op, expr)
             right_val, should_abort = self.interpret_statement(
                 right_val, local_vars, allow_recursion - 1)
             if should_abort:
-                raise ExtractorError(f'Premature right-side return of {op} in {expr!r}')
+                raise_expr_error('right-side', op, expr)
             return opfunc(left_val or 0, right_val)
 
         m = re.match(
@@ -334,20 +349,23 @@
             expr)
         if m:
             variable = m.group('var')
-            member = remove_quotes(m.group('member') or m.group('member2'))
+            nl = Nonlocal()
+
+            nl.member = remove_quotes(m.group('member') or m.group('member2'))
             arg_str = expr[m.end():]
             if arg_str.startswith('('):
-                arg_str, remaining = self._seperate_at_paren(arg_str, ')')
+                arg_str, remaining = self._separate_at_paren(arg_str, ')')
             else:
                 arg_str, remaining = None, arg_str
 
             def assertion(cndn, msg):
                 """ assert, but without risk of getting optimized out """
                 if not cndn:
-                    raise ExtractorError(f'{member} {msg}: {expr}')
+                    raise ExtractorError('{0} {1}: {2}'.format(nl.member, msg, expr))
 
             def eval_method():
-                nonlocal member
+                # nonlocal member
+                member = nl.member
                 if variable == 'String':
                     obj = str
                 elif variable in local_vars:
@@ -366,13 +384,13 @@
                 # Function call
                 argvals = [
                     self.interpret_expression(v, local_vars, allow_recursion)
-                    for v in self._seperate(arg_str)]
+                    for v in self._separate(arg_str)]
 
                 if obj == str:
                     if member == 'fromCharCode':
                         assertion(argvals, 'takes one or more arguments')
                         return ''.join(map(chr, argvals))
-                    raise ExtractorError(f'Unsupported string method {member}')
+                    raise ExtractorError('Unsupported string method %s' % (member, ))
 
                 if member == 'split':
                     assertion(argvals, 'takes one or more arguments')
@@ -435,6 +453,7 @@
 
                 if isinstance(obj, list):
                     member = int(member)
+                    nl.member = member
                 return obj[member](argvals)
 
             if remaining:
@@ -449,7 +468,7 @@
             fname = m.group('func')
             argvals = tuple([
                 int(v) if v.isdigit() else local_vars[v]
-                for v in self._seperate(m.group('args'))])
+                for v in self._separate(m.group('args'))])
             if fname in local_vars:
                 return local_vars[fname](argvals)
             elif fname not in self._functions:
@@ -486,12 +505,11 @@
         """ @returns argnames, code """
         func_m = re.search(
             r'''(?x)
-                (?:function\s+%s|[{;,]\s*%s\s*=\s*function|var\s+%s\s*=\s*function)\s*
+                (?:function\s+%(f_n)s|[{;,]\s*%(f_n)s\s*=\s*function|var\s+%(f_n)s\s*=\s*function)\s*
                 \((?P[^)]*)\)\s*
-                (?P\{(?:(?!};)[^"]|"([^"]|\\")*")+\})''' % (
-                re.escape(funcname), re.escape(funcname), re.escape(funcname)),
+                (?P\{(?:(?!};)[^"]|"([^"]|\\")*")+\})''' % {'f_n': re.escape(funcname), },
             self.code)
-        code, _ = self._seperate_at_paren(func_m.group('code'), '}')  # refine the match
+        code, _ = self._separate_at_paren(func_m.group('code'), '}')  # refine the match
         if func_m is None:
             raise ExtractorError('Could not find JS function %r' % funcname)
         return func_m.group('args').split(','), code
@@ -506,7 +524,7 @@
             if mobj is None:
                 break
             start, body_start = mobj.span()
-            body, remaining = self._seperate_at_paren(code[body_start - 1:], '}')
+            body, remaining = self._separate_at_paren(code[body_start - 1:], '}')
             name = self._named_object(
                 local_vars,
                 self.extract_function_from_code(
@@ -523,12 +541,10 @@
         local_vars = global_stack.pop(0)
 
         def resf(args, **kwargs):
-            local_vars.update({
-                **dict(zip(argnames, args)),
-                **kwargs
-            })
+            local_vars.update(dict(zip(argnames, args)))
+            local_vars.update(kwargs)
             var_stack = LocalNameSpace(local_vars, *global_stack)
-            for stmt in self._seperate(code.replace('\n', ''), ';'):
+            for stmt in self._separate(code.replace('\n', ''), ';'):
                 ret, should_abort = self.interpret_statement(stmt, var_stack)
                 if should_abort:
                     break

@pukkandan
Copy link
Contributor

There is probably some things that are not handled correctly in my implementation. It works well enough for the n-sig decryption, hence why I committed it. But you may want to review the whole thing

@dirkf
Copy link
Contributor Author

dirkf commented Nov 2, 2021

Thanks. What I thought, but good to go anyway. You've obviously hit the things (comma-expressions, etc) that the previous version failed on for the n-sig. One question is whitespace.

I guess just keep adding things into the unit test until it breaks ...

@rofl0r
Copy link

rofl0r commented Nov 2, 2021

@dirkf do you have a branch that has all your fixes/backports cumulatively ? i need to stick with py2-compatible youtube-dl too.

@dirkf
Copy link
Contributor Author

dirkf commented Nov 2, 2021

I guess I should but I'm not looking forward to the endless conflict resolutions.

@rofl0r
Copy link

rofl0r commented Nov 2, 2021

I guess I should but I'm not looking forward to the endless conflict resolutions.

how's that? do most fixes touch upon the same code locations ?
otherwise it would just be a new branch based on youtube-dl master and a couple git cherry-picked commits.

@pukkandan

This comment has been minimized.

@coletdjnz
Copy link
Contributor

coletdjnz commented Nov 3, 2021

Note there was another fix applied recently to support a recent player update: yt-dlp/yt-dlp@a1fc7ca

@dirkf dirkf force-pushed the pukkandan-jsinterp-patch branch from a3aaca9 to cabfef7 Compare November 27, 2021 02:13
@Botan626
Copy link

Could this PR provide compiled file, like this PR does?

@dirkf
Copy link
Contributor Author

dirkf commented Nov 29, 2021

At a minimum jsinterp.py and extractor/youtube.py would have to be changed to replace the fix in PR #30184, which makes patching a bit more complex, especially as I haven't actually made the required updates to the latter file yet.

yt-dlp/yt-dlp/@06dfe0a, improve _MATCHING_PARENS
@dirkf
Copy link
Contributor Author

dirkf commented Feb 16, 2022

Closed by revised #30184.

@dirkf dirkf closed this Feb 16, 2022
@dirkf dirkf mentioned this pull request Feb 17, 2022
11 tasks
@dirkf dirkf deleted the pukkandan-jsinterp-patch branch June 30, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants