YoutubeLoader does not work on production environment. #6915

DevDeepakBhattarai · 2024-10-01T09:27:11Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain.js documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain.js rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

"use server";
import "server-only";
import { db } from "@/server/db";
import { z } from "zod";
import { auth } from "@/server/auth";
import { YoutubeLoader } from "@langchain/community/document_loaders/web/youtube";
import { SitemapLoader } from "@langchain/community/document_loaders/web/sitemap";
import { env } from "@/env";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { OpenAIEmbeddings } from "@langchain/openai";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const YoutubeSchema = z.object({
  url: z.string().url(),
  type: z.literal("youtube"),
});

const webPageSchema = z.object({
  url: z.string().url(),
  type: z.literal("webpage"),
});

const siteMapSchema = z.object({
  url: z.string().url(),
  type: z.literal("sitemap"),
});

const schema = z.union([YoutubeSchema, webPageSchema, siteMapSchema]);

export async function addKnowledgeFormExternalSource(
  dbiId: string,
  data: z.infer<typeof schema>,
) {

  const parsedData = await schema.safeParseAsync(data);

  if (!parsedData.success) {
    throw new Error("Invalid data");
  }

  const { url, type } = parsedData.data;

  let loader: YoutubeLoader | CheerioWebBaseLoader | SitemapLoader;
  switch (type) {
    case "youtube":
      loader = YoutubeLoader.createFromUrl(url, {
        language: "en",
        addVideoInfo: true,
      });
      break;
    case "webpage":
      // loader = new FireCrawlLoader({
      //   url: url,
      //   apiKey: env.FIRECRAWL_API_KEY,
      //   mode: "scrape",
      // });
      loader = new CheerioWebBaseLoader(url);
      break;
    case "sitemap":
      loader = new SitemapLoader(url);
      break;
  }
  const docs = await loader.load();

  console.log(docs.length, docs);

  if (docs.length < 1) {
    throw new Error("No docs found");
  }

  const embedding = new OpenAIEmbeddings();

  const fileName =
    type === "youtube"
      ? (docs[0]!.metadata?.title as string)
      : type === "webpage"
        ? (docs[0]!.metadata?.source as string)
        : (docs[0]?.metadata?.title as string);

  const fileId = await createFileRecord(fileName, dbiId);
  const pinecone = new Pinecone({
    apiKey: env.PINECONE_API_KEY,
  });
  const pineconeIndex = pinecone.Index("allweone");
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
  });

  const splitDocs = await splitter.splitDocuments(docs);
  const splitDocsWithMetadata = splitDocs.map((doc) => ({
    ...doc,
    metadata: {
      ...doc.metadata,
      fileId,
    },
  }));

  const store = await PineconeStore.fromDocuments([], embedding, {
    pineconeIndex,
    namespace: `${userId}-${dbiId}`,
  });

  await store.addDocuments(splitDocsWithMetadata, {
    ids: splitDocsWithMetadata.map((_, index) => `${fileId}#${index}`),
  });
}

Error Message and Stack Trace (if applicable)

Error: Failed to get YouTube video transcription: [YoutubeTranscript] 🚨 Transcript is disabled on this video (ejVMXu-u1hs)
at IW.load (/var/task/.next/server/chunks/6531.js:365:290354)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async y (/var/task/.next/server/chunks/1201.js:97:6103)
at async /var/task/.next/server/chunks/6531.js:368:24443
at async eg (/var/task/.next/server/chunks/6531.js:368:24270)
at async /var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry [email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:16:418
at async rR (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry [email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:15:7978)
at async r7 (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry [email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:18:1144)
at async es (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry [email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/server.runtime.prod.js:16:26324)
at async en.responseCache.get.routeKind (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry [email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/server.runtime.prod.js:17:1026) {
digest: '2995320257'
}

Description

I am using Next JS and Langchain JS. And hosting my app on Vercel.

I am trying to build a rag app where I can also upload data form youtube videos. But I am facing one issue right now.
When ever I run this code on my local server it works and I can successfully get the transcript data from youtube. But whenever I try do the same things in the production server than I get the above error.

The same video works in development and not on the production.

There is no any environment variable that is required for this so I don't think there is any difference between production and development.

So what is the issue here.
Can anyone help me ?

System Info

Next Js 14.2.5
Vercel

Local Server : Windows 11
Node 20.10.0

dosubot · 2024-10-01T09:29:10Z

The error you're encountering is related to the YouTube transcript being disabled for the video in question. This issue has been discussed in a similar thread: TypeError: Cannot read properties of undefined (reading 'transcriptBodyRenderer') using YoutubeLoader. The error message indicates that the transcript is disabled, which can happen if the video owner has not enabled transcripts.

Here's a suggested workaround:

Check Video Settings: Ensure that the video has transcripts enabled. This can vary between environments if the video settings are changed or if different videos are being tested.
Use a Forked Package: As a temporary solution, you can use a fork of the youtube-transcript package that addresses similar issues:
```
npm i https://github.com/sinedied/youtube-transcript#a10a073ac325b3b88018f321fa1bc5d62fa69b1c
```
This fork might handle the transcript fetching differently and could resolve the issue in production.
Error Handling: Ensure your code gracefully handles cases where transcripts are not available. You can catch the error and provide a fallback or notification to the user.

If the issue persists, consider checking any differences in network permissions or API rate limits between your local and production environments, as these can sometimes affect API interactions.

To continue talking to Dosu, mention @dosu.

DevDeepakBhattarai · 2024-10-01T09:33:21Z

@dosu No, that is not the issue because in the development environment I can get the transcript using the same video. So transcript is not disabled.

dosubot · 2024-10-01T09:34:59Z

If the transcript works in development but not in production, consider these possibilities:

Network Differences: Check if there are any network restrictions or firewall settings in the production environment that might block requests to YouTube.
API Rate Limits: Ensure that your production environment isn't hitting YouTube's API rate limits. This can differ from development if more requests are being made.
Dependencies: Verify that all dependencies, especially youtube-transcript and youtubei.js, are the same in both environments. Differences in versions might cause discrepancies.
Error Handling and Logging: Enable detailed logging in production to capture any errors or warnings that might provide more insight. You can use the verbose flag or set LANGCHAIN_TRACING to true for more detailed logs ^[1].
Environment Variables: Double-check any environment variables that might affect network requests or API keys, even if you believe none are required.

These steps should help identify the root cause of the issue.

To continue talking to Dosu, mention @dosu.

jacoblee93 · 2024-10-03T02:23:36Z

Scraping is generally a bit janky unfortunately - perhaps Vercel's IP is getting blocked?

DevDeepakBhattarai · 2024-10-03T15:22:27Z

Thanks @jacoblee93.
I figured that was the case but now I am clear.

I think now I will use the real youtube api to get the transcript.

Is there a way to do this in langchain. Perhaps it would be great if this feature was implemented.

dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 1, 2024

DevDeepakBhattarai closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YoutubeLoader does not work on production environment. #6915

YoutubeLoader does not work on production environment. #6915

DevDeepakBhattarai commented Oct 1, 2024

dosubot bot commented Oct 1, 2024

DevDeepakBhattarai commented Oct 1, 2024

dosubot bot commented Oct 1, 2024

jacoblee93 commented Oct 3, 2024

DevDeepakBhattarai commented Oct 3, 2024 •

edited

Loading

YoutubeLoader does not work on production environment. #6915

YoutubeLoader does not work on production environment. #6915

Comments

DevDeepakBhattarai commented Oct 1, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

dosubot bot commented Oct 1, 2024

DevDeepakBhattarai commented Oct 1, 2024

dosubot bot commented Oct 1, 2024

jacoblee93 commented Oct 3, 2024

DevDeepakBhattarai commented Oct 3, 2024 • edited Loading

DevDeepakBhattarai commented Oct 3, 2024 •

edited

Loading