Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YoutubeLoader does not work on production environment. #6915

Closed
5 tasks done
DevDeepakBhattarai opened this issue Oct 1, 2024 · 5 comments
Closed
5 tasks done

YoutubeLoader does not work on production environment. #6915

DevDeepakBhattarai opened this issue Oct 1, 2024 · 5 comments
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@DevDeepakBhattarai
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

"use server";
import "server-only";
import { db } from "@/server/db";
import { z } from "zod";
import { auth } from "@/server/auth";
import { YoutubeLoader } from "@langchain/community/document_loaders/web/youtube";
import { SitemapLoader } from "@langchain/community/document_loaders/web/sitemap";
import { env } from "@/env";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { OpenAIEmbeddings } from "@langchain/openai";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const YoutubeSchema = z.object({
  url: z.string().url(),
  type: z.literal("youtube"),
});

const webPageSchema = z.object({
  url: z.string().url(),
  type: z.literal("webpage"),
});

const siteMapSchema = z.object({
  url: z.string().url(),
  type: z.literal("sitemap"),
});

const schema = z.union([YoutubeSchema, webPageSchema, siteMapSchema]);

export async function addKnowledgeFormExternalSource(
  dbiId: string,
  data: z.infer<typeof schema>,
) {

  const parsedData = await schema.safeParseAsync(data);

  if (!parsedData.success) {
    throw new Error("Invalid data");
  }

  const { url, type } = parsedData.data;

  let loader: YoutubeLoader | CheerioWebBaseLoader | SitemapLoader;
  switch (type) {
    case "youtube":
      loader = YoutubeLoader.createFromUrl(url, {
        language: "en",
        addVideoInfo: true,
      });
      break;
    case "webpage":
      // loader = new FireCrawlLoader({
      //   url: url,
      //   apiKey: env.FIRECRAWL_API_KEY,
      //   mode: "scrape",
      // });
      loader = new CheerioWebBaseLoader(url);
      break;
    case "sitemap":
      loader = new SitemapLoader(url);
      break;
  }
  const docs = await loader.load();

  console.log(docs.length, docs);

  if (docs.length < 1) {
    throw new Error("No docs found");
  }

  const embedding = new OpenAIEmbeddings();

  const fileName =
    type === "youtube"
      ? (docs[0]!.metadata?.title as string)
      : type === "webpage"
        ? (docs[0]!.metadata?.source as string)
        : (docs[0]?.metadata?.title as string);

  const fileId = await createFileRecord(fileName, dbiId);
  const pinecone = new Pinecone({
    apiKey: env.PINECONE_API_KEY,
  });
  const pineconeIndex = pinecone.Index("allweone");
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
  });

  const splitDocs = await splitter.splitDocuments(docs);
  const splitDocsWithMetadata = splitDocs.map((doc) => ({
    ...doc,
    metadata: {
      ...doc.metadata,
      fileId,
    },
  }));

  const store = await PineconeStore.fromDocuments([], embedding, {
    pineconeIndex,
    namespace: `${userId}-${dbiId}`,
  });

  await store.addDocuments(splitDocsWithMetadata, {
    ids: splitDocsWithMetadata.map((_, index) => `${fileId}#${index}`),
  });
}

Error Message and Stack Trace (if applicable)

Error: Failed to get YouTube video transcription: [YoutubeTranscript] 🚨 Transcript is disabled on this video (ejVMXu-u1hs)
at IW.load (/var/task/.next/server/chunks/6531.js:365:290354)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async y (/var/task/.next/server/chunks/1201.js:97:6103)
at async /var/task/.next/server/chunks/6531.js:368:24443
at async eg (/var/task/.next/server/chunks/6531.js:368:24270)
at async /var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:16:418
at async rR (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:15:7978)
at async r7 (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/app-page.runtime.prod.js:18:1144)
at async es (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/server.runtime.prod.js:16:26324)
at async en.responseCache.get.routeKind (/var/task/node_modules/.pnpm/next@14.2.5_@babel+core@7.24.5_@opentelemetry[email protected][email protected][email protected][email protected]/node_modules/next/dist/compiled/next-server/server.runtime.prod.js:17:1026) {
digest: '2995320257'
}

Description

I am using Next JS and Langchain JS. And hosting my app on Vercel.

I am trying to build a rag app where I can also upload data form youtube videos. But I am facing one issue right now.
When ever I run this code on my local server it works and I can successfully get the transcript data from youtube. But whenever I try do the same things in the production server than I get the above error.

The same video works in development and not on the production.

There is no any environment variable that is required for this so I don't think there is any difference between production and development.

So what is the issue here.
Can anyone help me ?

System Info

Next Js 14.2.5
Vercel

Local Server : Windows 11
Node 20.10.0

Copy link

dosubot bot commented Oct 1, 2024

The error you're encountering is related to the YouTube transcript being disabled for the video in question. This issue has been discussed in a similar thread: TypeError: Cannot read properties of undefined (reading 'transcriptBodyRenderer') using YoutubeLoader. The error message indicates that the transcript is disabled, which can happen if the video owner has not enabled transcripts.

Here's a suggested workaround:

  1. Check Video Settings: Ensure that the video has transcripts enabled. This can vary between environments if the video settings are changed or if different videos are being tested.

  2. Use a Forked Package: As a temporary solution, you can use a fork of the youtube-transcript package that addresses similar issues:

    npm i https://github.com/sinedied/youtube-transcript#a10a073ac325b3b88018f321fa1bc5d62fa69b1c

    This fork might handle the transcript fetching differently and could resolve the issue in production.

  3. Error Handling: Ensure your code gracefully handles cases where transcripts are not available. You can catch the error and provide a fallback or notification to the user.

If the issue persists, consider checking any differences in network permissions or API rate limits between your local and production environments, as these can sometimes affect API interactions.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 1, 2024
@DevDeepakBhattarai
Copy link
Author

@dosu No, that is not the issue because in the development environment I can get the transcript using the same video. So transcript is not disabled.

Copy link

dosubot bot commented Oct 1, 2024

If the transcript works in development but not in production, consider these possibilities:

  1. Network Differences: Check if there are any network restrictions or firewall settings in the production environment that might block requests to YouTube.

  2. API Rate Limits: Ensure that your production environment isn't hitting YouTube's API rate limits. This can differ from development if more requests are being made.

  3. Dependencies: Verify that all dependencies, especially youtube-transcript and youtubei.js, are the same in both environments. Differences in versions might cause discrepancies.

  4. Error Handling and Logging: Enable detailed logging in production to capture any errors or warnings that might provide more insight. You can use the verbose flag or set LANGCHAIN_TRACING to true for more detailed logs [1].

  5. Environment Variables: Double-check any environment variables that might affect network requests or API keys, even if you believe none are required.

These steps should help identify the root cause of the issue.

To continue talking to Dosu, mention @dosu.

@jacoblee93
Copy link
Collaborator

Scraping is generally a bit janky unfortunately - perhaps Vercel's IP is getting blocked?

@DevDeepakBhattarai
Copy link
Author

DevDeepakBhattarai commented Oct 3, 2024

Thanks @jacoblee93.
I figured that was the case but now I am clear.

I think now I will use the real youtube api to get the transcript.

Is there a way to do this in langchain. Perhaps it would be great if this feature was implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants