FirecrawlLoader web loader not working. #6893

DevDeepakBhattarai · 2024-09-28T07:51:13Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain.js documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain.js rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

export async function addKnowledgeFormExternalSource(
  dbiId: string,
  data: z.infer<typeof schema>,
) {
  const session = await auth();
  if (!session) {
    throw new Error("Unauthorized");
  }

  const userId = session.user.id;

  const parsedData = await schema.safeParseAsync(data);

  if (!parsedData.success) {
    throw new Error("Invalid data");
  }

  const { url, type } = parsedData.data;

  let loader: YoutubeLoader | FireCrawlLoader | SitemapLoader;
  switch (type) {
    case "youtube":
      loader = YoutubeLoader.createFromUrl(url, {
        language: "en",
        addVideoInfo: true,
      });
      break;
    case "webpage":
      loader = new FireCrawlLoader({
        url: url,
        apiKey: env.FIRECRAWL_API_KEY,
        mode: "scrape",
      });
      break;
    case "sitemap":
      loader = new SitemapLoader(url);
      break;
  }
  const docs = await loader.load();

  console.log(docs.length, docs);

  if (docs.length < 1) {
    throw new Error("No docs found");
  }

  const embedding = new OpenAIEmbeddings();

  const fileName =
    type === "youtube"
      ? (docs[0]!.metadata?.title as string)
      : type === "webpage"
        ? (docs[0]!.metadata?.ogTitle as string)
        : (docs[0]?.metadata?.source as string);

  const fileId = await createFileRecord(fileName, dbiId);
  const pinecone = new Pinecone({
    apiKey: env.PINECONE_API_KEY,
  });
  const pineconeIndex = pinecone.Index("allweone");
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
  });

  const splitDocs = await splitter.splitDocuments(docs);
  const splitDocsWithMetadata = splitDocs.map((doc) => ({
    ...doc,
    metadata: {
      ...doc.metadata,
      fileId,
    },
  }));

  const store = await PineconeStore.fromDocuments([], embedding, {
    pineconeIndex,
    namespace: `${userId}-${dbiId}`,
  });

  await store.addDocuments(splitDocsWithMetadata, {
    ids: splitDocsWithMetadata.map((_, index) => `${fileId}#${index}`),
  });
}

Error Message and Stack Trace (if applicable)

⨯ TypeError: Cannot read properties of undefined (reading 'markdown')
at Array.map ()
at async addKnowledgeFormExternalSource (./src/app/_actions/dbi/addKnowledge.tsx:86:18)

Description

I am trying to build a rag app where i can pull in knowledge from youtube , website and webpages.
For the website. I am using firecrawl but is gives me this error.

I know the firecrawl api is working because I can see the activity logs from firecrawl logs.

System Info

pnpm 9.4.0
@langchain/community 0.2.33

dosubot · 2024-09-28T07:52:06Z

The TypeError: Cannot read properties of undefined (reading 'markdown') occurs because the firecrawlDocs array contains an element that is undefined or does not have the markdown property. This can happen if the response from the Firecrawl API does not include the expected data structure.

To fix this issue, you should add a check to ensure that each document in firecrawlDocs is defined and has the markdown property before attempting to access it. Here is the modified load method with the necessary checks:

public async load(): Promise<DocumentInterface[]> {
  const params: ConstructorParameters<typeof FirecrawlApp>[0] = {
    apiKey: this.apiKey,
  };
  if (this.apiUrl !== undefined) {
    params.apiUrl = this.apiUrl;
  }
  const app = new FirecrawlApp(params);
  let firecrawlDocs: FirecrawlDocument[];

  if (this.mode === "scrape") {
    const response = await app.scrapeUrl(this.url, this.params);
    if (!response.success) {
      throw new Error(
        `Firecrawl: Failed to scrape URL. Error: ${response.error}`
      );
    }
    firecrawlDocs = [response.data as FirecrawlDocument];
  } else if (this.mode === "crawl") {
    const response = await app.crawlUrl(this.url, this.params, true);
    firecrawlDocs = response as FirecrawlDocument[];
  } else {
    throw new Error(
      `Unrecognized mode '${this.mode}'. Expected one of 'crawl', 'scrape'.`
    );
  }

  return firecrawlDocs
    .filter((doc) => doc && doc.markdown !== undefined)
    .map(
      (doc) =>
        new Document({
          pageContent: doc.markdown || "",
          metadata: doc.metadata || {},
        })
    );
}

This code filters out any undefined documents or documents that do not have the markdown property before mapping them to Document instances ^[1].

To continue talking to Dosu, mention @dosu.

jacoblee93 · 2024-09-30T15:42:12Z

CC @nickscamara

nickscamara · 2024-09-30T16:09:56Z

@rafaelsideguide is taking a look!

dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FirecrawlLoader web loader not working. #6893

FirecrawlLoader web loader not working. #6893

DevDeepakBhattarai commented Sep 28, 2024

dosubot bot commented Sep 28, 2024

jacoblee93 commented Sep 30, 2024

nickscamara commented Sep 30, 2024

FirecrawlLoader web loader not working. #6893

FirecrawlLoader web loader not working. #6893

Comments

DevDeepakBhattarai commented Sep 28, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

dosubot bot commented Sep 28, 2024

jacoblee93 commented Sep 30, 2024

nickscamara commented Sep 30, 2024