Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads stats are off #3697

Closed
mcollina opened this issue May 2, 2024 · 23 comments
Closed

Downloads stats are off #3697

mcollina opened this issue May 2, 2024 · 23 comments
Labels

Comments

@mcollina
Copy link
Member

mcollina commented May 2, 2024

I took a look at https://nodedownloads.nodeland.dev/ and it seems downloads stats for April are off.

Quite a bit of data is missing, as those are the only files available for April:

nodejs.org-access.log.20240406.json
nodejs.org-access.log.20240407.json
nodejs.org-access.log.20240418.json
nodejs.org-access.log.20240419.json
nodejs.org-access.log.20240420.json
nodejs.org-access.log.20240421.json
nodejs.org-access.log.20240423.json
nodejs.org-access.log.20240424.json
nodejs.org-access.log.20240425.json
nodejs.org-access.log.20240426.json
nodejs.org-access.log.20240427.json
nodejs.org-access.log.20240428.json
nodejs.org-access.log.20240429.json

Can somebody with access retrieve the missing data?

@targos
Copy link
Member

targos commented May 2, 2024

CleanShot 2024-05-02 at 11 55 08@2x

There's a problem with the log processor on GCP. I don't have time to investigate it.

@targos
Copy link
Member

targos commented May 2, 2024

If someone else with access wants to have a look: https://console.cloud.google.com/run/detail/us-central1/processlogs/metrics?project=nodejs-org

@targos
Copy link
Member

targos commented May 2, 2024

@mcollina
Copy link
Member Author

mcollina commented May 6, 2024

How can I have access to look into it?

@targos
Copy link
Member

targos commented May 6, 2024

I formalized an access request: #3705

@richardlau
Copy link
Member

I currently don't have time to look into this either, but for a quick glance at GCP some things that probably warrant looking at:

@mcollina
Copy link
Member Author

I'm still planning to look into it, but things got deferred.

@mcollina
Copy link
Member Author

Unfortunately, I did not have time to look into this, but it looks like this is a serious issue as the last day we had the data is the 2024/5/2. Maybe somebody else can step in? I don't think I would not be able to look into it in the coming weeks/months.

@mcollina
Copy link
Member Author

@trivikr maybe could you take a look?

@trivikr
Copy link
Member

trivikr commented Jun 14, 2024

@trivikr maybe could you take a look?

Sure. I created access request at #3760, and shared the email address of my Google account as a Slack DM with @targos

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

Direct link for viewing April stats: https://storage.googleapis.com/access-logs-summaries-nodejs/?marker=nodejs.org-access.log.20240331.json

Verified that the access logs are missing for the dates, as mentioned in the description #3697 (comment) above.

Screenshot nodejs.org-access.log.202404

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

As per the source code, it looks like the contents of the filename are not a valid JSON

function createPipeline (bucket, filename, processedFile, callback) {
const storage = new Storage({
keyFilename: 'metrics-processor-service-key.json'
})
console.log('INSIDE CREATE PIPELINE')
const readBucket = storage.bucket(bucket)
const writeBucket = storage.bucket('processed-logs-nodejs')
readBucket.file(filename).download(function (err, contents) {
if (err) {
console.log('ERROR IN DOWNLOAD ', filename, err)
// callback(500)
callback()
} else {
const stringContents = contents.toString()
console.log('String length: ', stringContents.length)
const contentsArray = stringContents.split('\n')
console.log('Array Length: ', contentsArray.length)
let results = ''
for (const line of contentsArray) {
try {
const jsonparse = JSON.parse(line)

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

Example log which throws error

BUCKET cloudflare-logs-nodejs
FILENAME 20240618/20240618T225603Z_20240618T225716Z_da79c9d1.log.gz
PROCESSEDFILENAME 20240618/20240618T225603Z_20240618T225716Z
INSIDE CREATE PIPELINE
String length:  21570831
Array Length:  44744
SyntaxError: Unexpected end of JSON input 

Direct link to file: 20240618/20240618T225603Z_20240618T225716Z_da79c9d1.log.gz

Sample code which processes this file by reading file locally from test.json

import { readFile } from "fs/promises";

const data = await readFile("test.json", "utf8");
for (const line of data.split("\n")) {
  try {
    JSON.parse(line);
  } catch (error) {
    const { message } = error;
    console.log({ message, line });
  }
}

Example output

{ message: 'Unexpected end of JSON input', line: '' }

An easy fix would be to skip reading lines which don't have any data

// ...
for (const line of data.split("\n")) {
  if (line.length === 0) {
    continue;
  }
// ...

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

The SyntaxError is just logged in console but the processing continues.
It may will not fix the download stats, but it'll reduce the number of error occurrences in the logs.

for (const line of contentsArray) {
try {
const jsonparse = JSON.parse(line)
const printout = logTransform2(jsonparse)
if (printout !== undefined) { results = results.concat(printout) }
} catch (err) { console.log(err) }
}

I posted a fix at #3765 to reduce the number of errors.

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

Posted another PR to log line with length if parsing fails #3766

@trivikr
Copy link
Member

trivikr commented Jun 18, 2024

I'll resume my deep dive once fixes for SyntaxError are reviewed, merged and deployed.

@trivikr
Copy link
Member

trivikr commented Jun 19, 2024

I've created some bug reports based on logs.

However, the access logs are not created after 20240503 and Google Cloud retains logs only for 30 days.
The logs from the successful processing nodejs.org-access.log.20240502.json are not available for viewing.

The next steps would be to deep dive on #3771, and add more logs in collectData

const [files] = await storage.bucket('processed-logs-nodejs').getFiles({ prefix: `${filePrefix}`})
for (const file of files) {
const data = await storage.bucket('processed-logs-nodejs').file(file.name).download()
const stringContents = data[0].toString()
const contentsArray = stringContents.split('\n')
for (const line of contentsArray) {
try {
const csvparse = csvStream(line)
if (csvparse !== undefined && csvparse[0][0] !== '') { summary(csvparse) }
} catch (err) { console.log(err) }
}
}

I can take it up after getting approvals for existing open PRs.
If I can get a write/deploy permission for Google Cloud Run, it would be helpful.

@trivikr
Copy link
Member

trivikr commented Jun 26, 2024

Updates:

@trivikr
Copy link
Member

trivikr commented Jul 5, 2024

No action needed here, although increasing the timeout is recommended in #3771 (comment)

The summaries are being produced since 20240629. I couldn't find the exact reason, the removal of moment depdendency seems to be the closest match in terms of timing.

A fix for this is pending merge in #3801
Once merged, I'll be able to process summaries for all missing dates.

@trivikr
Copy link
Member

trivikr commented Jul 5, 2024

On examining the source code of summaries, the only thing which comes out as long pole seems to be the number of files which it needs to process, which happens in parallel

const [files] = await storage.bucket('processed-logs-nodejs').getFiles({ prefix: `${filePrefix}`})
for (const file of files) {
const data = await storage.bucket('processed-logs-nodejs').file(file.name).download()

However, I couldn't find a clear correlation between the missing summaries and number of files processed

Test code:

import { Storage } from "@google-cloud/storage";

const projectId = "nodejs-org";
const storage = new Storage({ projectId });

for (let i = 1; i < 31; i++) {
  const date = `202404${i.toString().padStart(2, "0")}`;

  const filePrefix = `${date}/`;
  const [files] = await storage.bucket("processed-logs-nodejs").getFiles({ prefix: filePrefix });

  const fileName = `nodejs.org-access.log.${date}.json`;
  const [exists] = await storage.bucket("access-logs-summaries-nodejs").file(fileName).exists();

  console.log(date, files.length, exists);
}

Output

20240401 3611 false
20240402 3757 false
20240403 3774 false
20240404 3725 false
20240405 3895 false
20240406 3671 true
20240407 3402 true
20240408 3708 false
20240409 3781 false
20240410 3776 false
20240411 4204 false
20240412 4883 false
20240413 4659 false
20240414 4694 false
20240415 4452 false
20240416 2700 false
20240417 2674 false
20240418 2654 true
20240419 2643 true
20240420 2609 true
20240421 2631 true
20240422 2689 false
20240423 2663 true
20240424 2660 true
20240425 2652 true
20240426 2640 true
20240427 2618 true
20240428 2636 true
20240429 2754 true
20240430 2664 false

It could happen because of the content of the files, or some other reason.
We'll get more information once we retry running summaries for those dates after #3801 is merged and deployed.

@trivikr
Copy link
Member

trivikr commented Jul 6, 2024

The summaries are created for months of April to June, with the fix in #3809
The list can be viewed at https://storage.googleapis.com/access-logs-summaries-nodejs/?marker=nodejs.org-access.log.20240401.json

Test code to call summaries

import { GoogleAuth } from "google-auth-library";
import { Storage } from "@google-cloud/storage";
import { Agent } from "undici";

const projectId = "nodejs-org";
const storage = new Storage({ projectId });

const cloudRunUrl = "https://produce-summaries-kdtacnjogq-uc.a.run.app";
const auth = new GoogleAuth();
const client = await auth.getIdTokenClient(cloudRunUrl);

globalThis[Symbol.for("undici.globalDispatcher.1")] = new Agent({
  headersTimeout: 30 * 60 * 1000,
});

const start = new Date("04/01/2024");
const end = new Date("06/30/2024");

for (let date = start; date <= end; date.setDate(date.getDate() + 1)) {
  const dateString = date.toISOString().split("T")[0].replace(/-/g, "");

  const fileName = `nodejs.org-access.log.${dateString}.json`;
  const [exists] = await storage.bucket("access-logs-summaries-nodejs").file(fileName).exists();

  if (!exists) {
    const startTime = Date.now();
    console.log(`Fetching for ${dateString}`);
    // Make a call to the API to populate the file
    const url = `${cloudRunUrl}/date/${dateString}`;

    try {
      const token = await client.idTokenProvider.fetchIdToken(cloudRunUrl);
      const response = await fetch(url, {
        method: "POST",
        headers: { Authorization: `Bearer ${token}` },
      });

      if (!response.ok) {
        throw new Error(`HTTP error! status: ${response.status}`);
      }

      console.log(`Successful in ${Date.now() - startTime}ms for ${dateString}`);
    } catch (error) {
      console.error(`Error in ${Date.now() - startTime}ms for ${dateString}`, error);
    }
  }
}

The script was run in parallel for different dates. This issue can be closed.

@trivikr

This comment was marked as off-topic.

@mcollina
Copy link
Member Author

mcollina commented Jul 8, 2024

thanks @trivikr!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants