RING-44425 - Comments for SPARK scripts #48

TrevorBenson · 2023-09-28T13:53:32Z

There are a few spots with ???? which either the entire operation I wanted to confirm before adding a comment, or there is a description, but I am not 100% confident in its accuracy and need to compare some more current output from the scripts to be sure.

…a subkey

scality-fno

stashing my review here — to be continued.
I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.

scripts/S3_FSCK/s3_fsck_p0.py

scripts/S3_FSCK/s3_fsck_p1.py

Co-authored-by: scality-fno <[email protected]>

TrevorBenson · 2023-09-28T16:39:14Z

stashing my review here — to be continued. I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.

I've applied most. Questions posed to outstanding suggestions.

… github.com:scality/spark into improvement/RING-44425-S3_FSCK-scripts-add-comments

scality-fno

New pass on P0. TODO: evaluate the need for each column to exist.

scripts/S3_FSCK/s3_fsck_p0.py

TrevorBenson · 2023-10-10T15:35:08Z

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

Co-authored-by: scality-fno <[email protected]>

scality-fno · 2023-10-10T15:59:12Z

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

👍
I'll make the time for P1, then

TrevorBenson · 2023-10-10T18:53:20Z

Take a look at p1-p4 now. They all have input structures, required fields, etc. The *WT Heck is inside S3_FSCK files* google sheet had a header added for at least one tab. This may cause a comment to incorrectly state it is reading with header describing input which may not actually have a header, needing to be changed to _c0, _c1 etc. Trevor Benson Scality - Staff Engineer +1 (707) 479 2965 (m) | ***@***.*** @scality <https://twitter.com/scality> @TrevorBenson <https://www.linkedin.com/in/trevorbenson/> scality.com <https://www.scality.com>

…

On Tue, Oct 10, 2023 at 8:59 AM scality-fno ***@***.***> wrote: New pass on P0. TODO: evaluate the need for each column to exist. Let's keep the focus on comments, including if columns are used or not used in later steps. Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts. 👍 I'll make the time for P1, then — Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACF6ID4Q57E4MLIUVPANLTLX6VWFZAVCNFSM6AAAAAA5LAI4A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVG42DQMZVHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

scality-fno

"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.

scripts/S3_FSCK/s3_fsck_p1.py

scality-fno

Quite sure there's not much to say (in the current design) about P2, P3 and P4.
One question I have is: can we make the P2 anti-join operation more efficient?

scripts/S3_FSCK/s3_fsck_p2.py

scripts/S3_FSCK/s3_fsck_p4.py

TrevorBenson · 2023-10-26T00:10:25Z

scripts/S3_FSCK/s3_fsck_p1.py

 dfCOSsingle = dfCOSsingle.withColumn("ringkey",dfCOSsingle["_c1"])
+# ???
 dfCOSsingle = dfCOSsingle.withColumn("_c1",F.expr("substring(_c1, 1, length(_c1)-14)"))


This operation likely explains why the withColumn() is used to duplicate _c1 into ringkey instead of a withColumnRenamd() for dfCOSsingle.

TrevorBenson · 2023-10-26T00:24:43Z

scripts/S3_FSCK/s3_fsck_p0.py

 dfnew = rdd.flatMap(lambda x: x).toDF()

 single = "%s://%s/%s/s3fsck/s3-dig-keys.csv" % (PROTOCOL, PATH, RING)
+# write the dataframe to a csv file with a header
+#   output structure: (digkey, sproxyd input key, subkey if available)
 dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)


If we want correct headers, this write operation w/ header="true" is where we get our first instance of generic _c0, _c1 column names. We could perform:

Suggested change

dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)

dfnew = dnew.withColumnRenamed("_c0", "digkey).withColumnRenamed("_c1", "input_key").withColumnRenamed("_c2", "subkey")

dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)

Requires updating the p2 script to read the new column names instead of the generic ones.

TrevorBenson · 2023-10-26T00:33:26Z

scripts/S3_FSCK/s3_fsck_p1.py

+# e.g. 555555A4948FAA554034E155555555A61470C07A,8000004F3F3A54FFEADF8C00000000511470C070,g1disk1,0
+# Required Fields:
+#   - _c1 (main chunk)
+#   - _c3 (FLAG)
 df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)


This is another spot we can inject valid headers prior to later commands, making them a bit simpler to comprehend:

Suggested change

df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)

df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)\

df = df.withColumnRenamed("_c0", "ringkey").withColumnRenamed("_c1", "mainchunk").withColumnRenamed("_c2", "disk").withColumnRenamed("_c3", "flag")

in this example I name the _c0 (ring chunk keys) as ringkey, instead of the naming _c1, the main chunk, as ringkey. I think this could potentially reduce confusion if we decide to be very specific and use explicit terms for each data type

ringkey (or ring_key) # The 30-33 chunk keys and 70-7B chunk keys

mainchunk (or main_chunk) # The 30 or 70 main chunk (aka zero keys)

disk

flag

inputkey (or input_key) # The sproxyd input key

digkey (or dig_key) # The md5sum digged from the main chunk

I like the underscore versions for better readability.

whatever's easier to read is fine by me

Co-authored-by: scality-fno <[email protected]>

TrevorBenson · 2023-10-26T00:45:06Z

"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.

It might be useful to contribute those to the repository so we can improve upon them as needed. We might need to repeat this process for SOFS_FSCK.

RING-44425 - First pass a complete comments

14fb933

TrevorBenson requested review from fra-scality and scality-fno September 28, 2023 13:53

TrevorBenson added 2 commits September 28, 2023 17:35

RING-44425 - More clearly define a digkey, and the actual content of …

5d3ce2d

…a subkey

RING-44425 - additional clarification

bc89b56

scality-fno reviewed Sep 28, 2023

View reviewed changes

RING-44425 - Apply suggestions from code review

41f1b34

Co-authored-by: scality-fno <[email protected]>

TrevorBenson added 2 commits October 2, 2023 08:39

RING-44425 - s3-bucketd data structure in p0

4a24dc9

Merge branch 'improvement/RING-44425-S3_FSCK-scripts-add-comments' of…

c159dd3

… github.com:scality/spark into improvement/RING-44425-S3_FSCK-scripts-add-comments

TrevorBenson changed the title ~~Comments for SPARK scripts~~ RING-44425 - Comments for SPARK scripts Oct 2, 2023

TrevorBenson requested a review from scality-fno October 2, 2023 20:00

scality-fno requested changes Oct 9, 2023

View reviewed changes

scality-fno marked this pull request as ready for review October 9, 2023 15:05

TrevorBenson and others added 4 commits October 10, 2023 08:44

Update scripts/S3_FSCK/s3_fsck_p0.py

663a845

Co-authored-by: scality-fno <[email protected]>

Update scripts/S3_FSCK/s3_fsck_p1.py

73f7442

Update scripts/S3_FSCK/s3_fsck_p0.py

1bfe636

RING-44425 - review suggestions

50c5108

Co-authored-by: scality-fno <[email protected]>

TrevorBenson force-pushed the improvement/RING-44425-S3_FSCK-scripts-add-comments branch from e99b340 to 50c5108 Compare October 10, 2023 15:46

TrevorBenson requested a review from scality-fno October 10, 2023 15:46

RING-44425 - input structures, required fields

644b54e

TrevorBenson force-pushed the improvement/RING-44425-S3_FSCK-scripts-add-comments branch from 3eb5bc2 to 644b54e Compare October 10, 2023 20:50

scality-fno requested changes Oct 25, 2023

View reviewed changes

scality-fno reviewed Oct 25, 2023

View reviewed changes

scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved

scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved

scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved

scripts/S3_FSCK/s3_fsck_p4.py Outdated Show resolved Hide resolved

TrevorBenson commented Oct 26, 2023

View reviewed changes

Apply suggestions from code review

2c8f3de

Co-authored-by: scality-fno <[email protected]>

TrevorBenson requested a review from scality-fno October 26, 2023 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RING-44425 - Comments for SPARK scripts #48

RING-44425 - Comments for SPARK scripts #48

TrevorBenson commented Sep 28, 2023

scality-fno left a comment

TrevorBenson commented Sep 28, 2023 •

edited

Loading

scality-fno left a comment

TrevorBenson commented Oct 10, 2023

scality-fno commented Oct 10, 2023

TrevorBenson commented Oct 10, 2023 via email

scality-fno left a comment

scality-fno left a comment

TrevorBenson Oct 26, 2023

TrevorBenson Oct 26, 2023

TrevorBenson Oct 26, 2023 •

edited

Loading

scality-fno Nov 6, 2023

TrevorBenson commented Oct 26, 2023

	dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)
	dfnew = dnew.withColumnRenamed("_c0", "digkey).withColumnRenamed("_c1", "input_key").withColumnRenamed("_c2", "subkey")
	dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)

	df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)
	df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)\
	df = df.withColumnRenamed("_c0", "ringkey").withColumnRenamed("_c1", "mainchunk").withColumnRenamed("_c2", "disk").withColumnRenamed("_c3", "flag")

RING-44425 - Comments for SPARK scripts #48

Are you sure you want to change the base?

RING-44425 - Comments for SPARK scripts #48

Conversation

TrevorBenson commented Sep 28, 2023

scality-fno left a comment

Choose a reason for hiding this comment

TrevorBenson commented Sep 28, 2023 • edited Loading

scality-fno left a comment

Choose a reason for hiding this comment

TrevorBenson commented Oct 10, 2023

scality-fno commented Oct 10, 2023

TrevorBenson commented Oct 10, 2023 via email

scality-fno left a comment

Choose a reason for hiding this comment

scality-fno left a comment

Choose a reason for hiding this comment

TrevorBenson Oct 26, 2023

Choose a reason for hiding this comment

TrevorBenson Oct 26, 2023

Choose a reason for hiding this comment

TrevorBenson Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

scality-fno Nov 6, 2023

Choose a reason for hiding this comment

TrevorBenson commented Oct 26, 2023

TrevorBenson commented Sep 28, 2023 •

edited

Loading

TrevorBenson Oct 26, 2023 •

edited

Loading