Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-specified column #9

Open
bmcgough opened this issue Jun 16, 2017 · 8 comments
Open

user-specified column #9

bmcgough opened this issue Jun 16, 2017 · 8 comments

Comments

@bmcgough
Copy link
Contributor

Goal is to have pwalk output a column value the user supplies on the command line.

he example is file and folder ownership - suppose you have an owner for a folder structure that is not the same as the UID owing the folder. You would specify that field, and it would be appended to each line of output as an additional column. Yes, it would be exactly the same for every line of output, but if the output is being copied into a database directly, this would help to avoid messiness in shell around altering the output.

Perhaps a repeatable parameter for allow multiple columns?

Ex:
pwalk --NoSnap --maxthreads=32 --addcol 'bob' --addcol 'bigfs' /mnt/really_big_filesystem/bobs_folder
would produce lines like:
62943850,69436284,1,"/mnt/really_big_filesystem/bobs_folder/testfile.tst","tst",1287,1287,4096,8,"0040644",1475167363,1475167363,1475167363,4,3001,bob,bigfs

@malcook
Copy link

malcook commented Jun 16, 2017

messiness is subjective.

I don't think this is messy:

pwalk --NoSnap --maxthreads=32 /mnt/really_big_filesystem/bobs_folder | sed -e 's/$/,bob,bigfs/'

  • it computes what you want
  • it is fewer characters overall
  • it does not introduce new options whose meaning is non-standard
  • it is clear what it does to anyone who knows sed with no additional manpage reading
  • is consistent with Unix Philosophy:

Do One Thing and Do It Well

@bmcgough
Copy link
Contributor Author

Yes, you are right.

I am also working with the author to eventually enable PostgreSQL binary output for direct COPY into tables.

To stick with that philosophy, a PostgreSQL binary format manipulator utility will need to be written (I have not found one). Perhaps that is what should be done instead of adding that functionality to pwalk... .

@malcook
Copy link

malcook commented Jun 16, 2017

Hi - If you're having a conversation on-line about PostgreSQL loading options I'd like to join it.

In fact, let me start one, in the wiki: Postgresql loading practices

@bmcgough
Copy link
Contributor Author

bmcgough commented Jun 16, 2017 via email

@malcook
Copy link

malcook commented Oct 23, 2018

@bmcgough - 2 things...

a) you agreed with me above with my opinion that with a little judicious sed this feature request might be laid to rest. If so, how about closing it?

b) I would very much like to hear about some the the challenged and solutions with queries, indexes, etc. Maybe you could add it to the wiki here? I'm especially interested if you have made this work with an organizational meta-data management mandates (aka MMMs. Not.). But seriously, how are you using this at scale and to what ends? Or did the big vendor win the showdown. And, if so, who is the BiG VeNdOr?

@bmcgough
Copy link
Contributor Author

I agree that some sed/awk foo is an acceptable workaround for my current case.

We have been using pwalk for almost two years now. It is working here. It is messy. Our views and other things are there, though there are some missing scripts (like UID and GID table creation) that we only keep locally.

We use pwalk to crawl 1.4TB in about 500 million files. The data is gathered by pwalk to a CVS file on a scratch file system. Then a pipeline of cvsquote, uconv, sort, awk, and finally psql fills the database. We have to create indexes, and in some cases materialized views to get the query performance we need from PostgreSQL. Recently we have put this data into an Elasticsearch cluster, and found querying to be much faster, but still not as fast as we need to put querying into the hands of our users.

The challenges include filenames, error handling, and run time.

During this project I have learned there is only 1 character not permitted in a filename - /. We even have file names with byte values that do not map to known character sets (thus the uconv). We use awk to supply the additional data to the row, and awk isn't really CSV-aware, thus we need to use csvquote. Psql is very picky (we are using COPY TO) so sort eliminates any duplicates, but is an additional step.

Before pwalk, I used Python with scandir to walk the tree, putting filenames onto a multiprocessing queue to be stated. This meant there was a (sometimes long) windows between directory walk and file stat, so files would be deleted during normal use and this would result in an error (file not found). With pwalk things are better, but that race condition still exists, and we hit it occasionally. So we have to differentiate between 'acceptable' error and not.

It takes us a varying amount of time to run the crawl - 8-16 hours. This is likely all due to the underlying file system at this point, but it is always a challenge. Scheduling on our slurm cluster has also been a challenge as we are competing with other cluster users for resources.

We are on the cusp of launching the ES solution and using it to pull in additional metadata from our Swift cluster, S3, and our scratch file system (currently BeeGFS). But pwalk will remain the method we use to gather POSIX file metadata.

As to metadata mandates... I think getting our metadata 'ducks in a row' is step one for us. Once we have the ability to view and query all our metadata, we will finally have the tools necessary for users to begin to voluntarily manage their data and metadata. It'll be an exciting time!

@malcook
Copy link

malcook commented Oct 23, 2018

This is great info - thanks @bmcgough - one question though - what does ES stand for in "the ES solution"? "Enterprise Scale"?

@bmcgough
Copy link
Contributor Author

Elasticsearch. This is what @fizwit used to use to query pwalk data. We had a PostgreSQL project ongoing and decided to use that. It does work, but you do have to optimize queries to get the performance you want.

Ideas for the future include:

Grab md5sum where easy (S3, Swift, etc.)
Read file magic
On-demand tree walking (user-triggered from a UI)
Data moving (again, user-triggered from a UI, but data avoiding client devices)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants