Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: log when loadArchives opens and closes warc files in a dir #156

Closed
dportabella opened this issue Dec 18, 2017 · 5 comments

Comments

@dportabella
Copy link
Contributor

val inputWarcDir = "/data/warcs/*.warc.gz"
val webPages: RDD[ArchiveRecord] = RecordLoader.loadArchives(inputWarcDir, sc)

Is it possible to add a log showing when a warc file is open and closed?

@ianmilligan1
Copy link
Member

One quick solution might be passing

sc.setLogLevel("INFO")

in Spark-shell, which gives you very verbose logging. It does include input information like this:

2017-12-21 09:59:00,434 [Executor task launch worker for task 6] INFO  NewHadoopRDD - Input split: file:/Users/ianmilligan1/Dropbox/git/aut/example.arc.gz:0+2012526
2017-12-21 09:59:00,435 [Executor task launch worker for task 7] INFO  NewHadoopRDD - Input split: file:/Users/ianmilligan1/Dropbox/git/aut/example2.arc.gz:0+2012526

No file close information, but I have used it to debug things before (such as finding bad W/ARCs).

@ianmilligan1
Copy link
Member

Just re pinging on this @dportabella, is this something you want above and beyond the Spark logging options?

@dportabella
Copy link
Contributor Author

Just re pinging on this @dportabella, is this something you want above and beyond the Spark logging options?

No. Your solution sc.setLogLevel("INFO") is fine for me, thx!

But I would need file close information also.
Thought it is not an urgent issue.

@jrwiebe
Copy link
Contributor

jrwiebe commented Jan 29, 2019

@dportabella, my recent commit addresses your request. If you set the log level to "INFO", you will see messages like this:

2019-01-29 12:54:11 INFO  ArchiveRecordInputFormat:141 - Opening archive file file:/home/jrwiebe/aut/target/test-classes/arc/example.arc.gz
2019-01-29 12:54:11 INFO  ArchiveRecordInputFormat:240 - Closed archive file file:/home/jrwiebe/aut/target/test-classes/arc/example.arc.gz

Is this satisfactory?

@dportabella
Copy link
Contributor Author

Great, thanks!

@ruebot ruebot closed this as completed in fc0178d Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants