-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtb-git.Rmd
859 lines (687 loc) · 19.6 KB
/
tb-git.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
---
title: "Mining repositories with Git"
author: "Diomidis Spinellis, Georgios Gousios"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
revealjs::revealjs_presentation:
css: theme.css
highlight: pygments
theme: white
reveal_options:
slideNumber: true
previewLinks: true
---
## Outline
* Basics
* Workflow
* Fetching Git data
* Selecting elements
* Processing records
* Summarizing results
* Scaling
* Debugging
## Basics
* Command line rationale
* Getting to the command line
* Workflow
* Commands
* Expansion and quoting
* Programming constructs
* Help!
## CLI rationale
* Versatility
* Performance
* Parallelism
* Efficiency
* Open science
* Availability
## Key concepts
* Break the problem into small well-understood tasks
* Each tool do should do a single thing and do it well
* One form of interface: flat text
* One record per line
* Fields split by space or other special character
* Filters
## Getting to the command line
* Windows: Cygwin + mintty or Windows Subsystem for Linux (WSL)
* Mac OS X: Applications - Utilities - Terminal
* Unix/Linux: terminal, xterm, rxvt, konsole, kvt, gnome-terminal, nxterm, eterm
## Simple commands
```sh
command
# Input redirection
command <input-file
# Output redirection
command >output-file
# Pipeline
command1 | command2
```
## Variables, expansions
```sh
# Variable assignment
e=expansion
# Variable expansion
$e
# Command expansion
$(command)
# Quoting
'literal string'
"single string with $e"
Literal \$ \' \"
# Wildcards
* ? [abc] [0-9]
```
## Looping
```sh
# Repeat while command succeeds
while command ; do
commands
done
# Process piped input
while read var ; do
commands
done
# Iterate over a set
for var in a b c; do
commands
done
```
## Help!
* *command* `--help`
* `man` *command*
* GIYF
* [The Art of the Command Line](https://github.com/jlevy/the-art-of-command-line)
* [explainshell.com](https://explainshell.com/)
* [regular expressions 101](https://regex101.com/)
## Book
![](artwork/dscl-cover.png)
Also [available online](https://www.datascienceatthecommandline.com/)
## Workflow: Fetch
![](artwork/fetch.jpg)
## Fetching Git data
* Git internals
* Git commands
* Web and JSON data
## The Git user interface
Git is a content-addressable file system, that happens to be good for versioning
text files. From the user's perspective, Git separates commands in two types:
* **User-level commands:** Contrary to common belief, Git does have a
user interface. This includes commands such as `commit`, `log`, etc.
* **Plumbing commands:** Those interact directly with the Git internals,
for example `cat-file`, `update-ref`, etc.
## Git Internals {- }
At the lowest level Git, is a Key-Value (KV) store, residing on the local
hard drive. For any given input, `git` generates a hash and stores it
into its database. We can interact with Git's KV store, using the
`hash-object` and `cat-file` plumbing commands.
```bash
# Let's write some content in git's store
$ echo "foo" | git hash-object --stdin -w
257cc5642cb1a054f08cc83f2d943e56fd3ebe99
# Verify that it was written
$ find . -name 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
./.git/objects/25/257cc5642cb1a054f08cc83f2d943e56fd3ebe99
# Retrieve contents back
git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
foo
```
## Git Objects {- }
Git supports 3 types of objects in its database:
* `blobs`, or content objects. These store raw content
* `trees` store a list of named pointers to blobs (directories)
* `commits` store information about commits, including a link to
the top level tree the commit represents
```bash
$ git --bare clone https://github.com/ghtorrent/ghtorrent.org && \
cd ghtorrent.org
$ git cat-file -t 838b2b05e315713791da7e003af9cc80da64658e
tree 6754adb51be5f9d4bee3aeafc7bb67c65396a343
parent af34da36cc381b0da4ec7378ec23b52c31b53145
parent caa91d1b4052a0e34cabe2b415f0f5cf232e8837
author Georgios Gousios <[email protected]> 1517405162 +0100
committer Georgios Gousios <[email protected]> 1517405162 +0100
Merge branch 'italobelo-patch-1'
```
## Tree objects
Tree objects contain named pointers to blob objects or other tree objects.
```bash
$ git cat-file -p 4c2b7a1d59329a3d3a02e9abc5d1e878cf2b07af
100644 blob ae10be1b81757e2127a8328e16b2b2155a64edeb .gitignore
100644 blob 6904bcdd60efe9315801621c3c36d805f7963148 404.html
100644 blob 7370bac51515903c580df4e946d62cc270eb112c Gemfile
040000 tree 35dbdb9b45d275393139f03bb86617a6bac75a1a _bibliography
100644 blob 4103650b66ad802a65afc60a1088e17e06f04cc2 README.md
[...]
$ git cat-file -p 35dbdb9b45d275393139f03bb86617a6bac75a1a
100644 blob 4515579877119d3d026f1a8ed01faa60b43add0a references.bib
$ git cat-file -p 45155798771
@inproceedings{GPD14,
author = {Gousios, Georgios and Pinzger, Martin and Deursen, Arie van},
[...]
```
## Commit objects
Commit objects contain a tree, author, committer, and parents.
```bash
$ git cat-file -t 838b2b05e315713791da7e003af9cc80da64658e
tree 6754adb51be5f9d4bee3aeafc7bb67c65396a343
parent af34da36cc381b0da4ec7378ec23b52c31b53145
parent caa91d1b4052a0e34cabe2b415f0f5cf232e8837
author Georgios Gousios <[email protected]> 1517405162 +0100
committer Georgios Gousios <[email protected]> 1517405162 +0100
Merge
```
_Tags_ are a special type of commit objects: they point to a commit but
has a `tagger` rather than an `author`.
## References
Git uses _references_ to give human readable names to important commit
objects.
```bash
$ find .git/refs -type f
.git/refs/heads/master
.git/refs/heads/pr-51
.git/refs/remotes/origin/HEAD
.git/refs/remotes/origin/master
$ cat .git/refs/heads/master
eb6ba576ec6dd0be077f20ccc47f2269580d6de8
```
Branches are just references.
```bash
$ git branch
* master
pr-51
$ cat .git/refs/heads/pr-51
b058f06bd69ec7aed7a976c737eb8579904720fd
```
## Putting the pieces together
![The Git Object Model](artwork/data-model-4.png){width=80%}
The Git object model is a directed graph stored in a KV database,
featuring file system semantics that enable content de-duplication.
## Efficient mining of `git` data
Checking out specific versions to analyze is inefficient and
error-prone.
We skip this method and focus on three tools:
* `show:` To interact with blobs
* `ls-tree:` To interact with trees
* `log:` To interact with commits
## Git mining: Typical process
1. Get commits from all branches (`branch`, `log`)
2. Select specific commits to analyze (`log`)
3. Get specific/all files for those commits
4. Process the contents of those files with external tools
## Git `log`: important options
Git `log` walks the git commit graph and prints details from commit
objects. Generally, Git `log` can be configured across 3 dimensions:
* _where to find commits_: `--all`, `--remotes`
* _which commits to exclude_: `--author`, `--grep`, `--filename`
* _what details to print_: `--pretty=<format>`
```bash
# Find commits by moritzbeller, containing an issue reference
# and print their abbreviated SHA and tree SHA
$ git --no-pager log --author [email protected] --grep "Refs" \
--pretty="%h %t"
1a4dafcc 89875e32
6b8e79e6 ebb41476
ac4842e1 5107a9d0
[...]
```
## Interacting with trees
Git `ls-tree` prints the contents of a Git tree, identified by its
SHA1. Important options:
* `-r` Recursive listing of all tree items
* `--abbrev` Abbreviate hashes (usually 10 is enough)
```bash
$ git ls-tree --abbrev=10 -r aada6ffe | head -n 10
100644 blob 4c05514ff2 .gitignore
100644 blob a5b896487d .travis.yml
100644 blob 14aa34d8aa WatchDogCore/Core.iml
100644 blob 1fb2ad661b WatchDogCore/WatchDog/.classpath
100644 blob a0cb478b8a WatchDogCore/WatchDog/.project
100644 blob 62492222ad WatchDogCore/WatchDog/.settings/org.eclipse.jdt.core.prefs
100644 blob f897a7f1cb WatchDogCore/WatchDog/.settings/org.eclipse.m2e.core.prefs
100644 blob b0a8bfcbcc WatchDogCore/WatchDog/META-INF/MANIFEST.MF
[...]
```
## Interacting with blobs
Git `show` prints the contents of a blob, identified by its blob SHA.
```bash
$ git --no-pager show 841f06d4ee
package nl.tudelft.watchdog.test.timingOutput;
import java.io.ByteArrayOutputStream;
import java.io.UnsupportedEncodingException;
import java.util.Date;
import java.util.LinkedList;
import java.util.List;
[...]
```
## Worth knowing
* *find*: select files from checked out projects
* *ssh*: obtain data from remote host
* *mysql*: execute SQL queries on MariaDB / mySQL
## Workflow: Select
![](artwork/select.jpg)
## Tools
* grep
* fgrep
* egrep
* awk
* sed
* (cut)
* (jq)
* (xml)
## Regular expressions
* Regular characters: `a`
* Any character: `.`
* Any number of times: `k*`
* Set: `[a-z]`
* Set complement: `[^a-z]`
* Beginning of line: `^a`
* End of line: `b$`
* Escaping: `\.` `\[` `\*`
* Extended: `a|b` `c+` `d?` `{9}` `{,9}`
* Back-references: `(a.b)` `\1`
## Fixed pattern
Obtaining committer timezones
```sh
git blame --line-porcelain readme.md |
fgrep committer-tz |
sort |
uniq -c |
sort -rn |
head -4
```
Result:
```
270 committer-tz +0700
228 committer-tz +0200
50 committer-tz +0100
14 committer-tz +0000
```
## Simple RE
Assembly language lines in Fourth Research Edition Unix
```sh
git ls-tree -r --name-only Research-V4 sys |
grep '\.s$' |
xargs -I@ git show Research-V4:@ |
wc -l
```
## More complex RE
* Git built-in *grep* command
* Count the number of header include directives
```sh
git grep '^[ \t]*#[ \t]*include.*\.h' Research-V4 |
wc -l
```
## Counting files
```sh
git grep -l '^[ \t]*#[ \t]*include.*\.h' Research-V4 |
wc -l
```
## Counting elements
Number of Makefile files in the second BSD distribution (32)
```sh
git ls-tree -r --name-only BSD-2 -- |
grep -c /makefile\$
```
## Excluding patterns
Number of networks system call **uses** in BSD 4.2
```sh
for i in accept ... socketpair do
echo -n $i
# Find system call lines
git grep -P "\\b$i\\s*\\(" BSD-4_2 |
# Remove kernel source code
grep -v :usr/src/sys |
# Remove noise
egrep -v '/(games|new|pcc)/' |
# Count occurrences in C source code
grep -c ':usr/src.*\.c:'
done |
sort -k3nr
```
## Selecting with awk
* *predicate* `{` *action* `}`
* Any of the two can be missing
* Predicate:
* Regular expression
* Arithmetic expression
* Logical expression
* Auto-splitting into `$`*N*
* Change field separator: `-F,`
## awk example
Contributed subsystems larger than 100 kLoC
```sh
git ls-tree --name-only FreeBSD-release/11.1.0:contrib |
while read d ; do
echo -n "$d "
git ls-tree -r --name-only FreeBSD-release/11.1.0:contrib/$d |
xargs -I@ git show FreeBSD-release/11.1.0:contrib/$d/@ |
wc -l
done |
awk '$2 > 100000'
```
## sed
* `s/`*RE*`/`*string*`/`
* `/`*pattern*`/`*command*
* `/`*pattern*`/!`*command*
* `n`ext
* `N`ext (append)
* `d`elete
* `p`rint
## sed example
Email domains of Linux pipe.c contributors
```sh
git blame --show-email master -- fs/pipe.c |
sed -n 's/^[^ ]* (<[^@]*@\([^>]*\).*/\1/p' |
sort -u
```
## Workflow: Process
![](artwork/process.jpg)
## Tools
* sort
* comm
* join
* tr
## sort
* Ordering
* Grouping
* Top *N* / bottom *N*
* Merging
* Prepare for
* Set operations
* Joining
## Useful sort arguments
* `--numeric-sort` or `-n`
* `--reverse` or `-r`
* `--key=2nr` or `-k2nr`
* `--unique` or `-u`
* `--field-separator=:` or `-t`
* `--month-sort` or `-m`
## Grouping and ordering
Top-5 Linux committers
```sh
# Show commit emails
git log --pretty="%ae" |
sort |
# Count consecutive same records
uniq -c |
# Order numerically
sort -rn |
# Obtain top-5
head -5
```
Results:
```
17317 [email protected]
6679 [email protected]
5790 [email protected]
4163 [email protected]
4111 [email protected]
```
## Set operations with comm
* Input: two sorted files
* Output:
* Records only in first (exclude with `-1`)
* Records only in second (exclude with `-2`)
* Common records (exclude with `-3`)
* Performs set difference and intersection
## comm example
How many Linux commits affect both the kernel and drivers?
```bash
# Obtain ordered list of SHA hashes
area_commits()
{
git log --pretty='%H' master -- $1 | sort
}
# Count common elements
comm -12 <(area_commits kernel) <(area_commits drivers) | wc -l
```
Result: 2213
## join example
Number of commits by same people in Linux and Docker
```bash
# Commits per committer
committer_commits()
{
git --git-dir=$1 log --pretty='%ae' | sort | uniq -c | sort -k2
}
# Join commits by comitter email
join -j 2 <(committer_commits docker/docker) <(committer_commits torvalds/linux) |
# Order by Docker commits
sort -k3nr |
# Top-5
head -5
```
Result:
```
[email protected] 8 129
[email protected] 164 84
[email protected] 1 62
[email protected] 1 59
[email protected] 8 38
```
## tr
* Map characters
* To uppercase: `tr a-z A-Z`
* Delete newlines: `tr -d '\n'`
* Complement set: `tr -cs A-Za-z '\n'`
* This generates a list of words
## tr example
Number of structural change commits
```sh
# Project and file to analyze
cd ReactiveX/RxJava
f=src/main/java/rx/Observable.java
# Obtain list of commit hashes
git log --pretty='%H' 1.x -- $f |
while read sha ; do
# Show file at given version
git show $sha:$f |
# Convert it into structure signature
tr -cd '{}'
# Start a new line
echo
done |
# Obtain unique structures and count them
uniq |
wc -l
```
## Worth knowing
* *tsort*: topological sort
* *gvpr*: graph processing
* *diff*: file difference
* *test*: evaluate predicates
* *expr*: evaluate expressions
* *date*: convert between date formats
* *paste*: paste file records side by side
* *rev*: reverse characters of each line
* *tac*: reverse order of lines in file
## Workflow: Summarize
![](artwork/summarize.jpg)
## Simple tools
* *wc*: count records
* *head*: first *N* records
* *tail*: last *N* records
* *uniq*: unique records
## Summarizing with awk
* Associative arrays
* `BEGIN`, `END` pseudo-predicates
## awk filter example
When was each committer's latest commit
```bash
git log --date-order --date=format:%F --pretty='%ae %ad' |
awk '!seen[$1] {print; seen[$1] = 1}'
```
## awk summary example
Count common / total commits in Linux kernel / drivers
```bash
comm <(area_commits kernel) <(area_commits drivers) |
# Sum kernel, driver, and common commits
awk -F$'\t' '
$1 {k++}
$2 {d++}
$3 {c++}
END {print k + c, d + c, c}'
```
Result:
```
22645 310148 2213
```
Out of 22,645 kernel commits and 310,148 driver commits, only 2,213 are common!
## awk array example
When are the most lines committed?
```sh
# git log with hour and added/removed/file records
git log --date=format:%H --pretty=%ad --numstat |
awk '
# Keep record of time
NF == 1 {hour = $1}
# Accumulate lines added
NF == 3 {added[hour] += $1}
END {
for (i in added)
print i, added[i]
}' |
# Order by volume; get top
sort -k 2rn | head -1
```
Result: 15:00 for Linux, 21:00 for FreeBSD, 22:00 for RxJava
## Pushing scale and scope
![](artwork/scale.jpg)
## Example workflow
Find sizes for all test file versions in a Java project
```bash
git log --pretty='%t' |
while read tree; do
git ls-tree --abbrev=10 -r $tree | grep Test.java$ | awk '{print $3, $4}'
done |
sort -u |
while read blob testfile; do
echo -n "$testfile "
git show $blob | wc -l
done |
sort -k2r
```
. . .
**D:** Do you see how the code above could be improved?
## Better version
We can wrap loops into functions and apply them with `xargs` or `parallel`.
This also makes our code testable.
```bash
function tree_items {
git ls-tree --abbrev=10 -r $1 | grep Test.java$ | awk '{print $3, $4}'
}
function file_size {
set $1
echo -n "$1 $2 " ; git show $1 | wc -l
}
time git log --pretty='%t' | parallel tree_items {} |
sort -u |
parallel file_size {} |
sort -k2r
```
. . .
**D:** How can we execute this code on 100s of repos?
## Scaling the analysis
We have two options:
* Run yet another parallel loop
* Use `make`
. . .
```bash
function tree_items {...}
function file_size {...}
function per_repo {
cd $1
git log --pretty="%h,%t,%ae" |
parallel tree_items {} |
sort | uniq |
parallel file_size {} |
sort -k 2
}
export -f tree_items file_size per_repo
time ls | parallel -P 4 per_repo
```
## Repeatability with `make`
```Makefile
INPUTS = $(wildcard *.git)
OUTPUTS = $(INPUTS:.git=.ts)
TESTSIZE=/home/gousiosg/repos/testsize
%.ts : %.git
cd $< && \
$(TESTSIZE) > ../$(@F)
all: $(OUTPUTS)
clean:
rm *.ts
```
We use `make` as both a dependency tracker and a parallelization tool.
```bash
$ make -n # See what will be run
$ make -j 16 # Run 16 parallel jobs
```
## Running long tasks
* Scripting
* Ensure command isn't terminated with `nohup`
* Reconnect to session with `screen` or `tmux`
* Use cloud servers
## Mining with `libgit2`/`JGit`
Shell scripting can go a long way, but it has limitations:
* Each step of a pipeline is a process, which must be spawned. Loops can
be costly.
* Useful data structures (queues, trees, graphs) are cumbersome to implement.
* Testing/Debugging can be tricky.
In those cases, we can use `libgit2` or `JGit`, for scripting languages or
Java; we get exactly the same interface, but in a much richer programming
environment.
## Getting extra information
The selection process will return a list of projects that looks like this:
```csv
project_id, owner, repo
1334,rails,rails
1335,jenkinsm,cellulite
1337,lshift,diffa
1341,CottageLabs,cl
1342,Craga89,qTip2
1345,xforty,vagrant-drupal
```
We might need to get extra information from the GitHub API (e.g. number of
stars, language), to do the selection. This is easy to do with two basic tools:
* `curl` Retrieves contents from URLs
* `jq` Query JSON objects
## Obtain GitHub star count
```bash
cat repos.csv | cut -f 2,3 -d ',' | tr ',' '/' |
xargs -I {} curl -s https://api.github.com/repos/{} |
jq '[.stargazers_count, .language |tostring]|join(", ")' > info.csv
paste -d ',' repos.csv info.csv
```
If your project list is big, you will quickly hit the GitHub API
rate limit. You then need a GitHub API key, which you can provide
to `curl` as follows
```bash
curl -H "Authorization: token OAUTH-TOKEN" -s ...
```
## Obtaining data from diverse sources
[Perceval](https://github.com/chaoss/grimoirelab-perceval) covers tens of systems: askbot, bugzilla, bugzillarest, confluence, discourse, dockerhub, gerrit, git, github, gitlab, googlehits, groupsio, hyperkitty, jenkins, jira, launchpad, mattermost, mbox, mediawiki, meetup, nntp, phabricator, pipermail, redmine, rss, slack, stackexchange, supybot, telegram, twitter
## Cloning 1000s of repos
* Use bare format
* Script the process
```bash
awk -F, '{print $2 "/" $3}' repos.txt |
xargs -I@ -P 8 git clone --bare https://github.com/@
```
## Debugging
* Trace scripts with `-x`
* Exit on errors with `-e`
* Peek on pipeline data with `tee`
* Log to standard error with `echo ... >&2`
## Further reading
* Chacon, Scott, and Ben Straub. *Pro Git*. Berkeley, CA New York, NY: Apress,Distributed to the Book trade worldwide by Spring Science+Business Media, 2014. Print.
* Jeroen Janssens, 2014. *Data Science at the Command Line*, O’Reilly, Sebastopol, CA.
* Friedl, J. E., 2006. *Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools*, 3rd Edition. O'Reilly Media, Sebastopol, CA.
* Kernighan, Brian W., and Rob Pike. *The UNIX programming environment*. Englewood Cliffs, N.J: Prentice-Hall, 1984. Print.
* Diomidis Spinellis. [Tools and techniques for analyzing product and process data](doi:10.1016/B978-0-12-411519-4.00007-0). In Tim Menzies, Christian Bird, and Thomas Zimmermann, editors, *The Art and Science of Analyzing Software Data*, pages 161-212. Morgan-Kaufmann, 2015.
## Licensing information
![Creative Commons](artwork/by-nc-sa.eu.png){width=10%}
This work is (c) 2018 - onwards by Diomidis Spinellis and Georgios Gousios.
It is licensed
under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International](http://creativecommons.org/licenses/by-nc-sa/4.0/) license.