Build and Execute a Simple Pipeline¶
Let’s consider a simple pipeline working on two input files, a.txt
:
1 | Hello you!
|
and b.txt
:
1 2 | Hello you,
and your friends!
|
For each of the two files, the first stage of the pipeline computes the number of lines, words and characters and stores in a comma-separated file.
The second stage combines the two comma-separated files into a single comma-separated file with an extra field to indicate the source.
Build pipeline¶
The Python code dodo.py
for building this pipeline using JUDI is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | from judi import File, Task, add_param, combine_csvs
add_param([1, 2], 'n')
class GetCounts(Task):
"""Count lines, words and characters in file"""
inputs = {'inp': File('text', path=['a.txt', 'b.txt'])}
targets = {'out': File('counts.csv')}
actions = [("(echo line word char file; wc {}) | sed 's/^ \+//;s/ \+/,/g' > {}", ["$inp", "$out"])]
class CombineCounts(Task):
"""Combine counts"""
mask = ['n']
inputs = {'inp': GetCounts.targets['out']}
targets = {'out': File('result.csv', mask=mask, root='.')}
actions = [(combine_csvs, ["#inp", "#out"])]
|
Execute pipeline¶
The pipeline is executed from command line:
$ doit -f dodo.py
. GetCounts:n~1
. GetCounts:n~2
. CombineCounts:
The .
before each pipeline task denotes that the task was computed afresh.
The first stage generates two intermediate count files, judi_files/n~1/counts.csv
and
./judi_files/n~2/counts.csv
.
$ cat judi_files/n~1/counts.csv
line,word,char,file
1,2,11,a.txt
$ cat judi_files/n~2/counts.csv
line,word,char,file
2,5,29,b.txt
The second stage consolidates the counts in a file result.csv
:
$ cat result.csv
line,word,char,file,n
1,2,11,a.txt,1
2,5,29,b.txt,2
Re-execute pipeline¶
Invoking doit
again gives:
$ doit -f dodo.py
-- GetCounts:n~1
-- GetCounts:n~2
-- CombineCounts:
where --
denotes that the pipeline task was not executed.
Now let’s update the second input file b.txt
to:
1 2 3 | Hello you,
your friends,
and whole world!
|
and execute the pipeline again:
$ doit -f dodo.py
. GetCounts:n~2
-- GetCounts:n~1
. CombineCounts:
This time only the counts for b.txt
is recomputed, the unaffected part of
the pipeline for a.txt
is not executed.