summaryrefslogtreecommitdiffstats
path: root/method.txt
blob: b3fb59b100333a2200976af2c95cc5276172caad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
Machine assisted license cleanup
--------------------------------

1. Tools

   1.1 scancode toolkit

       A license scanner tool which can be run from the command line and
       provides excellent parellellsation. While fast, its recommended to
       be run on a machine with tons of CPUs and tons of Memory.

       A run with 128 parallel scan threads takes about 15 minutes. Go
       figure how long it will take on your laptop :)

       https://github.com/nexB/scancode-toolkit

   1.2 spdx helper scripts

       A bunch of horrible python scripts with even more horrible shell
       glue.

       git://git.kernel.org/pub/scm/utils/spdx/spdx-utils

       The main workhorse is lcheck.py. I wrote it initialy to gather
       statistics and other information, but over time it evolved to a
       swiss army knife. lcheck.py --help gives you the gory details, no
       manpage sorry.

   1.3 git

       The git tools must be available.

       A clean linux tree must be cloned. Ensure that there are no
       artifacts from editing, patch directories etc.

   To reproduce the setup (in case you have a big enough machine or
   lots of time for thumb twiddling):

    - Install scancode and git. If you need help with scancode talk
      to Philipe.

    - Clone the linux kernel

    - Clone the spdx scripts

    - cd into the spdx scripts directory

    - invoke the runscript with:

      ./runall.sh path/to/linux/kernel

      The path can be relative or absolute

    - Wait ....

    - Check the results in the stepX directories

    - Chech the results in the kernel directory (each step creates a
      branch).


   For your convenience:

     The spdx-utils repository contains aside of the master branch a branch
     named linux-5.0. That contains:

     - the scancode json files for each step
     - the stats.txt file for each step
     - the rules which are handled in each step
     - the resulting patches

    The resulting git tree is pushed to:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git

    Branches step1, step2, step3 contain the steps documented below.


2) Approach

   The Documentation directory is ignored for now. That needs some extra
   care.

   2.1 Files with no license

       These files have not been touched during the first large sweep.

   2.1.1 Build files

   	 Make/Kconfig files without license information

   2.1.2 Source files which have only MODULE_LICENSE("GPL") and/or
   	 EXPORT_SYMBOL_GPL()

	 Now that MODULE_LICENSE is clarified this can be tackled.

   The scripts identify these files in the scanner result and add the
   proper license identifier (GPL-2.0-only)

   The scripts generate patches which can be applied with quilt or imported
   into git with 'git quiltimport'

   SPDX count goes from 22574 to 25712

   2.2 Files with a single license: GPL-2.0-only or GPL-2.0-or-later

       The scripts handle the following tasks:

       - Find the affected files in the scanner output

       - Generate a list of match rules which represent a unique pattern
         This is achieved by normalizing the texts (removing formatting,
         white space damage, uppercase / lowercase and punctuation damage.

       - Add the appropriate license header and remove the boiler plate
         text or the license reference.

       - Create a patch series. Each patch contains only the modifications
         for a single match rule. The rule (and eventual variants)
	 are saved in the change log of each patch to ease review

       - Once a reference dataset (compliance data provided by Siemens) is
         available the scripts will also check for conflicts with that
	 data set.

       This results in 515 patches at the moment.

       The scripts generate patches which can be applied with quilt or
       imported into git with 'git quiltimport'

       SPDX count goes from 25712 to 46368

    2.3. Files with GPL-2.9-only/or-later and Linux-OpenIB

       Basically the same as above just with dual licensing.


    2.4  More fun later :)

       I have quite a bunch of steps in preparation but lets get the above
       agreed on and reviewed first.