diff mbox series

[FFmpeg-devel,v2] doc/developer: origin of tables should be documented.

Message ID 20200808175709.7359-1-george@nsup.org
State New
Headers show
Series [FFmpeg-devel,v2] doc/developer: origin of tables should be documented. | expand

Checks

Context Check Description
andriy/default pending
andriy/make success Make finished
andriy/make_fate success Make fate finished

Commit Message

Nicolas George Aug. 8, 2020, 5:57 p.m. UTC
Tables that were not just written by the code author are
not actually source code, otherwise,
"recode data..x1 < proprietary.o > source.c"
would be enough to launder a proprietary blob into
the source code.

Documenting the origin of the tables or the methods
for their generation is necessary to let other developers
take over if the original author is no longer available.

Signed-off-by: Nicolas George <george@nsup.org>
---
 doc/developer.texi | 8 ++++++++
 1 file changed, 8 insertions(+)


I count:

- Two objections, to which I have answered, and who have not given
  follow up.

- One objection about a typo, I fixed "engineered" and proof-read
  everything carefully.

- Two positive opinions.

Andreas:

> Not sure I agree with your definition of Libre Software and your exact

I was more trying to express the difference between Libre Software and
Open Source, which is minute in legal terms but huge in terms of
mentality. Anyway, this part is only explanatory, it does not go into
the tree.

> wording of the patch.

Have you suggestions to make it better?

> I agree that leaving a hint of where the data comes from, directly where
> the data is defined in the source code, is a good thing and probably not
> to much to ask for.

Thanks.

Comments

Mark Thompson Aug. 8, 2020, 9:26 p.m. UTC | #1
On 08/08/2020 18:57, Nicolas George wrote:
> Tables that were not just written by the code author are
> not actually source code, otherwise,
> "recode data..x1 < proprietary.o > source.c"
> would be enough to launder a proprietary blob into
> the source code.
> 
> Documenting the origin of the tables or the methods
> for their generation is necessary to let other developers
> take over if the original author is no longer available.
> 
> Signed-off-by: Nicolas George <george@nsup.org>
> ---
>   doc/developer.texi | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> 
> I count:
> 
> - Two objections, to which I have answered, and who have not given
>    follow up.
> 
> - One objection about a typo, I fixed "engineered" and proof-read
>    everything carefully.
> 
> - Two positive opinions.

+1 from me too, I think this is a good thing to include.

> 
> Andreas:
> 
>> Not sure I agree with your definition of Libre Software and your exact
> 
> I was more trying to express the difference between Libre Software and
> Open Source, which is minute in legal terms but huge in terms of
> mentality. Anyway, this part is only explanatory, it does not go into
> the tree.
> 
>> wording of the patch.
> 
> Have you suggestions to make it better?
> 
>> I agree that leaving a hint of where the data comes from, directly where
>> the data is defined in the source code, is a good thing and probably not
>> to much to ask for.
> 
> Thanks.
> 
> 
> diff --git a/doc/developer.texi b/doc/developer.texi
> index b33cab0fc7..c3103f31dc 100644
> --- a/doc/developer.texi
> +++ b/doc/developer.texi
> @@ -216,6 +216,14 @@ please use av_log() instead.
>   @item
>   Casts should be used only when necessary. Unneeded parentheses
>   should also be avoided if they don't make the code easier to understand.
> +
> +@item
> +If the code contains tables of numbers or other data, their origin should be
> +documented in a comment, so that other developers can rebuild them if
> +necessary. If they were taken from a reference, include the URL of that
> +reference. If they were computed by a tool, include the code of the tool.
> +If they were reverse-engineered, include an honest attempt at explaining the
> +methods used.

Also: if they were taken from a written specification, include the section or table number within that specification.

(A reference to "H.264 table A-1" is more useful than one to "<https://www.itu.int/rec/T-REC-H.264/en>".)

>   @end itemize
>   
>   @section Editor configuration
> 

Thanks,

- Mark
Zane van Iperen Aug. 9, 2020, 1:06 a.m. UTC | #2
On Sat, 8 Aug 2020 19:57:09 +0200
"Nicolas George" <george@nsup.org> wrote:

> 
> Tables that were not just written by the code author are
> not actually source code, otherwise,
> "recode data..x1 < proprietary.o > source.c"
> would be enough to launder a proprietary blob into
> the source code.
> 
> Documenting the origin of the tables or the methods
> for their generation is necessary to let other developers
> take over if the original author is no longer available.
> 
> Signed-off-by: Nicolas George <george@nsup.org>
> ---
>  doc/developer.texi | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> 
> I count:
> 
> - Two objections, to which I have answered, and who have not given
>   follow up.
> 
> - One objection about a typo, I fixed "engineered" and proof-read
>   everything carefully.
> 
> - Two positive opinions.
> 

I'm apprehensive about this, especially in the case of
reverse-engineered tables. It should definitely be encouraged, but not
necessarily hard-required.

If you explicitly say "Reverse Engineered from so-and-so", that seems
essentially like putting a target on FFmpeg's back.

Case-in-point: No reference decoder/encoder exists, and the tables had
to be dumped from the application binary.

Zane
Alexander Strasser Aug. 9, 2020, 11:09 a.m. UTC | #3
Hi Nicolas!

Am 8. August 2020 23:26:02 MESZ schrieb Mark Thompson <sw@jkqxz.net>:
>On 08/08/2020 18:57, Nicolas George wrote:

[...]

>> Andreas:

I'm sure the following was addressed at me.


>>> Not sure I agree with your definition of Libre Software and your
>exact
>> 
>> I was more trying to express the difference between Libre Software
>and
>> Open Source, which is minute in legal terms but huge in terms of
>> mentality. Anyway, this part is only explanatory, it does not go into
>> the tree.

Yes, that's fine.

IMHO we don't actually need ideology for this matter.


>>> wording of the patch.
>> 
>> Have you suggestions to make it better?

At least the aspect Mark mentioned below.

In general I think it could be worded a bit clearer and easier to grasp. Basically it should ask for a short comment in front of non-obvious data definitions.

I can try to write something later or tomorrow. Not sure if I can really come up with something much better, so I would be OK with your wording if Mark's comment is addresed.


  Alexander

>>> I agree that leaving a hint of where the data comes from, directly
>where
>>> the data is defined in the source code, is a good thing and probably
>not
>>> to much to ask for.
>> 
>> Thanks.
>> 
>> 
>> diff --git a/doc/developer.texi b/doc/developer.texi
>> index b33cab0fc7..c3103f31dc 100644
>> --- a/doc/developer.texi
>> +++ b/doc/developer.texi
>> @@ -216,6 +216,14 @@ please use av_log() instead.
>>   @item
>>   Casts should be used only when necessary. Unneeded parentheses
>>   should also be avoided if they don't make the code easier to
>understand.
>> +
>> +@item
>> +If the code contains tables of numbers or other data, their origin
>should be
>> +documented in a comment, so that other developers can rebuild them
>if
>> +necessary. If they were taken from a reference, include the URL of
>that
>> +reference. If they were computed by a tool, include the code of the
>tool.
>> +If they were reverse-engineered, include an honest attempt at
>explaining the
>> +methods used.
>
>Also: if they were taken from a written specification, include the
>section or table number within that specification.
>
>(A reference to "H.264 table A-1" is more useful than one to
>"<https://www.itu.int/rec/T-REC-H.264/en>".)
>
>>   @end itemize
>>   
>>   @section Editor configuration
>> 
>
>Thanks,
>
>- Mark
Nicolas George Aug. 9, 2020, 11:13 a.m. UTC | #4
Alexander Strasser (12020-08-09):
> >> Andreas:
> 
> I'm sure the following was addressed at me.

Yes, sorry to both of you.

> IMHO we don't actually need ideology for this matter.

I hope so.

> At least the aspect Mark mentioned below.
> 
> In general I think it could be worded a bit clearer and easier to grasp. Basically it should ask for a short comment in front of non-obvious data definitions.
> 
> I can try to write something later or tomorrow. Not sure if I can really come up with something much better, so I would be OK with your wording if Mark's comment is addresed.

Thanks. I will wait for your proposal, and in the meantime see if I can
come up with a better way of saying it myself.

Regards,
Nicolas George Aug. 9, 2020, 11:15 a.m. UTC | #5
Zane van Iperen (12020-08-09):
> I'm apprehensive about this, especially in the case of
> reverse-engineered tables. It should definitely be encouraged, but not
> necessarily hard-required.
> 
> If you explicitly say "Reverse Engineered from so-and-so", that seems
> essentially like putting a target on FFmpeg's back.
> 
> Case-in-point: No reference decoder/encoder exists, and the tables had
> to be dumped from the application binary.

I would say that the tables themselves put a target our back, and the
comment mitigates it, since dumping tables from a binary is legal
(tables of numbers contain no creativity and therefore create no
copyright).

But anyway, the patch says "should", and such improbable cases, if they
arise, can be discussed on a case by case basis, possibly even in
private.

Regards,
Jean-Baptiste Kempf Aug. 10, 2020, 3:34 p.m. UTC | #6
On Sat, 8 Aug 2020, at 19:57, Nicolas George wrote:
> +@item
> +If the code contains tables of numbers or other data, their origin should be
> +documented in a comment, so that other developers can rebuild them if
> +necessary.

I would add "as honestly as possible" after documented... or something like that, explaining that there are cases that canno be documented.

>+ If they were taken from a reference, include the URL of that

"from a doc/reference, name it, with section information, if applicable".

> +reference. If they were computed by a tool, include the code of the tool.

include it wehere?

> +If they were reverse-engineered, include an honest attempt at explaining the
> +methods used.

That is not even always possible. And sometimes you don't want to say how you got that table.
If a table is a math function, sure, it's doable,but if it is a decryption key, it is random...

Best,
Tomas Härdin Aug. 11, 2020, 1:31 p.m. UTC | #7
lör 2020-08-08 klockan 19:57 +0200 skrev Nicolas George:
> diff --git a/doc/developer.texi b/doc/developer.texi
> index b33cab0fc7..c3103f31dc 100644
> --- a/doc/developer.texi
> +++ b/doc/developer.texi
> @@ -216,6 +216,14 @@ please use av_log() instead.
>  @item
>  Casts should be used only when necessary. Unneeded parentheses
>  should also be avoided if they don't make the code easier to understand.
> +
> +@item
> +If the code contains tables of numbers or other data, their origin should be
> +documented in a comment, so that other developers can rebuild them if
> +necessary. If they were taken from a reference, include the URL of that
> +reference. If they were computed by a tool, include the code of the tool.
> +If they were reverse-engineered, include an honest attempt at explaining the
> +methods used.

Look OK to me beyond the possible RE issues which have already been
highlighted. I try to put references to the relevant SMPTE spec and
section/table number whenever I make changes to mxf*, but of course no
one's perfect.

/Tomas
Alexander Strasser Aug. 11, 2020, 10:29 p.m. UTC | #8
On 2020-08-09 13:13 +0200, Nicolas George wrote:
> Alexander Strasser (12020-08-09):
> > >> Andreas:
[...]
>
> > At least the aspect Mark mentioned below.
> >
> > In general I think it could be worded a bit clearer and easier to grasp. Basically it should ask for a short comment in front of non-obvious data definitions.
> >
> > I can try to write something later or tomorrow. Not sure if I can really come up with something much better, so I would be OK with your wording if Mark's comment is addresed.
>
> Thanks. I will wait for your proposal, and in the meantime see if I can
> come up with a better way of saying it myself.

I ended up with this:

    Definitions of non-obvious data should have a short comment
    explaining their origin.

    If the data is of mathematical origin, you can document that
    or use code snippets or pseudo-code. If the data was gained
    empirically, describe the methods used. If the data was taken
    from a document like a specification, reference the section
    and/or table number. A link can also be used, if there is a
    stable source and there are no better ways.

    If you generated the data with a program, consider including
    the source code in FFmpeg and reference it in the comment.

    Typical examples are tables of numbers. Here is one:

        <nice example to be found and inserted>


I feel it could well be improved, though I wasn't able to do it
myself :( Maybe others can help.

Not sure it's clearly better than your patch. It's different
and a bit longer and also includes the document section stuff.
As it splits the rule itself from the hints makes it's easier
to extract the essence.

As for the example, I couldn't readily find a good one in our
code base. So if anyone could point one out, that would help.
We could also use one of your examples, or we could leave out
the example entirely. I think it would be better to include one,
because it might raise adoption of the convention.

Anyway, use my words above as you see fit. If others agree
to this version, I can also sent a proper patch.


  Alexander
Jean-Baptiste Kempf Aug. 12, 2020, 10:32 a.m. UTC | #9
On Wed, 12 Aug 2020, at 00:29, Alexander Strasser wrote:
>     Definitions of non-obvious data should have a short comment
>     explaining their origin.
> 
>     If the data is of mathematical origin, you can document that
>     or use code snippets or pseudo-code. If the data was gained
>     empirically, describe the methods used. If the data was taken
>     from a document like a specification, reference the section
>     and/or table number. A link can also be used, if there is a
>     stable source and there are no better ways.
> 
>     If you generated the data with a program, consider including
>     the source code in FFmpeg and reference it in the comment.
> 
>     Typical examples are tables of numbers. Here is one:
> 
>         <nice example to be found and inserted>
> 
> 
> I feel it could well be improved, though I wasn't able to do it
> myself :( Maybe others can help.

What about RE values?
Alexander Strasser Aug. 12, 2020, 12:38 p.m. UTC | #10
On 2020-08-12 12:32 +0200, Jean-Baptiste Kempf wrote:
> On Wed, 12 Aug 2020, at 00:29, Alexander Strasser wrote:
> >     Definitions of non-obvious data should have a short comment
> >     explaining their origin.
> >
> >     If the data is of mathematical origin, you can document that
> >     or use code snippets or pseudo-code. If the data was gained
> >     empirically, describe the methods used. If the data was taken
> >     from a document like a specification, reference the section
> >     and/or table number. A link can also be used, if there is a
> >     stable source and there are no better ways.
> >
> >     If you generated the data with a program, consider including
> >     the source code in FFmpeg and reference it in the comment.
> >
> >     Typical examples are tables of numbers. Here is one:
> >
> >         <nice example to be found and inserted>
> >
> >
> > I feel it could well be improved, though I wasn't able to do it
> > myself :( Maybe others can help.
>
> What about RE values?

All in all it's same as Nicolas' proposal: The convention is to
document the origin of the data. It says should, which is not must.

I just dropped the explicit mention, since I feel it added no value.
There were some comments in this discussion, indicating to me that
it could even irritate more than help.

Do you think mentioning RE explicitly would be better? If yes, why?

For RE there are theoretically different ways to gain values,
if it was reconstructed by inspecting and comparing outputs of
a binary, that methods could be described or sources of the used
tools could be included.

If it is just a dump of a table from a binary or a trivial
transformation thereof, than adding a comment might not have much
value at all. Maybe if a contributor is comfortable with it, he
could mention from which binary it was dumped. In some cases
it might not really be different at all from a pointer to a spec,
if the spec doesn't reveal the origin of the included data.

If the dump has undergone non-obvious transformations, then
describing those would certainly be helpful.

As Nicolas mentioned before, for such special cases discussions
could happen on an case by case basis if a reviewer cares enough.

As I cannot judge any legal stuff surrounding this, which could
even be different in different countries, I guess it's important
to not urge individuals to write things they are not comfortable
with.


  Alexander
Jean-Baptiste Kempf Aug. 14, 2020, 9:34 a.m. UTC | #11
On Wed, 12 Aug 2020, at 14:38, Alexander Strasser wrote:
> On 2020-08-12 12:32 +0200, Jean-Baptiste Kempf wrote:
> > On Wed, 12 Aug 2020, at 00:29, Alexander Strasser wrote:
> > >     Definitions of non-obvious data should have a short comment
> > >     explaining their origin.
> > >
> > >     If the data is of mathematical origin, you can document that
> > >     or use code snippets or pseudo-code. If the data was gained
> > >     empirically, describe the methods used. If the data was taken
> > >     from a document like a specification, reference the section
> > >     and/or table number. A link can also be used, if there is a
> > >     stable source and there are no better ways.
> > >
> > >     If you generated the data with a program, consider including
> > >     the source code in FFmpeg and reference it in the comment.
> > >
> > >     Typical examples are tables of numbers. Here is one:
> > >
> > >         <nice example to be found and inserted>
> > >
> > >
> > > I feel it could well be improved, though I wasn't able to do it
> > > myself :( Maybe others can help.
> >
> > What about RE values?
> 
> All in all it's same as Nicolas' proposal: The convention is to
> document the origin of the data. It says should, which is not must.

SHOULD can mean "really mandatory, besides exceptions", so I would soften it, to explain common sense must be shared, like "if origin is mathematical or specification", or similar.

But I like your version.
Alexander Strasser Aug. 15, 2020, 8:02 a.m. UTC | #12
On 2020-08-14 11:34 +0200, Jean-Baptiste Kempf wrote:
> On Wed, 12 Aug 2020, at 14:38, Alexander Strasser wrote:
> > On 2020-08-12 12:32 +0200, Jean-Baptiste Kempf wrote:
> > > On Wed, 12 Aug 2020, at 00:29, Alexander Strasser wrote:
> > > >     Definitions of non-obvious data should have a short comment
> > > >     explaining their origin.
> > > >
> > > >     If the data is of mathematical origin, you can document that
> > > >     or use code snippets or pseudo-code. If the data was gained
> > > >     empirically, describe the methods used. If the data was taken
> > > >     from a document like a specification, reference the section
> > > >     and/or table number. A link can also be used, if there is a
> > > >     stable source and there are no better ways.
> > > >
> > > >     If you generated the data with a program, consider including
> > > >     the source code in FFmpeg and reference it in the comment.
> > > >
> > > >     Typical examples are tables of numbers. Here is one:
> > > >
> > > >         <nice example to be found and inserted>
> > > >
> > > >
> > > > I feel it could well be improved, though I wasn't able to do it
> > > > myself :( Maybe others can help.
> > >
> > > What about RE values?
> >
> > All in all it's same as Nicolas' proposal: The convention is to
> > document the origin of the data. It says should, which is not must.
>
> SHOULD can mean "really mandatory, besides exceptions", so I would soften it, to explain common sense must be shared, like "if origin is mathematical or specification", or similar.
>
> But I like your version.

Maybe instead of

    Definitions of non-obvious data should have a short comment
    explaining their origin.

changing the first paragraph to

    Consider documenting the origin of non-obvious data in a
    short commment above its definition.

would help?

It sounds less like a "formal should" and probably doesn't change the
"effective adoption rate" at all. What do you think?

Thanks for your feedback!


  Alexander
diff mbox series

Patch

diff --git a/doc/developer.texi b/doc/developer.texi
index b33cab0fc7..c3103f31dc 100644
--- a/doc/developer.texi
+++ b/doc/developer.texi
@@ -216,6 +216,14 @@  please use av_log() instead.
 @item
 Casts should be used only when necessary. Unneeded parentheses
 should also be avoided if they don't make the code easier to understand.
+
+@item
+If the code contains tables of numbers or other data, their origin should be
+documented in a comment, so that other developers can rebuild them if
+necessary. If they were taken from a reference, include the URL of that
+reference. If they were computed by a tool, include the code of the tool.
+If they were reverse-engineered, include an honest attempt at explaining the
+methods used.
 @end itemize
 
 @section Editor configuration