Talk:Find first and last set bit of a long integer: Difference between revisions

From Rosetta Code
Content added Content deleted
(→‎Errors: More thoughts on this topic)
(→‎Errors: Maybe the '''lwb''' and '''upb''' routines can be modelled (and renamed) to ␣␣␣/␣␣␣ to avoid confusion?)
Line 102: Line 102:


: The problem comes when comparing, say, calculations of the bits for an 8-bit number with those for a 64-bit number. The only non-crazy interpretation of <code>upb</code> and <code>lwb</code> involves giving the same values for “65” however many bits are used to represent it. This in turn enables arbitrary-precision integer arithmetic (because you can ignore the actual representation width — a good thing because it may be variable in the implementation). OTOH, it also means that you probably lose the duality between bit arrays and numbers as with numbers you enumerate the bits in ''logical'' order, whereas with bit arrays you enumerate in physical order; there's no reason to assume the two are identical (nor is it possible to know the physical order of bits in a byte with modern hardware, though it probably follows the byte order in order to keep circuitry smaller in ALUs). The long and short of it is this: the task asked about integers so the results ''must'' be sane for them even if that leads to odd results for bit strings. (I prefer the convention of using <tt>-1</tt> for bit positions of <tt>0</tt>, as that at least is a position that is always not in the bit sequence at all; this might be in part implied by the fact that the Tcl solution is always using arbitrary-precision math; that's how Tcl's math works.) –[[User:Dkf|Donal Fellows]] 14:51, 8 December 2011 (UTC)
: The problem comes when comparing, say, calculations of the bits for an 8-bit number with those for a 64-bit number. The only non-crazy interpretation of <code>upb</code> and <code>lwb</code> involves giving the same values for “65” however many bits are used to represent it. This in turn enables arbitrary-precision integer arithmetic (because you can ignore the actual representation width — a good thing because it may be variable in the implementation). OTOH, it also means that you probably lose the duality between bit arrays and numbers as with numbers you enumerate the bits in ''logical'' order, whereas with bit arrays you enumerate in physical order; there's no reason to assume the two are identical (nor is it possible to know the physical order of bits in a byte with modern hardware, though it probably follows the byte order in order to keep circuitry smaller in ALUs). The long and short of it is this: the task asked about integers so the results ''must'' be sane for them even if that leads to odd results for bit strings. (I prefer the convention of using <tt>-1</tt> for bit positions of <tt>0</tt>, as that at least is a position that is always not in the bit sequence at all; this might be in part implied by the fact that the Tcl solution is always using arbitrary-precision math; that's how Tcl's math works.) –[[User:Dkf|Donal Fellows]] 14:51, 8 December 2011 (UTC)

Fair enough, getting the answer for '''short int''', vs '''long int''' is useful, and OTOH it is also useful to know just the actual position of the bit without counting backwards from the last bit. Note also:
* RFC's counts from the left to the right [http://tools.ietf.org/html/rfc2360#section-3.1 rfc2360].
* There is a comparison in wikipedia ([[wp:Bit_numbering#LSB_0_bit_numbering|LSB_0_bit_numbering]] vs [[wp:Bit_numbering#MSB_0_bit_numbering|MSB_0_bit_numbering]]) and a description of [[wp:MSB_0_bit_numbering#Bit_numbering|Bit_numbering]].
* Intel asm uses the [http://pdos.csail.mit.edu/6.858/2011/readings/i386/BSF.htm BSF] & [http://pdos.csail.mit.edu/6.858/2011/readings/i386/BSR.htm BSR] op codes.
* Maybe the '''lwb''' and '''upb''' routines can be modelled (and renamed) to match intels BSF/BSR to avoid confusion?
* Or keep use '''lwb''' and '''upb''' for absolute position ([[wp:Bit_numbering#MSB_0_bit_numbering|MSB_0_bit_numbering]]), but use '''lwb0''' and '''upb0''' for LSB (bits width) reverse relative positions ([[wp:Bit_numbering#LSB_0_bit_numbering|LSB_0_bit_numbering]]).
* Maybe '''rlwb''' and '''rupb''' for the "right/reverse/relative" '''lwb''' and '''upb'''.

Certainly greater minds then ours have already pondered this. c.f. [[wp:Lilliput_and_Blefuscu#Satirical_interpretations|Lilliput and Blefuscu]]. :-)

BTW: What is the representation of '''not''' 0xf in "arbitrary-precision" TCL? I (tongue in cheek) suggest hex 0x[[wp:Aleph_number|&alefsym;0]] where &alefsym; is the HTML entity &amp;alefsym;... this can simply be pronounced "all F symbol", meaning an infinite number of Fs. ;-)

[[User:NevilleDNZ|NevilleDNZ]] 23:08, 8 December 2011 (UTC)

Revision as of 23:08, 8 December 2011

More Clarification "set bit"

The example gives

INT: find first & last set bit  |LWB|UPB|Bits
                               0| 33| 32|0

How are there any set bits in an unsigned integer? --Dgamey 03:43, 8 December 2011 (UTC)

Can you clarify? For it would be an interesting aside to see a host code sample that is not for Two's complement. NevilleDNZ 04:39, 8 December 2011 (UTC)

Hrm, if it's not in two's compliment, what motivates you to find bits in it to begin with? --Ledrug 04:49, 8 December 2011 (UTC)

Wikipedia has a few non-Two's complement ideas:

Sometimes it is just done for fun:

NevilleDNZ 05:52, 8 December 2011 (UTC)

Clarification

The task is unclear in what number should be computed. It mentions "first" and "last" set bits, but with bits written in most-significant to least-significant order, or the other way around? (i.e. which of "upb" and "lwb" should compute the least significant set bit? and which computes the most significant set bit?) For a 32-bit integer should the least significant bit be numbered 0, 1, 31, or 32? Also, what values should be returned if there are no set bits? And for negative integers, should the most significant set bit be the sign bit?

The ALGOL example seems to indicate that: the least significant bit is numbered 32 (and the most significant bit is numbered 1). This seems the reverse of most people's bit numberings (where the least significant bit should be numbered 0, and the most significant 31), and also wouldn't make sense for arbitrary-precision integers (where would you start counting?). Also, it seems that "lwb" computes the most significant set bit, while "upb" computes the least significant set bit. Is this the example we should follow for other languages? --208.80.119.67 23:38, 7 December 2011 (UTC)

The whole thing doesn't make sense for multi-precision integers. If you are willing to use language/library bigint implementation, you basically willingly forfeit control over how it's done behind the scenes, it doesn't make sense to "efficiently find a set bit" -- who says the numbers are internally represented by bits to begin with? --Ledrug 00:30, 8 December 2011 (UTC)

/* Clarification */ integers are used in the test case only to conveniently supply binary specimen for testing.

re: This seems the reverse of most people's bit numberings.

C also - correct me if I am wrong - prints the most significant bit/octel/hex digit on the left, and the least on the extreme right, hence the operators << and >>. Also python, and most other programming languages display and (pretend to) store right to left. It's probably a human thing.

I was referring to the bit numbering, not how they are printed. From what I've seen, the least significant bit is usually numbered "0" and the most significant bit "31". --208.80.119.67 02:34, 8 December 2011 (UTC)

re: And for negative integers

The task is focused on binary/bits. Whether it is signed or unsigned is not so important. The test case requires only +ve. Beyond that just do the natural thing for the specific language & CPU.

re: wouldn't make sense for arbitrary-precision integers

Agreed. It does seem odd having the least significant bit at (algol68g's) 116 for long bits/long ints. Thus this could make Arbitrary Precision integers particularly weird. But this task is about bit manipulation. The test case happens to use long int. Maybe a particular language/CPU puts these in reverse. In such a case just find the location of first and last non zero bits on this platform. Try not to worry about what is being represented. Try not to worry about the significance of the actual bits.

Also: The task doesn't request for Arbitrary Precision bits per se, rather it asks for an algorithm that works 32 bits, and then another algorithm that works for any particular precision implementation. Hopefully his would not preclude an Arbitrary Width Bits algorithm. ¢ A 'not of 0 on Arbitrary Precision bits is going to be very interesting! :-) ¢

FYI: Algol68 permits the coercion/casting/widening of bits to a bool array, eg<lang algol68>[]BOOL bools a = 2r101,

      bools b = (TRUE,FALSE,TRUE); 

BITS bits c = 2r101,

    bits d = bits pack((TRUE,FALSE,TRUE)); # Hence ... #
    print((bits width ELEM bits a, []BOOL(bits a)[bits width])) # prints "1F" - the same last bit #</lang> Hence there are strong synergies between bits and []bool type, but they are not the same type.  Similarly the types int and bits are distinct and require explicit casting with the operators bin and abs.  C - by contrast - does not have this distinction, in C an int also happens to be bits.

Good luck, and Merry Xmas!

NevilleDNZ 02:02, 8 December 2011 (UTC)

Initial C code specimen requires a long int algorithm

The use of #define in the C code specimen works really well when unrolling a loop.

The code issues are:

  • code specimen lacks any more generally algorithm that would work - for example - with 128, or 256 width bits.
  • the code confuses "lowest set bit" (a bit order concept) with "least significant bit" (an integer concept). These are actually almost opposites on most machines.
  • ditto for "upper set bit" and "most significant bit".
  • the results are in hex eg. 0xc0ffee, not binary 0b110000001111111111101110 making it difficult to visually discover/confirm the results.

Try using Duff's device to unroll wider bits types.

A thought: maybe ( as C has given us << left shift and >> right shift ) this bits task should be asking for the first left bit set, and the first right bit set? This would be "type agnostic" avoiding confusion with integers arithmetic properties.

NJoy NevilleDNZ 03:17, 8 December 2011 (UTC)

From a practical point of view (C is not likely your go-to language for proof of concept stuff), anything longer than the longest native type doesn't need its individual method, because it has to be in a struct or array, and there's no atomic bit operations anyway -- we are worried about efficiency (are we?), so a generic "any length bit operation" is out of question. Just scan whatever struct or array from one side to the other and deal with the first non-zero item. You can hardly do better than that without special hardware. Unless we are talking about say, OpenCL, where it's a whole different matter.
For the first/last nomenclature, an integer type has well defined least/most signicant bit concept, while first/last/left/right all depends on endianness. If the bits are transmitted on a wire, we also need to know the bit ordering in a byte, so these are hazy at best. The 32-bit number 257 can be stored in memory as either "01 01 00 00" or "00 00 01 01", so what's the "last" set bit? Or the "left most" one? Are we going to consider 32 bit as smallest unit, or 8 bit? Unless you have a spec about using these bits, lsb/msb is a better way to refer to them.
As to the hex vs binary output, I personally find reading 8 hex digit is less stressful than reading 32 1s and 0s. I did provide the positions of the related bits, though. --Ledrug 04:22, 8 December 2011 (UTC)

C has given us << left shift and >> right shift operators for int, long int, and long long int. Clearly C knows the difference between left and right. That could be a good starting point. NevilleDNZ 04:31, 8 December 2011 (UTC)

Errors

The C example goes one power too far and runs into problems with losing the MSB. Minor correction needed though; it's not the core task itself that is wrong.

The Algol 68 example though, I cannot see how those figures (especially the LSB) could possibly be correct on a machine that uses binary digits. They are just wildly off. I don't know exactly what's wrong, but I truly don't believe them. Or are they counting from the other end? That'd just be weird given that virtually everyone numbers bits from the LSB, as that gives consistent answers for small integers irrespective of the size of machine integers being used. –Donal Fellows 09:01, 8 December 2011 (UTC)

Compare:

  • C's unsigned has the bits organised right to left, but - IIRC - not numbered in any way.
    • The numbering is "virtual" depending on whether << or >> is used to extract the bit.
    • Maybe C's closest "ordering" concept would be: <lang c>struct {signed int a:1, b:1, c:1, d:1, e:1, f:1, g:1, h:1;} ordered;</lang>
      • IIRC this won't work: <lang c>struct {signed int bits:1[32];} ordered;</lang>
    • In particular unsigned bits (or unsigned) cannot ever be coerced to a bool array.
    • There is no need or use in an actual number or index to the bits.
  • Algol68 has two places where the bit number/index is used:
    • elem operator, eg x elem 1 give the "first" bit of x, which is the bit on the extreme left.
    • When a bits variable (eg x) is coerced to a bool array - []bool(x) - again the "first" bool - x[1] - is the extreme left bit.
    • One nice effect is x[LWB x:UPB x] would exactly extract the non-zero bits 2r11101101 of 2r0000000111011010000
  • Weirdly, when a short bits literal e.g. bits x:=2r11101101 is assigned into the bits variable x, it is padded to the left 24 with zeros and become 2r0000000000000000000011101101! i.e right justified just like C!
  • shl and shr work the same as C's << and >>

NevilleDNZ 12:01, 8 December 2011 (UTC)

The problem comes when comparing, say, calculations of the bits for an 8-bit number with those for a 64-bit number. The only non-crazy interpretation of upb and lwb involves giving the same values for “65” however many bits are used to represent it. This in turn enables arbitrary-precision integer arithmetic (because you can ignore the actual representation width — a good thing because it may be variable in the implementation). OTOH, it also means that you probably lose the duality between bit arrays and numbers as with numbers you enumerate the bits in logical order, whereas with bit arrays you enumerate in physical order; there's no reason to assume the two are identical (nor is it possible to know the physical order of bits in a byte with modern hardware, though it probably follows the byte order in order to keep circuitry smaller in ALUs). The long and short of it is this: the task asked about integers so the results must be sane for them even if that leads to odd results for bit strings. (I prefer the convention of using -1 for bit positions of 0, as that at least is a position that is always not in the bit sequence at all; this might be in part implied by the fact that the Tcl solution is always using arbitrary-precision math; that's how Tcl's math works.) –Donal Fellows 14:51, 8 December 2011 (UTC)

Fair enough, getting the answer for short int, vs long int is useful, and OTOH it is also useful to know just the actual position of the bit without counting backwards from the last bit. Note also:

  • RFC's counts from the left to the right rfc2360.
  • There is a comparison in wikipedia (LSB_0_bit_numbering vs MSB_0_bit_numbering) and a description of Bit_numbering.
  • Intel asm uses the BSF & BSR op codes.
  • Maybe the lwb and upb routines can be modelled (and renamed) to match intels BSF/BSR to avoid confusion?
  • Or keep use lwb and upb for absolute position (MSB_0_bit_numbering), but use lwb0 and upb0 for LSB (bits width) reverse relative positions (LSB_0_bit_numbering).
  • Maybe rlwb and rupb for the "right/reverse/relative" lwb and upb.

Certainly greater minds then ours have already pondered this. c.f. Lilliput and Blefuscu. :-)

BTW: What is the representation of not 0xf in "arbitrary-precision" TCL? I (tongue in cheek) suggest hex 0xℵ0 where ℵ is the HTML entity &alefsym;... this can simply be pronounced "all F symbol", meaning an infinite number of Fs. ;-)

NevilleDNZ 23:08, 8 December 2011 (UTC)