User talk:Hailholyghost

Welch's t-test

Hello, you wrote: "Welch's t-test is only part of the calculation, it isn't the purpose of the page."

What's the purpose of the page, then?

By the way, "Given two lists of data, calculate the p-value used for Calculation of P-value." is totally meaningless. Which p-value? There are zillions of p-value, associated with zillions of statistical tests. Basically all you have to know is the probability distribution of some statistic, but there are infinitely many of them: you are not going to ask for every possible distribution, so you have to choose one. Here, all the task in its current state is about Welch's t-test and how to compute the corresponding p-value, and yet you pretend it's not the purpose of the page. Puzzling.

Eoraptor (talk) 11:37, 8 December 2017 (UTC)

Welch's t-test is easy, the reason I created this page is because I had no idea how to calculate the integration, which is the point of the page. This page is meant to calculate this the same as R's "t.test(x,y,paired=FALSE)" If this page were about the t-test, as you say it is, the C code would have 1 line, and is completely trivial. The point of the page is to show how to do the non-trivial things: integration. I spent weeks figuring out how to do this, and your title change obfuscates my hard work.--Hailholyghost (talk) 12:51, 8 December 2017 (UTC)

Then this is a very badly designed task. There are already tasks about integration. Anyway, nobody would compute the p-value this way: there are good implementations of special functions, here the incomplete beta function, and usually they make use of specific properties of the functions, like series, continued fraction, optimal rational (Pade like) or polynomial (Chebyshev equioscillation) approximations, etc. If you want the p-value, the "standard" way is to find a special function library. Integration should be a last resort for Rosetta Code, for languages that do not have readily available special functions libraries. Common languages such as C, Fortran or Python all have something (I implemented the task in Python using numpy/scipy, and there is something in Fortran using IMSL and SLATEC, both well-known libraries), statistical packages even have builtin function for the whole test (see Stata and R here).

If you want to see how to use a generalist integration routine to compute the p-value, you are doing it wrong. If you want a general task on integration, you are doing it wrong too, and there is already a task about this.

You write the C code is trivial: really? How? You will need at least the incomplete beta function from GSL (or IMSL, NAG or any library with it, even a Fortran-based library), and this will require more than one line (you may do more or less what is done here in Fortran or Python). Even getting a working GSL on Windows is not trivial at all. And what you call the non-trivial thing is really not about integration at all.

Eoraptor (talk) 13:13, 8 December 2017 (UTC)

For instance, the R implementation uses a port of TOMS 708 from the ACM. The original code is here. Notice that, as usual with special functions, there are calls to different methods according the the values of the arguments. There are reasons not to reimplement this from scratch: it requires much research work to prove a given algorithm is correct, to find necessary coefficients with enough precision, and to get a correct and efficient implementation. Here you are not even trying to investigate the convergence of the integral, and it's an inefficient way to find an accurate answer. Eoraptor (talk) 13:30, 8 December 2017 (UTC)

"You will need at least the incomplete beta function" you are contradicting yourself. The Welch's t-test requires no integration. It took me a long time to implement this in portable GNU99 C, which wouldn't require library installations. This was a particularly daunting task, because I couldn't install the libraries- they were useless. I thought the world could benefit from my work. If you have some better way of doing this, fine, re-write the C code to implement the algorithms you mention.

Nonetheless, I don't see how to alter the title of the page. It appears that this page is stuck with a wrong title.

Oh, no integration, no beta incomplete function? That is going to be quite difficult. You need the beta incomplete function, which can be written as an integral. Either you integrate, or you apply clever techniques to compute the function without integration. But you have to do something. And what you ask for is not doable anyway (no "general p-value"). You don't seem to be understanding what you are trying to do wrong. Good luck. I give up. Eoraptor (talk) 15:46, 8 December 2017 (UTC)

Hi,

Suggested reading for your Python code: PEP 8 -- Style Guide for Python Code Most notably, tabs are not recommended but 4 spaces per tab are common, line length should be limited, no spaces after opening paren and before closing paren, no space between function name and following paren... While it may look stupid to you, these rules are applied in virtually every published Python code, and this is an important part of the readability of this language. There is also a minor bug at the beginning (two tests on ARRAY1_SIZE, none on ARRAY2_SIZE), and "while (1):" would rather be written "while True:". There are other oddities (inconsistent spacing around operators at least). If you don't mind, I could rewrite this in a more "pythonistic" way.

As an aside, it's an often overlooked matter (and Burkardt was guilty as well, as far as I was able to check) : translating from a language to another language requires more than just converting syntax. It's a criticism that was also made to the famous (or infamous) Numerical Recipes : the C version looked too much like the Fortran original, and not like usual C.

I also moved the numpy solution first: not that I am eager to put "my" solution first, but: 1/ It's much shorter, so the second solution is more visible if it's the low-level implementation 2/ It's not what would actually be done in Python, for several reasons: speed (use compiled code when it's available), but most importantly because statistical computations in Python are usually done with the numpy/scipy/pandas framework, which is much closer to what's available in, say, R, than "pure Python", which would require reinventing everything.

HTH

Eoraptor (talk) 13:08, 10 March 2018 (UTC)

Hi Eoraptor, thanks for your suggestions. I've tried to re-edit my code as much as I could. I am new, and am learning from old code. I don't know how I can make the code idiomatic. I make sure that the output is correct before I post the code. Please feel free to edit what I put, I won't be insulted.--Hailholyghost (talk) 20:01, 12 March 2018 (UTC)

Done. A few notes: I use Python 3 (the only visible difference here is I write print("something"), not print "something"). I put the betain in a separate function: easier to debug, easier to reuse. While the naming convention is debatable, I feel it's more readable to have short names in long formulas. The main welch_ttest function returns what is usually printed when doing that kind of test: statistic, degrees of freedom, p-value. I have limited the output to one example, but I'll add the other ones if you wish.

I translated the betain function from Stata, as it's very little work, but there is still a problem with the license. I asked the RSS, but got no response. My problem is: the Applied Statistics routines were written by various authors, however the RSS claims to hold copyright on all the code. Why not, but there is no mention of the exact license (except that code should be freely redistributed). The GPL is much more explicit, and I don't think Burkardt had the right to reditribute a translation of AS 63 under GPL (at least, I have no evidence showing he was entitled to do so). It's a well-known problem with ancient numerical codes. By ancient, I mean 70s and 80s: code was often distributed with no specific notion of license, nothing was ever explicit. And years later software engineering was more "formalized", and the status of old code was not very clear. It's one of the reason the GSL was created: a kind of modern SLATEC, with explicit GPL license. And because old code licenses were so fuzzy, the GSL guys decided not to reuse the "golden oldies" like SLATEC, CMLIB, NSWC, MATH77... those large generalist numerical libraries from the 70s, often written by US governement personel or at least with US government funding (so there should be no problem, however the copyright holders are usually not known, and there may be surprises). From time to time other free software projects consider using them, but the answer is always the same: too unclear, too risky.

Long story short: don't use these codes in software you plan to redistribute.

Hope this helps,

Eoraptor (talk) 21:54, 12 March 2018 (UTC)

P-value correction

Hi,

In the task there are a number of p-values. Would it be possible to reproduce the original datasets? Here is the reason: these adjustments are used for multiple comparisons, and there may be routines in statistical software to do that, including the adjustment, but starting from actual data, not p-values. While I understand that the task goal is to mimick R, it's a bit silly to have to rewrite everything if there is already basically the same functionality.

I didn't look at SAS documentation yet, but in Stata there is the dunntest package from Alexis Dinno, which does exactly that. See the help (in Stata's markup language, but it should be readable).

Eoraptor (talk) 15:54, 13 March 2018 (UTC)

Actually, SAS can do both with PROC MULTTEST: usually it will be used with a dataset, but it's possible to pass p-values as well. Eoraptor (talk) 15:56, 13 March 2018 (UTC)

Hi Eoraptor, unfortunately the pvalues generated were from a random number generator in R, probably `rnorm` or something similar (I did this several months ago, I don't remember exactly).--Hailholyghost (talk) 19:55, 13 March 2018 (UTC)

Ok, thanks. That's what I suspected, but I tried and asked :) Eoraptor (talk) 21:21, 13 March 2018 (UTC)

License

Hi,

There was already a problem with Fivenum, but the code is rather short and obvious there.

There is another problem, see Talk:P-value correction#License problem and User talk:Thundergnat#Problem with GPL code inserted in RC and translated in other languages.

Please, read carefully Rosetta Code:Copyrights and don't do that again.

Now, we have implementations in several languages, which required a fair amount of work from their authors, but which are nevertheless protected by GPL because they are modified versions of GPL'd code. And this is a problem on Rosetta Code.

Eoraptor (talk) 11:28, 14 March 2018 (UTC)

I would like to add, however, that the tasks you created about statistics are interesting and useful here on RC (and I'd like to see more of them). This is a problem about how you give a reference implementation and/or documentation, not about the tasks themselves. You can't just copy/paste R source: that R is free software does not mean you can do anything you want with its source code. I'll try to make the tasks more clear, but I don't have much time to invest in this currently. At least I update the P-value correction task with full reference to the articles you pointed to, and added JSTOR links. Since I have an access to JSTOR, I can have a look to these articles and I may be able to provide code written from scratch for this. But it will take some time. Eoraptor (talk) 11:43, 14 March 2018 (UTC)

Cannot run python libraries

I moved this discussion to your talk page, as it's not a general problem with Wlech't test, but a specific you problem you had with your program

current numpy/scipy/scipy.stats will not run on my Ubuntu VM:

<lang python> true Traceback (most recent call last):

 File "numpy_scipy_pvalue.py", line 3, in <module>
   import scipy.stats
 File "/usr/lib/python2.7/dist-packages/scipy/stats/__init__.py", line 344, in <module>
   from .stats import *
 File "/usr/lib/python2.7/dist-packages/scipy/stats/stats.py", line 173, in <module>
   import scipy.special as special
 File "/usr/lib/python2.7/dist-packages/scipy/special/__init__.py", line 643, in <module>
   from ._ellip_harm import ellip_harm, ellip_harm_2, ellip_normal
 File "/usr/lib/python2.7/dist-packages/scipy/special/_ellip_harm.py", line 7, in <module>
   from ._ellip_harm_2 import _ellipsoid, _ellipsoid_norm
 File "scipy/special/_ellip_harm_2.pyx", line 5, in init scipy.special._ellip_harm_2 (scipy/special/_ellip_harm_2.c:7330)
 File "/usr/lib/python2.7/dist-packages/scipy/integrate/__init__.py", line 59, in <module>
   from ._bvp import solve_bvp
 File "/usr/lib/python2.7/dist-packages/scipy/integrate/_bvp.py", line 10, in <module>
   from scipy.sparse.linalg import splu
 File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/__init__.py", line 112, in <module>
   from .isolve import *
 File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/isolve/__init__.py", line 6, in <module>
   from .iterative import *
 File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/isolve/iterative.py", line 84, in <module>
   def bicg(A, b, x0=None, tol=1e-5, maxiter=None, xtype=None, M=None, callback=None):
 File "/usr/lib/python2.7/dist-packages/scipy/_lib/_threadsafety.py", line 59, in decorator
   return lock.decorate(func)
 File "/usr/lib/python2.7/dist-packages/scipy/_lib/_threadsafety.py", line 47, in decorate
   return scipy._lib.decorator.decorate(func, caller)
 File "/usr/lib/python2.7/dist-packages/scipy/_lib/decorator.py", line 241, in decorate
   evaldict, __wrapped__=func)
 File "/usr/lib/python2.7/dist-packages/scipy/_lib/decorator.py", line 224, in create
   self = cls(func, name, signature, defaults, doc, module)
 File "/usr/lib/python2.7/dist-packages/scipy/_lib/decorator.py", line 120, in __init__
   formatvalue=lambda val: "", *argspec[:-2])[1:-1])
 File "/usr/lib/python2.7/inspect.py", line 871, in formatargspec
   return '(' + string.join(specs, ', ') + ')'

AttributeError: 'module' object has no attribute 'join' </lang>

these libraries seem promising, but I can't use them.

--Hailholyghost (talk) 15:11, 16 March 2018 (UTC)

Can you post your own program that attempts to use these libraries? Does it fail when you only import the lib? If not, does the following fail?

import numpy as np
a = np.array([1,2,3],dtype=float)
np.sum(a)

Eoraptor (talk) 17:54, 16 March 2018 (UTC)

Hi Eoraptor,

I'm trying the same script that you're running, the problem is in scipy.stats I think:

<lang python> import numpy as np import scipy as sp import scipy.stats

def welch_ttest(x1, x2):

   n1 = x1.size
   n2 = x2.size
   m1 = np.mean(x1)
   m2 = np.mean(x2)
   v1 = np.var(x1, ddof=1)
   v2 = np.var(x2, ddof=1)
   t = (m1 - m2) / np.sqrt(v1 / n1 + v2 / n2)
   df = (v1 / n1 + v2 / n2)**2 / (v1**2 / (n1**2 * (n1 - 1)) + v2**2 / (n2**2 * (n2 - 1)))
   p = 2 * sp.stats.t.cdf(-abs(t), df)
   return t, df, p

welch_ttest(np.array([3.0, 4.0, 1.0, 2.1]), np.array([490.2, 340.0, 433.9])) (-9.559497721932658, 2.0008523488562844, 0.01075156114978449)</lang>

I've installed numpy and scipy, but if I run only the top 3 lines (importing) I get the error as well.--Hailholyghost (talk) 18:18, 16 March 2018 (UTC)

Running the short script you mentioned above, with just numpy, produces the following:

<lang bash> con@con-vb:~/Scripts$ python numpy.py Traceback (most recent call last):

 File "numpy.py", line 1, in <module>
   import numpy as np
 File "/home/con/Scripts/numpy.py", line 2, in <module>
   a = np.array([1,2,3],dtype=float)

AttributeError: 'module' object has no attribute 'array' </lang>

Mmm. It looks like you are giving your programs the same names as the modules you are importing. Bad idea: Python will look first in the current directory, and guess what it will import when you do "import numpy"? Eoraptor (talk) 21:20, 16 March 2018 (UTC)

I've changed the file names, and I'm still getting the same errors :( there are no files in this directory matching num* or sci* --Hailholyghost (talk) 22:03, 16 March 2018 (UTC)

Any chance a directory in the PYTHONPATH variable contains such files? The errors mean Python does not find the functions in modules 'string' and 'numpy' respectively. The file are found (otherwise you would get for instance ModuleNotFoundError: No module named 'string'), but they do not contain what they should. There is another possibility: in Python 2, the 'string' module has a 'join' function, but not in Python 3. It should never happen, but if for some obscure reason you are running the Python 3 executable with a Python 2 library, it's not surprising you get errors. You could do 'python --version' in the console to see what you are actually running. But usually, this could not happen because Python will automatically look for the correct version of its library (may be possible to fool Python by setting some symlinks, or some other weird trick). You may also ask for help on Stack Overflow, where users of both Linux and Python may have a better answer than me (I work on Windows currently, and have not used Python 2 for years). If you try Stack Overflow, you may help by providing some information on your install, in addition to the error message: which Linux distribution, how you installed Python and numpy... But if you used the Linux package manager or Python's pip, and if you didn't do anything "by hand", the default install should be ok. Eoraptor (talk) 23:27, 16 March 2018 (UTC)

Ah, by the way: you renamed the numpy.py/scipy.py or other files you created, but you will also have to remove the pyc files: they are compiled Python files automatically created when you import a library (they may be in a subdirectory called '__pycache__', but IIRW with Python 2 they are in the same directory as the .py files - just have a look anyway). Eoraptor (talk) 23:32, 16 March 2018 (UTC)

Also, you write there is no num*/sci* file in your directory, but actually there should not be a string.py either, as it's what is causing a problem in your first error message. Check 'echo $PYTHONPATH', move to an empty directory, and retry to import some modules. Eoraptor (talk) 23:42, 16 March 2018 (UTC)

do you know if there are any better websites than StackOverflow? there are some serious issues with that site, namely down-voting with no reason, my reputation can be destroyed in a few seconds for no reason (that I'm aware of anyway), and because of this I'm very hesitant to ask a question on that website. I got a -4 vote for my last question, and I have no idea why. For that reason, I only use stackoverflow.com as a last resort. I only use that website if I've exhausted every other possible avenue. https://meta.stackexchange.com/questions/198962/anonymous-downvotes-ruining-stack-overflow-for-new-users-driving-them-away-from--Hailholyghost (talk) 02:16, 18 March 2018 (UTC)

Stack Overflow is good, but you have to be careful when you ask something. You could also ask on the Python mailing list. Anyway, see How do I ask a good question? as it will be useful wherever you ask for help. Here, however, there is a risk you get downvoted because you are in a situation that was already discussed many times on Stack Overflow: see for instance this. So often that there is something on Meta about this: Canonical Q/A for 'Module 'x' has no attribute 'y' in Python. However, you seem to be claiming there is no shadowing module. Before asking on Stack Overflow, you will have first to double- and triple-check, because, honestly, I see no other reason this error would happen. So, check that there is no Python file (.py, .pyc, .pyo, .pyd) in the directory from which you call your scripts or in any directory in the Python search path (which you can find with

import sys
sys.path

inside Python), that could be masking a Python module, and especially the ones which cause an error: here string.py and numpy.py. Because I am almost certain there is one: the error means precisely that.

You could try:

import string
string.__file__

And if the printed path is not in your Python installation directory (judging from your output above, it should be /usr/lib/python2.7/string.py), then this is the cause of your problem. Eoraptor (talk) 09:44, 18 March 2018 (UTC)

hi Eoraptor, I solved the problem. I had another program called "string.py" in the folder which was also executing with my script (I would never have expected this bizarre behavior). I am used to C and Perl, Python behaves very unexpectedly. Thanks for your help.--Hailholyghost (talk) 19:29, 19 March 2018 (UTC)

Fine! :) Note that in Python, you can put anything you want it your source files, but you can't name exactly as you want if they lie in a directory in PYTHONPATH (or in current directory). However, it's easy to solve this, if these are modules you use in your program: put them in a subdirectory. Say you have a the following directory structure, with PYTHONPATH=/home/myproject:

/home/myproject/main.py
/home/myproject/lib1.py
/home/myproject/foo/lib2.py

Now, from your main program (main.py), you can

<lang python>import lib1 import foo.lib2 as lib2</lang>

And now functions from lib1 can be called with lib1.func(), and function in lib2 as lib2.func(). There is more to know about modules, see the tutorial.

While you can use from lib1 import * and from foo.lib2 import *, it can become a bit tricky if there are not only functions or classes, but variables in your modules (you may experiment with what happens whether you do a = 1 or a = [] in lib1.py, the difference being that the first is immutable, while the second is mutable in Python). Also, as this pollutes the global namespace, I tend to stick with import lib1, unless I only need one or two functions and I do from lib1 import f1, f2.

Eoraptor (talk) 07:23, 20 March 2018 (UTC)