More

protomikron · 2025-12-01T10:27:17 1764584837

You can use ctrl+shift+t to open the recently closed tab again.

protomikron · 2025-09-17T10:51:24 1758106284

It's an interesting file browser, the default "blue" colors gives old-school vibes nowadays.

If you wonder how to quit (if started from a terminal):

It's `ESC 0`. Or "exit" like from a shell. Took me some time I have to admit (q, ctrl-c, ctrl-q, F10, ESC all did not work).

kidsil · 2025-09-17T10:58:46 1758106726

F10 always worked for me, but some terminals use F10 for some other functionality.

If that fails, you can click on "10 Quit" with the mouse (not ideal, but an immediate solution).

protomikron · 2025-09-17T11:00:16 1758106816

Ah I see, thx, it's a pretty default configured "gnome-terminal", which probably captures the F10.

protomikron · 2025-08-27T19:48:11 1756324091

Valid question i think in this context. I knew about GNU multiprecision library, but thought that couldnt be it, as it's "just" a highly optimized low level bit fiddling lib (at least thats my expectation without looking into the source), so it's strange why it could be damaging Hardware ...

account42 · 2025-08-28T10:19:11 1756376351

TFA is on gmplib.org which kind of answers the question though.

protomikron · 2025-08-14T21:53:12 1755208392

It's okayish. Considering 64G to 128G are available for (nerd) high-end consumers you're just off with a factor 5 (if we can squeeze out a little bit more performance).

Thas is pretty astonishing in my opinion.

protomikron · 2025-08-04T06:39:50 1754289590

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

throwaway4496 · 2025-08-04T06:48:36 1754290116

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

Macha · 2025-08-04T11:36:08 1754307368

My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?

protomikron · 2025-04-24T17:27:03 1745515623

Do you think as an observer of Roko's basilisk ... should I share these prompt or not?

protomikron · on April 12, 2024

This guy footnotes.

protomikron · on March 11, 2024

They don't.

protomikron · on March 11, 2024

Although this is nice, the problems with the GIL are often blown out of proportion: people stating that you couldn't do efficient (compute-bounded) multi-processing, which was never the case as the `multiprocessing` module works just fine.

ynik · on March 11, 2024

multiprocessing only works fine when you're working on problems that don't require 10+ GB of memory per process. Once you have significant memory usage, you really need to find a way to share that memory across multiple CPU cores. For non-trivial data structures partly implemented in C++ (as optimization, because pure python would be too slow), that means messing with allocators and shared memory. Such GIL-workarounds have easily cost our company several man-years of engineer time, and we still have a bunch of embarrassingly parallel stuff that we still cannot parallelize due to GIL and not yet supporting shared memory allocation for that stuff.

Once the Python ecosystem supports either subinterpreters or nogil, we'll happily migrate to those and get rid of our hacky interprocess code.

Subinterpreters with independent GILs, released with 3.12, theoretically solve our problems but practically are not yet usable, as none of Cython/pybind11/nanobind support them yet. In comparison, nogil feels like it'll be easier to support.

ebiester · on March 11, 2024

And I guess what I don't understand is why people choose Python for these use cases. I am not in the "Rustify" everything camp, but Go + C, Java + JNI, Rust, and C++ all seem like more suitable solutions.

KaiserPro · on March 11, 2024

> but Go + C, Java + JNI, Rust, and C++ all seem like more suitable solutions.

apart from go (maybe java) those are all "scary" languages that require a bunch of engineering to get to the point that you can prototype.

even then you can normally pybind the bits that are compute bound.

If Microsoft had been better back in the say, then c# should have been the goto language of choice. It has the best tradeoff of speed/handholding/rapid prototyping. Its also statically typed, unless you tell it to not be.

snovv_crash · on March 12, 2024

#pragma omp parallel for

gets you 90% of the potential performance of a full multithreaded producer/consumer setup in C++. C++ isn't as scary as it used to be.

oivey · on March 11, 2024

Notably, all of those are static languages and none of them have array types as nice as PyTorch or NumPy, among many other packages in the Python ecosystem. Those two facts are likely closely related.

abdullahkhalids · on March 11, 2024

Python is just the more popular language. Julia array manipulation is mostly better (better syntax, better integration, larger standard library) or as good as python. Julia is also dynamically typed. It is also faster than Python, except for the jit issues.

znpy · on March 11, 2024

> It is also faster than Python, except for the jit issues.

I was intrigued by Julia a while ago, but didn't have time to properly learn it.

So just out of curiosity: what's the issues with jit and Julia ?

cjalmeida · on March 11, 2024

The "issue" is Julia is not Just-in-Time, but a "Just-Ahead-of-Time" language. This means code is compiled before getting executed, and this can get expensive for interactive use.

The famous "Time To First Plot" problem was about taking several minutes to do something like `using Plots; Plots.plot(sin)`.

But to be fair recent Julia releases improved a lot of it, the code above in Julia 1.10 takes 1.5s on my 3-year old laptop

jakobnissen · on March 11, 2024

Julia's JIT compiles code when its first executed, so Julia has a noticable delay from you start the program and until it starts running. This is anywhere from a few hundred milliseconds for small scripts, to tens of seconds or even minutes for large packages.

shiroiushi · on March 12, 2024

I wonder why they don't just have an optional pre-compilation, so once you have a version you're happy with and want to run in production, you just have a fully compiled version of the code that you run.

aoanla · on March 12, 2024

Effectively, it does - one of the things recent releases of Julia have done is to add more precompilation caching on package install. Julia 1.10 feels considerably snappier than 1.0 as a result - that "first time to plot" is now only a couple of seconds thanks to this (and subsequent plots are, of course, much faster than that).

oivey · on March 11, 2024

Preaching to the choir here.

Julia’s threading API is really nice. One deficiency is that it can be tricky to maintain type stability across tasks / fetches.

samatman · on March 11, 2024

If only there were a dynamic language which performs comparably to C and Fortran, and was specifically designed to have excellent array processing facilities.

Unfortunately, the closest thing we have to that is Julia, which fails to meet none of the requirements. Alas.

rmbyrro · on March 11, 2024

If only there was a car that could fly, but was still as easy and cheap to buy and maintain :D

esafak · on March 11, 2024

Why do people use python for anything beyond glue code? Because it took off, and machine learning and data science now rely on it.

I think Python is a terrible language that exemplifies the maxim "worse is better".

https://en.wikipedia.org/wiki/Worse_is_better

nottorp · on March 11, 2024

To quote from Eric Raymond's article about python, ages ago:

"My second [surprise] came a couple of hours into the project, when I noticed (allowing for pauses needed to look up new features in Programming Python) I was generating working code nearly as fast as I could type.

When you're writing working code nearly as fast as you can type and your misstep rate is near zero, it generally means you've achieved mastery of the language. But that didn't make sense, because it was still day one and I was regularly pausing to look up new language and library features!"

Source: https://www.linuxjournal.com/article/3882

It doesn't go for large code bases, but if you need quick results using existing well tested libraries, like in machine learning and data science, I think those statements are still valid.

Obviously not when you're multiprocessing, that is going to bite you in any language.

rmbyrro · on March 11, 2024

Some speculate that universities adopted it as introductory language for its expressiveness and flat learning curve. Scientific / research projects in those unis started picking Python, since all students already knew it. And now we're here

spprashant · on March 11, 2024

I have no idea if this is verifiably true in a broad sense, but I work at the university and this is definitely the case. PhD students are predominantly using Python to develop models across domains - transportation, finance, social sciences etc. They then transition to industry, continuing to use Python for prototyping.

zamadatix · on March 11, 2024

People choose Python the use case, regardless what that is, because it's quick and easy to work with. When Python can't realistically be extended to a use case then it's lamented, when it can it's celebrated. Even Go, while probably the friendliest of that buch when it comes to parallel work, is on a different level.

pillusmany · on March 11, 2024

"Ray" can share python objects memory between processes. It's also much easier to use than multi processing.

ptx · on March 11, 2024

How does that work? I'm not familiar with Ray, but I'm assuming you might be referring to actors [1]? Isn't that basically the same idea as multiprocessing's Managers [2], which also allow client processes to manipulate a remote object through message-passing? (See also DCOM.)

[1] https://docs.ray.io/en/latest/ray-core/walkthrough.html#call...

[2] https://docs.python.org/3/library/multiprocessing.html#manag...

pillusmany · on March 11, 2024

Shared memory:

https://docs.ray.io/en/latest/ray-core/objects.html

ptx · on March 11, 2024

According to the docs, those shared memory objects have significant limitations: they are immutable and only support numpy arrays (or must be deserialized).

Sharing arrays of numbers is supported in multiprocessing as well: https://docs.python.org/3/library/multiprocessing.html#shari...

jononor · on March 12, 2024

I think that 90 or maybe even 99% of cases has under 1GB of memory per process? At least it has been the case for me the last 15 years.

Of course, getting threads to be actually useful for concurrency (GIL removed) adds another very useful tool to the performance toolkit, so that is great.

vita7777777 · on March 11, 2024

On the other hand, this particular argument also gets overused. Not all compute-bounded parallel workloads are easily solved by dropping into multiprocessing. When you need to share non-trivial data structures between the processes you may quickly run into un/marshalling issues and inefficiency.

kroolik · on March 11, 2024

Managing processes is more annoying than threads, though. Incl. data passing and so forth.

pillusmany · on March 11, 2024

The "ray" library makes running python code on multi core and clusters very easy.

kroolik · on March 11, 2024

Although its great the library helps with multicore Python, the existence of such package shouldnt be an excuse not to improve the state of things in std python

smcl · on March 11, 2024

Interesting - looking at their homepage they seem to lean heavily into the idea that it's for optimising AI/ML work, not multi-process generally.

pillusmany · on March 11, 2024

You can use just ray.core to do multi process.

You can do whatever you want in the workers, I parse JSONs and write to sqlite files.

liuliu · on March 11, 2024

`multiprocessing` works fine for serving HTTP requests or do some other subset of embarrassingly-parallel problems.

skrause · on March 11, 2024

> `multiprocessing` works fine for serving HTTP requests

Not if you use Windows, then it's a mess. I have a suspicion that people who say that the multiprocessing works just fine never had to seriously use Python on Windows.

rmbyrro · on March 11, 2024

Probably a very small minority of Python codebases run on Windows, no? That's my impression. It would explain why so many people are unaware of multiprocessing issues on Windows. I've never ran any serious Python code on windows...

ptx · on March 11, 2024

Why is it a mess? What's wrong with it on Windows?

colatkinson · on March 11, 2024

Adding on to the other comment, multiprocessing is also kinda broken on Linux/Mac.

1. Because global objects are refcounted, CoW effectively isn't a thing on Linux. They did add a way to avoid this [0], but you have to manually call it once your main imports are done.

2. On Mac, turns out a lot of the system libs aren't actually fork-safe [1]. Since these get imported inadvertently all the time, Python on Mac actually uses `spawn` [2] -- so it's roughly as slow as on Windows.

I haven't worked in Python in a couple years, but handling concurrency while supporting the major OSes was a goddamn mess and a half.

[0]: https://docs.python.org/3.12/library/gc.html#gc.freeze

[1]: https://bugs.python.org/issue33725

[2]: https://docs.python.org/3.12/library/multiprocessing.html#co...

fulafel · on March 12, 2024

Re (1), are there publicly documented cases with numbers on observed slowdowns with it?

I see this mentioned from time to time, but intuitively you'd think this wouldn't pose a big slowdown since the system builtin objects would have been allocated at the same time (startup) and densely located on smaller nr of pages. I guess if you have a lot of global state in your app it could be more significant.

Would also be interesting to see a benchmark using hugepages, you'd think this could solve remaining perf problems if they were due to large number of independent CoW page faults.

fulafel · on March 12, 2024

Replying to my self: it seems one poster case was Instagram and their very large Django app: https://bugs.python.org/issue40255#msg366835

skrause · on March 11, 2024

* A lack of fork() makes starting new processes slow.

* All Python webservers that somewhat support multiprocessing on Windows disable the IOCP asyncio event loop when using more than one process (because it breaks in random ways), so you're left with the slower select() event loop which doesn't support more than 512 connections.

jcranmer · on March 11, 2024

> as the `multiprocessing` module works just fine.

Something that tripped me up when I last did `multiprocessing` was that communication between the processes requires marshaling all the data into a binary format to be unmarshaled on the other side; if you're dealing with 100s of MB of data or more, that can be quite some significant expense.

protomikron · on March 3, 2024

Why is it always public cloud (aws, gcp, azure) vs. "bring your own hardware and deploy it in racks".

There are multiple providers that offer VPS and ingress/egress for a fraction of the cost of public clouds and they mostly have good uptime.

pclmulqdq · on March 3, 2024

I recently rented a rack with a telecom and put some of my own hardware in it (it's custom weird stuff with hardware accelerators and all the FIPS 140 level 4 requirements), but even the telecom provider was offering a managed VPS product when I got on the phone with them.

The uptime in these DCs is very good (certainly better than AWS's us-east-1), and you get a very good price with tons of bandwidth. Most datacenter and colo providers can do this now.

I think people believe that "on prem" means actually racking the servers in your closet, but you can get datacenter space with fantastic power, cooling, and security almost anywhere these days.

jupp0r · on March 3, 2024

It's a spectrum:

On top is AWS lambda or something where you are completely removed from the actual hardware that's running your code.

At the bottom is a free acre of land where you start construction and talk to utilities to get electricity and water there. You build your own data center, hire people to run and extend it, etc.

There is tons of space in between where compromises are made by either paying a provider to do something for you or doing it yourself. Is somebody from the datacenter where you rented a rack or two going in and pressing a reset button after you called them a form of cloud automation? How about you renting a root VM at Hetzner? Is that VM on prem? People who paint these tradeoffs in a black and white matter and don't acknowledge that there are different choices for different companies and scenarios are not doing the discussion a service.

On the other hand, somebody who built their business on AppEngine or CloudFlare workes could look at that other company who is renting a pet pool of EC2 instances and ask if they are even in the cloud or if they are just simulating on-prem.

ethbr1 · on March 4, 2024

I think the question people are really interested in is usually "What percentage over my costs would I pay to outsource X?" (where X is some component of the complexity stack)

Which, first order approximated, is a function of (1) how big a company you are (aka "Can you even afford to hire two people to just do X?") and (2) how competitive the market is for X.

Colo and dedicated VMs are so reasonably priced because it's a standardized, highly-competive market.

Similarly, certain managed cloud services are ridiculously expensive because they have a locked-in customer base.

Which would suggest outsourcing components that have maximum vendor competition and standardization, as they're going to be offered at the lowest margin.

pclmulqdq · on March 4, 2024

There's also a good point here (at least at the top of the stack) about reliability: the top of the spectrum goes down relatively frequently due to its dependencies, but even plain old boring EC2 has much better reliability than services like Lambda.

Zircom · on March 3, 2024

>I think people believe that "on prem" means actually racking the servers in your closet, but you can get datacenter space with fantastic power, cooling, and security almost anywhere these days.

That's because that is what on prem means. What you're describing is colocating.

pclmulqdq · on March 3, 2024

When clouds define "on-prem" in opposition to their services (for sales purposes), colo facilities are lumped into that bucket. They're not exactly wrong, except a rack at a colo is an extension of your premises with a landlord who understands your needs.

hiatus · on March 4, 2024

> it's custom weird stuff with hardware accelerators and all the FIPS 140 level 4 requirements

What kind of weird stuff are we talking?

pclmulqdq · on March 4, 2024

Servers for an API for https://arbitrand.com

Essentially very high security and throughput TRNG servers (with cryptographic signing and the like).

cangeroo · on March 3, 2024

Because their arguments are disingenuous.

It reads like propaganda sponsored by the clouds. Scaremongering.

Clouds are incredibly lucrative.

But don't worry. You can make the prices more reasonable by making a 3-year commitment to run old outdated hardware.