US politics is broken, but most AI regulations are poorly designed. Look at EU AI policy. It is not addressing any real problems and is mostly just additional paperwork.
sounds like you should ask for your money back, remember OP stands by everything he creates and if you dont like it he promises money back no questions asked
> did they self-publish or went with some known publisher?
My former boss self-published a book on Leanpub. For the "known publisher", I'm a little acquainted with the author of "Docker High Performance" (Packt).
For the leanpub one, it was organically easy to get the word out through local meetups and clients. Packt reached out to the "Docker High Performance" guy, he reviewed a few books for them then they emailed him.
> What kind money can you make on ebooks?
For the people I mentioned, I can't say. But if you do get a good book out with a good following you can hit Nathan Barry's numbers:
Self-pub, through Leanpub and eventually Kindle, Createspace, and iBooks, I'm up to around 100-200 sales/month (ASP ~$10, net ~$7), mostly by keeping my book updated with the latest version of the software I write about (so my book is the only one that's constantly relevant, unlike the ones from the major publishers). I spend maybe 2-4 hours/month updating the book, running tests and fixing things.
I dont think he means to deprecate enjoying life and spending time with friends and family. It's more about good healthy life habits - nutrition, exercise, sleep. When you think about it all these good habits make you happier. He certainly does not advocate workaholism. At least I dont read it this way.
> You would probably handle more requests if you changed that -- I would do the file access in a run_in_executor with a max executor workers of 1000.
This is really good point. I'm going to check this and edit post adding this information there.
> Also, the placement of your semaphore acquisition doesn't make any sense to me. I would create a dedicated coroutine like this:
looking into my semaphore code next day after writing it I do wonder if I'm using it correctly. I assumed it works correctly because it fixed my "too many open files" exception, so it seems to mean that I'm no longer exceeding 1024 open files limits. Can you clarify why you think my use of semaphore does not make sense and why your suggestion is better? What is the benefit of dedicated coroutine?
> That being said, it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.
I admit that I focused more on my client than server. One thing that worries me about my test server is that it does not print any exceptions. Either it does not fail at all, which seems unlikely, or it fails silently, which is more likely and is bad. So I need to check my server code to see what exactly happens there.
> it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.
main reason for semaphore in client code is that it should stop client from making over 1k connections at a time. My logic here is that if client wont make 1k connections at a time - server wont receive 1k connections at a time and thus there will be no problem of too many open files on server (it won't have to send more than 1k responses). However I see that this logic may not be totally correct, other comment points out that it's possible for sockets to "hang around" after closing: https://news.ycombinator.com/item?id=11557672 so I need to review that and edit post.
As per my comment further up, it might be interesting to spin up a handful of listening processes (eg: 127.0.0.1 through 127.0.0.10), and a handful of clients; and have the clients pick one at random, or something like that. Not so much for "real world testing", but just as an exercise to see if one can press the system to other limits than open connections/address pairs?
No problem, it's especially hard to find external feedback for side projects and experiments so I try to give it when I can.
> I assumed it works correctly because it fixed my "too many open files" exception
It works, so at the end of the day that's what matters. The client vs server question, from my perspective, ultimately comes down to a question of test realism; in a real-world deployment you couldn't limit connections with client-side code because there are multiple clients. That's what I mean by "it doesn't make sense given that the error is server-side".
> Can you clarify why you think my use of semaphore does not make sense and why your suggestion is better? What is the benefit of dedicated coroutine?
I'm saying that mostly, but not exclusively, from a division of concerns standpoint. You're acquiring the semaphore in a completely different context than you're releasing it. On the one hand, that's partly a programming style issue. On the other hand, it can also have some really important consequences: for example, it's actually the event loop itself that is releasing the semaphore for you when the task is done. Because of the way the event loop works, it's hard to say exactly when the semaphore will be released. You want to hold it for the absolute minimum time possible, since it's holding up execution of other connections in the loop. Putting it into a dedicated coroutine makes it clearer what's going on, makes it such that the acquirer and releaser of the semaphore are the same, and means you are definitely holding the semaphore for the minimum amount of time possible (since, again, execution flow will not leave any particular coroutine until you yield/await another). In general I would say that releasing the semaphore in a callback is significantly more fragile, and mildly to moderately less performant, than creating a dedicated coroutine to hold the semaphore and handle the request.
Does that all make sense?
> Either it does not fail at all, which seems unlikely, or it fails silently, which is more likely and is bad.
That's a fair statement, I think. As an aside, the print statement is slow, so keep that in mind. It might actually be faster to have a single memory-mapped file for the whole thing, and then just append the error and traceback to the file. The built-in traceback library can be very useful for that. That's also a bit more realistic, since obviously IRL you wouldn't be using a print statement to keep track of errors. On a similar note, because file access is so slow, you'd be best off figuring out some way to remove the part where the server accesses the disk once per connection entirely. On a real-world system you'd possibly use some kind of memory caching system to do that, especially if you're just reading files and not writing them. That allows you to use a little more memory (potentially as little as enough to have a single copy of the file in memory) to drastically improve performance.
> in a real-world deployment you couldn't limit connections with client-side code
yeah that's a very good point. But in a real world scenario handling this would not be that easy. Limiting number of available connections on the server side is not a trivial task to implement. Setting your server to avoid failures and simply return either 503 service unavailable to some clients or 429 (too many conns) to others would probably require quite a lot of coding. It's also not very clear to me how this would be implemented, how do people implement things like this? Just putting some check for number of open files before line that opens file and setting response code to 500 and 429 before opening file? This would only stop server from opening to many "html" files, but would not stop server from getting flooded with connections. Is my aiohttp app even the right place to add checks like this? Wouldn't it be better to use haproxy or nginx or some other load balancing service in front of aiohttp app and let it handle too much traffic?
Other thing that comes to my mind (need to check this later) is that perhaps some partial "handling" of cases like this could/should be implemented in aiohttp library. I'm not sure how it behaves now, but maybe it should simply fail to open file, return 500 to the client, and print noisy traceback about open files to my logs? I didnt see this behavior when doing my tests, so either it didn't occur, is not implemented in aiohttp, or it occurrred and I somehow missed that. From my experience with Twisted I know that this is how Twisted resources behave, if you have some unhandled exceptions twisted just returns 500 to client and show traceback in logs.
Keep in mind that 5XX error codes are for server errors and 4XX codes are for client errors. Returning 429 would imply "too many connections (from your computer)", not total for the service. Choosing to return a 503 for over-taxed servers is, as far as I can tell, done maybe half the time. Depending on the kind of service you're running, you might want to enforce a server timeout that says "after a certain number of milliseconds of local response time, return a 5XX error code and abandon the connection". That would be a particular component in an overall strategy for handling high load, which would heavily bias towards serving the easiest responses first. That may or may not be a good idea: what if the "expensive" requests are from paying customers accessing account pages, and the "inexpensive" ones are from a sudden spike in traffic to your homepage due to some good press somewhere? Of course eventually, you'd want to separate these two kinds of traffic entirely, such that customers are only affected by outages that they create. You can then focus on expanding your capacity to handle customers directly, instead of trying to lump that in with the much more unpredictable behavior of general web traffic.
> Just putting some check for number of open files before line that opens file and setting response code to 500 and 429 before opening file?
So actually this is one of the big benefits of putting the semaphore limiting file access within its own dedicated coroutine (except on the server side instead of the client). It allows you to handle the connection without having to deal with immediate responses. What that means in practice is that your server will be slower to respond under high load, but until it hits the client's (browser) timeout limit, you'll still be able to respond. It actually doesn't require any extra code to do that. Note that this isn't the only way to achieve this result, but it's probably the most direct, and simplest, especially given the approach you've taken with the code thus far.
A load balancer sits on top of that, ideally monitoring metrics like server CPU usage, memory load, or (most directly) request response time, and then shifts around requests between servers accordingly, to minimize the delay incurred in the aforementioned "wait for semaphore (or other synchronization primitive)" part.
At the end of the day, until you start hitting the limit of concurrent connections that others have mentioned, you don't really actually need to worry very much about how many connections you have open at once. You just want to focus on handling every connection you have as quickly as possible.
I wonder if you can reliably classify jobs into "nonroutine" and "routine". There is element of routine in every work, and I'm pretty sure that even most boring and repetitive job can be done better with some degree of creativity. It would be really interesting to read more about reasoning behind classification presented in this article. I mean can you seriously say there is no "routine" in programming or management?
If your work is just following instructions, then it's probably routine.
As for me, I'd like to see most middle management go away, since I largely see it as a waste (basically, if people know how to manage themselves, you can get rid of most middle-managers).
It's not a knowledge problem it's a process problem. You can get rid of most middle managers but only once you have the conditions in the business where people can be both autonomous and aligned to the business goals.
Most middle managers end up achieving neither, but a layer of management is the default solution that companies most end up with.
That is an interesting thought. I would suggest that one of the goals of the company should be to teach people to be autonomous and aligned to the business goals.
You still need management as a way of reducing communication costs. Without any management you need (n!) communication channels in the worst case. With proper management you can achieve (n*C ~).
You don't need to have a person working full-time as a communicator. Your team can have a 'designated communicator,' and that role can even be swapped around so everyone learns to do it.
Then you are dynamically creating (and cutting I suspect) lots of communication channels. My suspicion is that for any large group something like that would require an extremely strong institution and lots of paper trail. This is a noble objective, but I'm not sure if it is always an option.
By the way, what is the largest organization you can think that follows that swaps 'designated communicators' roles with no management?
>By the way, what is the largest organization you can think that follows that swaps 'designated communicators' roles with no management?
Good question. I've worked at a fortune 500 company where people routinely ignored official communication channels in order to communicate with the people they needed. It becomes a lot harder to find the person you need at a large company and building relationships across departments becomes important.
It's rare in any organization that the people who have power are the same ones who get things done.
unfortunately pyqt docs are far from perfect everytime i have to do something with pyqt i just go to qt docs and just "translate" the concepts and api calls to python. if you are able to understand c++ syntax and translate that in your head to python youll be ok. Aside from docs i used some links from this https://wiki.python.org/moin/PyQt but you have to be careful to avoid outdated resources (current version is 5 and many things differ between 4 & 5) and not all tutorials are high quality. I also wrote one tutorial myself BTW http://pawelmhm.github.io/python/pyqt/qt/webkit/2015/09/08/b...
> code did not cover some weird edge case on the scraped resource and that all data extracted was now basically untrustworthy and worthless.
Your data should not be worthless just because you dont catch some edge cases early. Sure there are always some edge cases but best way to handle them is to have proper validation logic in scrapy pipelines - if something is missing some required fields for example or you get some invalid values (e.g. prices as sequence of characters without digits) you should detect that immediately and not after 50k urls. Rule of thumb is: "never trust data from internet" and always validate it carefully.
If you have validation and encounter edge cases you will be sure that they are actual weird outliers that you can either choose to ignore or somehow try to force into your model of content.
Hmm, I'll have to investigate that, any tips for libraries to use for validation that tie well into scrapy?
What do you do if you discover that your parsing logic needs to be changed after you've scraped a few thousand items? Re-run your spiders on the URLs that raised errors?
Where can one find analysis of this vulnerability? There are no details in checkpost blogpost revealing vulnerability. I assume its serious and real if magento releases patches but would be cool to be able to judge myself.
how is it possible that someone can carry this kind of attack without facing any kind of legal consequences? I know they are china we're not going to start a war with them but shit is there really no legal authority here?