Hacker Newsnew | past | comments | ask | show | jobs | submit | harshal's commentslogin

As a happy Zerodha user since 2017, I agree with all of this. Even if the pandemic would not have happened, Zerodha would still be India's largest, most profitable broker.


Potentially the most useful part of this post to me was this part

> 2ndQuadrant is working on highly efficient upgrades from earlier major releases, starting with 9.1 → 9.5/9.6.

I hadn't heard of that before. Anybody know more about this? I'm currently babysitting a 9.1 deployment which we desperately want to get upgraded. The amount of downtime this can tolerate a very limited and I was currently tasked with coming up with a plan. Its going to get hairy. If such a tool is really on its way, I could make a case for holding off on the upgrade for a few more months and save quite a bit of work.


That paragraph might be clearer, but Simon is referring to supporting logical replication from 9.1 to recent releases, which essentially means backporting logical decoding (added in 9.4) back to 9.1. There's no way this could get to official 9.1 branch, as that'd be a major risk / behavior change.

Making "zero downtime" upgrades possible is part of the whole logical replication effort - both within the community and in 2ndQuadrant in particular.

Petr Jelinek actually described how to do that using UDR (uni-directional logical replication) in 2014:

https://wiki.postgresql.org/images/a/a8/Udr-pgconf.pdf

There might be a newer talk somewhere, I'm sure he spoke about it on several events.


you can use pg_upgrade with -k - it will complete within seconds. Afterwards, things will be slow until a complete analyze updates the statistics, but the update itself can be done in seconds.

I have updated ~2TB of database from 9.0 all the way to 9.5 over the years.


The problem with this is that if anything fails, you can potentially corrupt your data and have no backup plan. To make that option safe, you would have to copy your data directory first, and you need to be offline for that. So you have to add the time it takes to make that copy.


This is why I ensure that the slaves are up to date, then disconnect them, pg_upgrade the master and resync the slaves (which is required anyways). If something goes wrong, I would fail over to the slave.

Also: You don't need to be offline to copy the data directory. Check `pg_start_backup` or `pg_basebackup` (which calls the former)


That requires the master and slaves to run different versions for a while. And that is not possible with stock postgresql, is it?

Regarding your second point, I meant copying the data directory as in a 'cp' command. Or rsync if you will. The functions you mentioned are only useful when doing a dump, isn't it? And recovering from a upgrade problem using a dump is way slower than just starting the previous version in the backup data directory.


> That requires the master and slaves to run different versions for a while. And that is not possible with stock postgresql, is it?

Yes. That's not possible. But if I announce the downtime, bring master and slave down, migrate the slave and run our test-suite, migrate the master, run the test suite again and bring the site back up, then I know whether the migration worked.

If the migration on the slave fails, well, then I can figure out where the problem lies and just bring master back.

If the migration on master fails, but works on slave, then I can bring slave up as the new master.

No matter what, there's always one working copy and the downtime is limited to two `pg_upgrade -k` runs (which is measured in minutes).

> Regarding your second point, I meant copying the data directory as in a 'cp' command. Or rsync if you will.

Yes. You execute `select pg_start_backup()` to tell the server that you're now going to run cp or rsync and to thus keep the data files in a consistent state. Once you have finished cp/rsync, you execute `select pg_stop_backup()` to put the server back in the original mode.

This works while the server is running.

If you don't want the hassle of executing these commands, you can also invoke the command-line tool `pg_basebackup` which does all of this for you.


Any chance you have a blog or website where you could write up/post an example of this entire process? It sounds like the details that you've posted above would be of extreme assistance to many others.


Yes, that does allow you to copy the database directory with the "cp" command. The command tells postgres to stop deleting obsolete WAL files until further notice. As long as you start your copy after you issue the command, and copy across at least all of the files that are present (as in, you can ignore new files that are created), then the data is safe. Just don't forget to tell postgres that the backup has finished afterwards.


pg_start_backup/pg_basebackup are used when doing an rsync-style copy. You'll end up with a copy of the data directory, rather than a dump. You can then start up a server instance directly in the resulting backup directory.


How do I learn all this stuff, as a person that has no reason to touch Postgres other than personal interest? I never get to encounter these types of problems in my day to day.


You don't need to be offline to make a copy of the data directory. You can do that ahead of time, keeping all the WAL segments up until the point that you make the switch.

(See also https://www.postgresql.org/docs/9.5/static/continuous-archiv...)


Solution: Have a backup plan.

;)


9.1 to 9.5/9.6 upgrade should be available for customers by Oct


Yeah, I've seen multiple folks putting off reserved instance commitments and continuing to pay on-demand costs since they don't want to get locked in for a year. Its a serious commitment for smaller companies.

The No-Upfront reservation options that AWS last year helped narrow the difference quite a bit - but Google automatic discounts for sustained use are so much better and less complicated for users.


And per-minute billing, and the ability to move between zones month to month ("Oh, Haswell zone just launched? I'll be moving there..."). It's night and day, and if I could, I'd happily take the other side of any RI deal ;).

Disclosure: I work on Compute Engine.


Geographic reach is the significant thing holding GCP back at this point - despite being ahead on the tech. Not having a Singapore presence (or anything closer to India) basically kills it as a serious option for many Indian companies. I personally know of a couple of startups personally who would have loved to use GCP since they actually want to use things like BigQuery, but went with AWS due to this.

They have so far just mentioned they are adding two regions soon - but nothing about where. I fear its just additional options in Europe or something ... 2017 is still too far out for a startup considering options today.

Edit: Oh .... looks like I missed the announcement that the two new regions are Tokyo and Oregon. Looks like India is out of luck for now.


Hmm, our Taiwan region (asia-east1) is too far? We've had a number of customers from India, and thanks to our points-of-presence throughout the world it's not so simple as "AWS is in Singapore, Google is in Taiwan". That said, we hear you, and you can imagine we've done a lot of asking customers (and losing deals!) on the basis of where we are and where we could be.

Disclosure: I work on Compute Engine.


Yes, Taiwan is too far. I personally know of an adtech startup that is going through contortions because they need to stay below latency limits on specific ad exchanges and those exchanges are located in Singapore. Taiwan is certainly way to far for them.

There are other instances where I think additional ~50ms latency diff to from India to Taiwan matters much less and I am dubious if it is material. But what matters if that the difference exists and people believe it does matter.


We're squeezing by in Taiwan just about, if anything changes networking wise though we're going to have to diversify to more providers.

If you're struggling to respond within 50ms anyway then you're going to have a bad time with the added latency. Thankfully our 95th percentile is around 20ms.

We are however having to go into AWS for Aus/NZ which is a pain.


AWS is coming to Mumbai(IN) in near future. It will make it give much better performance for us so right now investing time in GCE is difficult to justify.


It looks like the devs are doing exactly the right thing: Fixing the immediate problem quickly on released branches and following up by stricter checking in the development version (which looks like it will need a lot more changes) See: https://bugzilla.mozilla.org/show_bug.cgi?id=1202447#c17 https://bugzilla.mozilla.org/show_bug.cgi?id=1202509


Well there are are attempts to setup that kind of infrastructure e.g. http://www.flywheel.com You are correct that the industry as a whole needs to get behind it rather than fragmenting efforts across cities/fleets etc.


Here is the post from the most recent maintainer on his blog: http://samuelsidler.com/2013/05/the-end-of-an-era/

Thanks to all the devs who have worked on Camino over the years.


Trello is a great product. Its great that you find it useful and worth paying for. But you lost me right here..

>it’s paid and it won’t disappear overnight.

Thats a gross simplification. Perhaps your company paying for trello, and you blogging about it, makes it a little bit more likely that Trello will stick around and will "succeed". But it certainly does not do much to _guarantee_ that they will hang around for however long you want to use it. Much larger paid-for services have disappeared. The best that they can guarantee you is that they will try to make your data available to you in some form if they ever go away.


That's correct! I made a quick update.


Do also bear in mind that Fogcreek haven't always stuck to their word on how they will keep their systems up and running (see thread [1] in story [2]).

No company, or service/facility that a company states will stick around, is guaranteed to stick around. Trawl back through the archives on HN and you'll find tale after tale of (previously) reputable companies backtracking, cancelling projects, going bust etc.

Make sure you always keep hold of your data!

[1] https://news.ycombinator.com/item?id=4722056 [2] https://news.ycombinator.com/item?id=4721028


"Fogcreek [sic] haven't always stuck to their word on how they will keep their systems up and running".

What do you mean by this? It sounds like you are suggesting that "how" we keep our systems up and running is required to be the same at all times. Why should people care about "how" we do it, just that it's done? For awhile, we had a second data center in Toronto instead of LA. Trello moved to AWS, so it doesn't need a 'second' data center. I don't think we should be derided because we changed the way we keep our systems up and running.

"Make sure you always keep hold of your data!" - Business Class has a great export function :)

Also, minor point, it's Fog Creek, not Fogcreek.


Apologies for the misspelling.

You used the second data-center to sell FogBugz on Demand to your customers. You didn't tell them that you no longer had it. It went down when the storm hit NY, and there was no backup data centre.

If you change the way you keep up your systems, and keep them up, no one minds. When you don't, people rightly question what you've told them in the past and wonder if what you're telling them now fits into the same categories.

I'm aware you have export functions, thats great. My message was to your customers who need to remember to utilise them. Thats not a dig at your company, its applies to any SaaS providers.

I hope my spelling was correct this time around, apologies if not. FYI, "fog creek" is not capitalised correctly in your HN profile, just letting you know so you can keep the brand in check.


Can you clarify when FogBugz on Demand was in two data centers and when not, no matter whether it's in LA or Toronto? Back then it was really interesting that Joel found SQL Server Mirroring not usable and instead wrote that they went to manually coded Log shipping for SQL server. At the same time I was experimenting with the same techniques and was very interested in this, so I also remembered it closely and wondered about it once the bucket brigade story was told.


There will probably be a blog post in a couple of months about the latest infrastructure changes that we've added. Keep an eye out for that.


> Can you clarify when FogBugz on Demand was in two data centers and when not...

> There will probably be a blog post in a couple of months...

That'll be a "no", then.


I bet in the end it was not really such a disadvantage for all the OnDemand customers: Imagine they had the two data centers: The log shipping Joel "sold" us would have had a nasty side effect: If someone makes the call to do the switch over, everyone would have lost the last few hours of work!! All customer mail, all bug events would be gone.

Well one could say that in the particular case it might have worked out: They knew about Sandy in advance and had a bit of time until the lights went out. But if it happens unexpected, it's a tough call to make: Keep it running or keep the last data since the last log shipping?

If on the other hand they had gone with Sql mirroring, it's likely that there would have been quite some outage just because of the mirroring: If one of the two data centers went down, the whole system would be down, it's often like that with failover systems


Apple deserves a lot of credit for their support for WebKit, particularly in its early days. But even old time apple webkit devs will readily admit that the existence of Mozilla made WebKit much more viable back in the day. Mac's had practically no marketshare during those days. The fact that Mozilla existed and had a reasonable amount of marketshare across platforms made it possible for Apple to ship anything other than IE as a default browser on Mac. Mozilla maintained the incentive for webdevs to support non-IE browsers, and use standards-compliant CSS. Desktop Safari as a default browser would not have been a viable product if so many webdevs had not already worked to make their sites work well with Gecko/Firefox. Webkit sends "like Gecko" in its user-agent string for a very good reason.

In other words, a lot of the web kept using standards (and not IE proprietary stuff) since they wanted to work in Mozilla browsers. And all of this content tended to also work well in webkit as well. This made Safari useful and viable. So, its at least somewhat reasonable to say that Mozilla and Firefox are perhaps primarily responsible for keeping the desktop browser market open for non-IE products. Their existence and marketshare allowed the newer browsers to compete on the basis features and performance and not be hobbled by poor website compatibility.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: