Hacker Newsnew | past | comments | ask | show | jobs | submit | zeroimpl's commentslogin

I wonder if you could pay them to tweak the messaging about your products. So when a user asks: Is drinking Coke everyday good for my health, it starts saying yes because sugar is vital to our survival.


Doesn’t that force you to give the Agent some generic code execution environment, or does everybody already do that anyways?


I couldn’t find a library like this in PHP, but realized for my use case I could easily hack something together. Algorithm is simply:

- trim off all trailing delimiters: },"

- then add on a fixed suffix: "]}

- then try parsing as a standard json. Ignore results if fails to parse.

This works since the schema I’m parsing had a fairly simple structure where everything of interest was at a specific depth in the hierarchy and values were all strings.


If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.

But in general we aren’t trying to parse arbitrary documents, we are trying to parse a document with a somewhat-known schema. In this sense, we can parse them so long as the input matches the schema we implicitly assumed.


> If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.

You can parse ANY context-free language with regex so long as you're willing to put a cap on the maximum nesting depth and length of constructs in that language. You can't parse "JSON" but you can, absolutely, parse "JSON with up to 1000 nested brackets" or "JSON shorter than 10GB". The lexical complexity is irrelevant. Mathematically, whether you have JSON, XML, sexps, or whatever is irrelevant: you can describe any bounded-nesting context-free language as a regular language and parse it with a state machine.

It is dangerous to tell the wrong people this, but it is true.

(Similarly, you can use a context-free parser to understand a context-sensitive language provided you bound that language in some way: one example is the famous C "lexer hack" that allows a simple LALR(1) parser to understand C, which, properly understood, is a context-sensitive language in the Chomsky sense.)

The best experience for the average programmer is describing their JSON declaratively in something like Zod and having their language runtime either build the appropriate state machine (or "regex") to match that schema or, if it truly is recursive, using something else to parse --- all transparently to the programmer.


What everyone forgets is that regexes as implemented in most programming languages are a strict superset of mathematical regular expressions. E.g., PCRE has "subroutine references" that can be used to match balanced brackets, and .NET has "balancing groups" that can similarly be used to do so. In general, most programming languages can recognize at least the context-free languages.


You kind of missed the “and direct actors to play it out” part. If you did all of that, that’s essentially the creator.


... What was the last word in my comment?


The parent is making a philosophical argument. The exact Hollywood definitions aren’t important since there are far many more job roles in film production compared to software development. If you insist though just replace creator with producer in his original argument and it’s the same - you can produce a movie without doing the acting yourself.


And my response is that creation is never, ever, just assigned to a singular person. All work that goes into a work, is acknowledged.

Pretending that your model use doesn't matter, is ignoring all the people's works that are being used to construct it.

If one director or producer went around bragging that the film was all their work, the Actor's Unions and Writer's Unions would tear them to pieces.

You can't pretend only you are the creator. Because it does matter.


Naively it seems difficult to decrease the ratio of 1.8x while simultaneously increasing availability. The less duplication, the greater risk of data loss if an AZ goes down? (I thought AWS promises you have a complete independent copy in all 3 AZs though?)

To me though the idea that to read like a single 16MB chunk you need to actually read like 4MB of data from 5 different hard drives and that this is faster is baffling.


Availability zones are not durability zones. S3 aims for objects to still be available with one AZ down, but not more than that. That does actually impose a constraint on the ratio relative to the number of AZs you shard across.

If we assume 3 AZs, then you lose 1/3 of shards when an AZ goes down. You could do at most 6:9, which is a 1.5 byte ratio. But that's unacceptable, because you know you will temporarily lose shards to HDD failure, and this scheme doesn't permit that in the AZ down scenario. So 1.5 is our limit.

To lower the ratio from 1.8, it's necessary to increase the denominator (the number of shards necessary to reconstruct the object). This is not possible while preserving availability guarantees with just 9 shards.

Note that Cloudflare's R2 makes no such guarantees, and so does achieve a more favorable cost with their erasure coding scheme.

Note also that if you increase the number of shards, it becomes possible to change the ratio without sacrificing availability. Example: if we have 18 shards, we can chose 11:18, which gives us 1.61 physical bytes per logical byte. And it still takes 1 AZ + 2 shards to make an object unavailable.

You can extrapolate from there to develop other sharding schemes that would improve the ratio and improve availability!

Another key hidden assumption is that you don't worry about correlated shard loss except in the AZ down case. HDDs fail, but these are independent events. So you can bound the probability of simultaneous shard loss using the mean time to failure and the mean time to repair that your repair system achieves.


RAID doesn’t exactly make writes faster, it can actually be slower. Depends on if you are using RAID for mirroring or sharding. When you mirror, writes are slower since you have to write to all disks.


He explicitly mentioned RAID0 though :)


Metrics from the CDN will be wildly inaccurate. Also downloading a video isn’t the same as watching it.


I think they are solving two different problems at the same time. One is the order of elements in a single operation (SELECT then FROM then WHERE etc), and the second is the actual pipelining which replaces the need for nested queries.

It does seem like the former could be solved by just loosening up the grammar to allow you to specify things in any order. Eg this seems perfectly unambiguous:

  from customer
  group by c_custkey
  select c_custkey, count(*) as count_of_customers


Yeah, exactly. You don't need literal pipes


Space-wise, as long as you compress it, it's not going to make any difference. I suspect a JSON parser is a bit slower than a CSV parser, but the slight extra CPU usage is probably worth the benefits that come with JSON.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: