Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Naming things (2015) [pdf] (duke.edu)
157 points by petethomas on Aug 18, 2017 | hide | past | favorite | 57 comments


The real lesson of this discussion seems to be: metadata has failed our expectations.

All this ancillary stuff that we'd like attached to files, like dates, client names and projects, versions and so on, are metadata. Some systems keep metadata in files: EXIF, Word, PDF. Some systems have conventions for this instead: header blocks in source code. But if neither of those applies? Only place you can put it reliably is the filename :(


If OS vendors could get over their "not invented here" syndrome with respect to default file systems we might see something a little bit more sane than FAT as the one interchangeable format. This is necessary but not sufficient to make decent file systems widely available. Then we can start hoping for such trivial features to become available.


Everybody shops for the best filesystem around that they can use. The problem is, licensing is a problem.

In other times, the industry would solve this by creating a standard for metadata interchange, but one of the biggest players gets a too big to ignore amount of revenue from FAT, and is able to block any attempt at standardizing.


Licensing is a problem for SOME formats.

ZFS, ext4, and plenty of others used by BSDs and Linux distros have no licensing issues.

Also, if any player cared to win the filesystem war, they'd open up their spec royalty-free.


>lesson of this discussion seems to be: metadata has failed our expectations.

I've written several "disk and file catalog" utilities over the years so I inevitably spent a lot of time thinking about the "metadata" problem.

I think the issue is that it's impossible to solve metadata in a universal way that satisfies all scenarios. This is why metadata often ends up being inscribed into the filename. It's the "least worse" solution.

Let's take one example of the scientific data of csv files. Typical Comma-Separated-Value files do not have metadata fields such as author, measuring device, timestamp of readings, GPS coordinates. (Yes, csv files sometimes have a first line for "column names" which is arguably metadata but that's not the higher-level metadata I'm talking about.)

Exactly where does one put that high-level metadata?

1) If one makes a new pseudo-standard that signifies any lines at the top the csv beginning with "//" as metadata, that means that modifying any metadata of a 100GB csv file (e.g. change author from "John Doe" to "Jacob Doe" is rewriting the whole 100GB file to add 1 byte.) As a related issue, let's say you have hash of "e1bb76e7391b93eb12" for the csv file. You really want a stable hash that represents the actual "raw data" of the csv file. You don't necessarily want the hash to change just because the metadata changed. In this case, embedding metadata into the file itself makes certain operations worse since typical hash utilities don't have "intelligence" about which parts of the file is "important" for hashing. (A similar problem is scanning mp3 files for duplicates. If 2 mp3 files have bit-identical audio output but the metadata tags are different, are they the same or different?!? It depends.)

2) if you put metadata at the end, typical utilities won't know about about it. (UNIX has "tail" command but standard MS Windows does not. The tail command is also unstructured and read-only which makes it a non-solution for managing end-of-file metadata fields. Also, the "quick" view of GUI file managers show the top of the file and not the bottom of it.)

3) If you put metadata in a separate file, it easily gets lost. File managers like MacOS Finder and MS Windows Explorer don't know when 2 files are supposed to be "treated as one unit" vs separately.

4) If you try to put metadata in a separate special area using os file system features suchs MS "NTFS alternate data streams" or Mac OSX "resource forks", they will get lost when transferring across incompatible filesystems or uploading to Amazon S3.

If one is feeling uncharitabe, one could say the MS WinFS[1] was a spectacular failed attempt at unifying metadata. (A relational database that makes metadata more of a 1st class concept.) Nobody has tried it on that level since. Even Apple's new file APFS system didn't have the same metadata ambitions as WinFS.

The combination of tradeoffs leads everybody to re-invent the idea of embedding metadata (including namespacing hierarchies) into filenames. The article's suggestions for scientific data filenames looks very similar to filenames that companies end up using for ETL pipelines.[2]

[1] https://en.wikipedia.org/wiki/WinFS

[2] https://en.wikipedia.org/wiki/Extract,_transform,_load


"File managers like MacOS Finder and MS Windows Explorer don't know when 2 files are supposed to be "treated as one unit" vs separately"

They sort-of do, each in their own way. Mac OS has packages: directories with files that the Finder treats as a single item. I think this is mostly implemented outside of the file system; if you give a directory a file name with an extension that some application claims to be an extension for a package, the Finder treats it as a package.

On Windows, when you export a web page from Internet Explorer as "web page, complete", you get a file and an associated folder containing the images of the file. MS Explorer shows them as separate icons, but knows that the two form a unit; when you delete one, it informs you about the existence of the other. I don't know how this is implemented, but suspect it is 100% outside of the file system, too.

Also, classic Mac OS had resource forks: a single alternate stream for every file. Text editors used it, for example, to store the cursor position, line wrap settings, etc. across saves.


Couldn't we just come up with some convention for "expanded filenames" where the meta-data is included in the file name itself? In the UI portion, you see what you see now, no difference, but say anything after the // delimiter in the file name is considered meta data and not shown in the windows/terminal UI.

Not sure if it's a good solution but if I were to put the meta data somewhere I would somehow try to put it in the identifier of the file (the name) as it is data that would help me identify the file AND it's content!


This is partially what filesystem forks/streams were supposed to be for. The big problem, as mentioned, is that this works for the OS, but isn't somehow transferred to third parties.

https://blogs.technet.microsoft.com/askcore/2013/03/24/alter...

https://en.wikipedia.org/wiki/Fork_(file_system)


thanks for the links!


The problem I've found with file names like those described as "awesome" in the fifth slide is that if you have a bunch of them open at once, your taskbar/switcher/windows menu truncates them all to something like "2013-06-26_BRAFWTNEG...", making finding the one you want a bit more burdensome.

Jakob Nielsen had a post (a link for which I cannot find) recommending that web-page titles put the most specific information at the beginning. Doing something similar with file names (e.g., calling them "H01_MutantFraction...2013-06-26.csv", etc) would trade some of the advantages of the proposed scheme for speed of finding and switching between files when you're actually using them.


If you look at the filename examples, there seems to be an implicit suggestion of naming a group of related files using a common prefix.

If one needs to distinguish groups of files, why not just put them in directories? That's the reason directories exist, no?

I can somewhat understand if some (bad) software is written to look for files only in a single directory and you have to put everything there. But otherwise, it seems pretty pointless to use a common prefix and make filenames longer.


You loose the information in the directory title if the file is downloaded or from viewing the title in an application title bar.


yeah, thinking along these lines filenames themselves are the easiest way to display the contents of a file, and that data travels with each and every file no matter where they're mv'd to, uploaded, deployed, shared, etc.

It's like brand name packaging, all the information including nutrition packed neatly on the outside. You don't go to the store and buy 'bread' you buy '2017-08-18-00-natures-own-dbl-fiber-wheat'.

This whole pdf resonated with me because it made me realize I'd developed these almost identical practices without knowing it. Mostly over time, trial and error, and a kind of natural selection, when it comes to sake of ease.

Cool stuff.


Because you might want to group files in an order other than first element to last element. Putting them in directories bakes in a single specific organization, rather than letting them be organized as desired on the fly.


Isn't that simply solved by (hard)linking the files in whatever directory structures needed? There is nothing really in directories that bakes in a single specific organization.


You could, and then every time you wanted to organize things differently you'd hardlink a new set of directories, and then you'd have all these directories sitting around—some of which are useful, and some of which aren't.

Which is to say: yes, you could, but it seems like a worse idea than just not putting them in directories at all, and relying on the OS's (strong) search features to get you want you want.


> and then you'd have all these directories sitting around—some of which are useful, and some of which aren't.

Directories do not really cost anything, so carrying them around doesn't really matter that much. You can think them just as simple (hierarchical) tags. If you really cared, doing a (scheduled) GC pass to prune empty branches shouldn't be too difficult to do.

Of course the question if tags are actually good way of organizing things is still open, but that is distinctly different problem than "baking in a single specific organization"; I would even venture to say that it is almost opposite problem, simple tags generally being too freeform and unstructured.


That seems like a lot of trouble to accomplish something not quite as good as what she's already accomplished, without any of the work.


I'm constantly harping on everybody to pay attention to their file naming.

I'm a graphic designer, so for me everything is Client/YYMM-Project/_FINAL/YYMM-COLLATERAL-NAME

Within each project there is a _PROCESS folder with a _ELEMENTS subfolder for pieces the client has given me to work with.

For invoices I do YYMMDD-ClientName-Project-Sum.pdf. When the invoice is paid, I rename the file to add -PAID- before the client name. Its simple, but its allowed me to easily track and maintain projects and billing over the years.

If I end up working for another 83 years, I guess I'll pad the year with a 0...

Proper file management is an undervalued skill and should be taught both in school and in corporate environments. In an old tech job we had a public folder on the server that was total chaos. So many people insisted on naming their files MAY-%day%-%contents%-%personsname% -- and, as you'd expect, people spent countless hours per year trying to hunt down that one file so-and-so worked on before they left for another job.


> I'm a graphic designer, so for me everything is Client/YYMM-Project/_FINAL/YYMM-COLLATERAL-NAME

> Within each project there is a _PROCESS folder with a _ELEMENTS subfolder for pieces the client has given me to work with.

Out of curiosity, why do you have the year/month in the names of both the project folder and the collateral file?

And why do you start the names of your final/process/elements folders with underscores?


I can't speak to stevewillows' answer, but my girlfriend uses a similar naming scheme in her architecture work. The project folder contains a year-month prefix indicating when the project started, and any year-month prefixes below that indicate when that part of the project was initiated. It maps the project in time for her: this folder tells her when she first talked to the client, this folder tells her when design started, this folder tells her when she got the first construction bids. The date on the top-level folder is extremely useful when looking through old projects. Just from consulting those files so often, she knows when most of her major projects were started and finished and how long each phase took, which is something I don't remember (and can't reconstruct) for most of the projects I've worked on. She has folders like this going back ten years. I can't even imagine what that would look like for my projects. I'm a bit jealous.


Including the year and month on a project file means it can stand independent of it's directory for easy distinction in search, meaning you can find relevant files in a way that is uncomplicated and easy to understand.

[Edit] As dkarl mentioned more elegantly: this system helps map a project in time.

YYMM on project files is shorter and less ambiguous than using a project name for the same function.

I'd imagine the sub-directory underscore is also used to aid searching. You can easily visually identify (or filter) sub-directories.

> The initial underscore helps by (a) keeping those directories sorted [separately] and (b) providing a visual clue that they are 'special'. [0]

0. https://www.sitepoint.com/community/t/why-do-some-folders-ha...


I keep the dates on both for when I'm searching. For instance, if I'm searching for shirts, I'll see 0701-monkey-shirt.ai, but I draw a lot of these.


Have you considered writing an automatic file syntax check/rename? Reading the thread made the idea pop into my head that it could be a useful tool, especially in fast moving or group environments.

I prefer organizing by year and by category(business, code, etc) instead of by client but overall I'm mostly happy to have any organization of my files at all.


oh man, that would be great. There's probably a way, but I wish all files had something similar to ID3 tags with renaming applications.


That's an awesome scheme. I'm going to try incorporating it into mine!


The author forgot to write down the most important reason for the whole exercise: file transfer.

Modern operating systems index the contents of your files, so finding all your files on project "foo" is only a search away. If you are a GUI user, then file naming isn't really that important for locating data on your machine.

Where file names matter is because there is 40 years of cruft out there which absolutely refuses to move metadata along with files. So you can touch your files to set dates, organize them in directories or tag them to your heart's content in Mac OS or Windows, but you will lose all that information when you attach the file to an email or put it in DropBox.

So you only have a choice of two places to put metadata in such a way that the metadata will be carried along with the file- the file name or the file contents.

Putting the data in the file contents lacks discoverability and in many cases the applications you use to manipulate the files don't allow for additional metadata anyway. Also, some file types (Word .docx files, jpegs, MP3s) get their metadata updated and/or scrambled when you open them with specific applications. So really your only valid choice is to put it in the file name.

The author's specific recommendations (use underscores and hyphens for delimiting) assume that you really want to access the files with the command line and use globbing. Other than that implication, the recommendations are sound.


> avoid [...] accented characters

It's certainly good advice and I definitely avoid using non-ASCII characters in filenames in practice. But I can't help thinking that advice like that is why support for Unicode is still buggy in many places.

I see nothing fundamentally wrong with using non-ASCII for filenames (and the slides don't give any reasoning), if only random software wouldn't mangle encodings, sorting order, or plain refuse to accept such filenames.


In my view the main problem is in sharing those files with others. Other systems may not support those chars or the other users may not be aware of how they are sorted and even how to type them in a search.


For those who don't know Jenny Bryan, she is a wonderful force for good in the R community. It seems like these filenames are a little contentious here on HN, but IMO this will always be a big improvement over the file-naming practices of someone who has given little or no thought to the topic. Which I am guessing was her intended audience.

If you're getting into R or data analysis, check out http://stat545.com/topics.html. She has put a lot of thought into the project management aspects of carrying out a data science project that don't get discussed as often as other, sexier topics.


My only real concern is with left-padding numbers with 0s when you don't know in advance how big the numbers are going to get. Do you pad to 2 or 3 digits or...


When you have a long list of numbered filenames with no padding, pipe the list through "sort -V" for "version number" ordering.

Edit: Glitchmr has a better solution below, "ls -v".


Another annoyance is sorting hostnames, which reasonable sort right-to-left with period-delineated fields.


Why left pad numbers if you could simply use better tools that properly sort numbers (instead of using ASCII order which doesn't even make sense for ordering purposes). For instance, use `ls -v` instead of `ls` (possibly as an alias).


The pdf didnt mention the most important point - keep it short and to the point

Its so hard to use CI interface and type that filename every time or navigating to in file explorer when its so long to read and the important bits of info should always be at the start.

Iso date conventions arent necessary since most files have metadata associated with it (create and modified date) so adding ISO date format is redundant for human made files . As you can always use a bulk renamer at any point.

But my point still stands just sort regularly and add some sorting identifier at the front of the file. Depending on who is working on file and what context it is a simple number at front suffices 01, 02, 03, etc or it can be a word and version number at end

Lastly the author didnt mention foldernames. Those need to be one or two words at most to help segregate information if theres lots of files in that one folder

If your creating machine made / autogenerated saved reports following a standard ISO state convention makes sense though, with regexable slugs, etc


There's some empirical work on how developers encounter and respond to naming anti-patterns e.g. http://www.veneraarnaoudova.com/wp-content/uploads/2014/10/2... and associated googlescholar search https://scholar.google.com/scholar?hl=en&as_sdt=0,34&q=lingu...


I always name from general to specific — left to right.

Example: client_project_element-name_20170818.txt

When sorted by name, similar items are grouped together.


MMDDYYYY

I like to call this middle endian.


For some reason I want to avoid names beginning with numbers. All my filenames can be used as an identifier (except the extension part).

i.e. [a-zA-Z][0-9a-zA-Z_]+\.[a-z0-9]+


I name my files like this:

  2017-04-04 #S4907 #Choir List of names.pdf
  2017-04-05 #EXCITE #S=5005 Notes on Data Repositories.pdf
  2017-04-06 #ARAG #S=5031 #Amount=14.2 #CUR=EUR Invoice.pdf
with the following semantic:

* date. Every file name is prefixed with the ISO date to facilitate sorting

* tags. The syntax #<tagname> to categorize documents with tags.

* key-value pairs. The syntax $<key>=value let's us attach structured information to the document name.

and keep them in a single large folder.

On top of this, I have written some shell tooling to normalize, view, and shuffle around those documents: https://github.com/HeinrichHartmann/pile

E.g. `pile extract EXCITE` will extract all files with tag #EXCITE to a separate folder named #EXCITE. There is also a HTML form that helps with proper naming of new files.

File management is still a pain for me, but this at least gives me some confidence that I can retrieve stored documents reasonably well. I hope, that one day I'll be able to auto-generate expense reports and tax filings, from properly tagged up filenames.


You are suggesting a file name format with spaces in it?


yeah. I just can't stand all those dashes ;)

Just make sure to quote your variables in bash.


Whats wrong with spaces?


I'm not a fan of dates at the beginning of the file name. If a file name needs a date I always put it at the end so future versions group when sorting the files.


Nice rules :) I especially like the use of underline-delimited metadata in filenames, saved me on a huge research deadline once.


It's a nice set of guidelines, but why the snail on page 15?


I think it's to embrace the slug: https://en.wikipedia.org/wiki/Semantic_URL#Slug


Considered that most of the world uses "little-endian" format for date writing ( https://en.wikipedia.org/wiki/File:Date_format_by_country_(n... ) how comes ISO 8601 was set on "big-endian"?

Not so practical for anything else but file naming IMHO...


First off and most importantly ISO 8601 is a standard for data interchange so being easy to parse visually and with a computer is a feature. ISO 8601 groups and sorts well without needing special rules and remains consistent all the way from the year to the millisecond. It is easier to parse visually when you are looking at a list of values, especially if they are similar.

Second "little endian" dates are inconsistent because the year is still big endian. If you want to remain consistent you would have to write the current year as 1720 (or even 7120!) because the years (or decades) are smaller than the century. To achieve the consistency of ISO 8601 with little endian time you would also have to write seconds before minutes and minutes before hours.


Right, very good point about the endianness of the year. Makes sense now.


A little pain converting to a non-stupid-date format now would result in less pain for rest of the history of humanity (so hopefully thousands of years). I think it's worth it.


Given that we use "big-endian" Arabic numerals, the ISO date format using those numerals makes the most sense.

Of course, when we remember that Arabic is right-to-left, we realise that in the original Arabic the least significant digit was read first ...


Those filenames look terrible to me.

BRAFWTNEGASSAY doesnt make any sense. To distinguish between the filenames you actually have to read the full filename, and with such long filenames chances are they won't fully be displayed, so that you have 2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFrac... For the first four files.

Keeping cruft out of your filenames seems like a much, much better way to name files. Also, most systems keep track of the creation data, no need to keep it in the filename. I think it's better to give files an id.


BRAFWTNEGASSAY would make sense to the owner of the file or someone working in that particular project. Consider it a project name, or a keyword that is relevant in that particular context. If you're working with files from different sources with multiple contributors this sort of approach works brilliantly.

You could have named it differently: 2013-06-26_KUTKLOON7_Plasmid-Cellline-100-1MutantFrac

Creation date can sometimes be lost if you copy/move the file between different mediums


> Those filenames look terrible to me.

I agree. Maybe at some point, and in his example I think it makes sense to just change some of those '_' to '/' and boom, you now have folders.


"most systems keep track of the creation data"

You only need one tool that doesn't quite do things right to lose that:

- file copy programs need to explicitly set the creation date back to that of the original file.

- when you do save a file in an editor, it typically writes a complete new file (ideally through a write temp file/delete/rename dance). Again, the program may forget to reset the creation date of the new file to that of the original.

- I don't think got even _stores_ creation time stamps in repositories.

I've seen too many files with obviously bogus creation time stamps to put much trust in the creation dates of files.


> most systems keep track of the creation data

True, but often it makes sense to put date into the filename as well. For example, notes for an event that happened on a certain date. You might write the first draft at that date and then edit/move it later. It's still strongly related to the date the event took place, but the filesystem ctime and mtime will be different.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: