- The frontpage should directly show the list of papers, like with HN. You shouldn't have to click on "trending" first. (When you are logged in, you see a list of featured papers on the homepage, which isn't as engaging as the "trending" page. Again, compare HN: Same homepage whether you're logged in or not.)
- Ranking shouldn't be based on comment activity, which ranks controversial papers, rather papers should be voted on like comments.
- It's slightly confusing that usernames allow spaces. It will also make it harder to implement some kind of @ functionality in the comments.
- Use HTML rather then PDF. Something that could be trivial with HTML, like clicking on an image to show a bigger version, requires you to awkwardly zoom in with PDF. With HTML, you would also have one column, which would fit better with the split paper/comments view.
The PDF is the original paper, as it appears on arXiv, so using PDF is natural.
In general academics prefer PDF to HTML. In part, this is just because our tooling produces PDFs, so this is easiest. But also, we tend to prefer that the formatting be semi-canonical, so that "the bottom of page 7" or "three lines after Theorem 1.2" are meaningful things to say and ask questions about.
That said, the arXiv is rolling out an experimental LaTeX-to-HTML converter for those who prefer HTML, for those who usually prefer PDF but may be just browsing on their phone at the time, or for those who have accessibility issues with PDFs. I just checked this out for one of my own papers; it is not perfect, but it is pretty good, especially given that I did absolutely nothing to ensure that our work would look good in this format:
> In general academics prefer PDF to HTML. In part, this is just because our tooling produces PDFs, so this is easiest.
The tooling producing PDF by default absolutely makes the preference for PDF justifiable. However, tooling is driven by usage - if more papers come with rendered HTML (e.g. through Pandoc if necessary), and people start preferring to consume HTML, then tooling support for HTML will improve.
> But also, we tend to prefer that the formatting be semi-canonical, so that "the bottom of page 7" or "three lines after Theorem 1.2" are meaningful things to say and ask questions about.
Couldn't you replace references like "the bottom of page 7" with others like "two sentences after theorem 1.2" that are layout-independent? This would also make it easier to rewrite parts of the paper without having to go back and fix all of your layout-dependent references when the layout shifts.
HTML has strong advantages for both paper and electronic reading, so I think it's worth making an effort to adopt.
When I print out a paper to take notes, the margins are usually too narrow for my note-taking, and I additionally have a preference for a narrow margin on one side and a wide margin on the other (on the same side, not alternating with page parity like a book), which virtually no paper has in its PDF representation. When I read a paper electronically, I want to eliminate pagination and read the entire thing as a single long page. Both of these things are significantly easier to do with HTML than LaTeX (and, in the case of the "eliminate pagination" case, I've never found a way to do it with LaTeX at all).
(also, in general, HTML is just far more flexible and accessible than PDF for most people to modify to suit their preferences - I think most on HN would agree with that)
HTML still lacks one key feature: a way of storing the entire document as a single file that remains fully functional offline and can be reasonably expected to be widely supported for decades. Research papers are used both for communicating new results and for archiving them. The long-term stability needed for the latter has never been a strong point of web technology.
Indeed, I posted my first paper in 2006. It is still live on the internet in exactly the same format, and I've done absolutely nothing to maintain it.
I'm guessing there are few web pages of any significance which need to stay exactly the same for a long time. Here is one example which I've seen trotted out from time to time on HN:
This is clearly the exception. It seems that maintainers of web pages usually expect that they'll need to maintain and update them for as long as they want them to be accessible, and that's definitely not something I'd care to do for research papers.
You can make an HTML file self-contained by embedding CSS in a `<style>` tag and converting images to Base64, embedding them directly in the `<img>` tag as data URLs. This removes the need for external files, making everything contained within a single HTML file.
I agree that PDF is better than web technologies in terms of stability. I'm not objecting to PDFs being available (like you said, for archive purposes you want them provided by the authors), but to PDFs being the default, and oftentimes only, format available.
Note that ePub is basically just a zipped HTML file, and has become quite common for ebooks. I don’t know how that might be for archiving purposes?
I generally stick to PDF myself, but I do sometimes wish it would be more ergonomic to reflow a 2-column paper for reading on mobile on the go, for example. Also, ePub is easier to read in night mode than PDF recoloring, and seems easier to search through (try searching for a Greek letter in a PDF…).
EDIT: How is the math support in ePub though? Are people embedding KaTeX/MathJax or just relying on MathML, and how is the quality compared to TeX?
> Couldn't you replace references like "the bottom of page 7" with others like "two sentences after theorem 1.2" that are layout-independent?
Yes, but I think such references are inherently harder to locate. Personally I try to just avoid making references to specific locations in the document and instead name anything that needs to be referenced (e.g. Figure 5, Theorem 3.2).
> This would also make it easier to rewrite parts of the paper without having to go back and fix all of your layout-dependent references when the layout shifts.
Just thinking about having to change layout-dependent references, every time I add two sentences to the introduction, gives me a migraine.
I never do anything like this in the paper itself, nor does anyone else that I'm aware of. I'm thinking of informal discussions, where I ask another mathematician about something specific in a paper.
I increasingly recommend against the Arxiv HTML version. I thought it had an acceptable start and they would fix the remaining problems and rapidly become on par with the PDF, but that seems to not be happening.
The HTML version is seriously buggy; and the worst part is, a lot of those bugs take the form of silently dropping or hiding content. It's bad enough when half the paper is gone, because at least you notice that quickly, but it'll also do things like silently drop sections or figures, and you won't realize that until you hit a reference like 'as discussed in Section 3.1' and you wonder how you missed that. I filed like 25 bugs on their HTML pages, concentrating on the really big issues (minor typographic & styling issues are too legion to try to report), and AFAIK, not a single one has been fixed in a year+. Whatever resources they're devoting to it, it's apparently totally inadequate to the task.
I think development on the TeX-to-HTML compiler has slowed down at some point, and it's far from perfect yet. Some of the issues are probably HTML5 limitations, unlikely to be fixed any time soon (unless one wants formulas to become graphics).
But there is another problem: It takes too long to load on mobile and doesn't reflow. I thought mobile was one of the reasons people wanted HTML in the first place!
> Some of the issues are probably HTML5 limitations, unlikely to be fixed any time soon (unless one wants formulas to become graphics).
You can convert a lot of formulas into either Mathjax/Katex-style fonts or MathML, or even just HTML+Unicode. (I get a very long way with pure HTML+Unicode+CSS on Gwern.net, and didn't even have to write a TeX-to-HTML compiler - just a long LLM prompt: https://github.com/gwern/gwern.net/blob/master/build/latex2u... )
But that's missing the point. Who cares about all of the refinements like reflow or pretty equations, when you are routinely serving massively corrupted and silently incomplete HTML versions? I don't care how good the typography is in your book if it's missing 5% of pages at random and also doesn't have any page numbers or table of contents...
I think that's essentially only true if they are that in the original source. You can check for yourself, most papers have the TeX source available on arxiv.
I’m ok with the PDF but the title should be in HTML. The pdf failed to load for me due to tracker blockers (also why?!) so I was confused because there was no title but had comments
Counterpoint: please don't do any of the above and keep arxiv as it is. It is too valuable to mess it up, it is the few things on the internet that have not been ruined yet, and the "comment activity" can happen in the articles themselves at the scale of years, decades, and centuries.
I actually want the above, BUT I don't want it to be open to everyone. Not all gate keeping is bad.
We don't lack places that the public can engage with researchers and experts. What we do lack are places where researchers/experts can communicate with one another __and expect the other person to be a peer__. The bar to arxiv is (absurdly) low, and I think that's fine.
I'm going to go crazy if I get more GitHub issues asking where the source code is or how to fine tune a model. My research project page is not a Google Search engine nor ChatGPT...
I don't think this is an inherently better approach, but maybe there should be an option for different ranking mechanisms. You could also rank by things like cite-frequency, cite-recency, "cite pagerank", etc.
Agree, don’t sink a bunch of effort into creating a ranking algorithm. Expose metrics that users can sort or filter by which will work for both signed in and signed out. If you want to add more tools for signed in users then let them define their own filters that they can save like comment activity plus weighted by author, commenter, recency, topic etc. See the nntp discussion that was on here the other day.
It doesn't seem like citations would be good for discovery, because there must be a significant latency between when a paper is released and when citations start coming in.
Probably it would be best to just get a site on the web and expose a bunch of different metrics so people can sort by whatever.
Citations are probably not the best metric for discovery, but also this really just makes me wonder if papers are not the best thing for discovery. An academic produces ideas, not papers, those are just a side-effect. The path is something like:
* make a idea
* write short conference papers about it
* present it in conferences
* write journal papers about it
* maybe somebody writes a thesis about it
(Talking to people about it throughout).
If we want to discover ideas as they are being worked on, I guess we’d want some proxy that captures whether all that stuff is progressing, and if anybody has noticed…
Finding that proxy seems incredibly difficult, maybe impossible.
I'm not sure I agree about papers just being a side effect. An idea by itself has significantly less value than an idea which has been clearly documented and evaluated. I think a paper is often still the best way to do this.
Great idea, we'll look into making the home page the trending page soon.
Regarding HTMl, our original site actually only supported HTML (because it was easier to build an annotator for an HTML page). the issue is that a good ~25% of these papers don't render properly which pisses off a lot of academics. Academics spend a lot of time making their papers look nice for PDF, so when someone comes along and refactors their entire paper in HTML, not everyone is a fan.
That being said, I do think long term HTML makes a lot of sense for papers. It allows researchers to embed videos and other content (think, robotics papers!). At some point we do want to incorporate HTML papers back into the site (perhaps as a toggle).
> The frontpage should directly show the list of papers, like with HN.
I disagree. There are numerous times where I have browsed the comments on a HN post where people haven't read the article and are just responding to the comment thread. The workflow for this seems a bit different in that a person would have already read a paper and wanted to read through existing discussions or respond to discussion. With that, having the search front and center would follow as the next steps for a person who read a paper and wanted to "search" for discussions related to that paper in particular.
HN is more an aimless browsing which is a bit different than researching a specific area or topic.
> - Ranking shouldn't be based on comment activity, which ranks controversial papers, rather papers should be voted on like comments.
How about not ranking things at all? I don't feel like things like this should be a popularity/"like" contest and instead let the content of the paper/comments speak for themselves. Yes, there will be some chaff to sort through when reading, but humanity will manage.
Just sort things by updated/created/timestamp and all the content will be equal.
> let the content of the paper/comments speak for themselves.
People can't read everything, and have rely on others to filter up the good stuff. If you read something random, based on no recommendation, it's charity work (the odds are extremely good that it is bad) and you should recommend that thing to other people if it turns out to be useful. Ultimately, that's the entire point of any of this design: if we don't care about any of the metadata on the papers, they could just be numbered text files at an ftp site.
The fewer things I have to read to find out they're shit, the longer life I have.
I say the opposite: put a lot of thought into how papers are organized and categorized, how comments on papers are organized and categorized, the means through which papers can be suggested to users who may be interested in them, and the methods by which users can inject their opinions and comment into those processes. Figure out how to thwart ways this process can be gamed.
Treat the content equally, don't force the content to be equal. Hacker News shouldn't just be the unfiltered new page.
Most articles are not interesting, most of the interesting ones are interesting only for a niche of a few researchers. The front page will be flowed by uninteresting stuff.
My guess is that they're trying to deal with a problem they're creating: being open to everyone. The problem with places like Twitter, Reddit, Github, HN, etc is that you don't know you're talking to a peer and the idiots asking irrelevant questions or proposing dumb things outnumber researchers a million to one. Even allowing public to upvote or affect the rankings is not beneficial to science.
I'm all for casting wide nets and making things available to everyone, but a little gate keeping is not bad (just don't gate keep by race, class, or those things). But I'm sorry, research is hard. There's a reason people spend decades researching things that at face value look trivial. Rabbit holes are everywhere and just because you don't know about them doesn't mean your opinion has equal weight.
We seriously lack areas where experts can talk to other experts.
- The frontpage should directly show the list of papers, like with HN. You shouldn't have to click on "trending" first. (When you are logged in, you see a list of featured papers on the homepage, which isn't as engaging as the "trending" page. Again, compare HN: Same homepage whether you're logged in or not.)
- Ranking shouldn't be based on comment activity, which ranks controversial papers, rather papers should be voted on like comments.
- It's slightly confusing that usernames allow spaces. It will also make it harder to implement some kind of @ functionality in the comments.
- Use HTML rather then PDF. Something that could be trivial with HTML, like clicking on an image to show a bigger version, requires you to awkwardly zoom in with PDF. With HTML, you would also have one column, which would fit better with the split paper/comments view.