Great idea. - The frontpage should directly show the list of papers, like with H...

impendia · on Sept 8, 2024

> Use HTML rather then PDF.

The PDF is the original paper, as it appears on arXiv, so using PDF is natural.

In general academics prefer PDF to HTML. In part, this is just because our tooling produces PDFs, so this is easiest. But also, we tend to prefer that the formatting be semi-canonical, so that "the bottom of page 7" or "three lines after Theorem 1.2" are meaningful things to say and ask questions about.

That said, the arXiv is rolling out an experimental LaTeX-to-HTML converter for those who prefer HTML, for those who usually prefer PDF but may be just browsing on their phone at the time, or for those who have accessibility issues with PDFs. I just checked this out for one of my own papers; it is not perfect, but it is pretty good, especially given that I did absolutely nothing to ensure that our work would look good in this format:

https://arxiv.org/html/2404.00541v1

So it looks like we're converging towards having the best of both worlds.

throw10920 · on Sept 8, 2024

> In general academics prefer PDF to HTML. In part, this is just because our tooling produces PDFs, so this is easiest.

The tooling producing PDF by default absolutely makes the preference for PDF justifiable. However, tooling is driven by usage - if more papers come with rendered HTML (e.g. through Pandoc if necessary), and people start preferring to consume HTML, then tooling support for HTML will improve.

> But also, we tend to prefer that the formatting be semi-canonical, so that "the bottom of page 7" or "three lines after Theorem 1.2" are meaningful things to say and ask questions about.

Couldn't you replace references like "the bottom of page 7" with others like "two sentences after theorem 1.2" that are layout-independent? This would also make it easier to rewrite parts of the paper without having to go back and fix all of your layout-dependent references when the layout shifts.

HTML has strong advantages for both paper and electronic reading, so I think it's worth making an effort to adopt.

When I print out a paper to take notes, the margins are usually too narrow for my note-taking, and I additionally have a preference for a narrow margin on one side and a wide margin on the other (on the same side, not alternating with page parity like a book), which virtually no paper has in its PDF representation. When I read a paper electronically, I want to eliminate pagination and read the entire thing as a single long page. Both of these things are significantly easier to do with HTML than LaTeX (and, in the case of the "eliminate pagination" case, I've never found a way to do it with LaTeX at all).

(also, in general, HTML is just far more flexible and accessible than PDF for most people to modify to suit their preferences - I think most on HN would agree with that)

jltsiren · on Sept 9, 2024

HTML still lacks one key feature: a way of storing the entire document as a single file that remains fully functional offline and can be reasonably expected to be widely supported for decades. Research papers are used both for communicating new results and for archiving them. The long-term stability needed for the latter has never been a strong point of web technology.

impendia · on Sept 9, 2024

Indeed, I posted my first paper in 2006. It is still live on the internet in exactly the same format, and I've done absolutely nothing to maintain it.

I'm guessing there are few web pages of any significance which need to stay exactly the same for a long time. Here is one example which I've seen trotted out from time to time on HN:

https://www.dolekemp96.org/main.htm

This is clearly the exception. It seems that maintainers of web pages usually expect that they'll need to maintain and update them for as long as they want them to be accessible, and that's definitely not something I'd care to do for research papers.

faangguyindia · on Sept 9, 2024

You can make an HTML file self-contained by embedding CSS in a `<style>` tag and converting images to Base64, embedding them directly in the `<img>` tag as data URLs. This removes the need for external files, making everything contained within a single HTML file.

throw10920 · on Sept 9, 2024

I agree that PDF is better than web technologies in terms of stability. I'm not objecting to PDFs being available (like you said, for archive purposes you want them provided by the authors), but to PDFs being the default, and oftentimes only, format available.

setopt · on Sept 9, 2024

Note that ePub is basically just a zipped HTML file, and has become quite common for ebooks. I don’t know how that might be for archiving purposes?

I generally stick to PDF myself, but I do sometimes wish it would be more ergonomic to reflow a 2-column paper for reading on mobile on the go, for example. Also, ePub is easier to read in night mode than PDF recoloring, and seems easier to search through (try searching for a Greek letter in a PDF…).

EDIT: How is the math support in ePub though? Are people embedding KaTeX/MathJax or just relying on MathML, and how is the quality compared to TeX?

michaelmior · on Sept 8, 2024

> Couldn't you replace references like "the bottom of page 7" with others like "two sentences after theorem 1.2" that are layout-independent?

Yes, but I think such references are inherently harder to locate. Personally I try to just avoid making references to specific locations in the document and instead name anything that needs to be referenced (e.g. Figure 5, Theorem 3.2).

throw10920 · on Sept 8, 2024

Yes, I absolutely agree - I just figured that there had to be a reason that someone would want to do that. Chesterton's Fence and whatnot.

impendia · on Sept 9, 2024

> This would also make it easier to rewrite parts of the paper without having to go back and fix all of your layout-dependent references when the layout shifts.

Just thinking about having to change layout-dependent references, every time I add two sentences to the introduction, gives me a migraine.

I never do anything like this in the paper itself, nor does anyone else that I'm aware of. I'm thinking of informal discussions, where I ask another mathematician about something specific in a paper.

gwern · on Sept 8, 2024

I increasingly recommend against the Arxiv HTML version. I thought it had an acceptable start and they would fix the remaining problems and rapidly become on par with the PDF, but that seems to not be happening.

The HTML version is seriously buggy; and the worst part is, a lot of those bugs take the form of silently dropping or hiding content. It's bad enough when half the paper is gone, because at least you notice that quickly, but it'll also do things like silently drop sections or figures, and you won't realize that until you hit a reference like 'as discussed in Section 3.1' and you wonder how you missed that. I filed like 25 bugs on their HTML pages, concentrating on the really big issues (minor typographic & styling issues are too legion to try to report), and AFAIK, not a single one has been fixed in a year+. Whatever resources they're devoting to it, it's apparently totally inadequate to the task.

generationP · on Sept 8, 2024

I think development on the TeX-to-HTML compiler has slowed down at some point, and it's far from perfect yet. Some of the issues are probably HTML5 limitations, unlikely to be fixed any time soon (unless one wants formulas to become graphics).

But there is another problem: It takes too long to load on mobile and doesn't reflow. I thought mobile was one of the reasons people wanted HTML in the first place!

gwern · on Sept 9, 2024

> Some of the issues are probably HTML5 limitations, unlikely to be fixed any time soon (unless one wants formulas to become graphics).

You can convert a lot of formulas into either Mathjax/Katex-style fonts or MathML, or even just HTML+Unicode. (I get a very long way with pure HTML+Unicode+CSS on Gwern.net, and didn't even have to write a TeX-to-HTML compiler - just a long LLM prompt: https://github.com/gwern/gwern.net/blob/master/build/latex2u... )

But that's missing the point. Who cares about all of the refinements like reflow or pretty equations, when you are routinely serving massively corrupted and silently incomplete HTML versions? I don't care how good the typography is in your book if it's missing 5% of pages at random and also doesn't have any page numbers or table of contents...

amadeuspagel · on Sept 8, 2024

In PDFs on arXiv, syntax highlighted codeblocks are graphics.

davrosthedalek · on Sept 9, 2024

I think that's essentially only true if they are that in the original source. You can check for yourself, most papers have the TeX source available on arxiv.

ethanol-brain · on Sept 8, 2024

> That said, the arXiv is rolling out an experimental LaTeX-to-HTML

Some history: https://www.arxiv-vanity.com/

stogot · on Sept 9, 2024

I’m ok with the PDF but the title should be in HTML. The pdf failed to load for me due to tracker blockers (also why?!) so I was confused because there was no title but had comments

throw_pm23 · on Sept 8, 2024

Counterpoint: please don't do any of the above and keep arxiv as it is. It is too valuable to mess it up, it is the few things on the internet that have not been ruined yet, and the "comment activity" can happen in the articles themselves at the scale of years, decades, and centuries.

Epa095 · on Sept 8, 2024

This seems to be a completely different team than arxiv, making a discussion forum on the side.

And I prefer this over discussions on 'X'.

godelski · on Sept 9, 2024

I actually want the above, BUT I don't want it to be open to everyone. Not all gate keeping is bad.

We don't lack places that the public can engage with researchers and experts. What we do lack are places where researchers/experts can communicate with one another __and expect the other person to be a peer__. The bar to arxiv is (absurdly) low, and I think that's fine.

Not everything has to be for everyone.

My longer comment: https://news.ycombinator.com/item?id=41484123

I'm going to go crazy if I get more GitHub issues asking where the source code is or how to fine tune a model. My research project page is not a Google Search engine nor ChatGPT...

musicale · on Sept 9, 2024

> it is the few things on the internet that have not been ruined yet

Hmm, this is an interesting point.

Retr0id · on Sept 8, 2024

> rather papers should be voted on like comments.

I don't think this is an inherently better approach, but maybe there should be an option for different ranking mechanisms. You could also rank by things like cite-frequency, cite-recency, "cite pagerank", etc.

throwthrowuknow · on Sept 8, 2024

Agree, don’t sink a bunch of effort into creating a ranking algorithm. Expose metrics that users can sort or filter by which will work for both signed in and signed out. If you want to add more tools for signed in users then let them define their own filters that they can save like comment activity plus weighted by author, commenter, recency, topic etc. See the nntp discussion that was on here the other day.

dartos · on Sept 8, 2024

Yep. User driven ranking leads to people gaming the system for internet points.

anamexis · on Sept 8, 2024

It doesn't seem like citations would be good for discovery, because there must be a significant latency between when a paper is released and when citations start coming in.

bee_rider · on Sept 8, 2024

Probably it would be best to just get a site on the web and expose a bunch of different metrics so people can sort by whatever.

Citations are probably not the best metric for discovery, but also this really just makes me wonder if papers are not the best thing for discovery. An academic produces ideas, not papers, those are just a side-effect. The path is something like:

* make a idea

* write short conference papers about it

* present it in conferences

* write journal papers about it

* maybe somebody writes a thesis about it

(Talking to people about it throughout).

If we want to discover ideas as they are being worked on, I guess we’d want some proxy that captures whether all that stuff is progressing, and if anybody has noticed…

Finding that proxy seems incredibly difficult, maybe impossible.

michaelmior · on Sept 8, 2024

I'm not sure I agree about papers just being a side effect. An idea by itself has significantly less value than an idea which has been clearly documented and evaluated. I think a paper is often still the best way to do this.

sestep · on Sept 8, 2024

Tiny note: Stack Exchange also allows spaces in display names, and they make @ functionality work regardless: https://meta.stackexchange.com/a/43020/297476

Agreed that it makes it more complicated though.

rehaanahmad · on Sept 8, 2024

Great idea, we'll look into making the home page the trending page soon.

Regarding HTMl, our original site actually only supported HTML (because it was easier to build an annotator for an HTML page). the issue is that a good ~25% of these papers don't render properly which pisses off a lot of academics. Academics spend a lot of time making their papers look nice for PDF, so when someone comes along and refactors their entire paper in HTML, not everyone is a fan.

That being said, I do think long term HTML makes a lot of sense for papers. It allows researchers to embed videos and other content (think, robotics papers!). At some point we do want to incorporate HTML papers back into the site (perhaps as a toggle).

DoctorOetker · on Sept 8, 2024

I apologize for changing topic here:

Did you bulk download the arxiv metadata, PDF and or LaTeX files?

I am trying to figure out what the required space is for just the most recent version of the PDF's.

I can find mentions of the total size in their S3 bucket but unclear if that also includes older versions of the PDF's.

I also wonder if the Kaggle dataset is kept up to date since it states merely 1.7M articles instead of 2.4 I read elsewhere.

Edit: I just found the answers to my question here: https://info.arxiv.org/help/bulk_data_s3.html

ZeroSolstice · on Sept 8, 2024

> The frontpage should directly show the list of papers, like with HN.

I disagree. There are numerous times where I have browsed the comments on a HN post where people haven't read the article and are just responding to the comment thread. The workflow for this seems a bit different in that a person would have already read a paper and wanted to read through existing discussions or respond to discussion. With that, having the search front and center would follow as the next steps for a person who read a paper and wanted to "search" for discussions related to that paper in particular.

HN is more an aimless browsing which is a bit different than researching a specific area or topic.

diggan · on Sept 8, 2024

> - Ranking shouldn't be based on comment activity, which ranks controversial papers, rather papers should be voted on like comments.

How about not ranking things at all? I don't feel like things like this should be a popularity/"like" contest and instead let the content of the paper/comments speak for themselves. Yes, there will be some chaff to sort through when reading, but humanity will manage.

Just sort things by updated/created/timestamp and all the content will be equal.

pessimizer · on Sept 8, 2024

> let the content of the paper/comments speak for themselves.

People can't read everything, and have rely on others to filter up the good stuff. If you read something random, based on no recommendation, it's charity work (the odds are extremely good that it is bad) and you should recommend that thing to other people if it turns out to be useful. Ultimately, that's the entire point of any of this design: if we don't care about any of the metadata on the papers, they could just be numbered text files at an ftp site.

The fewer things I have to read to find out they're shit, the longer life I have.

I say the opposite: put a lot of thought into how papers are organized and categorized, how comments on papers are organized and categorized, the means through which papers can be suggested to users who may be interested in them, and the methods by which users can inject their opinions and comment into those processes. Figure out how to thwart ways this process can be gamed.

Treat the content equally, don't force the content to be equal. Hacker News shouldn't just be the unfiltered new page.

gus_massa · on Sept 8, 2024

Sorted by "new"...

Most articles are not interesting, most of the interesting ones are interesting only for a niche of a few researchers. The front page will be flowed by uninteresting stuff.

thornewolf · on Sept 8, 2024

thats ranking by recency, which means i can abuse it by churning low quality content out to arXive

gradus_ad · on Sept 8, 2024

> Ranking shouldn't be based on comment activity, which ranks controversial papers

But don't we want people's attention drawn to controversial/conversation generating papers? The whole point of the platform is to drive conversation

godelski · on Sept 9, 2024

My guess is that they're trying to deal with a problem they're creating: being open to everyone. The problem with places like Twitter, Reddit, Github, HN, etc is that you don't know you're talking to a peer and the idiots asking irrelevant questions or proposing dumb things outnumber researchers a million to one. Even allowing public to upvote or affect the rankings is not beneficial to science.

I'm all for casting wide nets and making things available to everyone, but a little gate keeping is not bad (just don't gate keep by race, class, or those things). But I'm sorry, research is hard. There's a reason people spend decades researching things that at face value look trivial. Rabbit holes are everywhere and just because you don't know about them doesn't mean your opinion has equal weight.

We seriously lack areas where experts can talk to other experts.

woodson · on Sept 8, 2024

The concern may be about what effect this will have on future papers (just like news headlines engineered for clickbait).

runningmike · on Sept 8, 2024

Many people on earth have names with spaces. So good that a username can reflect a real name a person has.