mnks's comments

mnks · on Dec 20, 2022

Looking at your list you may like VisiData (https://www.visidata.org/). See the demo from 2018: https://www.youtube.com/watch?v=N1CBDTgGtOU

zJayv · on Dec 21, 2022

100%. Visidata is great. Easy to pick up, rewarding to power users.

mnks · on Dec 16, 2022

Have a look at https://beta.openai.com/tokenizer which uses javascript reimplementation of the GPT-2 / GPT-3 BPE tokenizer. In this case it's [31373, 995].

mnks · on Sept 8, 2022

I wrote a blog post [1] with an interactive widget where you can provide an encoding for a random decimal digit and see how close you can get to the theoretical log₂(10) ≈ 3.32 bits.

[1]: https://blog.kardas.org/post/entropy/ (Average Code Length section)

mnks · on Dec 26, 2021

Glad you like Papers with Code. Please check [1] for the list of scientific domains we currently support and [2] for CS in particular.

[1]: https://portal.paperswithcode.com/

[2]: https://cs.paperswithcode.com/

mnks · on Feb 29, 2020

It's not about the number of letters in the compounds, but about the number of morphemes.

Your "powychodziłybyście" example could be translated as "you (feminine, plural) would have been going out". With the word tokenization, you get (ignoring comma and brackets) 8 tokens in English and one token in Polish. Now you can have three persons, two genders, two numbers, an imperfective or perfective verb, etc. resulting in combinatorial growth of word tokens in Polish. If you have all word forms for "go out" and you want to add "go in", in English you would add a single token "in", and in Polish you add all the tokens with "-wy-" replaced by "-w-". As a result in Polish you end up with much bigger vocabulary. Additionally you need bigger training corpus as you cannot learn the tokens independently. For example, if you know the meaning of "he ate" and "she wrote", you should be able to guess the meaning of "he wrote", as you've seen all of the tokens. In Polish it's "Zjadł", "Napisała" and "Napisał" - all of the word tokens are different.

Using the subword tokenization instead of word-level tokenization is kind of similar to using a normalized database instead of unnormalized one. It's not about one form being more complex than the other as they're equivalent. After all, will written English be much more complex if we remove all whitespaces? :)

mci · on Feb 29, 2020

I agree with what you wrote. I did not object to subword tokenization that let you(?) win the competition. I objected to GP's assertion that one can add many morphemes together to create very long "words" in Polish, which made casual readers think of stringing morphemes like German compounds while the number of morphemes in Polish words is bounded by 7, maybe by 8.