The Seven Secrets of Highly Cited Scientists
A couple of years ago, I discussed with some colleagues the topic of maximizing citations for academic research papers. Here is a summary of the discussion.
Why should we want our papers to be highly cited? I assume here that we want our work to influence other researchers, and that citation count is a reasonable estimate of influence.
Survey/review papers and methodology papers are often highly cited, but the focus of this discussion is on papers that present original work, although there is certainly merit in survey, review, and methodology papers. (By the way, there is evidence that citation counting is biased towards survey/review papers.)
It seems to me that these are the main factors that characterize more highly cited papers:
1. Reusability: The core idea should be relatively simple, so that other researchers can easily understand it and especially so that they can easily use it in their own research. This factor might also be called simplicity, elegance, or fertility, but I think reusability best captures what I mean. I will cite your paper if I can reuse your ideas in my own research.
2. Originality: The core idea should be original. A pioneering paper may face more challenges during reviewing (which is inherently conservative) than an incremental paper, but the most cited papers are pioneering papers. I will cite your paper if I use an idea and your paper is one of the oldest references I can find for that idea.
3. Effectiveness: There should be some experimental evidence that the core idea works better than past ideas or better than reasonable baselines. Reviewers care deeply about this. I will want to use your idea if you can show me that it works on tasks that I care about.
4. Venue: If all else is equal, a paper in a respected conference or journal will be cited more than a paper in a less respected venue. I prefer to cite respected conferences and journals, hoping that the respect for my citations will increase the respect for my own paper.
5. Accessibility: If all else is equal, online papers will be more cited. I will cite your paper if I can read it without walking to the library.
6. Timeliness: Turbo codes, a class of error correction codes, were invented in 1993. These codes approach the theoretical maximum performance (the Shannon limit). It turns out that they are similar to a class of codes called LDPC codes, invented in 1963, but ignored until the invention of Turbo codes. The LDPC codes were ignored because the hardware of the 1960s was not good enough to make LDPC codes practical. This illustrates the importance of timeliness for maximizing citation counts. I will cite your paper if I can use your ideas now.
7. Positivity: Negative results are not as popular as positive results, although there have recently been some efforts to correct this. I will cite your paper if you show me what I can do, instead of telling me what cannot be done.
It seems that venue is not as important as the other factors. When I look at the citation counts for my own papers, they are not highly correlated with the average citation counts of the venues. When I look at my favourite highly cited papers, it seems that reusability and originality are the most important factors. These are what we should strive for in our research. For increased accessibility, putting papers online is easy and makes sense. I use both arXiv and Cogprints.
Regarding negative results, when an algorithm succeeds at a task, a large number of factors have to be right. We usually don’t even know what all of the factors are, at least until years later, if ever. A researcher can publish a positive result, listing a few of the factors that were involved, and other researchers can try to replicate the result, knowing some of these factors, and knowing that a positive result is possible (i.e., we have an existence proof). When an algorithm fails at a task, any one of these factors may be responsible. Locating the exact factor may be very difficult. A negative result may scare researchers away from a whole approach, even though (for all they know) only one factor was wrong.
This is known as the Credit Assignment Problem. When a long chain of steps leads to success, we can simply distribute the reward evenly over all of the steps in the chain. But what are we to do when a long chain of steps leads to failure? How can we discover the step that caused the failure (the weakest link in the chain)? A negative result can lead to the rejection of the whole chain, due to one bad link. Instead of distributing a penalty evenly over all of the steps in the chain, it might be better to just forget about the negative result.
There is an interesting discussion of journals versus conferences in Academic Careers for Experimental Computer Scientists and Engineers (Appendix B). The big problem with journals is the delay. You can often expect two years from submission to publication. The advantages of journals are more prestige, more space to explain your work, and feedback from reviewers results in a much better final paper.
For me, the decision of conference versus journal is based on how much I have to talk about. I think most people these days (including myself) would rather read an eight page conference paper than a thirty page journal paper. When I read a paper, I’m looking for good ideas that I can use in my own work (i.e., reusability). Most good ideas can be expressed in eight pages (or less). For me, a journal paper is a last resort, to be used only when I have so much that I want to say, that it’s just impossible for me to fit it into eight pages.
Thanks to Joel Martin, David Nadeau, Daniel Lemire, Roland Kuhn, and Pierre Isabelle for their comments and contributions.
Filed under: Computational Linguistics, Computer Science | Tagged: citation, publishing, research
A few factors you left off:
- Prestige of the employer… MIT and Stanford researchers tend to get more citations… these days, a Googler would get more citations.
- Whether the authors are well known or not is probably a strong factor. I will almost surely browse a Turney paper I stumbled upon. Well, at least read the abstract.
- People like to cite highly cited papers. Hence, if a few people cite a given paper, it will tend to snow ball, irrespective of the quality of the work. Hence, it is likely to be a very nonlinear process.
- There are plenty of prestigious venues that are hardly ever cited. I do not want to include names here, but there are well established IEEE and ACM faring very badly citation-wise.
- Many journals accept short papers. Some only accept short papers. Think about Information Processing Letters, of Pattern Recognition Letters. In Physics, I think it is quite common to publish short papers. IEEE has also a long tradition of limiting the length of journal articles.
- Luck is probably a strong factor too.
- I agree that short and focused papers are more “citable”.
A few factors you left off
My focus was on factors that the author/researcher can control. Some of the factors that you list are things that the author cannot readily influence. But, in any case, these are all good points.
As a reviewer, too often I read papers that clearly fail in terms of reusability, originality, or effectiveness. These are factors that the author should consider at the very beginning of a research project, long before writing the paper.
Originality can be determined very early in the process. When I read a paper, I can judge its originality level pretty quickly. At least, originality is easily falsified: if the paper presents “Yet another way to compute frequent itemsets” or “yet another variant to k-means clustering”, you know it is not original.
Effectiveness can be determined as the project unfolds. In my experience, thankfully, most researchers know to provide evidence of effectiveness. This evidence can take a soft form (other people use this approach and it works well) or it can be hard evidence (we ran old fashioned tests and here our numbers). Theoretical computer scientists, would argue that their theoretical bound count as effectiveness measures.
It is easy to assess reusability in retrospect, but what appears complicated or possibly useless often turns into something highly reusable. How do you determine a priori reusability? How do you assess whether a paper provides something reusable? Simplicity is relative. Accessibility is relative. Computing the SVD of a large sparse matrix, is it simple and accessible? What about employing graph matching algorithms?
It is easy to assess reusability in retrospect, but what appears complicated or possibly useless often turns into something highly reusable. How do you determine a priori reusability?
You have a point. The mathematician G.H. Hardy wrote, “I have never done anything ‘useful’. No discovery of mine has made, or is likely to make, directly or indirectly, for good or ill, the least difference to the amenity of the world,” but he was wrong: his work has been applied widely in physics and cryptanalysis. I agree that it can be difficult to estimate reusability in some cases, but there are other cases where it is not so difficult. I believe that it is worthwhile to attempt to estimate reusability at an early stage in a research project; the fact that it may be difficult to estimate does not justify not even trying.
On my home page, I have grouped my publications by topic, such as Analogies and Relational Similarity. Immediately to the right of each topic link, there is a link to applications, such as Applications of Analogies and Relational Similarity. This is my attempt to estimate the reusability of my research. (Yes, I am aware of the risk of self-deception. Time will tell.)