A multiple choice exam is looking for a sweetspot of a certain percentage passing, not too high and not too low.
The multiple choice question may be one of the most despised games ever conceived. The purpose of a multiple choice exam is to exclude people in a quantitative manner, be it for admission into schools, licensing professionals, or limiting the number of high grades in a class. Assessing a person on their individual merits is a time consuming process, and once a school or class hits a critical mass of students, it isn’t economically reasonable to scrutinize all of them. Let’s say you’ve got 1000 participants and five people reading their results. You can cut time and costs by figuring out a way to neatly get rid of 500 because they scored under a certain amount. A multiple choice exam cannot be so difficult that you exclude an excessive number of applicants. Most law schools, for example, have a minimum LSAT score that you must score below for automatic denial and a high score for automatic acceptance. Applicants in between those scores are then addressed on an individual level and other factors are introduced. The problem with such a system is that to ensure a multiple choice test produces the right number of passing scores, you have to keep changing the questions.
An article by the National Center for Fair and Open Testing explains, “multiple-choice items are an inexpensive and efficient way to check on factual ("declarative") knowledge and routine procedures. However, they are not useful for assessing critical or higher order thinking in a subject, the ability to write, or the ability to apply knowledge or solve problems” ("Multiple-Choice Tests", Education.com). It’s for that reason that a multiple choice question is always limited in scope: it can only be about basic knowledge of a topic. The formula is to have two answers that are blatantly wrong, one that is kinda right, and one that is the most right. One instructor during a review session for the BAR pointed out that on average a student will know the correct answer immediately to 25% of the problems, have no clue on 25%, and be able to boil it down to the right and kinda right answer for the other 50%. So the way that you evaluate the difficulty of a multiple choice question is how similar the right and kinda right answers are. A person who can’t boil it down to those two doesn't know the basic material and shouldn't pass.
Right and kinda right are obtuse concepts. They revolve around checking if a person understands the distinction between knowing something is correct and understanding why something is correct. A good analogy would be the difference between someone saying that water puts out fire and another person saying water puts out fire because it cuts off the oxygen and removes heat from the material. In that sense, people either fail on the facts or on the technicalities when choosing between these two answers. In law exams, someone can have a legal code perfectly memorized but completely miss the point when trying to apply it to a scenario. On the other hand, they can perfectly understand the situation at hand but not know how, in particular, something illegal has occurred. In the example above, I might ask you why a fire went out when Suzie dumped a bucket of water on it. The correct answer obviously needs to be a bit more than, “Because water puts out fire.”
There’s a great post by Jerard Kehoe that outlines a lot of the basic strategies when crafting these questions. The important thing to remember is that a test maker wants a certain percentage to fail, so they are going to have a balance of questions based on how often people get them right on average ("Writing Multiple Choice Exams", Practical Assessment , Research & Evaluation, 1995). A multiple choice question ends with what is called a "stem", which will be either an incomplete statement or a direct question. Incorrect answers are called "distractors". The more information that is in the stem of the problem and not in the answers, the more difficult the question. This is because no matter how good the teacher, there is always a risk of accidentally giving away the answer in the phrasing of the distractors. The words might trigger a latent memory or response instead of a factual understanding by the student.
There are too many tricks to list but an example would be one in which the question lists a definition and then offers several vocabulary words. That question will fail significantly more students than the reverse situation because they have less to tip them off. Negative statements in multiple choice questions are useful but are very susceptible to bias. This would be the “pick the answer that is most wrong” type of question. Students just instinctively pick “correct” answers, so anytime that a negative answer is introduced, a teacher has to factor in that a larger percentage of people are going to get it wrong. Other examples would be having the correct answer be gibberish but the other choices be factually incorrect or distinguishing between two correct answers by having a factually incorrect statement attached to it.
The goal of ensuring the test taker’s factual knowledge is being tested rather than luck or deductive reasoning results in one of the weirdest elements of multiple choice tests: there is a “correct” way for people to get a question wrong. The post above explains, “The number of students choosing a distractor should depend only on deficits in the content area which the item targets and should not depend on cue biases or reading comprehension differences in 'favor' of the distractor”. In other words, you’re supposed to get it wrong because the kinda right answer is incomplete or factually wrong compared to the others. Things that tip you off to the answer or comprehension differences should be kept to a minimum. The test taker should only know what to do due to prior reference.
The biggest design issue with multiple choice tests is that writing a good, coherent multiple choice question is difficult because of the thin line between right and kinda right. Being adept at it means specializing both in the craft itself and having an encyclopedic knowledge of the field being tested. Most exams that I’ve worked with were written by very informed people who wrote frustratingly ambiguous questions. An example of a bad question would be one that distinguished the right and kinda right answer because one used the word "presumed" and the other used "inferred". While the words certainly have two different meanings, spotting the distinction had nothing to do with the subject matter of the question. Professional exam writers are aware of this problem and now many exams will test a question out before actually counting it. Out of an exam of 100 questions, 10 will be experimental ones that see how many people get it right. Once they’ve got the rough percentage of how many people get it wrong on average, they factor it in with easier and more difficult ones. It goes back to the overall purpose of the exam being to exclude people but not too many people. A multiple choice exam is looking for a sweetspot of a certain percentage passing, not too high and not too low.
The more the questions are refined to weed people out, the more people will rely on study materials tailored to the exam to pass. These, in turn, break down the scoring process because people who can get access to the materials disproportionately pass and skew the passage ratings. The questions must become more difficult to compensate for this. The problem grows because you can only maintain a test that people are breaking with study materials by changing the questions that you use. Since the body of knowledge that you’re testing is already pre-determined, there is no new material to naturally generate new questions. The MCATs can’t suddenly include a non-medical topic just to fix the passage rate. So to keep generating new questions, the tests have to keep finding ways to change the presentation of the material. The solution thus perpetuates the problem: excessively convoluted questions can only be passed through practice because they continually deviate from the subject matter’s normal presentation. This issue is particularly pronounced if the question is presenting extremely implausible hypothetical scenarios that a professional would never encounter in real life. Like, oh say, the MBE on the BAR. Once a standardized test gets to a point where studying the test is just as important as knowing the material, how much of the practice materials you can afford might as well be the first question.