SFF: Distilled

On one end of the scale I can see where this would be tantamount to the webcrawlers that are so effective at helping Internet users find those websites that cater to their tastes. Which I feel are pretty much failures because they have to periodically find ways to ferret out those sites that deliberately use this to redirect traffic their way.

I would also think that capitalization and even truncation of words would cause the validity of this test to tank. I often have trouble searching a word document without false positives and missed words because of some weirdness in the search engine. This is also why we turn off the auto correct functions because of the inability of search engines to correctly identify everything which gets compounded when it is used on the fly and then allowed to correct without any method of checking.

But let's look forward to someone taking a search and finding all the science words and replacing them with something that works though obviously might sound silly and then creating a novel that has no science words by your libraries definition. And then for that matter taking a non science fiction book and inserting words that fit that make it appear in nature to be full of science.

What about books that are full of science that have no intention of being science fiction.

But my question might be more of how much time has been spent testing this on real science papers out of scientific journals to determine how much or little science it requires in a science article to render it un-scientific.

I think if there was some measure of science on those and the resulting word dictionary used for terms that came from those perhaps we'd get closer to a reality of science. That still doesn't make this method useful. Unless you want your science fiction to read like a science article.

But let's take a look a Duke University and the research on Telepathy and Telekinesis and Teleportation -;ESP; Esper where do these fall in your dictionaries.

For that matter do you include Telephones, automobiles, light bulbs, calculators, magnets, magnetism, steam engines, dirigibles, weather balloons, cameras, rotoscopes, isotopes, heart monitors, defibrillator,syringes, dialysis machines, MRI, mechanical pencils, Pens, stylus, tablets, laptops, desktops,OS, software, hardware, firmware, IC, Integrated Chip design, Resistor, capacitor, toroid, flux, solder, wave solder, automated placement, integrated circuit design, CAD, Transistors, inductors, a to d converters, binary, octal, digital, analog, diode, cathode, gate array, logic array, flashlight, battery, rechargeable battery, Combustion engine, torque wrench, shaker table, environmental chamber, pressure chamber, drop table, test fixture, ...

The list goes on of things that are in use today that involve some form of science to build or be used in building the technology that is taking us into the future and all of that is part of making pure science fiction; because it's an extrapolation of known physics into speculation of future.

You can't even leave out mop, broom, mop bucket and squeegee because they all will be used to get us there and likely be used after we get there. Some of them might be integral to maintenance of the future in space.
 
Last edited:
I am contemplating another modification to the program. Stop counting all of the science words that are only used once in the entire book. If a word is only used once in 400 pages how significant is it? But in the count of the total number of different words used those singletons look more significant than they are.

In Player of Games there are 26 science words out of 60 that are only used once. Eliminating them from the count would reduce the density from 0.395 to 0.358. It would say 34 science words used instead of 60.

43% reduction in number of words but only a 10% reduction in density.

Any thoughts?

psik
 
Makes sense, especially as you cap the high end - might as well cap the low end and help cut false positives.

But what would help is to pick a top 20-50 list of some kind or use an award winner list or SF/F Masterworks or something so you/we can see how a variety of familiar works shake out and then apply the changes and see what specifically results across the board.
 
Fun stuff. I cannot think of an instance where I would use it for myself, but it is a fun project full of niftiness.
 
I am contemplating another modification to the program. Stop counting all of the science words that are only used once in the entire book. If a word is only used once in 400 pages how significant is it?

Any thoughts?

psik

Hmm, yes an informed test run would help toward an educated guess on the matter - but deciding on the significance of a single occurrence of a word requires context. Context and the SFD may never be friends.

An deterrent example might be where an author is adept at describing/explaining using a wide variety of terms. If using the stop counter you are considering; a term that is repeated would be counted each time, but the same term described differently would not be picked up at all.

Where the limiter works at avoiding large skews, the stop counter could produce misleading results.

Oh also wondering when you mention longer works tend to have a lower density, why is this? i.e. a 250 page book with 50 words should score the same as a 500 page book with 100 words, right?
 
Last edited:
Hmm, yes an informed test run would help toward an educated guess on the matter - but deciding on the significance of a single occurrence of a word requires context. Context and the SFD may never be friends.

An deterrent example might be where an author is adept at describing/explaining using a wide variety of terms. If using the stop counter you are considering; a term that is repeated would be counted each time, but the same term described differently would not be picked up at all.

Where the limiter works at avoiding large skews, the stop counter could produce misleading results.

I presume you mean the author using different terms for the same concept and using that term once for the concept in one place while using a different term 5 times for that same concept in others would throw things off.

I think we have different ideas about how good this system could possibly be and I am willing to throw out the baby with the bath water 1 in 100 times to get rid of the really dirty bath water 95% of the time.

Oh also wondering when you mention longer works tend to have a lower density, why is this? i.e. a 250 page book with 50 words should score the same as a 500 page book with 100 words, right?

Dune uses more words more often than Ender's Game but has a lower density. Dune is twice as long.

The input file is: FH.Dune.txt with 1182172 characters.
It uses 78 SF words 375 times for an SF density of 0.317212723698

The input file is: OSC.EndersGame.txt with 582652 characters.
It uses 42 SF words 214 times for an SF density of 0.367286133061

I have only been showing data from novels and not short stories less than 20 or 30 pages. The densities can go much higher. I think the highest score I have see was 8.something But I then notice that eliminating single counts from short stories would cause other problems. Maybe it should be a user option. Ask what the user wants before results print, but that would be annoying in batch processing. Somehow implement it as a command line parameter?

Personally I prefer reading novels but there is no reason to not test short stories too.

Do you have Python on your computer?

psik
 
Last edited:
Hmm, yes an informed test run would help toward an educated guess on the matter - but deciding on the significance of a single occurrence of a word requires context. Context and the SFD may never be friends.

Either forget about a computer doing context or expect sophisticated analysis and programming that is way over my head. You are talking IBM's Watson level stuff. I doubt that I would apply it to something this trivial if I was that smart.

psik
 
No Python on my computer.

Not a programmer, just an observer. Enjoying watching the process though.

I agree context is asking too much, was saying it would in fact defeat the purpose of the program. Even so I had a fleeting notion that analysing using context could potentially arise as a bi-product from future results that were subjected to different conditions. The notion is wispy. Maybe I need to come back up the pipeline.

Regarding densities - As long as the calculation is constant despite the character count (what I was asking), the lower densities of longer books must so far just be coincidental, no?

If I was forced to make a vote on counting single words or not to get this moving to GUI I would say count them!
 
So what is the most convenient way for me to send you a .pyc of the program?

I'll PM you.

Regarding densities - As long as the calculation is constant despite the character count (what I was asking), the lower densities of longer books must so far just be coincidental, no?

Not entirely coincidental as longer books are usually more padded with character and setting vs. ideas and scientific activities, so probably have more "he felt" and "the sun set" and less "bunsen burner". But, yeah, coincidental in the sense that there's no necessary reason for it to be so and probably some long work has a high number and many shorts could of course be very low.
 
Here are our 100 favorite books, according to Facebook

That gave percentages of people who included particular books so I did analysed the SF.

The Hitchhiker’s Guide to the Galaxy—Douglas Adams (5.97%) 0.955
The Hunger Games trilogy—Suzanne Collins (5.82%) 0.106
1984—George Orwell (5.37%) 0.176
A Wrinkle in Time—Madeleine L’Engle (4.38%) 0.426
The Handmaid’s Tale—Margaret Atwood (4.27%) 0.150
The Giver—Lois Lowry (3.53%) 0.083
Ender’s Game—Orson Scott Card (3.53%) 0.367
Fahrenheit 451—Ray Bradbury (3.15%) 0.077
Dune—Frank Herbert (3.02%) 0.317
Slaughterhouse-Five—Kurt Vonnegut (2.54%) 0.278
Stranger in a Strange Land—Robert Heinlein (2.39%) 0.405
Brave New World—Aldous Huxley (2.24%) 0.327
Outlander—Diana Gabaldon (2.07%) 0.043
The Time Traveler’s Wife—Audrey Niffenegger (1.63%)

Hitch Hiker's Guide gets a score I would expect from a hard SF story so counting words does not tell you everything. I regard it as more of a satire of science fiction rather than actual science fiction. though. I haven't found The Time Traveler’s Wife so no data on that. Fahrenheit 451 scores among the lowest in SF density.

psik
 
Regarding densities - As long as the calculation is constant despite the character count (what I was asking), the lower densities of longer books must so far just be coincidental, no?

On the subject of works where the Fantasy and SF densities are similar Andre Norton's Star Man's Son meets that criteria.

The input file is: AN.StarMnSon.txt with 357994 characters.

It uses 20 SF words 63 times for an SF density of 0.176

It uses 5 Fantasy words 60 times for a Fantasy density of 0.168
 
Hitch Hiker's Guide gets a score I would expect from a hard SF story so counting words does not tell you everything. I regard it as more of a satire of science fiction rather than actual science fiction. though. Fahrenheit 451 scores among the lowest in SF density.
This is what I said earlier, this is really hard thing to do.
The Hitch Hiker's Guide to the Galaxy (the original play) was certainly mainly meant to be comedy and secondarily a send up of SF. Douglas Adams was quite surprised by the reaction (of people regarding it as SF).
Fahrenheit 451 is often thought of as SF against censorship. Certainly Ray Bradbury publicly claimed it wasn't about censorship, it is probably SF, though in general Ray Bradbury's SF is more a special kind of Fantasy. One of the best.
I think counting words in the text listed in a special dictionary isn't really going to work. Even if it did identify SF, Hard SF, Fantasy SF, Low Fantasy, High Fantasy etc (which it can't), it would not tell you about the quality.

There is an other analysis you can do on the Hitch Hiker's Guide to the Galaxy series, which is harder programming, and that is identify fresh content in each succeeding book in the series. It turns out that employing someone to produce the sixth isn't a big issue as even the 5th book has very little new content!
 
Fahrenheit 451 is often thought of as SF against censorship. Certainly Ray Bradbury publicly claimed it wasn't about censorship, it is probably SF, though in general Ray Bradbury's SF is more a special kind of Fantasy. One of the best.

I think counting words in the text listed in a special dictionary isn't really going to work. Even if it did identify SF, Hard SF, Fantasy SF, Low Fantasy, High Fantasy etc (which it can't), it would not tell you about the quality.

What do you mean by WORK?

What did I ever claim this data would tell people or say that it would be 100% reliable. How would I know what hard SF works normally score if I didn't do the test? And anyone can tell by the cover that HHGttG is not serious SF in any way. Some people find it very funny. It got the occasional weak chuckle out of me. I avoided reading it for more than 20 years. I finally decided to read it so I could know what I was saying about it.

As a decades long SF reader I find Fahrenheit 451 odd in that it is mentioned so often but I found it so uninteresting and Bradbury himself said it was his only SF novel. In his time Bradbury was more "literary" than most other authors so he appears to have been selected as the poster boy for science fiction by the literary elite but not the serious SF readers of the 50s and 60s. I like his Martian Chronicles far more but admit it is an interesting mixture of SF and fantasy which got an SF density of 0.590 and a Fantasy density of 0.048.

A significant amount of data must be collected to determine anything about it. But are we actually supposed to believe there is not meaningful difference between and SF work that scores 0.2 and one with 0.8 but the reviewers say nothing about the science or technology in the stories.

psik
 
I emailed James Gunn at the University of Kansas where he has a SF study group. He said:

Sounds very useful. Jim

But as far as I can tell he did not download the program.

psik
 
Back
Top